Genetics AI Platform Blog


Sachin Kale · AI Software Engineer

Building a Domain-Specific LLM for Molecular Biology

System Architecture

Request Flow: User → React App → FastAPI → HuggingFace Inference Endpoint → Genetic LLM → Response

AWS Infrastructure: Route 53 (DNS) → AWS EKS (Container Orchestration) → ECR (Image Registry) → Secrets Manager (Credentials)

1. Genetics LLM

A Domain-adapted large language model specialized for genetics and molecular biology research. The model demonstrates improved accuracy and contextual understanding on domain-specific queries compared to general-purpose LLMs.

Training:

  • Base Model: Qwen2-1.5B with LoRA fine-tuning (rank=16, alpha=32) for parameter-efficient training without catastrophic forgetting
  • Dataset: Curated 98K Q&A pairs covering CRISPR mechanisms, DNA sequencing technologies, hereditary diseases, gene expression, and molecular diagnostics
  • Infrastructure: 3 epochs on NVIDIA A100 GPU (40GB VRAM) via HuggingFace training infrastructure with mixed-precision (bf16)
Qwen2-1.5B LoRA/PEFT HuggingFace PyTorch

2. Genetics API (inference)

Production-grade REST API providing scalable, low-latency access to the Genetic LLM for real-time inference.

Architecture:

  • Framework: FastAPI with async request handling and automatic OpenAPI documentation
  • Backend: HuggingFace Inference Endpoints with GPU acceleration (NVIDIA T4)
  • Security: API key-based authentication with rate limiting and request validation

Infrastructure:

  • Compute: AWS EKS (Kubernetes) with 2-node managed cluster
  • Network: Application Load Balancer with SSL/TLS termination via ACM
  • CI/CD: GitHub Actions pipeline with automated ECR push and kubectl rollout
FastAPI HF Endpoints Docker EKS

3. Genetics AI Assistant (Chatbot)

Interactive web application enabling natural language conversations with the Genetic LLM for researchers and students.

Stack:

  • Frontend: React 18 with TypeScript for type-safe development, built with Vite
  • Styling: Bootstrap CSS with custom design system
  • Deployment: S3 + CloudFront for global CDN distribution

Features:

  • Real-time streaming responses with typing indicators
  • Conversation history with local storage persistence
  • Mobile-responsive layout with example prompts
React 18 TypeScript Bootstrap CloudFront

4. AI Vision/Image Metadata Studio for Machine Learning

Professional image annotation tool for creating high-quality labeled datasets for machine learning models. Supports polygon annotations, time-lapse comparison with auto-detection, and multiple export formats.

  • Generate labeled datasets for Vision Transformers, Diffusion Models, and Object Detection
  • Annotate individual images with regions and points
  • Add captions and metadata for training
  • Export in COCO JSON, YOLO, HuggingFace JSONL, and CSV formats
  • Import metadata files to restore annotations for any image

Annotation Features:

  • Tools: Polygon regions, point markers, measurements with scale calibration
  • Metadata: Labels, notes, confidence scores, captions, and tags
  • Histopathologic Images: Uses Cellpose API for automated cell/nuclei detection and cell segmentation in microscopy images with Cellpose's nuclei and cyto3 models
  • Export: COCO JSON, YOLO, HuggingFace JSONL, CSV formats

Time-Lapse Comparison:

  • Auto-Detection: Region tracking across image series using color profiling
  • Analytics: Area progression charts with growth statistics
  • Demo Data: Histopathology samples (Lung, Liver, Breast, Cancer)

Infrastructure:

  • Frontend: React 19 + TypeScript + Vite + Canvas API
  • Backend: AWS SAM (Lambda + API Gateway + DynamoDB)
  • Deployment: S3 + CloudFront with GitHub Actions CI/CD
React 19 TypeScript Canvas API AWS SAM Lambda S3 CloudFront