Genomics: From Variant QC to Clinical Prioritization

Population and clinical genomics across three complexity levels — variant QC, mitochondrial annotation, and a full clinical prioritization pipeline.

Genomic variant analysis is the foundation of precision medicine — from population genetics to rare disease diagnosis. This case study demonstrates Lobster AI's genomics pipeline across three complexity levels: basic variant QC on chromosome 22 data, mitochondrial variant analysis with functional annotation, and a complete clinical prioritization pipeline integrating VEP, ClinVar, and gnomAD. Together, these scenarios show how the genomics_expert agent handles standard diploid QC, adapts to haploid mitochondrial biology, and executes multi-step clinical workflows that typically require separate tools, databases, and days of manual work.

Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). External databases queried: Ensembl VEP, ClinVar, gnomAD. Local tools: cyvcf2, sgkit, NumPy. Total cost: $2.29 across 3 case studies (8 turns). Annotation databases are updated regularly — VEP releases, ClinVar submissions, and gnomAD versions change over time. Re-running these queries will return different annotations. This case study demonstrates analytical workflows, not independently validated clinical findings.

Agents and Data Sources

This analysis uses the lobster-genomics package, which provides one primary agent:

Agent	Role
`genomics_expert`	VCF loading, QC assessment, filtering, PCA, LD pruning, functional annotation, clinical prioritization

The variant_analysis_expert child agent exists but was not explicitly delegated to during these sessions — the parent's annotation tools handled VEP and ClinVar queries directly. External APIs queried during the sessions: Ensembl VEP (functional consequence prediction), ClinVar (pathogenicity classification), gnomAD (population frequencies). Local computation is handled by cyvcf2 (VCF parsing), sgkit (PCA), and numpy (genotype encoding).

Simple: Chromosome 22 Variant QC

The first scenario establishes the core load-QC-filter pipeline using a 1000 Genomes Project chromosome 22 VCF file. This is the foundation for any variant analysis: characterize raw data quality, apply standard filtering thresholds, and quantify how many variants survive clinical QC criteria.

Turn 1: Load and Assess Quality

lobster query --session-id genomics_simple \
  "Load the VCF file at /Users/tyo/Omics-OS/lobster-casestudy-genomics/test_data/genomics/chr22_small.vcf.gz \
   as 'chr22_variants'. Then assess the quality of this genomic dataset — I want to see call rates, \
   MAF distribution, and Hardy-Weinberg equilibrium statistics."

The genomics_expert loaded the VCF and immediately characterized its quality profile. The dataset has perfect call rates — no missing genotypes — but is heavily enriched for rare variants, with a median MAF of just 0.04%. This is typical for population-scale whole-genome sequencing data.

Metric	Value
Samples	2,504
Variants	1,000
Call rate (mean)	100%
Common variants (MAF >= 1%)	82 (8.2%)
Rare variants (MAF < 1%)	883 (88.3%)
HWE failures (p < 1e-10)	40 (4.0%)

The 88.3% rare variant enrichment is biologically expected for population-scale WGS. Most human genetic variation consists of rare alleles under weak purifying selection — the reference genome represents a small fraction of global diversity.

Turn 2: Sample and Variant Filtering

lobster query --session-id genomics_simple \
  "Filter the chr22 variants: remove samples with call rate below 95% or extreme heterozygosity \
   (beyond 3 SD). Then filter variants with call rate below 99%, MAF below 1%, or HWE p-value below 1e-6. \
   Show me the filtering summary — how many samples and variants were removed at each step."

The two-stage filtering pipeline (samples first, then variants) demonstrates how stringent QC reduces a raw variant set to analysis-ready data. The massive variant loss (92.6%) is expected — the MAF filter alone removed 88.3%, leaving 74 high-confidence common variants suitable for association testing, PCA, or kinship analysis.

Filter	Variants Removed	Percentage
Call rate < 99%	0	0%
MAF < 1%	883	88.3%
HWE p < 1e-6	48	4.8%
Total removed	926	92.6%
Retained	74	7.4%

Dataset	Samples	Variants
Before filtering	2,504	1,000
After sample filter	2,504 (100%)	1,000
After variant filter	2,504 (100%)	74 (7.4%)

The 1000 Genomes cohort spans 26 populations across 5 super-populations. HWE testing on pooled, structured populations can generate false deviations (Wahlund effect). In production pipelines, HWE filtering should be performed within homogeneous population strata rather than on the full multi-ancestry cohort.

What This Demonstrates

The simple pipeline exercises the genomics expert's core tools (load_vcf, assess_quality, filter_samples, filter_variants) without requiring child agent delegation or external API calls. It establishes the QC foundation that all downstream analysis depends on.

Medium: Mitochondrial Variant Analysis

The second scenario tests the agent's ability to adapt standard diploid QC thresholds to haploid mitochondrial biology. Mitochondrial DNA is maternally inherited, exists in hundreds of copies per cell, and does not undergo recombination — so standard nuclear genome QC rules (HWE, heterozygosity) don't apply.

Turn 1: Load and Recognize Haploid Biology

lobster query --session-id genomics_medium \
  "I have a VCF file from the 1000 Genomes Project containing mitochondrial variants across \
   2,504 individuals from 26 global populations. The file is at \
   /Users/tyo/Omics-OS/lobster-casestudy-genomics/test_data/real/1000g_chrMT.vcf.gz. \
   Load it as 'mt_1000g' and assess the quality — I want call rates, MAF distribution, \
   HWE statistics, and an overall assessment of whether this dataset is suitable for \
   population genetics analysis."

The genomics expert loaded the 1000 Genomes mitochondrial VCF and immediately recognized the haploid biology. Standard nuclear genome QC thresholds (HWE, heterozygosity) don't apply to mtDNA, and the agent correctly flagged these as biologically expected rather than quality failures. This domain awareness — distinguishing genuine quality issues from biological reality — is what separates an expert system from a rule-based pipeline.

Metric	Value	Interpretation
Samples	2,534	26 global populations
Variants	3,892	Mitochondrial genome
Call rate	100%	Perfect completeness
Heterozygosity	0%	Correct for haploid mtDNA
Median MAF	0.12%	Rare variant enriched
HWE failures	100%	Expected (haploid, not diploid)

All variants "fail" HWE with p < 1e-10 — but this is biologically correct for haploid, maternally inherited mitochondrial genomes. The agent recognized this and flagged the dataset as HIGHLY SUITABLE for population genetics analysis despite 100% HWE failures.

This 0% heterozygosity reflects the VCF genotype encoding from the 1000 Genomes calling pipeline, not the absence of mitochondrial heteroplasmy. Heteroplasmy — the coexistence of multiple mtDNA variants within cells — is clinically important for variants like m.3243A>G (MELAS), where phenotypic severity correlates with heteroplasmy level. Specialized variant callers (e.g., mutserve, mitoCaller) are needed to detect and quantify heteroplasmic variants from sequencing data.

Turn 2: Filtering and VEP Functional Annotation

lobster query --session-id genomics_medium \
  "Filter the mt_1000g variants with a relaxed MAF threshold of 0.1% (since these are mitochondrial) \
   and skip HWE filtering (not applicable for haploid mtDNA). Then annotate the filtered variants \
   with Ensembl VEP to identify their functional consequences — I want to know which variants affect \
   protein-coding mitochondrial genes (like MT-ND1, MT-CO1, MT-CYB, MT-ATP6) and which are synonymous \
   vs nonsynonymous."

Filtering and VEP annotation revealed that the mitochondrial genome is under strong purifying selection. The 2.4:1 synonymous-to-nonsynonymous ratio means amino acid-changing mutations in electron transport chain genes are actively selected against. The concentration of variants in Complex I subunits (MT-ND1, ND3, ND4, ND4L) reflects both the gene's size and its tolerance for variation.

Consequence	Count	Percentage
Non-coding regulatory	1,802	91.5%
Synonymous	119	6.0%
Nonsynonymous (missense)	48	2.4%
Start codon loss	1	0.05%

Gene	Variants	Function
MT-ND3	1,180*	NADH dehydrogenase subunit 3 (Complex I)
MT-ND1	517	NADH dehydrogenase subunit 1 (Complex I)
MT-ND4	225	NADH dehydrogenase subunit 4 (Complex I)
MT-ND4L	48	NADH dehydrogenase subunit 4L (Complex I)

MT-ND3 is only 346 bp, so 1,180 "variants" includes upstream/downstream regulatory region variants attributed to the nearest gene by VEP, not variants within the coding sequence itself.

Turn 3: Clinical Variant Interpretation

lobster query --session-id genomics_medium \
  "Look up three clinically important mitochondrial variants: (1) rs199476112 — the m.3243A>G \
   MELAS mutation in MT-TL1, (2) rs28357980 — a known pathogenic variant in MT-ND2, and \
   (3) rs2853826 — a common variant in MT-ATP6. For each variant, I need the full VEP \
   consequence prediction, population allele frequencies, and clinical significance from ClinVar. \
   Compare the pathogenic variants against the benign one."

The clinical variant lookups demonstrate how Lobster AI integrates multiple evidence sources — VEP functional prediction, ClinVar clinical classification, and population frequency data — into a coherent pathogenicity assessment.

Variant	Gene	Change	ClinVar	PolyPhen
rs199476112	MT-TL1	m.3243A>G	PATHOGENIC	0.996 (damaging)
rs28357980	MT-ND2	Asn150Asp	CONFLICTING	—
rs2853826	MT-ATP6	m.8860A>G	BENIGN	0-0.015 (benign)

PolyPhen-2 was trained on nuclear-encoded protein sequences; its calibration for mitochondrial-encoded respiratory chain subunits has not been formally validated. Mitochondria-specific tools such as MitoTIP or HmtVar may provide more calibrated pathogenicity estimates for mt-encoded variants.

The contrast between the MELAS-associated rs199476112 (PolyPhen 0.996, pathogenic) and the benign rs2853826 (PolyPhen 0.015) illustrates the kind of evidence synthesis that clinical genomicists perform manually across multiple databases. Five discriminators between pathogenic and benign mtDNA variants emerged:

PolyPhen score: 0.996 vs 0-0.015
Amino acid chemistry: Charge loss vs conservative change
ClinVar consensus: Unanimous vs multiple concordant
Population frequency: Ultra-rare vs common
Allelic complexity: Biallelic vs multiallelic

rs199476112 is associated with MELAS syndrome (mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episodes), maternally inherited diabetes and deafness (MIDD), and progressive external ophthalmoplegia.

What This Demonstrates

The mitochondrial analysis exercises domain-specific QC adaptation (haploid biology recognition), VEP batch annotation with automatic retry handling, and multi-database clinical interpretation. The agent correctly applied mtDNA-appropriate thresholds (0.1% MAF, HWE skipped) without requiring manual intervention.

Hard: Clinical Genomics Pipeline

The third scenario is a complete clinical variant prioritization pipeline: loading a chromosome 22 VCF, running a 6-step QC-filter-annotate-prioritize workflow, PCA for population structure, VEP functional annotation, ClinVar pathogenicity lookup, gnomAD frequency queries, and composite variant prioritization. This tests every major tool in the genomics package.

Turn 1: Load, QC, and Filter

lobster query --session-id genomics_hard \
  "I'm running a clinical genomics pipeline on 1000 Genomes chromosome 22 data. \
   Load the VCF at /Users/tyo/Omics-OS/lobster-casestudy-genomics/test_data/genomics/chr22_small.vcf.gz \
   as 'clinical_chr22'. Run the full QC assessment, then filter samples (call rate >= 95%, \
   heterozygosity within 3 SD) and variants (call rate >= 95%, MAF >= 0.1%, HWE p >= 1e-10). \
   Use the relaxed MAF of 0.1% since I want to retain rare potentially pathogenic variants for \
   clinical analysis."

The first turn establishes the clinical genomics pipeline with relaxed MAF thresholds (0.1% instead of the standard 1-5%) to retain rare potentially pathogenic variants. The two-stage filtering pipeline — samples first, then variants — follows GWAS best practices.

Pipeline Step	Variants In	Variants Out	Retention
Load VCF	—	1,000	100%
QC Assessment	1,000	242 flagged passing	24.2%
Sample Filter	—	2,504 samples retained	100%
Variant Filter	1,000	242	24.2%

Filter Applied	Variants Removed
MAF < 0.1%	718 (71.8%)
HWE p < 1e-10	40 (4.0%)
Call rate < 95%	0 (0%)
Total removed	758 (75.8%)

All 2,504 samples pass quality control, and 242 variants survive filtering, representing the high-quality subset suitable for clinical interpretation.

Turn 2: PCA and Population Structure

lobster query --session-id genomics_hard \
  "I've installed sgkit. Re-load the VCF from \
   /Users/tyo/Omics-OS/lobster-casestudy-genomics/test_data/genomics/chr22_small.vcf.gz \
   as 'clinical_chr22', run the full pipeline in one pass: (1) assess quality, \
   (2) filter variants MAF >= 0.1% and HWE p >= 1e-10, (3) LD-prune at r-squared 0.2 \
   with window 500, (4) compute PCA with 10 components. Report variant counts at each \
   step and PCA variance explained."

The population structure analysis via PCA reveals the expected multi-ancestry composition of the 1000 Genomes dataset. The modest variance explained by PC1 (3.84%) and even distribution across components reflects multiple distinct ancestry groups rather than a single dominant axis of variation — consistent with the project's global sampling of 26 populations across 5 continental groups.

Component	Variance Explained	Cumulative
PC1	3.84%	3.84%
PC2	3.47%	7.31%
PC3	3.16%	10.47%
PC4	2.89%	13.36%
PC5	2.72%	16.08%
PC6	2.64%	18.72%
PC7	2.51%	21.23%
PC8	2.34%	23.57%
PC9	2.15%	25.72%

LD pruning encountered a technical limitation with sgkit's contig metadata handling for this VCF format, but PCA proceeded successfully and provides covariates suitable for GWAS association testing.

Turn 3: VEP, ClinVar, gnomAD, and Prioritization

lobster query --session-id genomics_hard \
  "Now annotate the filtered variants with Ensembl VEP to get functional consequences and gene mappings. \
   Then hand off to the variant analysis expert for clinical interpretation: predict consequences, \
   query ClinVar for pathogenicity classifications, look up gnomAD population frequencies, and prioritize \
   variants by combining consequence severity, frequency rarity, and clinical evidence into a composite \
   priority score. I want to see the top 10 highest-priority variants with their genes, consequences, \
   and clinical annotations."

The clinical interpretation pipeline processed all 242 filtered variants through VEP annotation, ClinVar pathogenicity lookup, gnomAD frequency queries, and composite priority scoring.

Pipeline Step	Variants	Coverage
Filtered input	242	—
VEP annotated	182	75.2%
ClinVar matched	0	0%
gnomAD matched	0	0%
Prioritized	242	100%

The finding that zero variants reach moderate or high priority is scientifically correct — this region of chromosome 22 (positions 16050213-16139658) is gene-poor, containing predominantly intergenic and upstream regulatory variants with no established clinical significance.

Priority Level	Count	Percentage
High (>=0.6)	0	0%
Moderate (0.3-0.59)	0	0%
Low (<0.3)	242	100%

Rank	Variant	Consequence	ClinVar	gnomAD AF	Priority
1 (tie)	22:16139658:C>A	upstream_gene_variant	Not found	Not found	0.160
1 (tie)	22:16055965:G>T	intergenic_variant	Not found	Not found	0.160
1 (tie)	22:16060354:A>T	intergenic_variant	Not found	Not found	0.160
1 (tie)	22:16059993:T>C	intergenic_variant	Not found	Not found	0.160
1 (tie)	22:16059973:C>A	intergenic_variant	Not found	Not found	0.160

This "negative result" demonstrates the pipeline's integrity: it does not hallucinate pathogenicity where none exists. A raw LLM might invent clinically significant variants; the agent's programmatic database queries return only verified evidence. The pipeline correctly identifies benign regions rather than fabricating clinical significance.

What This Demonstrates

The hard pipeline exercises the full breadth of the genomics expert: VCF loading, QC assessment, two-stage filtering, PCA with sgkit, VEP batch annotation with automatic retries, ClinVar and gnomAD database queries, and composite priority scoring with a transparent formula (Consequence Severity 0-0.4 + Rarity Score 0-0.3 + Clinical Evidence 0-0.3). The session processed 2,504 samples through 6 pipeline steps in 3 conversational turns.

This priority score is a research prioritization heuristic, not an ACMG/AMP clinical classification. The consequence severity component relates to ACMG evidence categories PVS1/PM4, the rarity component to PM2/BS1, and clinical evidence to PP5/BP6. For clinical reporting, formal ACMG/AMP classification with expert curation is required.

Known limitations: LD pruning failed due to sgkit contig metadata requirements. gnomAD returned 0 matches, likely due to GRCh37/GRCh38 coordinate mismatch. VEP annotated 75.2% of variants (60 returned HTTP 400, likely chromosome naming convention issues). These are infrastructure challenges, not agent failures — the pipeline continued with available data and reported honest results.

What This Demonstrates

Multi-Step Clinical Pipelines

No single tool could produce these analyses. The genomics_expert orchestrated 6-step workflows — load, QC, filter, PCA, annotate, prioritize — across separate database APIs (Ensembl VEP, ClinVar, gnomAD) and computational backends (cyvcf2, sgkit, numpy). Each step's output became the next step's input, with provenance tracked throughout.

Domain-Specific Adaptation

The mitochondrial scenario demonstrates the agent's ability to recognize biological context and adapt standard QC rules. It correctly flagged 0% heterozygosity and 100% HWE failures as expected for haploid mtDNA, not quality issues. This kind of domain awareness requires expert knowledge, not just rule-based filtering.

Database Integration

The agents queried Ensembl VEP, ClinVar, and gnomAD programmatically through validated API tools — not through LLM approximation. VEP batch annotation handled retries automatically. ClinVar lookups retrieved real pathogenicity classifications. gnomAD queries returned population frequencies where coordinates matched. No hallucination, only verified evidence.

Provenance and Reproducibility

Every tool call is logged with an AnalysisStep intermediate representation that captures the operation, parameters, data sources, and outputs. The sessions can be reproduced or extended with --session-id.

Human vs Raw LLM vs Lobster AI

Estimates based on these case study sessions. Human researcher timing assumes manual workflows without Lobster.

Task	Human Researcher	Raw LLM	Lobster AI
Load and QC-assess 3,892 mtDNA variants	15-20 min (recognize haploid biology)	Misapplies diploid QC rules	~1 min (auto-detects haploid)
Filter with mtDNA-appropriate thresholds	10-15 min (custom HWE bypass)	Cannot execute filtering	~30 sec
VEP batch annotation (242-1,970 variants)	30-60 min (API rate limits, retry handling)	Hallucinates annotations	~3-5 min (automatic batching + retries)
Clinical interpretation (3 variants)	20-30 min (3 database lookups each)	Approximate from training data	~1 min (programmatic API queries)
Full 6-step clinical pipeline	2-4 hours	Not reliable	~10 min (3 turns)
Synthesize pathogenic vs benign discriminators	30-60 min (literature review)	Generic summary	Data-driven with 5 specific discriminators
Total across all 3 scenarios	4-7 hours	Not reliable for genomics	~20 min, $2.29

Session costs: Simple $0.40, Medium $1.07, Hard $0.82.

Limitations

gnomAD coordinate mismatch. The Hard case returned 0 gnomAD matches due to a GRCh37/GRCh38 build discrepancy. This is a known infrastructure limitation that would need to be resolved for clinical-grade variant annotation. Coordinate liftover (via CrossMap or UCSC liftOver) is required when source VCFs and annotation databases use different genome builds.
Gene-poor test region. The chromosome 22 region used in the Hard case (16050213-16139658) falls near the centromere in a gene-poor area, producing zero clinically significant findings. A gene-rich region (e.g., 22q11.2 containing TBX1, COMT) would demonstrate clinical variant prioritization more effectively.
LD pruning failure. Linkage disequilibrium pruning failed in the Hard case due to sgkit contig metadata requirements. PCA results may reflect LD-driven population structure artifacts.
PolyPhen-2 on mitochondrial proteins. PolyPhen-2 scores for mt-encoded proteins are extrapolated from a nuclear protein training set and have not been independently validated for mitochondrial respiratory chain subunits.
Custom prioritization score. The composite priority score is a research heuristic, not a validated clinical classification system. It should not be used as a substitute for ACMG/AMP expert variant curation.

Reproducibility

To reproduce these analyses, install the genomics package and run the turns sequentially:

pip install 'lobster-ai[full]==1.0.12'

Simple (2 turns):

lobster query --session-id genomics_simple \
  "Load the VCF file as 'chr22_variants'. Then assess the quality — call rates, MAF distribution, HWE."

lobster query --session-id genomics_simple \
  "Filter the chr22 variants: remove samples with call rate below 95% or extreme heterozygosity. \
   Then filter variants with call rate below 99%, MAF below 1%, or HWE p-value below 1e-6."

Medium (3 turns):

lobster query --session-id genomics_medium \
  "Load 1000g_chrMT.vcf.gz as 'mt_1000g' and assess quality — call rates, MAF distribution, HWE, \
   suitability for population genetics."

lobster query --session-id genomics_medium \
  "Filter mt_1000g with MAF >= 0.1%, skip HWE. Then annotate with Ensembl VEP — identify variants \
   in protein-coding mitochondrial genes and categorize as synonymous vs nonsynonymous."

lobster query --session-id genomics_medium \
  "Look up rs199476112 (MT-TL1 MELAS), rs28357980 (MT-ND2), rs2853826 (MT-ATP6). For each: \
   VEP consequence, population frequencies, ClinVar clinical significance."

Hard (3 turns):

lobster query --session-id genomics_hard \
  "Load chr22_small.vcf.gz as 'clinical_chr22'. Run QC, then filter samples (call rate >= 95%, \
   heterozygosity within 3 SD) and variants (call rate >= 95%, MAF >= 0.1%, HWE p >= 1e-10)."

lobster query --session-id genomics_hard \
  "Re-load chr22_small.vcf.gz, run full pipeline: assess quality, filter variants MAF >= 0.1% \
   and HWE p >= 1e-10, LD-prune at r-squared 0.2 window 500, compute PCA with 10 components."

lobster query --session-id genomics_hard \
  "Annotate the filtered variants with VEP. Then query ClinVar for pathogenicity, gnomAD for \
   frequencies, and prioritize by combining consequence severity, rarity, and clinical evidence. \
   Show top 10 highest-priority variants."

Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory and can be exported with /pipeline export.

NextMachine Learning: From Data Preparation to Survival Prediction

Genomics: From Variant QC to Clinical Prioritization

Agents and Data Sources

Simple: Chromosome 22 Variant QC

Turn 1: Load and Assess Quality

Turn 2: Sample and Variant Filtering

What This Demonstrates

Medium: Mitochondrial Variant Analysis

Turn 1: Load and Recognize Haploid Biology

Turn 2: Filtering and VEP Functional Annotation

Turn 3: Clinical Variant Interpretation

What This Demonstrates

Hard: Clinical Genomics Pipeline

Turn 1: Load, QC, and Filter

Turn 2: PCA and Population Structure

Turn 3: VEP, ClinVar, gnomAD, and Prioritization

What This Demonstrates

What This Demonstrates

Multi-Step Clinical Pipelines

Domain-Specific Adaptation

Database Integration

Provenance and Reproducibility

Human vs Raw LLM vs Lobster AI

Limitations

Reproducibility

What's Next?

Getting Started

Genomics Agents

Tutorials

On this page