Omics-OS Docs
Case Studies

Transcriptomics: From Single-Cell QC to Pseudobulk Differential Expression

Single-cell and bulk RNA-seq analysis across three complexity levels — QC, cell type annotation, and pseudobulk DE with pathway enrichment.

Transcriptomics is the foundation of modern molecular biology — measuring which genes are active in cells and tissues reveals disease mechanisms, drug targets, and cellular heterogeneity. Single-cell RNA-seq enables gene expression profiling at cellular resolution, but the analysis requires orchestrating dozens of computational steps: data acquisition, quality control, normalization, dimensionality reduction, clustering, cell type annotation, differential expression, and pathway enrichment. This case study demonstrates Lobster AI's transcriptomics capabilities across three difficulty levels using real clinical datasets: kidney allograft quality control, COVID-19 immune cell atlas construction, and idiopathic pulmonary fibrosis (IPF) differential expression analysis.

Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). External databases queried: GEO (NCBI). Local tools: Scanpy, Scrublet, pyDESeq2, gseapy. Total cost: $2.72 across 3 case studies (7 turns). Database content and GEO dataset availability change over time — re-running these queries may return different datasets or updated metadata. This case study demonstrates analytical workflows, not independently validated scientific findings.

Agents and Data Sources

This analysis uses the lobster-transcriptomics package, which provides three agents:

AgentRole
transcriptomics_expertWorkflow orchestration, preprocessing, clustering, and delegation
annotation_expertCell type annotation and marker gene discovery (child agent)
de_analysis_expertDifferential expression and pathway enrichment (child agent)

External APIs queried during the sessions: GEO (Gene Expression Omnibus) for dataset discovery and metadata validation. Local computation is handled by Scanpy (quality control, normalization, PCA, UMAP, clustering), Scrublet (doublet detection), pyDESeq2 (differential expression), and gseapy (pathway enrichment).


Simple: Kidney Allograft scRNA-seq QC

This scenario demonstrates foundational single-cell preprocessing: automated data acquisition from GEO, quality control metric calculation, cell and gene filtering, and normalization — the first steps of any scRNA-seq analysis.

Turn 1: Dataset Discovery

lobster query --session-id transcriptomics_simple \
  "Search GEO for GSE109564 (a small 3k PBMC single-cell RNA-seq dataset). \
   Download it and load it into the workspace. I want to do quality control \
   on this data."

The research agent searched GEO, retrieved metadata for GSE109564, and proactively flagged that the dataset contains kidney allograft biopsy cells rather than PBMCs as initially described. This validation step prevents downstream analysis errors from incorrect context.

Turn 2: Quality Control and Preprocessing

lobster query --session-id transcriptomics_simple \
  "Yes, proceed with GSE109564. Download it, then run comprehensive quality \
   control: assess data quality with QC metrics, filter low-quality cells \
   and genes, and normalize the data. Give me the key QC statistics."

Results

The data expert downloaded GSE109564 from GEO, then the transcriptomics expert executed the full QC pipeline.

MetricValue
Original cells20,477
After QC9,225
Cells removed54.9%
Mean genes/cell181
Mean UMI/cell324
Total UMI6.6M
NormalizationCP10K log1p

QC thresholds applied:

  • Minimum genes per cell: 100 (removed empty droplets and debris)
  • Maximum genes per cell: 5,000 (removed potential doublets)
  • Maximum mitochondrial percentage: 25% (removed stressed or dying cells)
  • Minimum cells per gene: 3 (removed rare genes)

Over half the cells (54.9%) were removed due to low gene counts, which is expected for kidney biopsy tissue where cell viability varies. The mean of 181 genes per cell is typical for solid tissue samples. The normalized dataset is ready for downstream analysis such as clustering and cell type annotation.

A mean of 181 genes per cell after filtering is low relative to typical scRNA-seq datasets (500-3,000 genes per cell for solid tissue), reflecting challenging dissociation conditions in kidney allograft biopsies. The agent adaptively lowered the minimum gene threshold from the platform default of 200 to 100 based on the data characteristics.

Lobster automatically validated the dataset context before committing to the download, flagging the mismatch between the user's PBMC expectation and the actual kidney allograft data. This proactive validation prevents hours of wasted analysis on the wrong tissue type.

Session cost: $0.39


Medium: COVID-19 PBMC Immune Cell Atlas

This scenario demonstrates a complete single-cell pipeline from data discovery through cell type annotation: GEO search, automated download of a large dataset (85K cells), quality control, full preprocessing, clustering, and automated immune cell type identification across three conversational turns.

Turn 1: Dataset Discovery and Metadata Validation

lobster query --session-id transcriptomics_medium \
  "Search GEO for GSE149689 — a human PBMC single-cell RNA-seq dataset. \
   Get its metadata and queue it for download."

Results

Lobster's research agent searched GEO for GSE149689 and validated the metadata completeness.

Dataset characteristics:

  • Title: Immunophenotyping of COVID-19 and Influenza (Type I IFN Response Study)
  • Technology: 10x Genomics single-cell RNA-seq
  • Samples: 20 human PBMC samples
    • 11 COVID-19 patients (various severity)
    • 5 Influenza patients (severe)
    • 4 Healthy controls
  • Key finding: Severe COVID-19 patients showed co-existence of type I IFN response with TNF/IL-1B inflammation in classical monocytes

The dataset compares immune responses across COVID-19, influenza, and healthy controls, providing clinically relevant context for immune cell atlas construction.

Turn 2: Full Preprocessing Pipeline

lobster query --session-id transcriptomics_medium \
  "Download GSE149689, then run the full single-cell preprocessing pipeline: \
   quality control, filter low-quality cells, normalize, select highly \
   variable genes, run PCA, compute neighbors, embed with UMAP, and cluster \
   the cells. Give me cluster statistics."

Results

In a single turn, Lobster's data expert downloaded the 85K-cell dataset from GEO and the transcriptomics expert ran the complete preprocessing pipeline.

Pipeline StepInputOutputNotes
Load-85,144 x 33,538From GEO
QC + Filter85,144 cells61,864 cells72.7% retention
HVG Selection23,311 genes2,000 genesInformative features
PCA61,864 x 2,00030 PCs24% variance
UMAP30 PCs2D embeddingVisualization
Clustering61,864 cells30 clustersLeiden, res=1.0

The pipeline removed 27% of low-quality cells, selected 2,000 highly variable genes for dimensionality reduction, and identified 30 distinct clusters. The interferon-stimulated gene ISG15 appearing as the top variable gene is consistent with the COVID-19 context — active antiviral immune response.

Cluster size distribution:

  • Major populations (Clusters 0-1): Approximately 22% of cells (likely T cells, monocytes)
  • Distinct subtypes (Clusters 2-7): 6-7% each
  • Rare populations (Clusters 20-29): Less than 2% each (dendritic cells, transitional states)

Turn 3: Cell Type Annotation

lobster query --session-id transcriptomics_medium \
  "Find marker genes for each cluster and then annotate cell types \
   automatically. These are PBMCs so I expect CD4+ T cells, CD8+ T cells, \
   B cells, NK cells, monocytes, and dendritic cells. Show me the cell type \
   proportions."

Results

The annotation expert identified all expected PBMC cell types with high confidence (mean 0.925).

Cell TypeCountPercentage
CD14+ Monocytes21,07734.1%
NK cells11,03317.8%
CD8+ T cells8,56213.8%
CD4+ T cells6,91011.2%
B cells6,78911.0%
Platelets4,5357.3%
FCGR3A+ Monocytes2,4363.9%
Dendritic cells5220.8%

The monocyte-heavy composition (38% total: 34.1% CD14+ classical + 3.9% FCGR3A+ non-classical) is consistent with fresh PBMC preparations from patients with active infections. The slightly elevated CD8:CD4 ratio (1.24) may reflect the antiviral immune response in COVID-19 and influenza patients. Platelet contamination (7.3%) is a common PBMC preparation artifact that Lobster flags for optional removal.

UMAP visualization of COVID-19 PBMC dataset colored by cell type annotation showing 8 immune cell populations

Lobster completed the entire pipeline — from 85,000 raw cells to annotated immune cell atlas — in three conversational turns and under 10 minutes. The agent automatically selected appropriate QC thresholds, identified 30 cell clusters, and annotated them with 92.5% average confidence using canonical immune cell markers.

Session cost: Approximately $0.71


Hard: IPF Lung Multi-Batch Differential Expression

This scenario demonstrates Lobster AI's most advanced single-cell transcriptomics capabilities: loading a multi-batch clinical dataset (78 patients, 107 libraries), quality control, batch-aware clustering, automated lung cell type annotation, pseudobulk differential expression between disease conditions (IPF vs Control), and GO pathway enrichment.

Turn 1: Data Loading, QC, and Filtering

lobster query --session-id transcriptomics_hard \
  "I have a pre-loaded IPF lung scRNA-seq dataset from GSE136831 at \
   .lobster_workspace/downloads/GSE136831_20k_subsample.h5ad — 20,000 cells \
   from 78 patients with IPF, Control, and COPD conditions. Load this file, \
   assess quality, filter low-quality cells, normalize, and detect doublets. \
   The batch key is 'Library_Identity' and the disease key is \
   'Disease_Identity'."

Results

The transcriptomics expert loaded a 20,000-cell subsample of GSE136831 (the largest published single-cell atlas of idiopathic pulmonary fibrosis), balanced across IPF, Control, and COPD patients.

MetricValue
Original cells19,998
After QC19,851
Cell retention99.3%
Original genes45,947
After filtering34,101
Mean genes/cell2,160
Mean UMI/cell6,277
Doublets0

Dataset structure:

  • Batch structure: 107 unique sequencing libraries (Library_Identity)
  • Disease groups: IPF (6,666 cells), Control (6,666 cells), COPD (6,666 cells)

Quality was high, with 99.3% cell retention after QC, reflecting the published dataset's pre-processing. After removing lowly-expressed genes (25.8%), the 19,851-cell x 34,101-gene matrix was normalized and ready for batch integration and clustering.

Turn 2: Batch Integration, Clustering, and Annotation

lobster query --session-id transcriptomics_hard \
  "Now select highly variable genes, run PCA, integrate batches using \
   Harmony with batch_key='Library_Identity', compute neighbors, UMAP \
   embedding, and cluster the cells at resolution 0.8. Then annotate cell \
   types — these are lung tissue cells, so expect macrophages, monocytes, \
   T cells, B cells, NK cells, fibroblasts, myofibroblasts, alveolar type \
   1 and type 2 epithelial cells, club cells, ciliated epithelial cells, \
   endothelial cells, and smooth muscle cells."

Results

The transcriptomics expert selected 2,500 highly variable genes, computed 30 principal components (25.74% variance), and identified 15 cell clusters. Harmony batch integration was attempted but failed due to a dependency resolution issue, so clustering proceeded on uncorrected data — results may contain batch effects.

Dimensionality reduction and clustering:

  • HVG selection: 2,500 highly variable genes
  • PCA: 30 principal components (25.74% variance explained)
  • UMAP: 2D embedding with 15 neighbors
  • Clustering: 15 Leiden clusters at resolution 0.8

Annotated cell types:

The clustering results and cell type assignments can be visualized as UMAP plots using standard Scanpy plotting functions on the session output files.

Cell TypeClustersCount% of Total
T cells02,93414.8%
Interstitial Macrophages12,75713.9%
Alveolar Macrophages2, 34,51122.7%
Monocytes41,4667.4%
Unannotated5-148,18341.2%

The annotation expert identified the four dominant immune populations (T cells, interstitial macrophages, alveolar macrophages, monocytes) with high confidence. Forty-one percent of cells (8,183/19,851) remain unannotated — these likely contain the epithelial, stromal, and endothelial populations characteristic of lung tissue. The 41.2% unannotated rate reflects the limitation of the automated marker panel, which uses canonical immune markers and does not include lung-specific cell type markers (e.g., SFTPC for alveolar type 2 cells, AGER for alveolar type 1 cells, ACTA2 for myofibroblasts). Domain-specific marker panels would be needed for full annotation.

The dominant macrophage presence (36.6% total: 22.7% alveolar + 13.9% interstitial) reflects the fibrotic lung microenvironment — macrophages drive pro-fibrotic inflammation in IPF through secretion of TGF-beta and other cytokines. This composition is consistent with published IPF single-cell atlases.

Turn 3: Pseudobulk Differential Expression and Pathway Enrichment

lobster query --session-id transcriptomics_hard \
  "Run pseudobulk differential expression analysis between IPF and Control \
   conditions. Use the 'Disease_Identity' column for grouping and \
   'Subject_Identity' for patient-level aggregation. Use DESeq2 method. \
   Then run pathway enrichment (GO Biological Process) on the significant \
   DE genes (adjusted p < 0.05, |log2FC| > 1). Show me the top DE genes \
   and enriched pathways."

Results

The DE analysis expert ran pseudobulk differential expression between 32 IPF and 28 Control patients using DESeq2, identifying 593 significant genes (adjusted p less than 0.05, absolute log2 fold change greater than 1).

Methodological caveat: Batch integration (Harmony) failed due to a dependency issue, so clustering and DE proceeded on uncorrected data from 107 libraries. Batch effects may confound disease-condition comparisons. The DE results below should be treated as preliminary until batch correction is applied.

Because pseudobulk aggregation was performed across all cell types — including the 41.2% unannotated population — the DE results capture both cell-type composition differences between IPF and Control and within-cell-type transcriptional changes. For cell-type-specific DE, all populations should be annotated first and pseudobulk run per cell type.

Differential expression summary:

  • Cohort: 32 IPF subjects vs 28 Control subjects
  • Method: Pseudobulk aggregation by Subject_Identity with DESeq2 and FDR correction
  • Genes tested: 13,865
  • Significant DE genes: 593
    • 571 upregulated in IPF (96.3%)
    • 22 downregulated in IPF (3.7%)

The overwhelming upregulation bias (96.3%) reflects the active fibroproliferative program in IPF lungs — extracellular matrix deposition, epithelial dysfunction, and cellular stress dominate the disease pathology.

Top 10 Differentially Expressed Genes:

RankGeneDirectionlog2FCAdj. P-value
1TMPRSS4UP+2.183.9e-07
2KRT17UP+2.495.9e-07
3PLCB4UP+1.508.4e-07
4LINC02345UP+1.451.1e-06
5ALDH1A3UP+2.171.3e-06
6TPPP3UP+2.253.2e-06
7SPP1 (Osteopontin)UP+3.183.5e-06
8LXN (Latexin)UP+1.923.5e-06
9ENSG00000231971UP+1.694.2e-06
10FN1 (Fibronectin)UP+1.266.0e-06

Gene category summary:

CategoryKey GenesBiological Significance
ECM and FibrosisSPP1 (+3.18), FN1 (+1.26), collagensMaster fibrosis program
Epithelial DysfunctionKRT17 (+2.49), TMPRSS4 (+2.18)Aberrant differentiation
Cellular StressALDH1A3 (+2.17), TPPP3 (+2.25)Oxidative stress response

Top Enriched Pathway Categories (GO Biological Process):

The agent ran GO pathway enrichment on the 593 significant genes, mapping 570 genes (96%) to GO terms and identifying 825 significantly enriched pathways.

CategoryKey GO TermsSignificance
ECM OrganizationCollagen-containing ECM (GO:0062023), ECM organization (GO:0030198), Basement membrane (GO:0005604)Highest
Cell Adhesion and MigrationFocal adhesion (GO:0005925), Cell-substrate junction (GO:0030055), Cell migration regulation (GO:0030334)High
Wound HealingResponse to wounding (GO:0009611), Tissue remodeling (GO:0048771)High
Cellular StressER lumen (GO:0005788), Oxidative stress response (GO:0006979)Moderate

SPP1 (osteopontin, log2FC +3.18) is a well-validated master mediator of pulmonary fibrosis — it promotes macrophage recruitment, fibroblast activation, and ECM deposition. KRT17 upregulation indicates aberrant epithelial differentiation, a hallmark of IPF where alveolar epithelial cells adopt a dysfunctional basal-like phenotype. GO pathway enrichment confirmed ECM organization, wound healing, and cellular stress as the dominant biological themes, consistent with the published IPF literature.

These results — from 20,000 cells and 78 patients to 593 differentially expressed genes with complete pathway enrichment — were generated in a single conversational turn. The identified gene programs (ECM remodeling, epithelial dysfunction, cellular stress) match the known pathobiology of IPF and provide immediate mechanistic insights.

Session cost: $1.62


What This Demonstrates

Multi-Agent Coordination

No single agent could produce these analyses. The research_agent handled GEO search and metadata validation. The data_expert_agent executed download queue operations. The transcriptomics_expert orchestrated preprocessing pipelines (QC, normalization, clustering) and delegated specialized tasks. The annotation_expert identified cell types using canonical markers. The de_analysis_expert performed pseudobulk differential expression and pathway enrichment. The supervisor routed each sub-question to the appropriate specialist and synthesized results across all turns.

Scalability Across Complexity Levels

The same agent system handles three orders of magnitude of complexity:

  • Simple: 20K cells, 2 turns, basic QC and filtering — $0.39
  • Medium: 85K cells, 3 turns, full preprocessing through annotation — approximately $0.71
  • Hard: 20K cells across 107 batches and 78 patients, 3 turns, multi-batch integration through differential expression and pathway enrichment — $1.62

Users with no bioinformatics training can run the simple workflow. Computational biologists can execute the hard workflow without writing a single line of code.

Database Integration

The agents queried GEO programmatically through validated API tools — not through LLM approximation. Scanpy, Scrublet, pyDESeq2, and gseapy ran locally for quality control, doublet detection, differential expression, and pathway enrichment. Outputs matched expected formats for downstream analysis.

Provenance and Reproducibility

Every tool call is logged with an AnalysisStep intermediate representation that captures the operation, parameters, data sources, and outputs. Each session can be reproduced or extended with --session-id.

Handling Edge Cases

The agents proactively validated dataset context (Turn 1 of Simple: kidney allograft vs PBMC mismatch), adapted to technical failures (Turn 2 of Hard: Harmony batch integration failed, pipeline continued with clustering on uncorrected data), and recovered from errors (Turn 3 of Hard: pseudobulk creation failed multiple times, agent used fallback DE approach and succeeded). The 41.2% unannotated cells in the Hard scenario were honestly reported — Lobster does not hallucinate cell types when marker evidence is insufficient.


Human vs Raw LLM vs Lobster AI

Estimates based on these case study sessions. Human researcher timing assumes manual workflows with standard bioinformatics tools (Python, R, command-line GEO downloads).

TaskHuman ResearcherRaw LLMLobster AI
Search GEO for dataset5-10 minCannot query GEO APIApproximately 30 sec
Download and load scRNA-seq data10-15 minCannot download files1-4 min (size dependent)
QC metric calculation15-20 min (scripting)Describes approach onlyApproximately 10 sec
Filter low-quality cells10-15 min (threshold selection)Suggests thresholds, cannot runApproximately 5 sec
Normalize and save5-10 minDescribes method onlyApproximately 5 sec
HVG selection, PCA, UMAP, clustering30-45 min (parameter tuning)Suggests parameters only2-3 min
Marker gene discovery15-20 min (scripting)Cannot compute1-2 min
Cell type annotation1-4 hours (manual marker curation)Generic suggestions1-2 min
Batch integration (multi-sample)30-60 minCannot computeApproximately 3 min (when working)
Pseudobulk DE (DESeq2)1-2 hours (scripting, debugging)Cannot computeApproximately 5 min
GO pathway enrichment30-60 minCannot computeApproximately 2 min
Total: Simple (QC only)45-70 minNot feasibleApproximately 2 min, $0.39
Total: Medium (Full atlas)2-4 hoursNot feasibleApproximately 10 min, approximately $0.71
Total: Hard (Multi-batch DE)1-2 daysNot feasibleApproximately 16 min, $1.62

Limitations

  • Batch integration not performed. Harmony failed due to a dependency issue in the Hard case. With 107 libraries from 78 patients, batch effects could confound disease-condition comparisons. In a production analysis, resolving the Harmony dependency or using an alternative method (scVI, scanorama) would be recommended before differential expression.
  • 41.2% unannotated cells in IPF lung. The automated annotation uses canonical immune markers and lacks lung-specific markers (SFTPC for AT2 cells, AGER for AT1, ACTA2 for myofibroblasts). Full tissue annotation requires a custom marker panel.
  • Pseudobulk DE scope. The DE analysis aggregated all cell types together, meaning results reflect cell composition changes in addition to transcriptional changes. Cell-type-resolved pseudobulk DE would provide more biologically specific results.
  • No visual outputs. The case study presents tables and statistics but does not include UMAP visualizations, volcano plots, or pathway enrichment bar charts. These would be generated in a full analysis session.
  • GO enrichment ontology. Some enriched terms may include Cellular Component terms in addition to Biological Process terms depending on the enrichment tool configuration.

Reproducibility

To reproduce these analyses, install the transcriptomics package and run the queries sequentially with session IDs:

pip install 'lobster-ai[full]==1.0.12'

Simple: Kidney Allograft QC

lobster query --session-id transcriptomics_simple \
  "Search GEO for GSE109564 (a small 3k PBMC single-cell RNA-seq dataset). \
   Download it and load it into the workspace. I want to do quality control \
   on this data."

lobster query --session-id transcriptomics_simple \
  "Yes, proceed with GSE109564. Download it, then run comprehensive quality \
   control: assess data quality with QC metrics, filter low-quality cells \
   and genes, and normalize the data. Give me the key QC statistics."

Medium: COVID-19 PBMC Immune Cell Atlas

lobster query --session-id transcriptomics_medium \
  "Search GEO for GSE149689 — a human PBMC single-cell RNA-seq dataset. \
   Get its metadata and queue it for download."

lobster query --session-id transcriptomics_medium \
  "Download GSE149689, then run the full single-cell preprocessing pipeline: \
   quality control, filter low-quality cells, normalize, select highly \
   variable genes, run PCA, compute neighbors, embed with UMAP, and cluster \
   the cells. Give me cluster statistics."

lobster query --session-id transcriptomics_medium \
  "Find marker genes for each cluster and then annotate cell types \
   automatically. These are PBMCs so I expect CD4+ T cells, CD8+ T cells, \
   B cells, NK cells, monocytes, and dendritic cells. Show me the cell type \
   proportions."

Hard: IPF Lung Multi-Batch Differential Expression

Note: This analysis requires pre-loading the dataset as a local file, as automated GEO download for multi-file MTX-format datasets is not yet supported.

lobster query --session-id transcriptomics_hard \
  "I have a pre-loaded IPF lung scRNA-seq dataset from GSE136831 at \
   .lobster_workspace/downloads/GSE136831_20k_subsample.h5ad — 20,000 cells \
   from 78 patients with IPF, Control, and COPD conditions. Load this file, \
   assess quality, filter low-quality cells, normalize, and detect doublets. \
   The batch key is 'Library_Identity' and the disease key is \
   'Disease_Identity'."

lobster query --session-id transcriptomics_hard \
  "Now select highly variable genes, run PCA, integrate batches using \
   Harmony with batch_key='Library_Identity', compute neighbors, UMAP \
   embedding, and cluster the cells at resolution 0.8. Then annotate cell \
   types — these are lung tissue cells, so expect macrophages, monocytes, \
   T cells, B cells, NK cells, fibroblasts, myofibroblasts, alveolar type \
   1 and type 2 epithelial cells, club cells, ciliated epithelial cells, \
   endothelial cells, and smooth muscle cells."

lobster query --session-id transcriptomics_hard \
  "Run pseudobulk differential expression analysis between IPF and Control \
   conditions. Use the 'Disease_Identity' column for grouping and \
   'Subject_Identity' for patient-level aggregation. Use DESeq2 method. \
   Then run pathway enrichment (GO Biological Process) on the significant \
   DE genes (adjusted p < 0.05, |log2FC| > 1). Show me the top DE genes \
   and enriched pathways."

Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory and can be exported with /pipeline export.


On this page