Omics-OS Docs
Case Studies

Research: From Literature Mining to Dataset Discovery

Literature search and dataset discovery across three complexity levels — CRISPR base editing review, spatial transcriptomics datasets, and single-cell multi-omics method comparison.

Scientific literature grows exponentially — PubMed adds 1.5 million citations per year, bioRxiv posts 250+ preprints daily, and GEO accumulates 50,000+ datasets annually. A bioinformatics researcher trying to survey a fast-moving field faces hours of manual database queries, PDF downloads, and metadata extraction. This case study follows Lobster AI's research agents through three increasingly complex literature discovery tasks — from simple publication searches to multi-database rapid literature surveys with cross-paper methods comparison and dataset validation.

Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). External databases queried: PubMed, PMC (full-text), GEO, bioRxiv. Total cost: $1.12 across 3 case studies (8 turns). Literature databases are updated daily — re-running these queries will return different papers and datasets as new publications are indexed. Session files preserving exact results are stored in .lobster_workspace/ for reproducibility. This case study demonstrates analytical workflows, not independently validated findings.

Agents and Data Sources

This analysis uses the lobster-research package, which provides two complementary agents with distinct capabilities:

AgentRoleNetwork Access
research_agentLiterature search, publication analysis, metadata extractionOnline (PubMed, PMC, bioRxiv, GEO APIs)
data_expert_agentDataset download execution, modality detection, data loadingOffline (executes from download queue only)

The research_agent has no child agents — complexity is measured by query breadth and tool orchestration rather than parent-child delegation. External APIs queried during sessions: PubMed (literature search), PMC (full-text extraction), bioRxiv (preprints), GEO (dataset metadata and downloads), and SRA (sequence read archives).

The research agent operates in two modes: research_agent performs all online operations (searching, fetching, validating), then hands off to data_expert_agent for offline downloads from the queue. This separation ensures reproducibility — data downloads can be retried, audited, or executed on different infrastructure without re-querying external APIs.


Simple: CRISPR Base Editing Literature Review

The first scenario demonstrates rapid literature survey and computational methods extraction from a fast-moving field with rich PubMed coverage.

Turn 1: Search Recent High-Impact Papers

The first query establishes the landscape of recent CRISPR base editing therapeutics publications.

lobster query --session-id research_simple \
  "Search PubMed for the 5 most recent high-impact papers on \
   CRISPR base editing in human disease therapy published in 2024-2025. \
   For each paper, give me the PMID, title, journal, and a one-sentence \
   summary of the key finding."

The research_agent queried PubMed with date filters and returned 5 papers from December 2024 to December 2025, revealing a clear trend toward prime editing dominance (4 of 5 papers) with emphasis on clinical translation from rare disease cohorts.

#PMIDJournalKey Finding
141390734Nature CommunicationsML-designed compact ABE: 27% size reduction, 133.5-fold precision improvement
241414712HGG AdvancesPrime editing corrected de novo GDF11 nonsense mutation from patient
341421338Molecular CellNovel DIMMER circuits reduce off-target editing across two orders of magnitude
441455771Cell Death & DiseasePrime editing modeled multiple eIF2B pathogenic mutations in iPSCs
541465342Int J Mol SciTemplate-jumping prime editing targets F9 gene (hemophilia B therapy)

Turn 2: Extract Computational Methods

The second query dives deep into the computational pipeline of the ML-designed adenine base editor.

lobster query --session-id research_simple \
  "For the paper on protein-nucleic acid language model-assisted adenine \
   base editor design (PMID 41390734), extract the full computational methods: \
   what software, algorithms, parameters, and validation approaches were used? \
   Also get the full abstract."

The agent extracted a complete computational pipeline from the Nature Communications paper, revealing a sophisticated 5-step validation cascade that progressively filtered 150 generated sequences to 20 experimental candidates.

ML Architecture:

ComponentDetails
Base ModelESM-2 (transfer learning, 650M parameters)
Novel ComponentsNucleic acid encoder, editing position encoder, masked autoregressive decoder
Training Data34,255 TadA sequences (UniProtKB) + 27 TadA-8e variants
Pre-training OptimizerAdam (beta1=0.9, beta2=0.999, lr=1e-06)
GenerationTemperature=1.0, top-p=0.9, mask strategy <5 consecutive tokens
Output150 sequences (73 mutations, 39 insertions, 38 truncations)

Multi-Tool Validation Pipeline:

StepToolFilter/Metric
1AlphaFold2 (ColabFold v1.5.5)pLDDT >= 84
2ESM-1vMean log-likelihood (21/150 > wild-type)
3ESM-IFStructure-based sequence likelihood
4RosettaEnergy within 100 units, charge within 50 units of WT
5AlphaFold3Binary complex prediction (protein + ssDNA)

The 5-step validation cascade demonstrates how modern protein engineering combines language models (ESM-2, ESM-1v), structure prediction (AlphaFold2/3), physics-based scoring (Rosetta), and inverse folding (ESM-IF) in a multi-tool consensus workflow. This level of methods detail — including exact hyperparameters, training data sizes, and software versions — is typically scattered across main text, supplementary materials, and GitHub repositories. The research agent extracted it from PMC full-text in under 1 minute.

Cost and Performance

MetricValue
Session IDresearch_simple
Turns2
Total Time~2 minutes
Total Cost$0.26
Total Tokens71,601

Medium: Spatial Transcriptomics Dataset Discovery Pipeline

The second scenario exercises the full research-to-download pipeline: literature search, GEO cross-referencing, metadata validation, and download queue management.

Turn 1: Search Spatial Transcriptomics Papers with GEO Data

The first query targets a clinically relevant cancer domain with spatial profiling requirements.

lobster query --session-id research_medium \
  "Search PubMed for recent papers on spatial transcriptomics in \
   pancreatic ductal adenocarcinoma (PDAC) tumor microenvironment \
   published in 2024-2025. I need papers that have deposited their \
   spatial transcriptomics data in GEO. Find at least 3 papers and \
   for each give me the PMID, title, journal, and any GEO accession \
   numbers mentioned."

The agent identified 5 spatial transcriptomics studies in PDAC, each with confirmed GEO deposits, spanning cancer-associated fibroblasts, tertiary lymphoid structures, and primary-metastatic heterogeneity themes.

#PMIDJournalGEOSamplesPlatform
140154487Cancer Cell (Mar 2025)GSE2741035 PDAC (FFPE)10x Visium
240815230Cancer Immunol Res (Nov 2025)GSE27711628 PDAC10x Visium
339294496Nature Genetics (Sep 2024)GSE27236230 (10 primary, 3 normal, 12 hepatic mets, 5 LN mets)10x Visium
438798691Research Square (May 2024)GSE24007836 (23 NAT, 13 naive)GeoMx DSP

The strongest candidate for downstream analysis is GSE272362 from Nature Genetics — the largest cohort (30 specimens) with matched primary and metastatic samples, providing a unique spatial atlas of tumor microenvironment heterogeneity across anatomical sites.

Turn 2: Validate Dataset Metadata and Queue Download

The second query pre-flights the dataset before committing to a large download.

lobster query --session-id research_medium \
  "Validate the metadata for GSE272362 (the primary vs metastatic \
   PDAC spatial atlas from Nature Genetics). Check: how many samples, \
   what platform, what organism, any supplementary files. Then prepare \
   it for download and add it to the download queue."

The research agent validated GSE272362's metadata through the GEO API, confirming 100% field completeness and producing a queue entry with a recommended download strategy.

AttributeValue
AccessionGSE272362
OrganismHomo sapiens
Total specimens30 (10 primary, 3 normal, 12 hepatic mets, 5 LN mets)
Total spots91,496
PlatformGPL24676 (10x Visium)
Public sinceJuly 31, 2024
Supplementary files2
Metadata completeness100%
Download strategyMATRIX_FIRST (85% confidence)
Queue entry IDqueue_GSE272362_3bb5f772

Turn 3: Execute Download (Graceful Degradation)

The third query attempts to download the dataset from the prepared queue entry.

lobster query --session-id research_medium \
  "Download the GSE272362 dataset from the download queue \
   (entry queue_GSE272362_3bb5f772). Once downloaded, list the \
   available modalities and show me the basic shape and content \
   of the data."

The data_expert_agent exhausted 5 download strategies (MATRIX_FIRST, SUPPLEMENTARY_FIRST, H5_FIRST, RAW_FIRST, MATRIX_FIRST+union), downloading approximately 260 MB of supplementary files. However, Visium spatial transcriptomics data uses a non-standard multi-file structure (spatial coordinates, tissue images, spot-level matrices in custom subdirectories) that the current GEO download pipeline cannot parse as standard count matrices.

Strategy AttemptedResultReason
MATRIX_FIRSTFailedNo processed matrix in strategy config
SUPPLEMENTARY_FIRSTFailedSpatial format (images, coordinates) not parseable as count matrix
H5_FIRSTFailedH5AD not available in standard GEO format
RAW_FIRSTFailedRaw files require spatial-specific processing pipeline
MATRIX_FIRST + unionFailedSame underlying format issue

Rather than failing silently, the agent provided structured recovery suggestions:

OptionActionRationale
1Manual investigation of GSE272362Check exact file formats on GEO web
2Try GSE274103 or GSE277116Alternative PDAC spatial datasets
3Wait and retryPossible GEO server issues

This demonstrates Lobster's fail-safe design for data retrieval. When automated download fails, the system exhausts all available strategies, provides a structured diagnosis, and recommends actionable alternatives rather than leaving the user with a generic error message. The agent correctly identified a current limitation (Visium spatial format support) and suggested two peer datasets as alternatives.

Cost and Performance

MetricValue
Session IDresearch_medium
Turns3
Total Time~5 minutes (includes 5 download strategy attempts)
Total Cost$0.30
Total Tokens91,544

Hard: Single-Cell Multi-Omics Integration Review

The third scenario is the most demanding research workflow — combining literature search across multiple databases, full-text analysis, cross-paper synthesis, and dataset validation in a single session.

The first query executes a comprehensive search across PubMed and bioRxiv with specific filtering criteria.

lobster query --session-id research_hard \
  "I'm conducting a comprehensive methods comparison for single-cell \
   multi-omics integration methods in cancer immunology. Search PubMed \
   AND bioRxiv for papers from 2024-2025 that describe new computational \
   methods or benchmarks for integrating scRNA-seq with scATAC-seq \
   or CITE-seq data in tumor samples. I need at least 5 papers. \
   For each: PMID/DOI, title, journal, and any GEO/SRA/Zenodo \
   accession numbers for deposited data. Prioritize papers that \
   benchmarked multiple integration methods."

The research agent executed 7 targeted searches across PubMed and bioRxiv, identifying 8 papers with automatic triage by benchmark scope.

Priority-Ranked Paper Summary:

PriorityTitlePMIDJournalMethods BenchmarkedKey Accessions
HighscGALA: Graph link prediction cell alignment41298467Nat Commun (2025)14 methodsGSE261228, GSE232073, GSE230827, GSE232074
HighCelLink: Weak feature linkage integration41335468Nucleic Acids Res (2025)11 methodsPANC-DB, SeuratData
HighscCotag: Co-optimal transport integration41446270bioRxiv (2025)6 methodsGitHub (pending)
MediumComparison of scRNA+scATAC integration methods41675510Quant Biol (2025)MultipleNot extracted
MediumSpatialEx: Histology-anchored integration41407925Nat Methods (2025)Novel methodSupplementary
MediumGlioblastoma scMultiome radiation response41573875bioRxiv (2025)ApplicationPending

Three papers stood out for comprehensive method comparisons: scGALA benchmarked 14 integration methods, CelLink benchmarked 11, and scCotag benchmarked 6. Four GEO accessions and multiple public repository links were extracted directly from search results.

Turn 2: Methods Extraction + Dataset Validation

The second query extracts detailed computational methods from the top-ranked paper and validates its associated dataset.

lobster query --session-id research_hard \
  "For the scGALA paper (PMID 41298467), extract the complete \
   computational methods section including: (1) all software and \
   their versions, (2) hyperparameters for the graph attention \
   network, (3) benchmark evaluation metrics, (4) computing \
   resources used. Also validate the metadata for GSE261228 — \
   check sample count, organism, platform, and data availability. \
   Queue it for download if it looks good."

The agent extracted the complete computational architecture from the Nature Communications full text and simultaneously validated the associated tri-omics MPAL dataset.

scGALA Critical Hyperparameters:

ComponentSetting
OptimizerAdam
Learning Rate1e-3
SchedulerCosine annealing (patience=10 epochs)
Edge Masking30% random (uniform across edge types)
Dropout Rate50% (spatial tasks)
K-NNK=20 (both intra- and inter-dataset)
Model Size476,000 parameters

Benchmark Metrics and Performance:

CategoryMetricscGALA Improvement
Biological ConservationARI14.7-48.6%
Biological ConservationNMI7.7-17.0%
AlignmentFOSCTTM12.4%
Label TransferCohen's Kappa19.2% avg (up to 66.8%)
Batch CorrectionGraph Connectivity7.5%
Booster ModeClustering Accuracy67.8% when wrapping other methods

GSE261228 Validation:

AttributeValue
AccessionGSE261228
TitleMultiomic Single Cell Sequencing of MPAL (normal bone marrow PIP-Seq)
OrganismHomo sapiens
Samples14 adult MPAL patients
TechnologyPIP-Seq (DNA + RNA + protein)
PlatformGPL24676
StatusPublic (since Mar 13, 2024)
Queue Entryqueue_GSE261228_84056a40

Turn 3: Cross-Paper Methods Comparison

The third query synthesizes methods from multiple papers into a structured comparison table.

lobster query --session-id research_hard \
  "Now extract the methods from the CelLink paper (PMID 41335468) \
   and compare them side-by-side with scGALA. I want a structured \
   comparison table covering: (1) algorithmic approach, (2) cell \
   count scalability, (3) modalities supported, (4) benchmark \
   metrics used, (5) number of datasets tested, (6) key advantages \
   and limitations. Also read the full abstract of the scCotag \
   bioRxiv paper (PMID 41446270) and add it to the comparison."

The agent extracted complete methods from the second paper (CelLink from Nucleic Acids Research) and the abstract from the third (scCotag from bioRxiv), then produced a 3-way structured comparison.

Core Algorithm Comparison:

DimensionscGALACelLinkscCotag
Core AlgorithmVGAE + Graph Attention NetworkBalanced OT to Iterative Unbalanced OTPrior-informed Co-Optimal Transport + VAE
Learning ParadigmSelf-supervised (30% edge masking)Optimization-based (no training)Supervised deep learning (4,000 epochs)
Model Size476,000 parametersN/A (optimization)VAE: 2x256 layers
Hyperparameters11 critical5 parameters9 parameters
Training TimeGPU requiredN/A (no training)4,000 epochs

Scalability Comparison:

MetricscGALACelLinkscCotag
Max Cells Tested161,764100,000~10,000
Runtime (10K cells)Not reported2-5 minNot reported
HardwareGPU clusterApple M1 Pro (8-core, 16GB RAM)GPU (likely)

Modality Support:

ModalityscGALACelLinkscCotag
scRNA + scATACYesYes (via imputation)Yes (primary focus)
CITE-seq (RNA+ADT)YesYes (4 datasets)No
Spatial TranscriptomicsYes (161K spots)NoNo
Spatial Proteomics (CODEX)NoYes (primary strength)No
Tri-omics (DNA+RNA+Protein)Yes (MPAL PIP-Seq)NoNo

Use Case Recommendations:

Use CaseRecommended MethodRationale
Large-scale atlas (>100K cells)scGALAProven on 161K cells
Consumer hardware (no GPU)CelLinkCPU-only, 2-5 min for 10K cells
Spatial proteomics (CODEX/MIBI-TOF)CelLinkOnly method tested on imaging data
scRNA+scATAC integrationscCotagPrior-informed with gene-peak overlap
Extreme imbalance (1:10+ ratios)CelLinkTested on 1:76 ratio
Tri-omics (DNA+RNA+Protein)scGALAMPAL PIP-Seq dataset
Boosting existing workflowsscGALAWraps 14 methods

This comparison table — covering 7 dimensions with specific numbers and use-case guidance — is the kind of synthesis that typically takes a researcher days of reading and tabulation. The research agent produced it in a single conversational turn by leveraging PMC full-text extraction and structured analysis across three papers simultaneously. Key insight: scGALA is the universal booster (enhances 14 existing methods), CelLink is the CPU-friendly option for extreme imbalance, and scCotag specializes in scRNA+scATAC integration with biological priors.

Cost and Performance

MetricValue
Session IDresearch_hard
Turns3
Total Time~8 minutes
Total Cost$0.56
Total Tokens151,798

What This Demonstrates

Multi-Tool Orchestration

No single database query produces these analyses. The research agent orchestrated PubMed searches, bioRxiv preprint retrieval, PMC full-text extraction, GEO metadata validation, and cross-database accession linking — all within natural language conversations. The data expert agent managed download queue execution with 5 fallback strategies and structured failure reporting.

Cross-Database Synthesis

The hard case study required querying PubMed, bioRxiv, PMC, and GEO simultaneously, then synthesizing results across 3 papers into a structured comparison table with specific hyperparameters, benchmark metrics, and use-case recommendations. This workflow would typically require separate API scripts, manual PDF reading, and Excel tabulation over 1-2 days. Lobster compressed it into 8 minutes.

The Hard case demonstrates rapid literature survey capabilities, not formal systematic review methodology. Formal systematic reviews require a registered protocol (PROSPERO), PRISMA-compliant reporting, and explicit inclusion/exclusion criteria. Lobster AI accelerates the literature screening and synthesis phases but does not replace the methodological framework of evidence synthesis.

Structured Failure Handling

When the Visium spatial data download failed due to non-standard file formats, the system didn't produce a cryptic error — it exhausted all 5 download strategies, identified the root cause (spatial format incompatibility), and recommended specific alternative datasets. This fail-safe design ensures users understand why operations fail and what to do next.


Human vs Raw LLM vs Lobster AI

Estimates based on these case study sessions. Human researcher timing assumes manual workflows without automation.

TaskHuman ResearcherRaw LLMLobster AI
Search PubMed with date filters10-15 minCannot query APIs~30 sec
Extract computational methods from full text30-60 minHallucinates parameters/versions~1 min (PMC extraction)
Cross-reference PubMed + GEO accessions20-30 minCannot access multiple databasesAutomatic
Validate GEO dataset metadata15-20 minCannot access GEO API~15 sec
Compare methods across 3 papers3-5 hoursGeneric, misses details~2 min (structured synthesis)
Execute download with fallback strategies30-60 minCannot download~3 min (5 strategies automated)
Simple case (2 turns)1.5-2.5 hoursNot reliable~2 min, $0.26
Medium case (3 turns)2-3 hoursNot possible~5 min, $0.30
Hard case (3 turns)1-2 daysNot reliable~8 min, $0.56

Limitations

  • Search strategies are not disclosed. The case study shows natural language prompts but not the actual PubMed Boolean queries constructed by the agent. Without the exact query strings, search dates, and total result counts, the searches are not independently reproducible to systematic review standards.
  • Literature results are date-sensitive. PubMed, GEO, and bioRxiv indexes change daily. The specific papers and datasets returned in these case studies reflect the database state at the time of the session.
  • Spatial data download not supported. The Visium spatial transcriptomics format (GSE272362) could not be downloaded despite 5 strategy attempts. Spatial data formats require specialized handling beyond standard GEO matrix downloads.
  • Not a systematic review. The Hard case produces a structured methods comparison, not a PRISMA-compliant systematic review. Formal evidence synthesis requires registered protocols, explicit inclusion/exclusion criteria, and risk-of-bias assessment.
  • Full-text access varies. Methods extraction from PMC full-text depends on open-access availability. Paywalled publications may only yield abstract-level information.

Reproducibility

To reproduce these analyses, install the research package and run the queries sequentially with session continuity:

pip install 'lobster-ai[full]'

Simple case:

lobster query --session-id research_simple \
  "Search PubMed for the 5 most recent high-impact papers on \
   CRISPR base editing in human disease therapy published in 2024-2025. \
   For each paper, give me the PMID, title, journal, and a one-sentence \
   summary of the key finding."
lobster query --session-id research_simple \
  "For the paper on protein-nucleic acid language model-assisted adenine \
   base editor design (PMID 41390734), extract the full computational methods."

Medium case:

lobster query --session-id research_medium \
  "Search PubMed for recent papers on spatial transcriptomics in \
   pancreatic ductal adenocarcinoma published in 2024-2025 that have \
   deposited data in GEO. Find at least 3 papers with GEO accessions."
lobster query --session-id research_medium \
  "Validate the metadata for GSE272362. Then prepare it for download \
   and add it to the download queue."
lobster query --session-id research_medium \
  "Download the GSE272362 dataset from the download queue."

Hard case:

lobster query --session-id research_hard \
  "Search PubMed AND bioRxiv for papers from 2024-2025 on single-cell \
   multi-omics integration methods benchmarking scRNA-seq with scATAC-seq \
   or CITE-seq in tumor samples. Need at least 5 papers with GEO accessions."
lobster query --session-id research_hard \
  "For the scGALA paper (PMID 41298467), extract complete computational \
   methods and validate GSE261228 metadata. Queue it for download."
lobster query --session-id research_hard \
  "Extract methods from CelLink paper (PMID 41335468) and compare with \
   scGALA. Add scCotag (PMID 41446270) abstract to the comparison."

Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory and can be exported with /pipeline export.


On this page