Research: From Literature Mining to Dataset Discovery

Literature search and dataset discovery across three complexity levels — CRISPR base editing review, spatial transcriptomics datasets, and single-cell multi-omics method comparison.

Scientific literature grows exponentially — PubMed adds 1.5 million citations per year, bioRxiv posts 250+ preprints daily, and GEO accumulates 50,000+ datasets annually. A bioinformatics researcher trying to survey a fast-moving field faces hours of manual database queries, PDF downloads, and metadata extraction. This case study follows Lobster AI's research agents through three increasingly complex literature discovery tasks — from simple publication searches to multi-database rapid literature surveys with cross-paper methods comparison and dataset validation.

Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). External databases queried: PubMed, PMC (full-text), GEO, bioRxiv. Total cost: $1.12 across 3 case studies (8 turns). Literature databases are updated daily — re-running these queries will return different papers and datasets as new publications are indexed. Session files preserving exact results are stored in .lobster_workspace/ for reproducibility. This case study demonstrates analytical workflows, not independently validated findings.

Agents and Data Sources

This analysis uses the lobster-research package, which provides two complementary agents with distinct capabilities:

Agent	Role	Network Access
`research_agent`	Literature search, publication analysis, metadata extraction	Online (PubMed, PMC, bioRxiv, GEO APIs)
`data_expert_agent`	Dataset download execution, modality detection, data loading	Offline (executes from download queue only)

The research_agent has no child agents — complexity is measured by query breadth and tool orchestration rather than parent-child delegation. External APIs queried during sessions: PubMed (literature search), PMC (full-text extraction), bioRxiv (preprints), GEO (dataset metadata and downloads), and SRA (sequence read archives).

The research agent operates in two modes: research_agent performs all online operations (searching, fetching, validating), then hands off to data_expert_agent for offline downloads from the queue. This separation ensures reproducibility — data downloads can be retried, audited, or executed on different infrastructure without re-querying external APIs.

Simple: CRISPR Base Editing Literature Review

The first scenario demonstrates rapid literature survey and computational methods extraction from a fast-moving field with rich PubMed coverage.

Turn 1: Search Recent High-Impact Papers

The first query establishes the landscape of recent CRISPR base editing therapeutics publications.

lobster query --session-id research_simple \
  "Search PubMed for the 5 most recent high-impact papers on \
   CRISPR base editing in human disease therapy published in 2024-2025. \
   For each paper, give me the PMID, title, journal, and a one-sentence \
   summary of the key finding."

The research_agent queried PubMed with date filters and returned 5 papers from December 2024 to December 2025, revealing a clear trend toward prime editing dominance (4 of 5 papers) with emphasis on clinical translation from rare disease cohorts.

#	PMID	Journal	Key Finding
1	41390734	Nature Communications	ML-designed compact ABE: 27% size reduction, 133.5-fold precision improvement
2	41414712	HGG Advances	Prime editing corrected de novo GDF11 nonsense mutation from patient
3	41421338	Molecular Cell	Novel DIMMER circuits reduce off-target editing across two orders of magnitude
4	41455771	Cell Death & Disease	Prime editing modeled multiple eIF2B pathogenic mutations in iPSCs
5	41465342	Int J Mol Sci	Template-jumping prime editing targets F9 gene (hemophilia B therapy)

Turn 2: Extract Computational Methods

The second query dives deep into the computational pipeline of the ML-designed adenine base editor.

lobster query --session-id research_simple \
  "For the paper on protein-nucleic acid language model-assisted adenine \
   base editor design (PMID 41390734), extract the full computational methods: \
   what software, algorithms, parameters, and validation approaches were used? \
   Also get the full abstract."

The agent extracted a complete computational pipeline from the Nature Communications paper, revealing a sophisticated 5-step validation cascade that progressively filtered 150 generated sequences to 20 experimental candidates.

ML Architecture:

Component	Details
Base Model	ESM-2 (transfer learning, 650M parameters)
Novel Components	Nucleic acid encoder, editing position encoder, masked autoregressive decoder
Training Data	34,255 TadA sequences (UniProtKB) + 27 TadA-8e variants
Pre-training Optimizer	Adam (beta1=0.9, beta2=0.999, lr=1e-06)
Generation	Temperature=1.0, top-p=0.9, mask strategy <5 consecutive tokens
Output	150 sequences (73 mutations, 39 insertions, 38 truncations)

Multi-Tool Validation Pipeline:

Step	Tool	Filter/Metric
1	AlphaFold2 (ColabFold v1.5.5)	pLDDT >= 84
2	ESM-1v	Mean log-likelihood (21/150 > wild-type)
3	ESM-IF	Structure-based sequence likelihood
4	Rosetta	Energy within 100 units, charge within 50 units of WT
5	AlphaFold3	Binary complex prediction (protein + ssDNA)

The 5-step validation cascade demonstrates how modern protein engineering combines language models (ESM-2, ESM-1v), structure prediction (AlphaFold2/3), physics-based scoring (Rosetta), and inverse folding (ESM-IF) in a multi-tool consensus workflow. This level of methods detail — including exact hyperparameters, training data sizes, and software versions — is typically scattered across main text, supplementary materials, and GitHub repositories. The research agent extracted it from PMC full-text in under 1 minute.

Cost and Performance

Metric	Value
Session ID	research_simple
Turns	2
Total Time	~2 minutes
Total Cost	$0.26
Total Tokens	71,601

Medium: Spatial Transcriptomics Dataset Discovery Pipeline

The second scenario exercises the full research-to-download pipeline: literature search, GEO cross-referencing, metadata validation, and download queue management.

Turn 1: Search Spatial Transcriptomics Papers with GEO Data

The first query targets a clinically relevant cancer domain with spatial profiling requirements.

lobster query --session-id research_medium \
  "Search PubMed for recent papers on spatial transcriptomics in \
   pancreatic ductal adenocarcinoma (PDAC) tumor microenvironment \
   published in 2024-2025. I need papers that have deposited their \
   spatial transcriptomics data in GEO. Find at least 3 papers and \
   for each give me the PMID, title, journal, and any GEO accession \
   numbers mentioned."

The agent identified 5 spatial transcriptomics studies in PDAC, each with confirmed GEO deposits, spanning cancer-associated fibroblasts, tertiary lymphoid structures, and primary-metastatic heterogeneity themes.

#	PMID	Journal	GEO	Samples	Platform
1	40154487	Cancer Cell (Mar 2025)	GSE274103	5 PDAC (FFPE)	10x Visium
2	40815230	Cancer Immunol Res (Nov 2025)	GSE277116	28 PDAC	10x Visium
3	39294496	Nature Genetics (Sep 2024)	GSE272362	30 (10 primary, 3 normal, 12 hepatic mets, 5 LN mets)	10x Visium
4	38798691	Research Square (May 2024)	GSE240078	36 (23 NAT, 13 naive)	GeoMx DSP

The strongest candidate for downstream analysis is GSE272362 from Nature Genetics — the largest cohort (30 specimens) with matched primary and metastatic samples, providing a unique spatial atlas of tumor microenvironment heterogeneity across anatomical sites.

Turn 2: Validate Dataset Metadata and Queue Download

The second query pre-flights the dataset before committing to a large download.

lobster query --session-id research_medium \
  "Validate the metadata for GSE272362 (the primary vs metastatic \
   PDAC spatial atlas from Nature Genetics). Check: how many samples, \
   what platform, what organism, any supplementary files. Then prepare \
   it for download and add it to the download queue."

The research agent validated GSE272362's metadata through the GEO API, confirming 100% field completeness and producing a queue entry with a recommended download strategy.

Attribute	Value
Accession	GSE272362
Organism	Homo sapiens
Total specimens	30 (10 primary, 3 normal, 12 hepatic mets, 5 LN mets)
Total spots	91,496
Platform	GPL24676 (10x Visium)
Public since	July 31, 2024
Supplementary files	2
Metadata completeness	100%
Download strategy	MATRIX_FIRST (85% confidence)
Queue entry ID	queue_GSE272362_3bb5f772

Turn 3: Execute Download (Graceful Degradation)

The third query attempts to download the dataset from the prepared queue entry.

lobster query --session-id research_medium \
  "Download the GSE272362 dataset from the download queue \
   (entry queue_GSE272362_3bb5f772). Once downloaded, list the \
   available modalities and show me the basic shape and content \
   of the data."

The data_expert_agent exhausted 5 download strategies (MATRIX_FIRST, SUPPLEMENTARY_FIRST, H5_FIRST, RAW_FIRST, MATRIX_FIRST+union), downloading approximately 260 MB of supplementary files. However, Visium spatial transcriptomics data uses a non-standard multi-file structure (spatial coordinates, tissue images, spot-level matrices in custom subdirectories) that the current GEO download pipeline cannot parse as standard count matrices.

Strategy Attempted	Result	Reason
MATRIX_FIRST	Failed	No processed matrix in strategy config
SUPPLEMENTARY_FIRST	Failed	Spatial format (images, coordinates) not parseable as count matrix
H5_FIRST	Failed	H5AD not available in standard GEO format
RAW_FIRST	Failed	Raw files require spatial-specific processing pipeline
MATRIX_FIRST + union	Failed	Same underlying format issue

Rather than failing silently, the agent provided structured recovery suggestions:

Option	Action	Rationale
1	Manual investigation of GSE272362	Check exact file formats on GEO web
2	Try GSE274103 or GSE277116	Alternative PDAC spatial datasets
3	Wait and retry	Possible GEO server issues

This demonstrates Lobster's fail-safe design for data retrieval. When automated download fails, the system exhausts all available strategies, provides a structured diagnosis, and recommends actionable alternatives rather than leaving the user with a generic error message. The agent correctly identified a current limitation (Visium spatial format support) and suggested two peer datasets as alternatives.

Cost and Performance

Metric	Value
Session ID	research_medium
Turns	3
Total Time	~5 minutes (includes 5 download strategy attempts)
Total Cost	$0.30
Total Tokens	91,544

Hard: Single-Cell Multi-Omics Integration Review

The third scenario is the most demanding research workflow — combining literature search across multiple databases, full-text analysis, cross-paper synthesis, and dataset validation in a single session.

Turn 1: Multi-Database Systematic Search

The first query executes a comprehensive search across PubMed and bioRxiv with specific filtering criteria.

lobster query --session-id research_hard \
  "I'm conducting a comprehensive methods comparison for single-cell \
   multi-omics integration methods in cancer immunology. Search PubMed \
   AND bioRxiv for papers from 2024-2025 that describe new computational \
   methods or benchmarks for integrating scRNA-seq with scATAC-seq \
   or CITE-seq data in tumor samples. I need at least 5 papers. \
   For each: PMID/DOI, title, journal, and any GEO/SRA/Zenodo \
   accession numbers for deposited data. Prioritize papers that \
   benchmarked multiple integration methods."

The research agent executed 7 targeted searches across PubMed and bioRxiv, identifying 8 papers with automatic triage by benchmark scope.

Priority-Ranked Paper Summary:

Priority	Title	PMID	Journal	Methods Benchmarked	Key Accessions
High	scGALA: Graph link prediction cell alignment	41298467	Nat Commun (2025)	14 methods	GSE261228, GSE232073, GSE230827, GSE232074
High	CelLink: Weak feature linkage integration	41335468	Nucleic Acids Res (2025)	11 methods	PANC-DB, SeuratData
High	scCotag: Co-optimal transport integration	41446270	bioRxiv (2025)	6 methods	GitHub (pending)
Medium	Comparison of scRNA+scATAC integration methods	41675510	Quant Biol (2025)	Multiple	Not extracted
Medium	SpatialEx: Histology-anchored integration	41407925	Nat Methods (2025)	Novel method	Supplementary
Medium	Glioblastoma scMultiome radiation response	41573875	bioRxiv (2025)	Application	Pending

Three papers stood out for comprehensive method comparisons: scGALA benchmarked 14 integration methods, CelLink benchmarked 11, and scCotag benchmarked 6. Four GEO accessions and multiple public repository links were extracted directly from search results.

Turn 2: Methods Extraction + Dataset Validation

The second query extracts detailed computational methods from the top-ranked paper and validates its associated dataset.

lobster query --session-id research_hard \
  "For the scGALA paper (PMID 41298467), extract the complete \
   computational methods section including: (1) all software and \
   their versions, (2) hyperparameters for the graph attention \
   network, (3) benchmark evaluation metrics, (4) computing \
   resources used. Also validate the metadata for GSE261228 — \
   check sample count, organism, platform, and data availability. \
   Queue it for download if it looks good."

The agent extracted the complete computational architecture from the Nature Communications full text and simultaneously validated the associated tri-omics MPAL dataset.

scGALA Critical Hyperparameters:

Component	Setting
Optimizer	Adam
Learning Rate	1e-3
Scheduler	Cosine annealing (patience=10 epochs)
Edge Masking	30% random (uniform across edge types)
Dropout Rate	50% (spatial tasks)
K-NN	K=20 (both intra- and inter-dataset)
Model Size	476,000 parameters

Benchmark Metrics and Performance:

Category	Metric	scGALA Improvement
Biological Conservation	ARI	14.7-48.6%
Biological Conservation	NMI	7.7-17.0%
Alignment	FOSCTTM	12.4%
Label Transfer	Cohen's Kappa	19.2% avg (up to 66.8%)
Batch Correction	Graph Connectivity	7.5%
Booster Mode	Clustering Accuracy	67.8% when wrapping other methods

GSE261228 Validation:

Attribute	Value
Accession	GSE261228
Title	Multiomic Single Cell Sequencing of MPAL (normal bone marrow PIP-Seq)
Organism	Homo sapiens
Samples	14 adult MPAL patients
Technology	PIP-Seq (DNA + RNA + protein)
Platform	GPL24676
Status	Public (since Mar 13, 2024)
Queue Entry	queue_GSE261228_84056a40

Turn 3: Cross-Paper Methods Comparison

The third query synthesizes methods from multiple papers into a structured comparison table.

lobster query --session-id research_hard \
  "Now extract the methods from the CelLink paper (PMID 41335468) \
   and compare them side-by-side with scGALA. I want a structured \
   comparison table covering: (1) algorithmic approach, (2) cell \
   count scalability, (3) modalities supported, (4) benchmark \
   metrics used, (5) number of datasets tested, (6) key advantages \
   and limitations. Also read the full abstract of the scCotag \
   bioRxiv paper (PMID 41446270) and add it to the comparison."

The agent extracted complete methods from the second paper (CelLink from Nucleic Acids Research) and the abstract from the third (scCotag from bioRxiv), then produced a 3-way structured comparison.

Core Algorithm Comparison:

Dimension	scGALA	CelLink	scCotag
Core Algorithm	VGAE + Graph Attention Network	Balanced OT to Iterative Unbalanced OT	Prior-informed Co-Optimal Transport + VAE
Learning Paradigm	Self-supervised (30% edge masking)	Optimization-based (no training)	Supervised deep learning (4,000 epochs)
Model Size	476,000 parameters	N/A (optimization)	VAE: 2x256 layers
Hyperparameters	11 critical	5 parameters	9 parameters
Training Time	GPU required	N/A (no training)	4,000 epochs

Scalability Comparison:

Metric	scGALA	CelLink	scCotag
Max Cells Tested	161,764	100,000	~10,000
Runtime (10K cells)	Not reported	2-5 min	Not reported
Hardware	GPU cluster	Apple M1 Pro (8-core, 16GB RAM)	GPU (likely)

Modality Support:

Modality	scGALA	CelLink	scCotag
scRNA + scATAC	Yes	Yes (via imputation)	Yes (primary focus)
CITE-seq (RNA+ADT)	Yes	Yes (4 datasets)	No
Spatial Transcriptomics	Yes (161K spots)	No	No
Spatial Proteomics (CODEX)	No	Yes (primary strength)	No
Tri-omics (DNA+RNA+Protein)	Yes (MPAL PIP-Seq)	No	No

Use Case Recommendations:

Use Case	Recommended Method	Rationale
Large-scale atlas (>100K cells)	scGALA	Proven on 161K cells
Consumer hardware (no GPU)	CelLink	CPU-only, 2-5 min for 10K cells
Spatial proteomics (CODEX/MIBI-TOF)	CelLink	Only method tested on imaging data
scRNA+scATAC integration	scCotag	Prior-informed with gene-peak overlap
Extreme imbalance (1:10+ ratios)	CelLink	Tested on 1:76 ratio
Tri-omics (DNA+RNA+Protein)	scGALA	MPAL PIP-Seq dataset
Boosting existing workflows	scGALA	Wraps 14 methods

This comparison table — covering 7 dimensions with specific numbers and use-case guidance — is the kind of synthesis that typically takes a researcher days of reading and tabulation. The research agent produced it in a single conversational turn by leveraging PMC full-text extraction and structured analysis across three papers simultaneously. Key insight: scGALA is the universal booster (enhances 14 existing methods), CelLink is the CPU-friendly option for extreme imbalance, and scCotag specializes in scRNA+scATAC integration with biological priors.

Cost and Performance

Metric	Value
Session ID	research_hard
Turns	3
Total Time	~8 minutes
Total Cost	$0.56
Total Tokens	151,798

What This Demonstrates

Multi-Tool Orchestration

No single database query produces these analyses. The research agent orchestrated PubMed searches, bioRxiv preprint retrieval, PMC full-text extraction, GEO metadata validation, and cross-database accession linking — all within natural language conversations. The data expert agent managed download queue execution with 5 fallback strategies and structured failure reporting.

Cross-Database Synthesis

The hard case study required querying PubMed, bioRxiv, PMC, and GEO simultaneously, then synthesizing results across 3 papers into a structured comparison table with specific hyperparameters, benchmark metrics, and use-case recommendations. This workflow would typically require separate API scripts, manual PDF reading, and Excel tabulation over 1-2 days. Lobster compressed it into 8 minutes.

The Hard case demonstrates rapid literature survey capabilities, not formal systematic review methodology. Formal systematic reviews require a registered protocol (PROSPERO), PRISMA-compliant reporting, and explicit inclusion/exclusion criteria. Lobster AI accelerates the literature screening and synthesis phases but does not replace the methodological framework of evidence synthesis.

Structured Failure Handling

When the Visium spatial data download failed due to non-standard file formats, the system didn't produce a cryptic error — it exhausted all 5 download strategies, identified the root cause (spatial format incompatibility), and recommended specific alternative datasets. This fail-safe design ensures users understand why operations fail and what to do next.

Human vs Raw LLM vs Lobster AI

Estimates based on these case study sessions. Human researcher timing assumes manual workflows without automation.

Task	Human Researcher	Raw LLM	Lobster AI
Search PubMed with date filters	10-15 min	Cannot query APIs	~30 sec
Extract computational methods from full text	30-60 min	Hallucinates parameters/versions	~1 min (PMC extraction)
Cross-reference PubMed + GEO accessions	20-30 min	Cannot access multiple databases	Automatic
Validate GEO dataset metadata	15-20 min	Cannot access GEO API	~15 sec
Compare methods across 3 papers	3-5 hours	Generic, misses details	~2 min (structured synthesis)
Execute download with fallback strategies	30-60 min	Cannot download	~3 min (5 strategies automated)
Simple case (2 turns)	1.5-2.5 hours	Not reliable	~2 min, $0.26
Medium case (3 turns)	2-3 hours	Not possible	~5 min, $0.30
Hard case (3 turns)	1-2 days	Not reliable	~8 min, $0.56

Limitations

Search strategies are not disclosed. The case study shows natural language prompts but not the actual PubMed Boolean queries constructed by the agent. Without the exact query strings, search dates, and total result counts, the searches are not independently reproducible to systematic review standards.
Literature results are date-sensitive. PubMed, GEO, and bioRxiv indexes change daily. The specific papers and datasets returned in these case studies reflect the database state at the time of the session.
Spatial data download not supported. The Visium spatial transcriptomics format (GSE272362) could not be downloaded despite 5 strategy attempts. Spatial data formats require specialized handling beyond standard GEO matrix downloads.
Not a systematic review. The Hard case produces a structured methods comparison, not a PRISMA-compliant systematic review. Formal evidence synthesis requires registered protocols, explicit inclusion/exclusion criteria, and risk-of-bias assessment.
Full-text access varies. Methods extraction from PMC full-text depends on open-access availability. Paywalled publications may only yield abstract-level information.

Reproducibility

To reproduce these analyses, install the research package and run the queries sequentially with session continuity:

pip install 'lobster-ai[full]'

Simple case:

lobster query --session-id research_simple \
  "Search PubMed for the 5 most recent high-impact papers on \
   CRISPR base editing in human disease therapy published in 2024-2025. \
   For each paper, give me the PMID, title, journal, and a one-sentence \
   summary of the key finding."

lobster query --session-id research_simple \
  "For the paper on protein-nucleic acid language model-assisted adenine \
   base editor design (PMID 41390734), extract the full computational methods."

Medium case:

lobster query --session-id research_medium \
  "Search PubMed for recent papers on spatial transcriptomics in \
   pancreatic ductal adenocarcinoma published in 2024-2025 that have \
   deposited data in GEO. Find at least 3 papers with GEO accessions."

lobster query --session-id research_medium \
  "Validate the metadata for GSE272362. Then prepare it for download \
   and add it to the download queue."

lobster query --session-id research_medium \
  "Download the GSE272362 dataset from the download queue."

Hard case:

lobster query --session-id research_hard \
  "Search PubMed AND bioRxiv for papers from 2024-2025 on single-cell \
   multi-omics integration methods benchmarking scRNA-seq with scATAC-seq \
   or CITE-seq in tumor samples. Need at least 5 papers with GEO accessions."

lobster query --session-id research_hard \
  "For the scGALA paper (PMID 41298467), extract complete computational \
   methods and validate GSE261228 metadata. Queue it for download."

lobster query --session-id research_hard \
  "Extract methods from CelLink paper (PMID 41335468) and compare with \
   scGALA. Add scCotag (PMID 41446270) abstract to the comparison."

Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory and can be exported with /pipeline export.

NextTranscriptomics: From Single-Cell QC to Pseudobulk Differential Expression

Research: From Literature Mining to Dataset Discovery

Agents and Data Sources

Simple: CRISPR Base Editing Literature Review

Turn 1: Search Recent High-Impact Papers

Turn 2: Extract Computational Methods

Cost and Performance

Medium: Spatial Transcriptomics Dataset Discovery Pipeline

Turn 1: Search Spatial Transcriptomics Papers with GEO Data

Turn 2: Validate Dataset Metadata and Queue Download

Turn 3: Execute Download (Graceful Degradation)

Cost and Performance

Hard: Single-Cell Multi-Omics Integration Review

Turn 1: Multi-Database Systematic Search

Turn 2: Methods Extraction + Dataset Validation

Turn 3: Cross-Paper Methods Comparison

Cost and Performance

What This Demonstrates

Multi-Tool Orchestration

Cross-Database Synthesis

Structured Failure Handling

Human vs Raw LLM vs Lobster AI

Limitations

Reproducibility

What's Next?

Getting Started

Research Agents

Tutorials

On this page