Research: From Literature Mining to Dataset Discovery
Literature search and dataset discovery across three complexity levels — CRISPR base editing review, spatial transcriptomics datasets, and single-cell multi-omics method comparison.
Scientific literature grows exponentially — PubMed adds 1.5 million citations per year, bioRxiv posts 250+ preprints daily, and GEO accumulates 50,000+ datasets annually. A bioinformatics researcher trying to survey a fast-moving field faces hours of manual database queries, PDF downloads, and metadata extraction. This case study follows Lobster AI's research agents through three increasingly complex literature discovery tasks — from simple publication searches to multi-database rapid literature surveys with cross-paper methods comparison and dataset validation.
Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). External databases queried: PubMed, PMC (full-text), GEO, bioRxiv. Total cost: $1.12 across 3 case studies (8 turns). Literature databases are updated daily — re-running these queries will return different papers and datasets as new publications are indexed. Session files preserving exact results are stored in .lobster_workspace/ for reproducibility. This case study demonstrates analytical workflows, not independently validated findings.
Agents and Data Sources
This analysis uses the lobster-research package, which provides two complementary agents with distinct capabilities:
| Agent | Role | Network Access |
|---|---|---|
research_agent | Literature search, publication analysis, metadata extraction | Online (PubMed, PMC, bioRxiv, GEO APIs) |
data_expert_agent | Dataset download execution, modality detection, data loading | Offline (executes from download queue only) |
The research_agent has no child agents — complexity is measured by query breadth and tool orchestration rather than parent-child delegation. External APIs queried during sessions: PubMed (literature search), PMC (full-text extraction), bioRxiv (preprints), GEO (dataset metadata and downloads), and SRA (sequence read archives).
The research agent operates in two modes: research_agent performs all online operations (searching, fetching, validating), then hands off to data_expert_agent for offline downloads from the queue. This separation ensures reproducibility — data downloads can be retried, audited, or executed on different infrastructure without re-querying external APIs.
Simple: CRISPR Base Editing Literature Review
The first scenario demonstrates rapid literature survey and computational methods extraction from a fast-moving field with rich PubMed coverage.
Turn 1: Search Recent High-Impact Papers
The first query establishes the landscape of recent CRISPR base editing therapeutics publications.
lobster query --session-id research_simple \
"Search PubMed for the 5 most recent high-impact papers on \
CRISPR base editing in human disease therapy published in 2024-2025. \
For each paper, give me the PMID, title, journal, and a one-sentence \
summary of the key finding."The research_agent queried PubMed with date filters and returned 5 papers from December 2024 to December 2025, revealing a clear trend toward prime editing dominance (4 of 5 papers) with emphasis on clinical translation from rare disease cohorts.
| # | PMID | Journal | Key Finding |
|---|---|---|---|
| 1 | 41390734 | Nature Communications | ML-designed compact ABE: 27% size reduction, 133.5-fold precision improvement |
| 2 | 41414712 | HGG Advances | Prime editing corrected de novo GDF11 nonsense mutation from patient |
| 3 | 41421338 | Molecular Cell | Novel DIMMER circuits reduce off-target editing across two orders of magnitude |
| 4 | 41455771 | Cell Death & Disease | Prime editing modeled multiple eIF2B pathogenic mutations in iPSCs |
| 5 | 41465342 | Int J Mol Sci | Template-jumping prime editing targets F9 gene (hemophilia B therapy) |
Turn 2: Extract Computational Methods
The second query dives deep into the computational pipeline of the ML-designed adenine base editor.
lobster query --session-id research_simple \
"For the paper on protein-nucleic acid language model-assisted adenine \
base editor design (PMID 41390734), extract the full computational methods: \
what software, algorithms, parameters, and validation approaches were used? \
Also get the full abstract."The agent extracted a complete computational pipeline from the Nature Communications paper, revealing a sophisticated 5-step validation cascade that progressively filtered 150 generated sequences to 20 experimental candidates.
ML Architecture:
| Component | Details |
|---|---|
| Base Model | ESM-2 (transfer learning, 650M parameters) |
| Novel Components | Nucleic acid encoder, editing position encoder, masked autoregressive decoder |
| Training Data | 34,255 TadA sequences (UniProtKB) + 27 TadA-8e variants |
| Pre-training Optimizer | Adam (beta1=0.9, beta2=0.999, lr=1e-06) |
| Generation | Temperature=1.0, top-p=0.9, mask strategy <5 consecutive tokens |
| Output | 150 sequences (73 mutations, 39 insertions, 38 truncations) |
Multi-Tool Validation Pipeline:
| Step | Tool | Filter/Metric |
|---|---|---|
| 1 | AlphaFold2 (ColabFold v1.5.5) | pLDDT >= 84 |
| 2 | ESM-1v | Mean log-likelihood (21/150 > wild-type) |
| 3 | ESM-IF | Structure-based sequence likelihood |
| 4 | Rosetta | Energy within 100 units, charge within 50 units of WT |
| 5 | AlphaFold3 | Binary complex prediction (protein + ssDNA) |
The 5-step validation cascade demonstrates how modern protein engineering combines language models (ESM-2, ESM-1v), structure prediction (AlphaFold2/3), physics-based scoring (Rosetta), and inverse folding (ESM-IF) in a multi-tool consensus workflow. This level of methods detail — including exact hyperparameters, training data sizes, and software versions — is typically scattered across main text, supplementary materials, and GitHub repositories. The research agent extracted it from PMC full-text in under 1 minute.
Cost and Performance
| Metric | Value |
|---|---|
| Session ID | research_simple |
| Turns | 2 |
| Total Time | ~2 minutes |
| Total Cost | $0.26 |
| Total Tokens | 71,601 |
Medium: Spatial Transcriptomics Dataset Discovery Pipeline
The second scenario exercises the full research-to-download pipeline: literature search, GEO cross-referencing, metadata validation, and download queue management.
Turn 1: Search Spatial Transcriptomics Papers with GEO Data
The first query targets a clinically relevant cancer domain with spatial profiling requirements.
lobster query --session-id research_medium \
"Search PubMed for recent papers on spatial transcriptomics in \
pancreatic ductal adenocarcinoma (PDAC) tumor microenvironment \
published in 2024-2025. I need papers that have deposited their \
spatial transcriptomics data in GEO. Find at least 3 papers and \
for each give me the PMID, title, journal, and any GEO accession \
numbers mentioned."The agent identified 5 spatial transcriptomics studies in PDAC, each with confirmed GEO deposits, spanning cancer-associated fibroblasts, tertiary lymphoid structures, and primary-metastatic heterogeneity themes.
| # | PMID | Journal | GEO | Samples | Platform |
|---|---|---|---|---|---|
| 1 | 40154487 | Cancer Cell (Mar 2025) | GSE274103 | 5 PDAC (FFPE) | 10x Visium |
| 2 | 40815230 | Cancer Immunol Res (Nov 2025) | GSE277116 | 28 PDAC | 10x Visium |
| 3 | 39294496 | Nature Genetics (Sep 2024) | GSE272362 | 30 (10 primary, 3 normal, 12 hepatic mets, 5 LN mets) | 10x Visium |
| 4 | 38798691 | Research Square (May 2024) | GSE240078 | 36 (23 NAT, 13 naive) | GeoMx DSP |
The strongest candidate for downstream analysis is GSE272362 from Nature Genetics — the largest cohort (30 specimens) with matched primary and metastatic samples, providing a unique spatial atlas of tumor microenvironment heterogeneity across anatomical sites.
Turn 2: Validate Dataset Metadata and Queue Download
The second query pre-flights the dataset before committing to a large download.
lobster query --session-id research_medium \
"Validate the metadata for GSE272362 (the primary vs metastatic \
PDAC spatial atlas from Nature Genetics). Check: how many samples, \
what platform, what organism, any supplementary files. Then prepare \
it for download and add it to the download queue."The research agent validated GSE272362's metadata through the GEO API, confirming 100% field completeness and producing a queue entry with a recommended download strategy.
| Attribute | Value |
|---|---|
| Accession | GSE272362 |
| Organism | Homo sapiens |
| Total specimens | 30 (10 primary, 3 normal, 12 hepatic mets, 5 LN mets) |
| Total spots | 91,496 |
| Platform | GPL24676 (10x Visium) |
| Public since | July 31, 2024 |
| Supplementary files | 2 |
| Metadata completeness | 100% |
| Download strategy | MATRIX_FIRST (85% confidence) |
| Queue entry ID | queue_GSE272362_3bb5f772 |
Turn 3: Execute Download (Graceful Degradation)
The third query attempts to download the dataset from the prepared queue entry.
lobster query --session-id research_medium \
"Download the GSE272362 dataset from the download queue \
(entry queue_GSE272362_3bb5f772). Once downloaded, list the \
available modalities and show me the basic shape and content \
of the data."The data_expert_agent exhausted 5 download strategies (MATRIX_FIRST, SUPPLEMENTARY_FIRST, H5_FIRST, RAW_FIRST, MATRIX_FIRST+union), downloading approximately 260 MB of supplementary files. However, Visium spatial transcriptomics data uses a non-standard multi-file structure (spatial coordinates, tissue images, spot-level matrices in custom subdirectories) that the current GEO download pipeline cannot parse as standard count matrices.
| Strategy Attempted | Result | Reason |
|---|---|---|
| MATRIX_FIRST | Failed | No processed matrix in strategy config |
| SUPPLEMENTARY_FIRST | Failed | Spatial format (images, coordinates) not parseable as count matrix |
| H5_FIRST | Failed | H5AD not available in standard GEO format |
| RAW_FIRST | Failed | Raw files require spatial-specific processing pipeline |
| MATRIX_FIRST + union | Failed | Same underlying format issue |
Rather than failing silently, the agent provided structured recovery suggestions:
| Option | Action | Rationale |
|---|---|---|
| 1 | Manual investigation of GSE272362 | Check exact file formats on GEO web |
| 2 | Try GSE274103 or GSE277116 | Alternative PDAC spatial datasets |
| 3 | Wait and retry | Possible GEO server issues |
This demonstrates Lobster's fail-safe design for data retrieval. When automated download fails, the system exhausts all available strategies, provides a structured diagnosis, and recommends actionable alternatives rather than leaving the user with a generic error message. The agent correctly identified a current limitation (Visium spatial format support) and suggested two peer datasets as alternatives.
Cost and Performance
| Metric | Value |
|---|---|
| Session ID | research_medium |
| Turns | 3 |
| Total Time | ~5 minutes (includes 5 download strategy attempts) |
| Total Cost | $0.30 |
| Total Tokens | 91,544 |
Hard: Single-Cell Multi-Omics Integration Review
The third scenario is the most demanding research workflow — combining literature search across multiple databases, full-text analysis, cross-paper synthesis, and dataset validation in a single session.
Turn 1: Multi-Database Systematic Search
The first query executes a comprehensive search across PubMed and bioRxiv with specific filtering criteria.
lobster query --session-id research_hard \
"I'm conducting a comprehensive methods comparison for single-cell \
multi-omics integration methods in cancer immunology. Search PubMed \
AND bioRxiv for papers from 2024-2025 that describe new computational \
methods or benchmarks for integrating scRNA-seq with scATAC-seq \
or CITE-seq data in tumor samples. I need at least 5 papers. \
For each: PMID/DOI, title, journal, and any GEO/SRA/Zenodo \
accession numbers for deposited data. Prioritize papers that \
benchmarked multiple integration methods."The research agent executed 7 targeted searches across PubMed and bioRxiv, identifying 8 papers with automatic triage by benchmark scope.
Priority-Ranked Paper Summary:
| Priority | Title | PMID | Journal | Methods Benchmarked | Key Accessions |
|---|---|---|---|---|---|
| High | scGALA: Graph link prediction cell alignment | 41298467 | Nat Commun (2025) | 14 methods | GSE261228, GSE232073, GSE230827, GSE232074 |
| High | CelLink: Weak feature linkage integration | 41335468 | Nucleic Acids Res (2025) | 11 methods | PANC-DB, SeuratData |
| High | scCotag: Co-optimal transport integration | 41446270 | bioRxiv (2025) | 6 methods | GitHub (pending) |
| Medium | Comparison of scRNA+scATAC integration methods | 41675510 | Quant Biol (2025) | Multiple | Not extracted |
| Medium | SpatialEx: Histology-anchored integration | 41407925 | Nat Methods (2025) | Novel method | Supplementary |
| Medium | Glioblastoma scMultiome radiation response | 41573875 | bioRxiv (2025) | Application | Pending |
Three papers stood out for comprehensive method comparisons: scGALA benchmarked 14 integration methods, CelLink benchmarked 11, and scCotag benchmarked 6. Four GEO accessions and multiple public repository links were extracted directly from search results.
Turn 2: Methods Extraction + Dataset Validation
The second query extracts detailed computational methods from the top-ranked paper and validates its associated dataset.
lobster query --session-id research_hard \
"For the scGALA paper (PMID 41298467), extract the complete \
computational methods section including: (1) all software and \
their versions, (2) hyperparameters for the graph attention \
network, (3) benchmark evaluation metrics, (4) computing \
resources used. Also validate the metadata for GSE261228 — \
check sample count, organism, platform, and data availability. \
Queue it for download if it looks good."The agent extracted the complete computational architecture from the Nature Communications full text and simultaneously validated the associated tri-omics MPAL dataset.
scGALA Critical Hyperparameters:
| Component | Setting |
|---|---|
| Optimizer | Adam |
| Learning Rate | 1e-3 |
| Scheduler | Cosine annealing (patience=10 epochs) |
| Edge Masking | 30% random (uniform across edge types) |
| Dropout Rate | 50% (spatial tasks) |
| K-NN | K=20 (both intra- and inter-dataset) |
| Model Size | 476,000 parameters |
Benchmark Metrics and Performance:
| Category | Metric | scGALA Improvement |
|---|---|---|
| Biological Conservation | ARI | 14.7-48.6% |
| Biological Conservation | NMI | 7.7-17.0% |
| Alignment | FOSCTTM | 12.4% |
| Label Transfer | Cohen's Kappa | 19.2% avg (up to 66.8%) |
| Batch Correction | Graph Connectivity | 7.5% |
| Booster Mode | Clustering Accuracy | 67.8% when wrapping other methods |
GSE261228 Validation:
| Attribute | Value |
|---|---|
| Accession | GSE261228 |
| Title | Multiomic Single Cell Sequencing of MPAL (normal bone marrow PIP-Seq) |
| Organism | Homo sapiens |
| Samples | 14 adult MPAL patients |
| Technology | PIP-Seq (DNA + RNA + protein) |
| Platform | GPL24676 |
| Status | Public (since Mar 13, 2024) |
| Queue Entry | queue_GSE261228_84056a40 |
Turn 3: Cross-Paper Methods Comparison
The third query synthesizes methods from multiple papers into a structured comparison table.
lobster query --session-id research_hard \
"Now extract the methods from the CelLink paper (PMID 41335468) \
and compare them side-by-side with scGALA. I want a structured \
comparison table covering: (1) algorithmic approach, (2) cell \
count scalability, (3) modalities supported, (4) benchmark \
metrics used, (5) number of datasets tested, (6) key advantages \
and limitations. Also read the full abstract of the scCotag \
bioRxiv paper (PMID 41446270) and add it to the comparison."The agent extracted complete methods from the second paper (CelLink from Nucleic Acids Research) and the abstract from the third (scCotag from bioRxiv), then produced a 3-way structured comparison.
Core Algorithm Comparison:
| Dimension | scGALA | CelLink | scCotag |
|---|---|---|---|
| Core Algorithm | VGAE + Graph Attention Network | Balanced OT to Iterative Unbalanced OT | Prior-informed Co-Optimal Transport + VAE |
| Learning Paradigm | Self-supervised (30% edge masking) | Optimization-based (no training) | Supervised deep learning (4,000 epochs) |
| Model Size | 476,000 parameters | N/A (optimization) | VAE: 2x256 layers |
| Hyperparameters | 11 critical | 5 parameters | 9 parameters |
| Training Time | GPU required | N/A (no training) | 4,000 epochs |
Scalability Comparison:
| Metric | scGALA | CelLink | scCotag |
|---|---|---|---|
| Max Cells Tested | 161,764 | 100,000 | ~10,000 |
| Runtime (10K cells) | Not reported | 2-5 min | Not reported |
| Hardware | GPU cluster | Apple M1 Pro (8-core, 16GB RAM) | GPU (likely) |
Modality Support:
| Modality | scGALA | CelLink | scCotag |
|---|---|---|---|
| scRNA + scATAC | Yes | Yes (via imputation) | Yes (primary focus) |
| CITE-seq (RNA+ADT) | Yes | Yes (4 datasets) | No |
| Spatial Transcriptomics | Yes (161K spots) | No | No |
| Spatial Proteomics (CODEX) | No | Yes (primary strength) | No |
| Tri-omics (DNA+RNA+Protein) | Yes (MPAL PIP-Seq) | No | No |
Use Case Recommendations:
| Use Case | Recommended Method | Rationale |
|---|---|---|
| Large-scale atlas (>100K cells) | scGALA | Proven on 161K cells |
| Consumer hardware (no GPU) | CelLink | CPU-only, 2-5 min for 10K cells |
| Spatial proteomics (CODEX/MIBI-TOF) | CelLink | Only method tested on imaging data |
| scRNA+scATAC integration | scCotag | Prior-informed with gene-peak overlap |
| Extreme imbalance (1:10+ ratios) | CelLink | Tested on 1:76 ratio |
| Tri-omics (DNA+RNA+Protein) | scGALA | MPAL PIP-Seq dataset |
| Boosting existing workflows | scGALA | Wraps 14 methods |
This comparison table — covering 7 dimensions with specific numbers and use-case guidance — is the kind of synthesis that typically takes a researcher days of reading and tabulation. The research agent produced it in a single conversational turn by leveraging PMC full-text extraction and structured analysis across three papers simultaneously. Key insight: scGALA is the universal booster (enhances 14 existing methods), CelLink is the CPU-friendly option for extreme imbalance, and scCotag specializes in scRNA+scATAC integration with biological priors.
Cost and Performance
| Metric | Value |
|---|---|
| Session ID | research_hard |
| Turns | 3 |
| Total Time | ~8 minutes |
| Total Cost | $0.56 |
| Total Tokens | 151,798 |
What This Demonstrates
Multi-Tool Orchestration
No single database query produces these analyses. The research agent orchestrated PubMed searches, bioRxiv preprint retrieval, PMC full-text extraction, GEO metadata validation, and cross-database accession linking — all within natural language conversations. The data expert agent managed download queue execution with 5 fallback strategies and structured failure reporting.
Cross-Database Synthesis
The hard case study required querying PubMed, bioRxiv, PMC, and GEO simultaneously, then synthesizing results across 3 papers into a structured comparison table with specific hyperparameters, benchmark metrics, and use-case recommendations. This workflow would typically require separate API scripts, manual PDF reading, and Excel tabulation over 1-2 days. Lobster compressed it into 8 minutes.
The Hard case demonstrates rapid literature survey capabilities, not formal systematic review methodology. Formal systematic reviews require a registered protocol (PROSPERO), PRISMA-compliant reporting, and explicit inclusion/exclusion criteria. Lobster AI accelerates the literature screening and synthesis phases but does not replace the methodological framework of evidence synthesis.
Structured Failure Handling
When the Visium spatial data download failed due to non-standard file formats, the system didn't produce a cryptic error — it exhausted all 5 download strategies, identified the root cause (spatial format incompatibility), and recommended specific alternative datasets. This fail-safe design ensures users understand why operations fail and what to do next.
Human vs Raw LLM vs Lobster AI
Estimates based on these case study sessions. Human researcher timing assumes manual workflows without automation.
| Task | Human Researcher | Raw LLM | Lobster AI |
|---|---|---|---|
| Search PubMed with date filters | 10-15 min | Cannot query APIs | ~30 sec |
| Extract computational methods from full text | 30-60 min | Hallucinates parameters/versions | ~1 min (PMC extraction) |
| Cross-reference PubMed + GEO accessions | 20-30 min | Cannot access multiple databases | Automatic |
| Validate GEO dataset metadata | 15-20 min | Cannot access GEO API | ~15 sec |
| Compare methods across 3 papers | 3-5 hours | Generic, misses details | ~2 min (structured synthesis) |
| Execute download with fallback strategies | 30-60 min | Cannot download | ~3 min (5 strategies automated) |
| Simple case (2 turns) | 1.5-2.5 hours | Not reliable | ~2 min, $0.26 |
| Medium case (3 turns) | 2-3 hours | Not possible | ~5 min, $0.30 |
| Hard case (3 turns) | 1-2 days | Not reliable | ~8 min, $0.56 |
Limitations
- Search strategies are not disclosed. The case study shows natural language prompts but not the actual PubMed Boolean queries constructed by the agent. Without the exact query strings, search dates, and total result counts, the searches are not independently reproducible to systematic review standards.
- Literature results are date-sensitive. PubMed, GEO, and bioRxiv indexes change daily. The specific papers and datasets returned in these case studies reflect the database state at the time of the session.
- Spatial data download not supported. The Visium spatial transcriptomics format (GSE272362) could not be downloaded despite 5 strategy attempts. Spatial data formats require specialized handling beyond standard GEO matrix downloads.
- Not a systematic review. The Hard case produces a structured methods comparison, not a PRISMA-compliant systematic review. Formal evidence synthesis requires registered protocols, explicit inclusion/exclusion criteria, and risk-of-bias assessment.
- Full-text access varies. Methods extraction from PMC full-text depends on open-access availability. Paywalled publications may only yield abstract-level information.
Reproducibility
To reproduce these analyses, install the research package and run the queries sequentially with session continuity:
pip install 'lobster-ai[full]'Simple case:
lobster query --session-id research_simple \
"Search PubMed for the 5 most recent high-impact papers on \
CRISPR base editing in human disease therapy published in 2024-2025. \
For each paper, give me the PMID, title, journal, and a one-sentence \
summary of the key finding."lobster query --session-id research_simple \
"For the paper on protein-nucleic acid language model-assisted adenine \
base editor design (PMID 41390734), extract the full computational methods."Medium case:
lobster query --session-id research_medium \
"Search PubMed for recent papers on spatial transcriptomics in \
pancreatic ductal adenocarcinoma published in 2024-2025 that have \
deposited data in GEO. Find at least 3 papers with GEO accessions."lobster query --session-id research_medium \
"Validate the metadata for GSE272362. Then prepare it for download \
and add it to the download queue."lobster query --session-id research_medium \
"Download the GSE272362 dataset from the download queue."Hard case:
lobster query --session-id research_hard \
"Search PubMed AND bioRxiv for papers from 2024-2025 on single-cell \
multi-omics integration methods benchmarking scRNA-seq with scATAC-seq \
or CITE-seq in tumor samples. Need at least 5 papers with GEO accessions."lobster query --session-id research_hard \
"For the scGALA paper (PMID 41298467), extract complete computational \
methods and validate GSE261228 metadata. Queue it for download."lobster query --session-id research_hard \
"Extract methods from CelLink paper (PMID 41335468) and compare with \
scGALA. Add scCotag (PMID 41446270) abstract to the comparison."Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory and can be exported with /pipeline export.
What's Next?
Metabolomics: From LC-MS Quality Control to NMR Pathway Profiling
Metabolomics analysis across three complexity levels — LC-MS plasma QC, NMR dataset discovery, and type 2 diabetes metabolic profiling with pathway annotation.
Machine Learning: From Data Preparation to Survival Prediction
ML pipelines across three complexity levels — data preparation, biomarker discovery with bootstrap feature importance, and Cox proportional hazards survival analysis.