Omics-OS Docs

Research

Data discovery and research planning agents

lobster-research
FreeBeginner

Research planning: dataset discovery across GEO, PRIDE, MetaboLights, and more — with dynamic database routing

Input
GEO IDsMTBLS IDsPXD IDsKeywordsH5ADCSVMTX
Output
Dataset MetadataLoaded DataAnalysis PlansSample Info
Agents (2)
├── research_agentLiterature discovery, multi-database search (online)
└── data_expert_agentDownload execution and data loading (offline)
pip install lobster-research

Agents

research_agent

Primary interface for research planning and data discovery. Uses dynamic database routing via OmicsTypeRegistry to automatically route queries to the best databases based on the omics type.

Capabilities:

  • Research question decomposition
  • Dynamic database routing (queries route to preferred databases based on omics type)
  • Dataset recommendation across GEO, PRIDE, MetaboLights, and more
  • Literature-informed analysis planning
  • Workflow orchestration

data_expert

Specialized agent for data loading, exploration, and management.

Capabilities:

  • Multi-database dataset download (GEO, PRIDE, MetaboLights, SRA)
  • Data format detection and loading
  • Metadata exploration
  • Data quality assessment

Example:

"Find MetaboLights datasets for liver disease metabolomics"

Services

lobster-research includes data management services bundled with the package:

ServicePurpose
ModalityDetectionServiceAuto-detect data modality (scRNA-seq, bulk, proteomics, etc.)

The service is installed automatically with the agent package.

Example Workflows

Dataset Discovery

User: I'm studying the immune response in COVID-19 patients.
      What single-cell RNA-seq datasets are available?

[research_agent]
- Queries GEO via fast_dataset_search
- Returns 10+ datasets with titles, sample counts, and technology
- Ranks by relevance to research question
- Recommends top datasets for loading

Data Loading

User: Load GSE150728 and show me what's in it

[data_expert_agent]
- Downloads dataset files from GEO
- Auto-detects format (H5AD, MTX, CSV)
- Loads into DataManagerV2
- Reports: ~8,000 cells, ~30,000 genes, sample metadata
- Suggests next steps (QC, clustering)

Metabolomics Discovery

User: Search MetaboLights for liver disease metabolomics data

[research_agent]
- Routes to MetaboLights first (preferred DB for metabolomics queries)
- Falls back to Metabolomics Workbench, then GEO
- Recognizes MTBLS* and ST* accession types
- Returns studies with platform info (LC-MS, GC-MS, NMR)

End-to-End Discovery and Analysis

User: Find COVID-19 scRNA-seq data and run clustering

[research_agent → data_expert_agent → transcriptomics_expert]
- research_agent discovers datasets, recommends GSE150728
- data_expert_agent downloads and loads data
- transcriptomics_expert runs QC → normalization → clustering → UMAP
- Entire pipeline runs in a single session

Download Queue

Large datasets (>100 MB) are queued for background download so the chat remains responsive:

User: Download GSE158055

[data_expert_agent]
- Queues dataset for background download
- Chat remains responsive during transfer
- Automatic retry on network failures
- Modality created when download completes

The download queue handles large files without blocking the chat, supports concurrent downloads, and automatically retries on failure.

Batch Processing (Publication Queue)

For systematic literature reviews, research_agent can populate a publication queue for batch processing:

User: Add these papers to the publication queue: PMID 30643258, PMID 31018141

[research_agent]
- Creates publication queue entries for each PMID
- Extracts GEO/SRA identifiers via NCBI E-Link
- Sets status to HANDOFF_READY for metadata_assistant

This is a secondary workflow for importing curated publication lists (RIS files, systematic reviews). For interactive dataset discovery, use the primary fast_dataset_search workflow above.

Dependencies

lobster-research requires data access and parsing libraries:

LibraryPurpose
Bio.EntrezNCBI E-utilities API access
requestsHTTP requests for data download
BeautifulSoup4HTML/XML parsing
pandasData manipulation

These are installed automatically with the package.

Configuration

# .lobster_workspace/config.toml
enabled = ["research_agent", "data_expert"]

Sub-Agent Architecture

research_agent (supervisor-accessible, online)
    |
    | creates DownloadQueueEntry
    |
    v
download queue (queue-based coordination)
    |
    | supervisor polls queue
    |
    v
data_expert_agent (supervisor-accessible, ZERO online access)

The agents are coordinated through the download queue, maintaining strict separation between online discovery (research_agent) and offline execution (data_expert_agent). This architecture ensures:

  1. Security boundary - data_expert has ZERO online access
  2. Delegation pattern - supervisor routes to data_expert when queue has PENDING entries
  3. Async coordination - research_agent queues work, data_expert executes from queue

Integration with Other Agents

research_agent often initiates workflows that delegate to domain-specific agents:

research_agent (planning)
  -> data_expert (loading GSE data)
    -> transcriptomics_expert (analysis)
      -> visualization_expert (plots)

This multi-agent workflow allows:

  1. Intelligent planning - research_agent understands the scientific question
  2. Data acquisition - data_expert handles GEO/SRA downloads
  3. Domain analysis - transcriptomics_expert runs the appropriate pipeline
  4. Visualization - visualization_expert generates publication-ready figures

Dynamic Database Routing

The research agent uses OmicsTypeRegistry to dynamically route queries to the best databases based on the detected omics type. The routing table is generated at runtime from registered omics types — not a static list.

Omics TypePreferred DatabasesAccession Types
TranscriptomicsGEO, SRAGSE*, SRP*
ProteomicsPRIDE, MassIVE, GEOPXD*, MSV*
GenomicsGEO, SRA, dbGaPGSE*, SRP*
MetabolomicsMetaboLights, Metabolomics Workbench, GEOMTBLS*, ST*
MetagenomicsSRA, GEO, MG-RASTSRP*, GSE*

This routing table is dynamically generated from OmicsTypeRegistry. External packages can register new omics types via the lobster.omics_types entry point, and the research agent's routing updates automatically.

Supported Data Types

The data_expert agent supports datasets from multiple databases:

Data TypeFile FormatsAuto-Detection
scRNA-seqH5AD, MTX, H5Yes
Bulk RNA-seqCounts matrix, FPKMYes
MicroarraySeries matrixYes
ProteomicsMaxQuant outputYes
MetabolomicsMAF, CSV/TSV, mzMLYes

On this page