Research

lobster-research

FreeBeginner

Research planning: dataset discovery across GEO, PRIDE, MetaboLights, and more — with dynamic database routing

Input

GEO IDsMTBLS IDsPXD IDsKeywordsH5ADCSVMTX

Output

Dataset MetadataLoaded DataAnalysis PlansSample Info

Agents (2)

├── research_agent — Literature discovery, multi-database search (online)

└── data_expert_agent — Download execution and data loading (offline)

pip install lobster-research

Agents

research_agent

Primary interface for research planning and data discovery. Uses dynamic database routing via OmicsTypeRegistry to automatically route queries to the best databases based on the omics type.

Capabilities:

Research question decomposition
Dynamic database routing (queries route to preferred databases based on omics type)
Dataset recommendation across GEO, PRIDE, MetaboLights, and more
Literature-informed analysis planning
Workflow orchestration

data_expert

Specialized agent for data loading, exploration, and management.

Capabilities:

Multi-database dataset download (GEO, PRIDE, MetaboLights, SRA)
Data format detection and loading
Metadata exploration
Data quality assessment

Example:

"Find MetaboLights datasets for liver disease metabolomics"

Services

lobster-research includes data management services bundled with the package:

Service	Purpose
ModalityDetectionService	Auto-detect data modality (scRNA-seq, bulk, proteomics, etc.)

The service is installed automatically with the agent package.

Example Workflows

Dataset Discovery

User: I'm studying the immune response in COVID-19 patients.
      What single-cell RNA-seq datasets are available?

[research_agent]
- Queries GEO via fast_dataset_search
- Returns 10+ datasets with titles, sample counts, and technology
- Ranks by relevance to research question
- Recommends top datasets for loading

Data Loading

User: Load GSE150728 and show me what's in it

[data_expert_agent]
- Downloads dataset files from GEO
- Auto-detects format (H5AD, MTX, CSV)
- Loads into DataManagerV2
- Reports: ~8,000 cells, ~30,000 genes, sample metadata
- Suggests next steps (QC, clustering)

Metabolomics Discovery

User: Search MetaboLights for liver disease metabolomics data

[research_agent]
- Routes to MetaboLights first (preferred DB for metabolomics queries)
- Falls back to Metabolomics Workbench, then GEO
- Recognizes MTBLS* and ST* accession types
- Returns studies with platform info (LC-MS, GC-MS, NMR)

End-to-End Discovery and Analysis

User: Find COVID-19 scRNA-seq data and run clustering

[research_agent → data_expert_agent → transcriptomics_expert]
- research_agent discovers datasets, recommends GSE150728
- data_expert_agent downloads and loads data
- transcriptomics_expert runs QC → normalization → clustering → UMAP
- Entire pipeline runs in a single session

Download Queue

Large datasets (>100 MB) are queued for background download so the chat remains responsive:

User: Download GSE158055

[data_expert_agent]
- Queues dataset for background download
- Chat remains responsive during transfer
- Automatic retry on network failures
- Modality created when download completes

The download queue handles large files without blocking the chat, supports concurrent downloads, and automatically retries on failure.

Batch Processing (Publication Queue)

For systematic literature reviews, research_agent can populate a publication queue for batch processing:

User: Add these papers to the publication queue: PMID 30643258, PMID 31018141

[research_agent]
- Creates publication queue entries for each PMID
- Extracts GEO/SRA identifiers via NCBI E-Link
- Sets status to HANDOFF_READY for metadata_assistant

This is a secondary workflow for importing curated publication lists (RIS files, systematic reviews). For interactive dataset discovery, use the primary fast_dataset_search workflow above.

Dependencies

lobster-research requires data access and parsing libraries:

Library	Purpose
Bio.Entrez	NCBI E-utilities API access
requests	HTTP requests for data download
BeautifulSoup4	HTML/XML parsing
pandas	Data manipulation

These are installed automatically with the package.

Configuration

# .lobster_workspace/config.toml
enabled = ["research_agent", "data_expert"]

Sub-Agent Architecture

research_agent (supervisor-accessible, online)
    |
    | creates DownloadQueueEntry
    |
    v
download queue (queue-based coordination)
    |
    | supervisor polls queue
    |
    v
data_expert_agent (supervisor-accessible, ZERO online access)

The agents are coordinated through the download queue, maintaining strict separation between online discovery (research_agent) and offline execution (data_expert_agent). This architecture ensures:

Security boundary - data_expert has ZERO online access
Delegation pattern - supervisor routes to data_expert when queue has PENDING entries
Async coordination - research_agent queues work, data_expert executes from queue

Integration with Other Agents

research_agent often initiates workflows that delegate to domain-specific agents:

research_agent (planning)
  -> data_expert (loading GSE data)
    -> transcriptomics_expert (analysis)
      -> visualization_expert (plots)

This multi-agent workflow allows:

Intelligent planning - research_agent understands the scientific question
Data acquisition - data_expert handles GEO/SRA downloads
Domain analysis - transcriptomics_expert runs the appropriate pipeline
Visualization - visualization_expert generates publication-ready figures

The research agent uses OmicsTypeRegistry to dynamically route queries to the best databases based on the detected omics type. The routing table is generated at runtime from registered omics types — not a static list.

Omics Type	Preferred Databases	Accession Types
Transcriptomics	GEO, SRA	GSE, SRP
Proteomics	PRIDE, MassIVE, GEO	PXD, MSV
Genomics	GEO, SRA, dbGaP	GSE, SRP
Metabolomics	MetaboLights, Metabolomics Workbench, GEO	MTBLS, ST
Metagenomics	SRA, GEO, MG-RAST	SRP, GSE

This routing table is dynamically generated from OmicsTypeRegistry. External packages can register new omics types via the lobster.omics_types entry point, and the research agent's routing updates automatically.

Supported Data Types

The data_expert agent supports datasets from multiple databases:

Data Type	File Formats	Auto-Detection
scRNA-seq	H5AD, MTX, H5	Yes
Bulk RNA-seq	Counts matrix, FPKM	Yes
Microarray	Series matrix	Yes
Proteomics	MaxQuant output	Yes
Metabolomics	MAF, CSV/TSV, mzML	Yes

NextStructural Visualization

Research

Agents

research_agent

data_expert

Services

Example Workflows

Dataset Discovery

Data Loading

Metabolomics Discovery

End-to-End Discovery and Analysis

Download Queue

Batch Processing (Publication Queue)

Dependencies

Configuration

Sub-Agent Architecture

Integration with Other Agents

Dynamic Database Routing

Supported Data Types

What's Next?

Getting Started

Metadata Assistant

Download Queue

On this page