Research
Data discovery and research planning agents
Research planning: dataset discovery across GEO, PRIDE, MetaboLights, and more — with dynamic database routing
Agents
research_agent
Primary interface for research planning and data discovery. Uses dynamic database routing via OmicsTypeRegistry to automatically route queries to the best databases based on the omics type.
Capabilities:
- Research question decomposition
- Dynamic database routing (queries route to preferred databases based on omics type)
- Dataset recommendation across GEO, PRIDE, MetaboLights, and more
- Literature-informed analysis planning
- Workflow orchestration
data_expert
Specialized agent for data loading, exploration, and management.
Capabilities:
- Multi-database dataset download (GEO, PRIDE, MetaboLights, SRA)
- Data format detection and loading
- Metadata exploration
- Data quality assessment
Example:
"Find MetaboLights datasets for liver disease metabolomics"Services
lobster-research includes data management services bundled with the package:
| Service | Purpose |
|---|---|
| ModalityDetectionService | Auto-detect data modality (scRNA-seq, bulk, proteomics, etc.) |
The service is installed automatically with the agent package.
Example Workflows
Dataset Discovery
User: I'm studying the immune response in COVID-19 patients.
What single-cell RNA-seq datasets are available?
[research_agent]
- Queries GEO via fast_dataset_search
- Returns 10+ datasets with titles, sample counts, and technology
- Ranks by relevance to research question
- Recommends top datasets for loadingData Loading
User: Load GSE150728 and show me what's in it
[data_expert_agent]
- Downloads dataset files from GEO
- Auto-detects format (H5AD, MTX, CSV)
- Loads into DataManagerV2
- Reports: ~8,000 cells, ~30,000 genes, sample metadata
- Suggests next steps (QC, clustering)Metabolomics Discovery
User: Search MetaboLights for liver disease metabolomics data
[research_agent]
- Routes to MetaboLights first (preferred DB for metabolomics queries)
- Falls back to Metabolomics Workbench, then GEO
- Recognizes MTBLS* and ST* accession types
- Returns studies with platform info (LC-MS, GC-MS, NMR)End-to-End Discovery and Analysis
User: Find COVID-19 scRNA-seq data and run clustering
[research_agent → data_expert_agent → transcriptomics_expert]
- research_agent discovers datasets, recommends GSE150728
- data_expert_agent downloads and loads data
- transcriptomics_expert runs QC → normalization → clustering → UMAP
- Entire pipeline runs in a single sessionDownload Queue
Large datasets (>100 MB) are queued for background download so the chat remains responsive:
User: Download GSE158055
[data_expert_agent]
- Queues dataset for background download
- Chat remains responsive during transfer
- Automatic retry on network failures
- Modality created when download completesThe download queue handles large files without blocking the chat, supports concurrent downloads, and automatically retries on failure.
Batch Processing (Publication Queue)
For systematic literature reviews, research_agent can populate a publication queue for batch processing:
User: Add these papers to the publication queue: PMID 30643258, PMID 31018141
[research_agent]
- Creates publication queue entries for each PMID
- Extracts GEO/SRA identifiers via NCBI E-Link
- Sets status to HANDOFF_READY for metadata_assistantThis is a secondary workflow for importing curated publication lists (RIS files, systematic reviews). For interactive dataset discovery, use the primary fast_dataset_search workflow above.
Dependencies
lobster-research requires data access and parsing libraries:
| Library | Purpose |
|---|---|
| Bio.Entrez | NCBI E-utilities API access |
| requests | HTTP requests for data download |
| BeautifulSoup4 | HTML/XML parsing |
| pandas | Data manipulation |
These are installed automatically with the package.
Configuration
# .lobster_workspace/config.toml
enabled = ["research_agent", "data_expert"]Sub-Agent Architecture
research_agent (supervisor-accessible, online)
|
| creates DownloadQueueEntry
|
v
download queue (queue-based coordination)
|
| supervisor polls queue
|
v
data_expert_agent (supervisor-accessible, ZERO online access)The agents are coordinated through the download queue, maintaining strict separation between online discovery (research_agent) and offline execution (data_expert_agent). This architecture ensures:
- Security boundary - data_expert has ZERO online access
- Delegation pattern - supervisor routes to data_expert when queue has PENDING entries
- Async coordination - research_agent queues work, data_expert executes from queue
Integration with Other Agents
research_agent often initiates workflows that delegate to domain-specific agents:
research_agent (planning)
-> data_expert (loading GSE data)
-> transcriptomics_expert (analysis)
-> visualization_expert (plots)This multi-agent workflow allows:
- Intelligent planning - research_agent understands the scientific question
- Data acquisition - data_expert handles GEO/SRA downloads
- Domain analysis - transcriptomics_expert runs the appropriate pipeline
- Visualization - visualization_expert generates publication-ready figures
Dynamic Database Routing
The research agent uses OmicsTypeRegistry to dynamically route queries to the best databases based on the detected omics type. The routing table is generated at runtime from registered omics types — not a static list.
| Omics Type | Preferred Databases | Accession Types |
|---|---|---|
| Transcriptomics | GEO, SRA | GSE*, SRP* |
| Proteomics | PRIDE, MassIVE, GEO | PXD*, MSV* |
| Genomics | GEO, SRA, dbGaP | GSE*, SRP* |
| Metabolomics | MetaboLights, Metabolomics Workbench, GEO | MTBLS*, ST* |
| Metagenomics | SRA, GEO, MG-RAST | SRP*, GSE* |
This routing table is dynamically generated from OmicsTypeRegistry. External packages can register new omics types via the lobster.omics_types entry point, and the research agent's routing updates automatically.
Supported Data Types
The data_expert agent supports datasets from multiple databases:
| Data Type | File Formats | Auto-Detection |
|---|---|---|
| scRNA-seq | H5AD, MTX, H5 | Yes |
| Bulk RNA-seq | Counts matrix, FPKM | Yes |
| Microarray | Series matrix | Yes |
| Proteomics | MaxQuant output | Yes |
| Metabolomics | MAF, CSV/TSV, mzML | Yes |