Metadata
Publication queue processing and metadata filtering agent
Publication queue processing, ontology standardization, ID mapping, and metadata filtering
Agents
metadata_assistant
Specialized agent for processing publication queues and filtering metadata from research workflows.
Capabilities:
- Publication queue batch processing
- Cross-database ID mapping (PubMed ↔ GEO ↔ SRA)
- Semantic ontology matching — MONDO diseases, UBERON tissues, Cell Ontology cell types (with
vector-searchextra) - Cross-database knowledgebase ID mapping (MONDO ↔ UMLS ↔ MeSH)
- Metadata filtering by criteria
- Sample annotation aggregation
- Unified CSV export with publication context
Example Workflows
Cross-Database ID Mapping
User: Map these PubMed IDs to their corresponding GEO datasets:
30643258, 31018141
[metadata_assistant]
- Queries NCBI E-Link for PubMed → GEO mapping
- Resolves GEO series (GSE) and samples (GSM)
- Returns mapping table with dataset metadata
- Stores in publication queue for data_expert handoffMetadata Filtering
User: Show me all loaded datasets with tissue=liver
and organism=human
[metadata_assistant]
- Scans modality metadata across loaded datasets
- Applies filter criteria (tissue, organism, disease, etc.)
- Returns matching samples with publication context
- Exports filtered results to CSVFilter Microbiome Samples
User: Filter my microbiome data to include only gut samples from
healthy adults
[metadata_assistant]
- Applies microbiome_filtering_service with criteria
- Filters by: body_site=gut, health_status=healthy, age_group=adult
- Returns filtered dataset with sample IDs
- Exports unified CSVBatch Publication Queue Processing
For systematic literature reviews with curated publication lists:
User: Process the publication queue and export metadata for
all HANDOFF_READY entries
[metadata_assistant]
- Loads publication queue entries with HANDOFF_READY status
- Extracts GEO/SRA identifiers via NCBI E-Link
- Aggregates sample + publication context
- Exports unified CSV with all metadataSemantic Disease Matching
User: Match these disease terms to MONDO ontology:
glioblastoma, lung adenocarcinoma, T2D
[metadata_assistant]
- Embeds each term with SapBERT (768-dim)
- Queries MONDO collection via ChromaDB cosine search
- Returns standardized matches with confidence scores:
glioblastoma → MONDO:0018177 (confidence: 0.96)
lung adenocarcinoma → MONDO:0005061 (confidence: 0.94)
T2D → MONDO:0005148 (type 2 diabetes mellitus, confidence: 0.91)Semantic ontology matching requires the vector-search extra. Install with pip install 'lobster-ai[vector-search]'. Without it, disease matching falls back to keyword lookup with 4 hardcoded diseases. See the Semantic Search Guide for details.
The publication queue workflow is designed for batch processing of imported publication lists (RIS files, systematic reviews). For interactive dataset discovery, use the research agent's fast_dataset_search instead.
Services
lobster-metadata includes metadata management services bundled with the package:
| Service | Purpose |
|---|---|
| PublicationProcessingService | Batch process publication queue entries |
| GEOService | GEO database access and metadata extraction |
| SRAService | SRA database access and run metadata |
| ArrayExpressService | ArrayExpress data access |
| PRIDEService | PRIDE proteomics database access |
| ENAService | European Nucleotide Archive access |
| FilteringService | Metadata filtering by criteria |
| MicrobiomeAmpliconService | 16S/ITS amplicon metadata filtering |
| MicrobiomeShotgunService | Shotgun metagenomics metadata filtering |
| DiseaseOntologyService | Disease, tissue, and cell type ontology matching (with vector-search extra) |
| VectorSearchService | Vector search engine — backends, embedders, rerankers (with vector-search extra) |
The vector search infrastructure (~1,900 LOC) is bundled with lobster-metadata since it is the primary consumer. Install with pip install 'lobster-metadata[vector-search]' or pip install 'lobster-ai[vector-search]'.
All other services are installed automatically with the base package.
Dependencies
lobster-metadata requires database access and metadata parsing libraries:
| Library | Purpose |
|---|---|
| Bio.Entrez | NCBI E-utilities API (PubMed, GEO, SRA) |
| requests | HTTP requests for database APIs |
| pandas | Metadata manipulation and CSV export |
| lxml | XML parsing for database responses |
These are installed automatically with the package.
Configuration
# .lobster_workspace/config.toml
enabled = ["metadata_assistant"]Publication Queue Integration
metadata_assistant works downstream from research_agent and data_expert:
research_agent
-> Creates publication queue entries
-> Extracts identifiers via NCBI E-Link
-> Sets status to HANDOFF_READY
|
v
metadata_assistant
-> Batch processes HANDOFF_READY entries
-> Applies filter criteria
-> Exports unified CSV with publication + sample contextThis workflow allows:
- Batch processing - Handle multiple publications at once
- Context preservation - Maintain publication → dataset → sample linkage
- Flexible filtering - Apply domain-specific criteria (microbiome, transcriptomics, etc.)
- Unified export - Single CSV with all relevant metadata
Metadata Filtering Criteria
Supported filter dimensions:
| Dimension | Examples |
|---|---|
| Tissue/Organ | brain, liver, gut, blood |
| Disease State | healthy, cancer, diabetes, COVID-19 |
| Treatment | drug-treated, vehicle, untreated |
| Cell Type | T cells, neurons, fibroblasts |
| Age Group | adult, pediatric, elderly |
| Sex | male, female, mixed |
| Organism | human, mouse, rat |
Filters are applied via natural language or structured criteria.