Omics-OS Docs

Metadata

Publication queue processing and metadata filtering agent

lobster-metadata
FreeIntermediate

Publication queue processing, ontology standardization, ID mapping, and metadata filtering

Input
PubMed IDsGEO IDsSRA IDsPublication Queue
Output
Unified CSVMapped IdentifiersFiltered Metadata
Agents (1)
└── metadata_assistantPublication queue processing and metadata filtering
pip install lobster-metadata

Agents

metadata_assistant

Specialized agent for processing publication queues and filtering metadata from research workflows.

Capabilities:

  • Publication queue batch processing
  • Cross-database ID mapping (PubMed ↔ GEO ↔ SRA)
  • Semantic ontology matching — MONDO diseases, UBERON tissues, Cell Ontology cell types (with vector-search extra)
  • Cross-database knowledgebase ID mapping (MONDO ↔ UMLS ↔ MeSH)
  • Metadata filtering by criteria
  • Sample annotation aggregation
  • Unified CSV export with publication context

Example Workflows

Cross-Database ID Mapping

User: Map these PubMed IDs to their corresponding GEO datasets:
      30643258, 31018141

[metadata_assistant]
- Queries NCBI E-Link for PubMed → GEO mapping
- Resolves GEO series (GSE) and samples (GSM)
- Returns mapping table with dataset metadata
- Stores in publication queue for data_expert handoff

Metadata Filtering

User: Show me all loaded datasets with tissue=liver
      and organism=human

[metadata_assistant]
- Scans modality metadata across loaded datasets
- Applies filter criteria (tissue, organism, disease, etc.)
- Returns matching samples with publication context
- Exports filtered results to CSV

Filter Microbiome Samples

User: Filter my microbiome data to include only gut samples from
      healthy adults

[metadata_assistant]
- Applies microbiome_filtering_service with criteria
- Filters by: body_site=gut, health_status=healthy, age_group=adult
- Returns filtered dataset with sample IDs
- Exports unified CSV

Batch Publication Queue Processing

For systematic literature reviews with curated publication lists:

User: Process the publication queue and export metadata for
      all HANDOFF_READY entries

[metadata_assistant]
- Loads publication queue entries with HANDOFF_READY status
- Extracts GEO/SRA identifiers via NCBI E-Link
- Aggregates sample + publication context
- Exports unified CSV with all metadata

Semantic Disease Matching

User: Match these disease terms to MONDO ontology:
      glioblastoma, lung adenocarcinoma, T2D

[metadata_assistant]
- Embeds each term with SapBERT (768-dim)
- Queries MONDO collection via ChromaDB cosine search
- Returns standardized matches with confidence scores:
  glioblastoma → MONDO:0018177 (confidence: 0.96)
  lung adenocarcinoma → MONDO:0005061 (confidence: 0.94)
  T2D → MONDO:0005148 (type 2 diabetes mellitus, confidence: 0.91)

Semantic ontology matching requires the vector-search extra. Install with pip install 'lobster-ai[vector-search]'. Without it, disease matching falls back to keyword lookup with 4 hardcoded diseases. See the Semantic Search Guide for details.

The publication queue workflow is designed for batch processing of imported publication lists (RIS files, systematic reviews). For interactive dataset discovery, use the research agent's fast_dataset_search instead.

Services

lobster-metadata includes metadata management services bundled with the package:

ServicePurpose
PublicationProcessingServiceBatch process publication queue entries
GEOServiceGEO database access and metadata extraction
SRAServiceSRA database access and run metadata
ArrayExpressServiceArrayExpress data access
PRIDEServicePRIDE proteomics database access
ENAServiceEuropean Nucleotide Archive access
FilteringServiceMetadata filtering by criteria
MicrobiomeAmpliconService16S/ITS amplicon metadata filtering
MicrobiomeShotgunServiceShotgun metagenomics metadata filtering
DiseaseOntologyServiceDisease, tissue, and cell type ontology matching (with vector-search extra)
VectorSearchServiceVector search engine — backends, embedders, rerankers (with vector-search extra)

The vector search infrastructure (~1,900 LOC) is bundled with lobster-metadata since it is the primary consumer. Install with pip install 'lobster-metadata[vector-search]' or pip install 'lobster-ai[vector-search]'.

All other services are installed automatically with the base package.

Dependencies

lobster-metadata requires database access and metadata parsing libraries:

LibraryPurpose
Bio.EntrezNCBI E-utilities API (PubMed, GEO, SRA)
requestsHTTP requests for database APIs
pandasMetadata manipulation and CSV export
lxmlXML parsing for database responses

These are installed automatically with the package.

Configuration

# .lobster_workspace/config.toml
enabled = ["metadata_assistant"]

Publication Queue Integration

metadata_assistant works downstream from research_agent and data_expert:

research_agent
  -> Creates publication queue entries
  -> Extracts identifiers via NCBI E-Link
  -> Sets status to HANDOFF_READY
    |
    v
metadata_assistant
  -> Batch processes HANDOFF_READY entries
  -> Applies filter criteria
  -> Exports unified CSV with publication + sample context

This workflow allows:

  1. Batch processing - Handle multiple publications at once
  2. Context preservation - Maintain publication → dataset → sample linkage
  3. Flexible filtering - Apply domain-specific criteria (microbiome, transcriptomics, etc.)
  4. Unified export - Single CSV with all relevant metadata

Metadata Filtering Criteria

Supported filter dimensions:

DimensionExamples
Tissue/Organbrain, liver, gut, blood
Disease Statehealthy, cancer, diabetes, COVID-19
Treatmentdrug-treated, vehicle, untreated
Cell TypeT cells, neurons, fibroblasts
Age Groupadult, pediatric, elderly
Sexmale, female, mixed
Organismhuman, mouse, rat

Filters are applied via natural language or structured criteria.

On this page