Metadata

lobster-metadata

FreeIntermediate

Publication queue processing, ontology standardization, ID mapping, and metadata filtering

Input

PubMed IDsGEO IDsSRA IDsPublication Queue

Output

Unified CSVMapped IdentifiersFiltered Metadata

Agents (1)

└── metadata_assistant — Publication queue processing and metadata filtering

pip install lobster-metadata

Agents

metadata_assistant

Specialized agent for processing publication queues and filtering metadata from research workflows.

Capabilities:

Publication queue batch processing
Cross-database ID mapping (PubMed ↔ GEO ↔ SRA)
Semantic ontology matching — MONDO diseases, UBERON tissues, Cell Ontology cell types (with vector-search extra)
Cross-database knowledgebase ID mapping (MONDO ↔ UMLS ↔ MeSH)
Metadata filtering by criteria
Sample annotation aggregation
Unified CSV export with publication context

Example Workflows

Cross-Database ID Mapping

User: Map these PubMed IDs to their corresponding GEO datasets:
      30643258, 31018141

[metadata_assistant]
- Queries NCBI E-Link for PubMed → GEO mapping
- Resolves GEO series (GSE) and samples (GSM)
- Returns mapping table with dataset metadata
- Stores in publication queue for data_expert handoff

Metadata Filtering

User: Show me all loaded datasets with tissue=liver
      and organism=human

[metadata_assistant]
- Scans modality metadata across loaded datasets
- Applies filter criteria (tissue, organism, disease, etc.)
- Returns matching samples with publication context
- Exports filtered results to CSV

Filter Microbiome Samples

User: Filter my microbiome data to include only gut samples from
      healthy adults

[metadata_assistant]
- Applies microbiome_filtering_service with criteria
- Filters by: body_site=gut, health_status=healthy, age_group=adult
- Returns filtered dataset with sample IDs
- Exports unified CSV

Batch Publication Queue Processing

For systematic literature reviews with curated publication lists:

User: Process the publication queue and export metadata for
      all HANDOFF_READY entries

[metadata_assistant]
- Loads publication queue entries with HANDOFF_READY status
- Extracts GEO/SRA identifiers via NCBI E-Link
- Aggregates sample + publication context
- Exports unified CSV with all metadata

Semantic Disease Matching

User: Match these disease terms to MONDO ontology:
      glioblastoma, lung adenocarcinoma, T2D

[metadata_assistant]
- Embeds each term with SapBERT (768-dim)
- Queries MONDO collection via ChromaDB cosine search
- Returns standardized matches with confidence scores:
  glioblastoma → MONDO:0018177 (confidence: 0.96)
  lung adenocarcinoma → MONDO:0005061 (confidence: 0.94)
  T2D → MONDO:0005148 (type 2 diabetes mellitus, confidence: 0.91)

Semantic ontology matching requires the vector-search extra. Install with pip install 'lobster-ai[vector-search]'. Without it, disease matching falls back to keyword lookup with 4 hardcoded diseases. See the Semantic Search Guide for details.

The publication queue workflow is designed for batch processing of imported publication lists (RIS files, systematic reviews). For interactive dataset discovery, use the research agent's fast_dataset_search instead.

Services

lobster-metadata includes metadata management services bundled with the package:

Service	Purpose
PublicationProcessingService	Batch process publication queue entries
GEOService	GEO database access and metadata extraction
SRAService	SRA database access and run metadata
ArrayExpressService	ArrayExpress data access
PRIDEService	PRIDE proteomics database access
ENAService	European Nucleotide Archive access
FilteringService	Metadata filtering by criteria
MicrobiomeAmpliconService	16S/ITS amplicon metadata filtering
MicrobiomeShotgunService	Shotgun metagenomics metadata filtering
DiseaseOntologyService	Disease, tissue, and cell type ontology matching (with `vector-search` extra)
VectorSearchService	Vector search engine — backends, embedders, rerankers (with `vector-search` extra)

The vector search infrastructure (~1,900 LOC) is bundled with lobster-metadata since it is the primary consumer. Install with pip install 'lobster-metadata[vector-search]' or pip install 'lobster-ai[vector-search]'.

All other services are installed automatically with the base package.

Dependencies

lobster-metadata requires database access and metadata parsing libraries:

Library	Purpose
Bio.Entrez	NCBI E-utilities API (PubMed, GEO, SRA)
requests	HTTP requests for database APIs
pandas	Metadata manipulation and CSV export
lxml	XML parsing for database responses

These are installed automatically with the package.

Configuration

# .lobster_workspace/config.toml
enabled = ["metadata_assistant"]

Publication Queue Integration

metadata_assistant works downstream from research_agent and data_expert:

research_agent
  -> Creates publication queue entries
  -> Extracts identifiers via NCBI E-Link
  -> Sets status to HANDOFF_READY
    |
    v
metadata_assistant
  -> Batch processes HANDOFF_READY entries
  -> Applies filter criteria
  -> Exports unified CSV with publication + sample context

This workflow allows:

Batch processing - Handle multiple publications at once
Context preservation - Maintain publication → dataset → sample linkage
Flexible filtering - Apply domain-specific criteria (microbiome, transcriptomics, etc.)
Unified export - Single CSV with all relevant metadata

Metadata Filtering Criteria

Supported filter dimensions:

Dimension	Examples
Tissue/Organ	brain, liver, gut, blood
Disease State	healthy, cancer, diabetes, COVID-19
Treatment	drug-treated, vehicle, untreated
Cell Type	T cells, neurons, fibroblasts
Age Group	adult, pediatric, elderly
Sex	male, female, mixed
Organism	human, mouse, rat

Filters are applied via natural language or structured criteria.

NextMachine Learning