Semantic Search & Ontology Matching
Match biomedical terms to standardized ontology concepts using vector embeddings
Overview
Lobster AI includes an optional semantic vector search infrastructure for matching biomedical terms against standardized ontology concepts. Instead of relying on exact keyword matching, semantic search uses SapBERT embeddings (a PubMedBERT-based model trained on 4M+ UMLS synonym pairs) to find the closest ontology concept by meaning — handling synonyms, abbreviations, and spelling variations.
Supported Ontologies
| Ontology | Alias | Source | Terms | Examples |
|---|---|---|---|---|
| MONDO | disease | Monarch Disease Ontology | ~30K | glioblastoma, T2D, Crohn's disease |
| UBERON | tissue | Uber-anatomy Ontology | ~15K | liver, prefrontal cortex, jejunum |
| Cell Ontology | cell_type | Cell Ontology (CL) | ~2.5K | CD8+ T cell, oligodendrocyte, hepatocyte |
When You Need This
- Disease standardization: Map varied disease names ("GBM", "glioblastoma multiforme", "grade IV astrocytoma") to MONDO IDs
- Tissue harmonization: Unify tissue labels across datasets from different studies
- Cell type annotation: Match cell type labels to Cell Ontology concepts
- Cross-dataset integration: Ensure consistent terminology before merging datasets
Semantic search is optional. Without it, agents fall back to keyword matching with a small built-in dictionary. No errors occur — matching quality is just reduced.
Installation
Five installation methods are available:
# 1. Via lobster init (interactive — prompts after Docling)
lobster init
# 2. Via lobster init (non-interactive)
lobster init --install-vector-search --anthropic-key sk-...
# 3. Via pip directly
pip install 'lobster-ai[vector-search]'
# 4. As part of full install (includes all extras)
pip install 'lobster-ai[full]'
# 5. Via uv tool (global install)
uv tool install 'lobster-ai[vector-search,anthropic]'What gets installed:
lobster-metadata[vector-search]— Metadata package with vector search infrastructurechromadb>=1.0.0— Persistent vector storesentence-transformers>=2.2.0— SapBERT embedding model
The lobster-ai[vector-search] extra delegates to lobster-metadata[vector-search], which owns the vector search module. The end-user install command is unchanged.
Verify installation:
python -c "import chromadb; import sentence_transformers; print('Semantic Search available')"
# Or check via lobster status
lobster status
# Shows: Optional Capabilities: ✓ Semantic SearchQuick Start
Once installed, semantic search is used automatically by agents in chat:
Disease Matching
You: Match these disease terms to MONDO ontology: glioblastoma, lung adenocarcinoma, T2D
[metadata_assistant]
- Embeds each term with SapBERT (768-dim)
- Queries MONDO collection via ChromaDB
- Returns:
glioblastoma → MONDO:0018177 (confidence: 0.96)
lung adenocarcinoma → MONDO:0005061 (confidence: 0.94)
T2D → MONDO:0005148 (type 2 diabetes mellitus, confidence: 0.91)Tissue Standardization
You: Standardize my tissue annotations to UBERON terms
[metadata_assistant]
- Reads tissue column from loaded dataset
- Matches each unique value against UBERON
- Maps: "brain cortex" → UBERON:0001870 (cerebral cortex)
- Maps: "gut" → UBERON:0000160 (intestine)Cell Type Matching
You: What cell types match "CD8+ T cells" in Cell Ontology?
[metadata_assistant]
- Queries Cell Ontology collection
- Returns top matches with confidence scores
- CD8-positive, alpha-beta T cell (CL:0000625, confidence: 0.97)How It Works
Architecture
VectorSearchConfig.from_env()
├── create_backend() → ChromaDB (default) | FAISS | pgvector
├── create_embedder() → SapBERT (default) | MiniLM | OpenAI
└── create_reranker() → CrossEncoder | Cohere | None (default)
↓
VectorSearchService
├── query(text, collection, top_k) → SearchResponse
├── query_batch(texts, collection) → List[SearchResponse]
└── match_ontology(text, ontology) → List[OntologyMatch]Data Flow (First Use)
On first use, ontology data is automatically downloaded from S3:
1. User calls match_ontology("glioblastoma", "disease")
2. VectorSearchService checks ChromaDB for "mondo_v2024_01" collection
3. Collection not found → downloads from S3:
lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
4. Extracts to ~/.lobster/ontology_cache/ → copies to vector_store/
5. SapBERT embeds "glioblastoma" → 768-dim vector
6. ChromaDB cosine search → top-k results
7. Distance → similarity conversion → List[OntologyMatch]Subsequent queries skip the download step — data is cached locally.
Configuration
All configuration is via environment variables. Defaults work for most users.
| Variable | Default | Options | Purpose |
|---|---|---|---|
LOBSTER_VECTOR_BACKEND | chromadb | chromadb, faiss, pgvector | Vector store backend |
LOBSTER_EMBEDDING_PROVIDER | sapbert | sapbert, minilm, openai | Embedding model |
LOBSTER_VECTOR_STORE_PATH | ~/.lobster/vector_store/ | Any path | Persistent storage directory |
LOBSTER_RERANKER | none | cross_encoder, cohere, none | Optional reranking step |
LOBSTER_VECTOR_CLOUD_URL | (unset) | URL | Cloud ChromaDB endpoint (future) |
Example — use lightweight embeddings:
export LOBSTER_EMBEDDING_PROVIDER=minilmExample — custom storage location:
export LOBSTER_VECTOR_STORE_PATH=/data/lobster/vectorsAgent Integration
metadata_assistant
The metadata agent gains semantic ontology matching when vector-search is installed:
- DiseaseOntologyService upgrades from keyword matching (4 diseases) to embedding-based matching (~30K MONDO concepts)
- Cross-database ID mapping tool (
create_cross_database_id_mapping_tool) maps between MONDO, UMLS, MeSH, and other knowledgebases - Tissue and cell type matching available for dataset harmonization
genomics_expert
The genomics agent gains new knowledgebase tools:
- Variant Consequence Prediction — Ensembl VEP integration for predicting variant effects (SIFT, PolyPhen scores)
- Sequence Retrieval — Fetch genomic, cDNA, CDS, and protein sequences from Ensembl
annotation_expert (Planned)
Cell type annotation against Cell Ontology is planned for integration with the transcriptomics annotation expert.
Fallback Behavior
When vector-search is not installed, the system degrades gracefully:
- Disease matching falls back to keyword matching with 4 hardcoded diseases (CRC, UC, CD, Healthy)
- Tissue/cell type matching is not available (no errors, tools simply not registered)
- No import errors — all vector search dependencies use import guards:
try:
import chromadb
HAS_CHROMADB = True
except ImportError:
HAS_CHROMADB = False
# Tools only added to agent when deps available
if HAS_CHROMADB:
tools.append(create_ontology_match_tool(data_manager))This means agents work normally — they just have fewer tools available.
Disk Space Requirements
| Component | Size | Location |
|---|---|---|
| chromadb + sentence-transformers | ~150 MB | site-packages |
| SapBERT model (first use) | ~420 MB | ~/.cache/huggingface/ |
| Ontology data (3 tarballs) | ~50 MB each | ~/.lobster/ontology_cache/ |
| ChromaDB vector store | ~200 MB | ~/.lobster/vector_store/ |
| Total first-use download | ~800 MB |
After first use, only the ChromaDB vector store is accessed — no re-downloads needed.
Troubleshooting
Model download fails
# Test model download directly
python -c "from sentence_transformers import SentenceTransformer; m = SentenceTransformer('cambridgeltl/SapBERT-from-PubMedBERT-fulltext'); print('OK')"
# If behind a firewall, set HuggingFace cache
export HF_HOME=/path/with/space/.cache/huggingface
# If download is interrupted, clear cache and retry
rm -rf ~/.cache/huggingface/hub/models--cambridgeltl--SapBERT-from-PubMedBERT-fulltextDisk space issues
# Check current usage
du -sh ~/.lobster/vector_store/ ~/.lobster/ontology_cache/ ~/.cache/huggingface/
# Free space by removing ontology cache (will re-download on next use)
rm -rf ~/.lobster/ontology_cache/CUDA vs CPU
SapBERT runs on CPU by default and is fast enough for interactive use (~50ms per query). No GPU is required. If you have a CUDA GPU available, sentence-transformers will use it automatically for faster batch operations.
ChromaDB version conflicts
# Ensure compatible version
pip install 'chromadb>=1.0.0'
# If you see sqlite3 errors on older Linux
pip install pysqlite3-binaryRelated Documentation:
- Optional Dependencies Guide — Overview of all optional components
- Installation Guide — Main installation instructions
- Metadata Agent — Agent that uses semantic matching
- Architecture Overview — Disease Ontology Service architecture
- Vector Search Internals — Power user reference
User Guide Overview
Lobster AI is a multi-agent bioinformatics analysis platform that combines specialized AI agents with proven scientific tools to analyze complex multi-om...
Bulk RNA-seq Analysis Tutorial
This comprehensive tutorial demonstrates how to perform bulk RNA-seq differential expression analysis using Lobster AI with pyDESeq2 integration, formula-bas...