Semantic Search & Ontology Matching

Match biomedical terms to standardized ontology concepts using vector embeddings

Overview

Lobster AI includes an optional semantic vector search infrastructure for matching biomedical terms against standardized ontology concepts. Instead of relying on exact keyword matching, semantic search uses SapBERT embeddings (a PubMedBERT-based model trained on 4M+ UMLS synonym pairs) to find the closest ontology concept by meaning — handling synonyms, abbreviations, and spelling variations.

Supported Ontologies

Ontology	Alias	Source	Terms	Examples
MONDO	`disease`	Monarch Disease Ontology	~30K	glioblastoma, T2D, Crohn's disease
UBERON	`tissue`	Uber-anatomy Ontology	~15K	liver, prefrontal cortex, jejunum
Cell Ontology	`cell_type`	Cell Ontology (CL)	~2.5K	CD8+ T cell, oligodendrocyte, hepatocyte

When You Need This

Disease standardization: Map varied disease names ("GBM", "glioblastoma multiforme", "grade IV astrocytoma") to MONDO IDs
Tissue harmonization: Unify tissue labels across datasets from different studies
Cell type annotation: Match cell type labels to Cell Ontology concepts
Cross-dataset integration: Ensure consistent terminology before merging datasets

Semantic search is optional. Without it, agents fall back to keyword matching with a small built-in dictionary. No errors occur — matching quality is just reduced.

Installation

Five installation methods are available:

# 1. Via lobster init (interactive — prompts after Docling)
lobster init

# 2. Via lobster init (non-interactive)
lobster init --install-vector-search --anthropic-key sk-...

# 3. Via pip directly
pip install 'lobster-ai[vector-search]'

# 4. As part of full install (includes all extras)
pip install 'lobster-ai[full]'

# 5. Via uv tool (global install)
uv tool install 'lobster-ai[vector-search,anthropic]'

What gets installed:

lobster-metadata[vector-search] — Metadata package with vector search infrastructure
- chromadb>=1.0.0 — Persistent vector store
- sentence-transformers>=2.2.0 — SapBERT embedding model

The lobster-ai[vector-search] extra delegates to lobster-metadata[vector-search], which owns the vector search module. The end-user install command is unchanged.

Verify installation:

python -c "import chromadb; import sentence_transformers; print('Semantic Search available')"

# Or check via lobster status
lobster status
# Shows: Optional Capabilities: ✓ Semantic Search

Quick Start

Once installed, semantic search is used automatically by agents in chat:

Disease Matching

You: Match these disease terms to MONDO ontology: glioblastoma, lung adenocarcinoma, T2D

[metadata_assistant]
- Embeds each term with SapBERT (768-dim)
- Queries MONDO collection via ChromaDB
- Returns:
  glioblastoma → MONDO:0018177 (confidence: 0.96)
  lung adenocarcinoma → MONDO:0005061 (confidence: 0.94)
  T2D → MONDO:0005148 (type 2 diabetes mellitus, confidence: 0.91)

Tissue Standardization

You: Standardize my tissue annotations to UBERON terms

[metadata_assistant]
- Reads tissue column from loaded dataset
- Matches each unique value against UBERON
- Maps: "brain cortex" → UBERON:0001870 (cerebral cortex)
- Maps: "gut" → UBERON:0000160 (intestine)

Cell Type Matching

You: What cell types match "CD8+ T cells" in Cell Ontology?

[metadata_assistant]
- Queries Cell Ontology collection
- Returns top matches with confidence scores
- CD8-positive, alpha-beta T cell (CL:0000625, confidence: 0.97)

How It Works

Architecture

VectorSearchConfig.from_env()
    ├── create_backend()     → ChromaDB (default) | FAISS | pgvector
    ├── create_embedder()    → SapBERT (default) | MiniLM | OpenAI
    └── create_reranker()    → CrossEncoder | Cohere | None (default)
         ↓
VectorSearchService
    ├── query(text, collection, top_k)     → SearchResponse
    ├── query_batch(texts, collection)     → List[SearchResponse]
    └── match_ontology(text, ontology)     → List[OntologyMatch]

Data Flow (First Use)

On first use, ontology data is automatically downloaded from S3:

1. User calls match_ontology("glioblastoma", "disease")
2. VectorSearchService checks ChromaDB for "mondo_v2024_01" collection
3. Collection not found → downloads from S3:
   lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
4. Extracts to ~/.lobster/ontology_cache/ → copies to vector_store/
5. SapBERT embeds "glioblastoma" → 768-dim vector
6. ChromaDB cosine search → top-k results
7. Distance → similarity conversion → List[OntologyMatch]

Subsequent queries skip the download step — data is cached locally.

Configuration

All configuration is via environment variables. Defaults work for most users.

Variable	Default	Options	Purpose
`LOBSTER_VECTOR_BACKEND`	`chromadb`	`chromadb`, `faiss`, `pgvector`	Vector store backend
`LOBSTER_EMBEDDING_PROVIDER`	`sapbert`	`sapbert`, `minilm`, `openai`	Embedding model
`LOBSTER_VECTOR_STORE_PATH`	`~/.lobster/vector_store/`	Any path	Persistent storage directory
`LOBSTER_RERANKER`	`none`	`cross_encoder`, `cohere`, `none`	Optional reranking step
`LOBSTER_VECTOR_CLOUD_URL`	(unset)	URL	Cloud ChromaDB endpoint (future)

Example — use lightweight embeddings:

export LOBSTER_EMBEDDING_PROVIDER=minilm

Example — custom storage location:

export LOBSTER_VECTOR_STORE_PATH=/data/lobster/vectors

Agent Integration

metadata_assistant

The metadata agent gains semantic ontology matching when vector-search is installed:

DiseaseOntologyService upgrades from keyword matching (4 diseases) to embedding-based matching (~30K MONDO concepts)
Cross-database ID mapping tool (create_cross_database_id_mapping_tool) maps between MONDO, UMLS, MeSH, and other knowledgebases
Tissue and cell type matching available for dataset harmonization

genomics_expert

The genomics agent gains new knowledgebase tools:

Variant Consequence Prediction — Ensembl VEP integration for predicting variant effects (SIFT, PolyPhen scores)
Sequence Retrieval — Fetch genomic, cDNA, CDS, and protein sequences from Ensembl

annotation_expert (Planned)

Cell type annotation against Cell Ontology is planned for integration with the transcriptomics annotation expert.

Fallback Behavior

When vector-search is not installed, the system degrades gracefully:

Disease matching falls back to keyword matching with 4 hardcoded diseases (CRC, UC, CD, Healthy)
Tissue/cell type matching is not available (no errors, tools simply not registered)
No import errors — all vector search dependencies use import guards:

try:
    import chromadb
    HAS_CHROMADB = True
except ImportError:
    HAS_CHROMADB = False

# Tools only added to agent when deps available
if HAS_CHROMADB:
    tools.append(create_ontology_match_tool(data_manager))

This means agents work normally — they just have fewer tools available.

Disk Space Requirements

Component	Size	Location
chromadb + sentence-transformers	~150 MB	site-packages
SapBERT model (first use)	~420 MB	`~/.cache/huggingface/`
Ontology data (3 tarballs)	~50 MB each	`~/.lobster/ontology_cache/`
ChromaDB vector store	~200 MB	`~/.lobster/vector_store/`
Total first-use download	~800 MB

After first use, only the ChromaDB vector store is accessed — no re-downloads needed.

Troubleshooting

Model download fails

# Test model download directly
python -c "from sentence_transformers import SentenceTransformer; m = SentenceTransformer('cambridgeltl/SapBERT-from-PubMedBERT-fulltext'); print('OK')"

# If behind a firewall, set HuggingFace cache
export HF_HOME=/path/with/space/.cache/huggingface

# If download is interrupted, clear cache and retry
rm -rf ~/.cache/huggingface/hub/models--cambridgeltl--SapBERT-from-PubMedBERT-fulltext

Disk space issues

# Check current usage
du -sh ~/.lobster/vector_store/ ~/.lobster/ontology_cache/ ~/.cache/huggingface/

# Free space by removing ontology cache (will re-download on next use)
rm -rf ~/.lobster/ontology_cache/

CUDA vs CPU

SapBERT runs on CPU by default and is fast enough for interactive use (~50ms per query). No GPU is required. If you have a CUDA GPU available, sentence-transformers will use it automatically for faster batch operations.

ChromaDB version conflicts

# Ensure compatible version
pip install 'chromadb>=1.0.0'

# If you see sqlite3 errors on older Linux
pip install pysqlite3-binary

Related Documentation:

Optional Dependencies Guide — Overview of all optional components
Installation Guide — Main installation instructions
Metadata Agent — Agent that uses semantic matching
Architecture Overview — Disease Ontology Service architecture
Vector Search Internals — Power user reference

PreviousUser Guide Overview

NextFrequently Asked Questions (FAQ)

Semantic Search & Ontology Matching

On this page