Vector Search Internals
Deep dive into the vector search architecture: backends, embedders, rerankers, ontology collections, and S3 auto-download
Overview
This page covers the internal architecture of Lobster AI's vector search infrastructure. For usage and installation, see the Semantic Search Guide.
The vector search system is config-driven — a single switching point (VectorSearchConfig.from_env()) creates the backend, embedder, and reranker from environment variables:
VectorSearchConfig.from_env()
├── create_backend() → ChromaDB | FAISS | pgvector
├── create_embedder() → SapBERT | MiniLM | OpenAI
└── create_reranker() → CrossEncoder | Cohere | None
↓
VectorSearchService
├── query(text, collection, top_k) → SearchResponse
├── query_batch(texts, collection) → List[SearchResponse]
└── match_ontology(text, ontology) → List[OntologyMatch]All components implement base ABCs (BaseVectorBackend, BaseEmbedder, BaseReranker), making it straightforward to add new backends or models.
Backend Options
ChromaDB (Default)
The primary backend. Uses ChromaDB's PersistentClient for durable local storage.
| Aspect | Details |
|---|---|
| Type | Persistent local vector store |
| Storage | ~/.lobster/vector_store/ (configurable) |
| Dependencies | chromadb>=1.0.0 |
| Performance | 30-50ms per query |
| Best for | Default use, local installations, development |
ChromaDB stores embeddings in an SQLite-backed persistent directory. Collections survive process restarts.
FAISS (Ephemeral)
In-memory vector search using Facebook's FAISS library. Useful for ephemeral workloads or testing.
| Aspect | Details |
|---|---|
| Type | In-memory (ephemeral) |
| Storage | None (lost on process exit) |
| Dependencies | faiss-cpu or faiss-gpu |
| Performance | Sub-millisecond queries |
| Best for | Testing, benchmarks, ephemeral environments |
FAISS does not persist data between sessions. Ontology collections must be re-loaded on each startup, which adds latency on first use.
pgvector (Future)
PostgreSQL-based vector storage for cloud deployments. Currently a stub — the interface is defined but not yet implemented.
| Aspect | Details |
|---|---|
| Type | PostgreSQL extension |
| Storage | Remote database |
| Status | Stub (interface only) |
| Best for | Cloud deployments, shared infrastructure |
Embedder Options
SapBERT (Primary)
The default and recommended embedder for biomedical terminology.
| Aspect | Details |
|---|---|
| Model | cambridgeltl/SapBERT-from-PubMedBERT-fulltext |
| Dimensions | 768 |
| Training | 4M+ UMLS synonym pairs |
| Size | ~420 MB (downloaded on first use) |
| Best for | All biomedical ontology matching |
SapBERT is specifically trained on biomedical synonyms, making it the best choice for matching disease names, tissue terms, and cell types against ontology concepts.
MiniLM (Lightweight)
A smaller, general-purpose model for resource-constrained environments.
| Aspect | Details |
|---|---|
| Model | all-MiniLM-L6-v2 |
| Dimensions | 384 |
| Size | ~80 MB |
| Best for | Low-memory environments, quick testing |
Lower biomedical accuracy than SapBERT but faster and smaller.
OpenAI (API-Based)
Uses OpenAI's embedding API for environments where local model hosting is not possible.
| Aspect | Details |
|---|---|
| Model | text-embedding-3-small |
| Dimensions | 1536 |
| Requires | OPENAI_API_KEY environment variable |
| Best for | Environments without GPU/CPU capacity for local models |
Using the OpenAI embedder requires network access and incurs API costs. SapBERT is recommended for most users since it runs locally with no API calls.
Reranker Pipeline
Rerankers provide an optional second-pass scoring step after the initial vector search. By default, no reranker is used (LOBSTER_RERANKER=none).
Cross-Encoder
Uses MS MARCO MiniLM as a cross-encoder to re-score candidate matches:
| Aspect | Details |
|---|---|
| Model | MS MARCO MiniLM |
| Effect | Re-ranks top-k results by pairwise relevance |
| When useful | When initial results need refinement |
Cohere
Uses the Cohere API reranker:
| Aspect | Details |
|---|---|
| Requires | COHERE_API_KEY environment variable |
| Effect | API-based reranking |
None (Default)
No reranking step. The initial vector search results are returned directly. This is sufficient for most ontology matching use cases since the pre-built collections are already optimized.
Ontology Collections
Three ontology collections are pre-built and hosted on S3:
| Alias | Canonical Name | Source | Terms | Tarball |
|---|---|---|---|---|
disease | mondo_v2024_01 | MONDO | ~30K | mondo_sapbert_768.tar.gz |
tissue | uberon_v2024_01 | UBERON | ~15K | uberon_sapbert_768.tar.gz |
cell_type | cell_ontology_v2024_01 | Cell Ontology | ~2.5K | cell_ontology_sapbert_768.tar.gz |
Each collection contains:
- Pre-computed SapBERT embeddings (768-dim) for every ontology concept
- Concept metadata (ID, name, synonyms, parent terms)
- ChromaDB-compatible format for direct import
S3 Auto-Download
On first use, ontology data is downloaded automatically from S3:
1. VectorSearchService.match_ontology("glioblastoma", "disease")
2. ChromaDB backend checks for collection "mondo_v2024_01"
3. Collection not found → _ensure_ontology_data() triggered
4. Downloads: lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
5. Validates checksum → extracts to ~/.lobster/ontology_cache/
6. Copies to vector_store/ → collection now available
7. Subsequent queries skip download (cache hit)S3 URLs:
https://lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
https://lobster-ontology-data.s3.amazonaws.com/v1/uberon_sapbert_768.tar.gz
https://lobster-ontology-data.s3.amazonaws.com/v1/cell_ontology_sapbert_768.tar.gzCache locations:
- Download cache:
~/.lobster/ontology_cache/ - Vector store:
~/.lobster/vector_store/(configurable viaLOBSTER_VECTOR_STORE_PATH)
Corruption handling: If a cached tarball is corrupted, delete it and re-run — the system re-downloads automatically:
rm -rf ~/.lobster/ontology_cache/mondo_sapbert_768*
rm -rf ~/.lobster/vector_store/mondo_v2024_01/Building Custom Collections
The build script at scripts/build_ontology_embeddings.py generates ChromaDB collections from OBO ontology files:
# Build MONDO embeddings (requires ontology OBO file)
python scripts/build_ontology_embeddings.py --ontology mondo --output ./build/This is used internally to produce the S3-hosted tarballs. Users do not need to run this unless building custom ontology collections.
Environment Variable Reference
| Variable | Default | Options | Purpose |
|---|---|---|---|
LOBSTER_VECTOR_BACKEND | chromadb | chromadb, faiss, pgvector | Vector store backend |
LOBSTER_EMBEDDING_PROVIDER | sapbert | sapbert, minilm, openai | Embedding model |
LOBSTER_VECTOR_STORE_PATH | ~/.lobster/vector_store/ | Any path | Persistent storage directory |
LOBSTER_RERANKER | none | cross_encoder, cohere, none | Optional reranking step |
LOBSTER_VECTOR_CLOUD_URL | (unset) | URL | Cloud ChromaDB endpoint (future) |
Package Ownership
As of v1.0.7, vector search infrastructure lives in the lobster-metadata package. This follows the project rule "services travel with their primary agent package" — the primary consumers are metadata_assistant and annotation_expert.
# Canonical import paths (v1.0.7+)
from lobster.services.vector import VectorSearchService
from lobster.services.vector.config import VectorSearchConfig
from lobster.services.vector.backends.chromadb_backend import ChromaDBBackend
from lobster.services.vector.embeddings.base import BaseEmbedder
from lobster.services.vector.rerankers.base import BaseReranker
from lobster.services.vector.ontology_graph import load_ontology_graph, get_neighbors
# Search schemas remain in core (unchanged)
from lobster.core.schemas.search import OntologyMatch, SearchResult, SearchBackendOld import paths (lobster.core.vector.*) still work via deprecation shims that re-export from the new location with DeprecationWarning. These shims will be removed in v2.0.0. Update your imports to use lobster.services.vector.*.
Source Code Reference
| File | Purpose | Lines |
|---|---|---|
packages/lobster-metadata/lobster/services/vector/service.py | VectorSearchService — main orchestrator | 353 |
packages/lobster-metadata/lobster/services/vector/config.py | VectorSearchConfig — env-var factory | 207 |
packages/lobster-metadata/lobster/services/vector/__init__.py | Lazy exports via __getattr__() | — |
packages/lobster-metadata/lobster/services/vector/backends/chromadb_backend.py | ChromaDB + S3 auto-download | 485 |
packages/lobster-metadata/lobster/services/vector/backends/faiss_backend.py | FAISS in-memory backend | — |
packages/lobster-metadata/lobster/services/vector/backends/pgvector_backend.py | PostgreSQL stub | — |
packages/lobster-metadata/lobster/services/vector/embeddings/sapbert.py | SapBERT embedder | 131 |
packages/lobster-metadata/lobster/services/vector/embeddings/minilm.py | MiniLM embedder | — |
packages/lobster-metadata/lobster/services/vector/embeddings/openai_embedder.py | OpenAI embedder | — |
packages/lobster-metadata/lobster/services/vector/rerankers/cross_encoder_reranker.py | Cross-encoder reranker | — |
packages/lobster-metadata/lobster/services/vector/rerankers/cohere_reranker.py | Cohere API reranker | — |
packages/lobster-metadata/lobster/services/vector/ontology_graph.py | NetworkX graph traversal | 190 |
lobster/core/schemas/search.py | Search response models (stays in core) | — |
lobster/core/schemas/ontology.py | Disease/ontology models (stays in core) | — |
lobster/services/data_access/ensembl_service.py | Ensembl REST API client | 371 |
lobster/services/data_access/uniprot_service.py | UniProt REST API client | 303 |
scripts/build_ontology_embeddings.py | Ontology embedding builder | — |
Related Documentation:
- Semantic Search Guide — User-facing guide
- Architecture Overview — Disease Ontology Service section
- Metadata Agent — Primary consumer of ontology matching
- Optional Dependencies — Installation overview