Vector Search Internals

Deep dive into the vector search architecture: backends, embedders, rerankers, ontology collections, and S3 auto-download

Overview

This page covers the internal architecture of Lobster AI's vector search infrastructure. For usage and installation, see the Semantic Search Guide.

The vector search system is config-driven — a single switching point (VectorSearchConfig.from_env()) creates the backend, embedder, and reranker from environment variables:

VectorSearchConfig.from_env()
    ├── create_backend()     → ChromaDB | FAISS | pgvector
    ├── create_embedder()    → SapBERT | MiniLM | OpenAI
    └── create_reranker()    → CrossEncoder | Cohere | None
         ↓
VectorSearchService
    ├── query(text, collection, top_k)     → SearchResponse
    ├── query_batch(texts, collection)     → List[SearchResponse]
    └── match_ontology(text, ontology)     → List[OntologyMatch]

All components implement base ABCs (BaseVectorBackend, BaseEmbedder, BaseReranker), making it straightforward to add new backends or models.

Backend Options

ChromaDB (Default)

The primary backend. Uses ChromaDB's PersistentClient for durable local storage.

Aspect	Details
Type	Persistent local vector store
Storage	`~/.lobster/vector_store/` (configurable)
Dependencies	`chromadb>=1.0.0`
Performance	30-50ms per query
Best for	Default use, local installations, development

ChromaDB stores embeddings in an SQLite-backed persistent directory. Collections survive process restarts.

FAISS (Ephemeral)

In-memory vector search using Facebook's FAISS library. Useful for ephemeral workloads or testing.

Aspect	Details
Type	In-memory (ephemeral)
Storage	None (lost on process exit)
Dependencies	`faiss-cpu` or `faiss-gpu`
Performance	Sub-millisecond queries
Best for	Testing, benchmarks, ephemeral environments

FAISS does not persist data between sessions. Ontology collections must be re-loaded on each startup, which adds latency on first use.

pgvector (Future)

PostgreSQL-based vector storage for cloud deployments. Currently a stub — the interface is defined but not yet implemented.

Aspect	Details
Type	PostgreSQL extension
Storage	Remote database
Status	Stub (interface only)
Best for	Cloud deployments, shared infrastructure

Embedder Options

SapBERT (Primary)

The default and recommended embedder for biomedical terminology.

Aspect	Details
Model	`cambridgeltl/SapBERT-from-PubMedBERT-fulltext`
Dimensions	768
Training	4M+ UMLS synonym pairs
Size	~420 MB (downloaded on first use)
Best for	All biomedical ontology matching

SapBERT is specifically trained on biomedical synonyms, making it the best choice for matching disease names, tissue terms, and cell types against ontology concepts.

MiniLM (Lightweight)

A smaller, general-purpose model for resource-constrained environments.

Aspect	Details
Model	`all-MiniLM-L6-v2`
Dimensions	384
Size	~80 MB
Best for	Low-memory environments, quick testing

Lower biomedical accuracy than SapBERT but faster and smaller.

OpenAI (API-Based)

Uses OpenAI's embedding API for environments where local model hosting is not possible.

Aspect	Details
Model	`text-embedding-3-small`
Dimensions	1536
Requires	`OPENAI_API_KEY` environment variable
Best for	Environments without GPU/CPU capacity for local models

Using the OpenAI embedder requires network access and incurs API costs. SapBERT is recommended for most users since it runs locally with no API calls.

Reranker Pipeline

Rerankers provide an optional second-pass scoring step after the initial vector search. By default, no reranker is used (LOBSTER_RERANKER=none).

Cross-Encoder

Uses MS MARCO MiniLM as a cross-encoder to re-score candidate matches:

Aspect	Details
Model	MS MARCO MiniLM
Effect	Re-ranks top-k results by pairwise relevance
When useful	When initial results need refinement

Cohere

Uses the Cohere API reranker:

Aspect	Details
Requires	`COHERE_API_KEY` environment variable
Effect	API-based reranking

None (Default)

No reranking step. The initial vector search results are returned directly. This is sufficient for most ontology matching use cases since the pre-built collections are already optimized.

Ontology Collections

Three ontology collections are pre-built and hosted on S3:

Alias	Canonical Name	Source	Terms	Tarball
`disease`	`mondo_v2024_01`	MONDO	~30K	`mondo_sapbert_768.tar.gz`
`tissue`	`uberon_v2024_01`	UBERON	~15K	`uberon_sapbert_768.tar.gz`
`cell_type`	`cell_ontology_v2024_01`	Cell Ontology	~2.5K	`cell_ontology_sapbert_768.tar.gz`

Each collection contains:

Pre-computed SapBERT embeddings (768-dim) for every ontology concept
Concept metadata (ID, name, synonyms, parent terms)
ChromaDB-compatible format for direct import

S3 Auto-Download

On first use, ontology data is downloaded automatically from S3:

1. VectorSearchService.match_ontology("glioblastoma", "disease")
2. ChromaDB backend checks for collection "mondo_v2024_01"
3. Collection not found → _ensure_ontology_data() triggered
4. Downloads: lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
5. Validates checksum → extracts to ~/.lobster/ontology_cache/
6. Copies to vector_store/ → collection now available
7. Subsequent queries skip download (cache hit)

S3 URLs:

https://lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
https://lobster-ontology-data.s3.amazonaws.com/v1/uberon_sapbert_768.tar.gz
https://lobster-ontology-data.s3.amazonaws.com/v1/cell_ontology_sapbert_768.tar.gz

Cache locations:

Download cache: ~/.lobster/ontology_cache/
Vector store: ~/.lobster/vector_store/ (configurable via LOBSTER_VECTOR_STORE_PATH)

Corruption handling: If a cached tarball is corrupted, delete it and re-run — the system re-downloads automatically:

rm -rf ~/.lobster/ontology_cache/mondo_sapbert_768*
rm -rf ~/.lobster/vector_store/mondo_v2024_01/

Building Custom Collections

The build script at scripts/build_ontology_embeddings.py generates ChromaDB collections from OBO ontology files:

# Build MONDO embeddings (requires ontology OBO file)
python scripts/build_ontology_embeddings.py --ontology mondo --output ./build/

This is used internally to produce the S3-hosted tarballs. Users do not need to run this unless building custom ontology collections.

Environment Variable Reference

Variable	Default	Options	Purpose
`LOBSTER_VECTOR_BACKEND`	`chromadb`	`chromadb`, `faiss`, `pgvector`	Vector store backend
`LOBSTER_EMBEDDING_PROVIDER`	`sapbert`	`sapbert`, `minilm`, `openai`	Embedding model
`LOBSTER_VECTOR_STORE_PATH`	`~/.lobster/vector_store/`	Any path	Persistent storage directory
`LOBSTER_RERANKER`	`none`	`cross_encoder`, `cohere`, `none`	Optional reranking step
`LOBSTER_VECTOR_CLOUD_URL`	(unset)	URL	Cloud ChromaDB endpoint (future)

As of v1.0.7, vector search infrastructure lives in the lobster-metadata package. This follows the project rule "services travel with their primary agent package" — the primary consumers are metadata_assistant and annotation_expert.

# Canonical import paths (v1.0.7+)
from lobster.services.vector import VectorSearchService
from lobster.services.vector.config import VectorSearchConfig
from lobster.services.vector.backends.chromadb_backend import ChromaDBBackend
from lobster.services.vector.embeddings.base import BaseEmbedder
from lobster.services.vector.rerankers.base import BaseReranker
from lobster.services.vector.ontology_graph import load_ontology_graph, get_neighbors

# Search schemas remain in core (unchanged)
from lobster.core.schemas.search import OntologyMatch, SearchResult, SearchBackend

Old import paths (lobster.core.vector.*) still work via deprecation shims that re-export from the new location with DeprecationWarning. These shims will be removed in v2.0.0. Update your imports to use lobster.services.vector.*.

Source Code Reference

File	Purpose	Lines
`packages/lobster-metadata/lobster/services/vector/service.py`	`VectorSearchService` — main orchestrator	353
`packages/lobster-metadata/lobster/services/vector/config.py`	`VectorSearchConfig` — env-var factory	207
`packages/lobster-metadata/lobster/services/vector/__init__.py`	Lazy exports via `__getattr__()`	—
`packages/lobster-metadata/lobster/services/vector/backends/chromadb_backend.py`	ChromaDB + S3 auto-download	485
`packages/lobster-metadata/lobster/services/vector/backends/faiss_backend.py`	FAISS in-memory backend	—
`packages/lobster-metadata/lobster/services/vector/backends/pgvector_backend.py`	PostgreSQL stub	—
`packages/lobster-metadata/lobster/services/vector/embeddings/sapbert.py`	SapBERT embedder	131
`packages/lobster-metadata/lobster/services/vector/embeddings/minilm.py`	MiniLM embedder	—
`packages/lobster-metadata/lobster/services/vector/embeddings/openai_embedder.py`	OpenAI embedder	—
`packages/lobster-metadata/lobster/services/vector/rerankers/cross_encoder_reranker.py`	Cross-encoder reranker	—
`packages/lobster-metadata/lobster/services/vector/rerankers/cohere_reranker.py`	Cohere API reranker	—
`packages/lobster-metadata/lobster/services/vector/ontology_graph.py`	NetworkX graph traversal	190
`lobster/core/schemas/search.py`	Search response models (stays in core)	—
`lobster/core/schemas/ontology.py`	Disease/ontology models (stays in core)	—
`lobster/services/data_access/ensembl_service.py`	Ensembl REST API client	371
`lobster/services/data_access/uniprot_service.py`	UniProt REST API client	303
`scripts/build_ontology_embeddings.py`	Ontology embedding builder	—

Related Documentation:

Semantic Search Guide — User-facing guide
Architecture Overview — Disease Ontology Service section
Metadata Agent — Primary consumer of ontology matching
Optional Dependencies — Installation overview

PreviousSupervisor Configuration Guide

NextWorkspace Content Service