Omics-OS Docs
Advanced

Vector Search Internals

Deep dive into the vector search architecture: backends, embedders, rerankers, ontology collections, and S3 auto-download

Overview

This page covers the internal architecture of Lobster AI's vector search infrastructure. For usage and installation, see the Semantic Search Guide.

The vector search system is config-driven — a single switching point (VectorSearchConfig.from_env()) creates the backend, embedder, and reranker from environment variables:

VectorSearchConfig.from_env()
    ├── create_backend()     → ChromaDB | FAISS | pgvector
    ├── create_embedder()    → SapBERT | MiniLM | OpenAI
    └── create_reranker()    → CrossEncoder | Cohere | None

VectorSearchService
    ├── query(text, collection, top_k)     → SearchResponse
    ├── query_batch(texts, collection)     → List[SearchResponse]
    └── match_ontology(text, ontology)     → List[OntologyMatch]

All components implement base ABCs (BaseVectorBackend, BaseEmbedder, BaseReranker), making it straightforward to add new backends or models.

Backend Options

ChromaDB (Default)

The primary backend. Uses ChromaDB's PersistentClient for durable local storage.

AspectDetails
TypePersistent local vector store
Storage~/.lobster/vector_store/ (configurable)
Dependencieschromadb>=1.0.0
Performance30-50ms per query
Best forDefault use, local installations, development

ChromaDB stores embeddings in an SQLite-backed persistent directory. Collections survive process restarts.

FAISS (Ephemeral)

In-memory vector search using Facebook's FAISS library. Useful for ephemeral workloads or testing.

AspectDetails
TypeIn-memory (ephemeral)
StorageNone (lost on process exit)
Dependenciesfaiss-cpu or faiss-gpu
PerformanceSub-millisecond queries
Best forTesting, benchmarks, ephemeral environments

FAISS does not persist data between sessions. Ontology collections must be re-loaded on each startup, which adds latency on first use.

pgvector (Future)

PostgreSQL-based vector storage for cloud deployments. Currently a stub — the interface is defined but not yet implemented.

AspectDetails
TypePostgreSQL extension
StorageRemote database
StatusStub (interface only)
Best forCloud deployments, shared infrastructure

Embedder Options

SapBERT (Primary)

The default and recommended embedder for biomedical terminology.

AspectDetails
Modelcambridgeltl/SapBERT-from-PubMedBERT-fulltext
Dimensions768
Training4M+ UMLS synonym pairs
Size~420 MB (downloaded on first use)
Best forAll biomedical ontology matching

SapBERT is specifically trained on biomedical synonyms, making it the best choice for matching disease names, tissue terms, and cell types against ontology concepts.

MiniLM (Lightweight)

A smaller, general-purpose model for resource-constrained environments.

AspectDetails
Modelall-MiniLM-L6-v2
Dimensions384
Size~80 MB
Best forLow-memory environments, quick testing

Lower biomedical accuracy than SapBERT but faster and smaller.

OpenAI (API-Based)

Uses OpenAI's embedding API for environments where local model hosting is not possible.

AspectDetails
Modeltext-embedding-3-small
Dimensions1536
RequiresOPENAI_API_KEY environment variable
Best forEnvironments without GPU/CPU capacity for local models

Using the OpenAI embedder requires network access and incurs API costs. SapBERT is recommended for most users since it runs locally with no API calls.

Reranker Pipeline

Rerankers provide an optional second-pass scoring step after the initial vector search. By default, no reranker is used (LOBSTER_RERANKER=none).

Cross-Encoder

Uses MS MARCO MiniLM as a cross-encoder to re-score candidate matches:

AspectDetails
ModelMS MARCO MiniLM
EffectRe-ranks top-k results by pairwise relevance
When usefulWhen initial results need refinement

Cohere

Uses the Cohere API reranker:

AspectDetails
RequiresCOHERE_API_KEY environment variable
EffectAPI-based reranking

None (Default)

No reranking step. The initial vector search results are returned directly. This is sufficient for most ontology matching use cases since the pre-built collections are already optimized.

Ontology Collections

Three ontology collections are pre-built and hosted on S3:

AliasCanonical NameSourceTermsTarball
diseasemondo_v2024_01MONDO~30Kmondo_sapbert_768.tar.gz
tissueuberon_v2024_01UBERON~15Kuberon_sapbert_768.tar.gz
cell_typecell_ontology_v2024_01Cell Ontology~2.5Kcell_ontology_sapbert_768.tar.gz

Each collection contains:

  • Pre-computed SapBERT embeddings (768-dim) for every ontology concept
  • Concept metadata (ID, name, synonyms, parent terms)
  • ChromaDB-compatible format for direct import

S3 Auto-Download

On first use, ontology data is downloaded automatically from S3:

1. VectorSearchService.match_ontology("glioblastoma", "disease")
2. ChromaDB backend checks for collection "mondo_v2024_01"
3. Collection not found → _ensure_ontology_data() triggered
4. Downloads: lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
5. Validates checksum → extracts to ~/.lobster/ontology_cache/
6. Copies to vector_store/ → collection now available
7. Subsequent queries skip download (cache hit)

S3 URLs:

https://lobster-ontology-data.s3.amazonaws.com/v1/mondo_sapbert_768.tar.gz
https://lobster-ontology-data.s3.amazonaws.com/v1/uberon_sapbert_768.tar.gz
https://lobster-ontology-data.s3.amazonaws.com/v1/cell_ontology_sapbert_768.tar.gz

Cache locations:

  • Download cache: ~/.lobster/ontology_cache/
  • Vector store: ~/.lobster/vector_store/ (configurable via LOBSTER_VECTOR_STORE_PATH)

Corruption handling: If a cached tarball is corrupted, delete it and re-run — the system re-downloads automatically:

rm -rf ~/.lobster/ontology_cache/mondo_sapbert_768*
rm -rf ~/.lobster/vector_store/mondo_v2024_01/

Building Custom Collections

The build script at scripts/build_ontology_embeddings.py generates ChromaDB collections from OBO ontology files:

# Build MONDO embeddings (requires ontology OBO file)
python scripts/build_ontology_embeddings.py --ontology mondo --output ./build/

This is used internally to produce the S3-hosted tarballs. Users do not need to run this unless building custom ontology collections.

Environment Variable Reference

VariableDefaultOptionsPurpose
LOBSTER_VECTOR_BACKENDchromadbchromadb, faiss, pgvectorVector store backend
LOBSTER_EMBEDDING_PROVIDERsapbertsapbert, minilm, openaiEmbedding model
LOBSTER_VECTOR_STORE_PATH~/.lobster/vector_store/Any pathPersistent storage directory
LOBSTER_RERANKERnonecross_encoder, cohere, noneOptional reranking step
LOBSTER_VECTOR_CLOUD_URL(unset)URLCloud ChromaDB endpoint (future)

Package Ownership

As of v1.0.7, vector search infrastructure lives in the lobster-metadata package. This follows the project rule "services travel with their primary agent package" — the primary consumers are metadata_assistant and annotation_expert.

# Canonical import paths (v1.0.7+)
from lobster.services.vector import VectorSearchService
from lobster.services.vector.config import VectorSearchConfig
from lobster.services.vector.backends.chromadb_backend import ChromaDBBackend
from lobster.services.vector.embeddings.base import BaseEmbedder
from lobster.services.vector.rerankers.base import BaseReranker
from lobster.services.vector.ontology_graph import load_ontology_graph, get_neighbors

# Search schemas remain in core (unchanged)
from lobster.core.schemas.search import OntologyMatch, SearchResult, SearchBackend

Old import paths (lobster.core.vector.*) still work via deprecation shims that re-export from the new location with DeprecationWarning. These shims will be removed in v2.0.0. Update your imports to use lobster.services.vector.*.

Source Code Reference

FilePurposeLines
packages/lobster-metadata/lobster/services/vector/service.pyVectorSearchService — main orchestrator353
packages/lobster-metadata/lobster/services/vector/config.pyVectorSearchConfig — env-var factory207
packages/lobster-metadata/lobster/services/vector/__init__.pyLazy exports via __getattr__()
packages/lobster-metadata/lobster/services/vector/backends/chromadb_backend.pyChromaDB + S3 auto-download485
packages/lobster-metadata/lobster/services/vector/backends/faiss_backend.pyFAISS in-memory backend
packages/lobster-metadata/lobster/services/vector/backends/pgvector_backend.pyPostgreSQL stub
packages/lobster-metadata/lobster/services/vector/embeddings/sapbert.pySapBERT embedder131
packages/lobster-metadata/lobster/services/vector/embeddings/minilm.pyMiniLM embedder
packages/lobster-metadata/lobster/services/vector/embeddings/openai_embedder.pyOpenAI embedder
packages/lobster-metadata/lobster/services/vector/rerankers/cross_encoder_reranker.pyCross-encoder reranker
packages/lobster-metadata/lobster/services/vector/rerankers/cohere_reranker.pyCohere API reranker
packages/lobster-metadata/lobster/services/vector/ontology_graph.pyNetworkX graph traversal190
lobster/core/schemas/search.pySearch response models (stays in core)
lobster/core/schemas/ontology.pyDisease/ontology models (stays in core)
lobster/services/data_access/ensembl_service.pyEnsembl REST API client371
lobster/services/data_access/uniprot_service.pyUniProt REST API client303
scripts/build_ontology_embeddings.pyOntology embedding builder

Related Documentation:

On this page