Omics-OS Docs
Architecture

18. Architecture Overview

Lobster AI is a modular bioinformatics platform with pluggable execution environments, LLM providers, and integrated data management. The platform archit...

Platform Architecture

Lobster AI is a modular bioinformatics platform with pluggable execution environments, LLM providers, and integrated data management. The platform architecture consists of seven layers that work together to provide flexible, scalable omics analysis.

Architecture Diagram

Lobster Platform Architecture

Seven-Layer Architecture:

  1. Client Layer - CLI and Python SDK interfaces
  2. Execution Environment - Local (your hardware) or Cloud (managed infrastructure)
  3. LLM Provider Layer - ProviderRegistry with pluggable providers (Anthropic, Bedrock, Ollama, Gemini, OpenAI, Azure AI + future: Nebius)
  4. Multi-Agent System - Specialized agents for research, data engineering, analysis
  5. External Data Sources - GEO, SRA, ENA, PRIDE, MassIVE, MetaboLights, PubMed, PMC
  6. Data Management - DataManagerV2 for multi-modal orchestration and provenance
  7. Output Layer - Interactive visualizations, Jupyter notebooks, annotated data objects

Component Matrix

LayerComponentConfigurationUse Case
ExecutionLocalDefault (no setup)Privacy-first, offline, cost-sensitive
CloudLOBSTER_CLOUD_KEYTeam collaboration, scaling, managed infrastructure
LLM ProviderOllamalobster init → provider_config.jsonLocal-only, unlimited usage, offline
Anthropiclobster init → provider_config.json + .envBest quality, quick start, cloud/local
AWS Bedrocklobster init → provider_config.json + .envEnterprise, compliance, high throughput
Geminilobster init → provider_config.json + .envLong context, free tier available
OpenAIlobster init → provider_config.json + .envGPT-4o, reasoning models (o1/o3)
Azure AIlobster init → provider_config.json + .envMulti-model access, enterprise compliance
Future (Nebius)Pluggable via ILLMProvider interfaceEasy extensibility (~150 lines/provider)
Data SourcesGEO/SRA/ENAAuto-configuredTranscriptomics datasets
PRIDE/MassIVEAuto-configuredProteomics datasets
MetaboLights/Metabolomics WorkbenchAuto-configuredMetabolomics datasets
PubMed/PMCNCBI_API_KEY (optional)Literature mining, metadata extraction
Data ManagementDataManagerV2Auto-configuredMulti-modal data orchestration, provenance tracking

Deployment Patterns

Lobster supports three deployment patterns optimized for different use cases. For detailed setup instructions and comparison, see the Deployment Patterns Guide.

PatternBest ForKey Features
Local + OllamaPrivacy, learning, zero costOffline, unlimited usage, 100% local
Local + AnthropicQuality, developmentBest accuracy, quick setup, flexible
Cloud + BedrockProduction, teamsEnterprise SLA, high limits, scalable

Configuration Resources:


System Overview

Lobster AI is a professional multi-agent bioinformatics analysis platform that combines specialized AI agents with proven scientific tools to analyze complex multi-omics data. The platform features a modular, service-oriented architecture that enables natural language interaction with sophisticated bioinformatics workflows.

Core Design Principles

  1. Agent-Based Architecture - Specialist agents coordinated through centralized registry
  2. Service-Oriented Processing - Stateless, testable analysis services
  3. Cloud/Local Hybrid - Seamless switching between deployment modes
  4. Modular Design - Extensible components with clean interfaces
  5. Natural Language Interface - User describes analyses in plain English
  6. Publication-Quality Output - Interactive visualizations with scientific rigor

High-Level System Architecture

Technology Stack

Core Technologies

ComponentTechnologyPurpose
Agent FrameworkLangGraphMulti-agent coordination and workflows
AI ModelsAWS BedrockLarge language models for agent intelligence
Data ManagementAnnData, MuDataBiological data structures and storage
BioinformaticsScanpy, PyDESeq2Scientific analysis algorithms
CLI InterfaceTyper, RichTerminal-based interaction
VisualizationPlotlyInteractive scientific plots
StorageH5AD, HDF5Efficient biological data storage

Language and Dependencies

  • Python 3.11-3.14 - Core language with modern features
  • Async/Await - For responsive user interfaces
  • Type Hints - Professional code quality and IDE support
  • Pydantic - Data validation and configuration management

Data Flow Architecture

Data Expert Refactoring (Phase 2 - November 2024)

Overview

Phase 2 refactored the data expert agent to eliminate redundancies and implement a queue-based download pattern. This refactoring improves multi-agent coordination, eliminates duplicate metadata fetches, and enables pre-download validation.

Key Changes

Tool Consolidation (14 → 10 tools):

  • ❌ Removed: restore_workspace_datasets - Cross-session restoration moved to CLI
  • ❌ Removed: read_cached_publication - Replaced by get_content_from_workspace(workspace="literature")
  • ❌ Removed: restore_dataset, list_workspace_datasets - Consolidated into workspace commands
  • ❌ Removed: list_downloaded_datasets - Replaced by list_available_modalities() + get_modality_details()
  • ❌ Removed: get_modality_overview - Redundant; use list_available_modalities() for listing, get_modality_details() for inspection
  • ❌ Removed: retry_failed_download - Merged into execute_download_from_queue(strategy_override=...)

Queue-Based Download Pattern:

  • research_agent validates metadata and adds to download_queue
  • supervisor coordinates via workspace queries
  • data_expert downloads from queue (no direct GEO fetches)

Benefits:

  • 60% reduction in duplicate metadata fetches
  • Pre-download validation via metadata_assistant
  • Concurrent download prevention
  • Full provenance tracking

Layered Enrichment Pattern

The queue pattern implements a 4-layer enrichment strategy:

Layer Details:

  • Layer 1 (research_agent): Fetch basic GEO metadata via GEOProvider
  • Layer 2 (research_agent): Extract download URLs via GEOProvider.get_download_urls() and strategy config via DataExpertAssistant
  • Layer 3 (research_agent): Create DownloadQueueEntry with metadata + URLs + persisted strategy_config (v0.3.2.4)
  • Layer 4 (data_expert): Execute download using queue entry with strategy-aware processing

Download Queue Workflow

Performance Improvements

MetricBefore (Synchronous)After (Queue Pattern)Improvement
Metadata fetches2-3× per dataset1× per dataset60% reduction
Download coordinationNoneSupervisor-mediatedPrevents duplicates
Error recoveryManual retryQueue status trackingAutomated
Pre-download validationNot possibleVia metadata_storeNew capability

Architecture Impact

Updated Components:

  • GEOProvider: Added get_download_urls() method (214 lines)
  • research_agent: Queue entry creation (85 lines)
  • data_expert: Queue consumer pattern (163 lines)
  • download_queue: New infrastructure (342 lines)

Total Code Changes: +462 net lines, -225 lines removed = +237 lines overall

See: Download Queue System (Wiki 35) for detailed documentation.

Core System Components

1. Agent System

The heart of Lobster AI is its multi-agent architecture, where specialized AI agents handle different aspects of bioinformatics analysis:

  • Supervisor Agent - Routes requests and coordinates workflows
  • Data Expert - Data loading and quality assessment with 10 tools (Phase 2 queue-based downloads)
  • Transcriptomics Expert - Unified agent handling both single-cell and bulk RNA-seq analysis
  • Proteomics Expert - Unified agent handling both mass spectrometry and affinity proteomics analysis
  • Research Agent - Discovery & content analysis with 10 tools, workspace caching, publication queue processing (Phase 1-4 complete)
  • Metadata Assistant - Cross-dataset harmonization with 4 tools for sample mapping and validation, publication queue filtering with 3 tools for batch processing (Phase 3-4 complete)

Hierarchical Agent Delegation (Tool-Wrapping Pattern)

As of v2.5+, Lobster supports hierarchical agent delegation using the tool-wrapping pattern recommended by LangGraph. Sub-agents are wrapped as @tool functions rather than passed as supervisor child agents.

Architecture:

Main Supervisor
├─ research_agent (create_react_agent)
│   └─ tools include: delegate_to_metadata_assistant
├─ data_expert (create_react_agent)
│   └─ tools include: delegate_to_metadata_assistant
├─ metadata_assistant (shared leaf agent)
├─ transcriptomics_expert (unified leaf agent for single-cell and bulk RNA-seq)
└─ ...other leaf agents

Key Features:

  • Tool-based delegation: Sub-agents wrapped as @tool functions, invoked via agent.invoke()
  • Single instance, multiple parents: metadata_assistant is created once and shared by both research_agent and data_expert
  • Config-based configuration: Child relationships are defined in AGENT_CONFIG via the child_agents field
  • Two-phase creation: All agents created first, then parent agents re-created with delegation tools
  • Standard agents: All agents use create_react_agent - parent agents get additional delegation tools

Implementation (in graph.py):

def _create_delegation_tool(agent_name: str, agent, description: str):
    """Create a tool that delegates to a sub-agent."""
    @tool
    def delegate(request: str) -> str:
        result = agent.invoke({"messages": [{"role": "user", "content": request}]})
        return result["messages"][-1].content

    delegate.__name__ = f"delegate_to_{agent_name}"
    return delegate

Configuration (in each agent module):

# research_agent module
AGENT_CONFIG = AgentRegistryConfig(
    ...
    child_agents=["metadata_assistant"],  # Creates delegate_to_metadata_assistant tool
)

# metadata_assistant module
AGENT_CONFIG = AgentRegistryConfig(
    ...
    child_agents=None,  # Leaf agent - no delegation tools
)

Benefits over create_supervisor pattern:

  • Simpler: No complex supervisor nesting
  • Standard: Follows official LangGraph patterns
  • Explicit: Clear control over what sub-agent sees
  • Debuggable: Easier to trace tool calls in LangSmith

2. Service Layer

Stateless analysis services provide the computational backbone:

Transcriptomics Services

  • PreprocessingService - Quality control, filtering, normalization
  • QualityService - Multi-metric assessment and validation
  • ClusteringService - Leiden clustering, UMAP, cell annotation
  • EnhancedSingleCellService - Doublet detection, marker genes
  • BulkRNASeqService - Differential expression with pyDESeq2
  • PseudobulkService - Single-cell to bulk aggregation

Proteomics Services

  • ProteomicsPreprocessingService - MS/affinity data filtering
  • ProteomicsQualityService - Missing value analysis, CV assessment
  • ProteomicsAnalysisService - Statistical testing, PCA
  • ProteomicsDifferentialService - Linear models, FDR control

Data Access Services (Download Infrastructure)

The DownloadOrchestrator pattern provides unified, queue-based downloads from multiple biological databases with automatic service routing and retry logic:

IDownloadService Interface:

  • Abstract base class for all download implementations
  • Standardized 3-tuple return: (adata, stats, ir) for provenance tracking
  • Strategy validation and execution separation
  • Integration with DownloadQueue for multi-agent workflows

Implemented Download Services:

ServiceDatabasesStatusKey Features
GEODownloadServiceGEO✅ ProductionH5AD, matrix, supplementary strategies. Refactored (Nov 2024) into modular services/data_access/geo/ package (v0.3.2.4)
SRADownloadServiceSRA, ENA, DDBJ✅ Production (Dec 2024)Multi-mirror failover, MD5 validation, nf-core-compliant error handling
PRIDEDownloadServicePRIDE, PXD✅ ProductionmzML, mzTab, RAW formats
MassIVEDownloadServiceMassIVE, MSV✅ ProductionPROXI API integration
MetaboLightsDownloadServiceMetaboLights, MTBLS✅ ProductionMAF-first strategy, mzML, vendor raw files

SRADownloadService Architecture (Dec 2024):

  • Primary Source: ENA filereport API with HTTPS download
  • Mirror Failover: ENA → NCBI → DDBJ for high availability
  • Error Handling: Production-grade retry logic from nf-core/fetchngs
    • HTTP 429: Retry-After header + exponential backoff
    • HTTP 500: 3 retries with backoff (5s→10s→20s)
    • HTTP 204: Permission issue detection
    • Network errors: Automatic retry with cleanup
  • Data Integrity: MD5 checksum validation, atomic writes (.tmp → final)
  • Size Protection: Soft warning at 100 GB (override: LOBSTER_SKIP_SIZE_WARNING=true)
  • Output: Metadata-based AnnData with FASTQ file paths, ready for alignment/quantification
  • Compliance: Verified against nf-core/fetchngs, pachterlab/ffq, pysradb

Download Workflow:

research_agent → validate metadata → create DownloadQueueEntry

DownloadQueue (status: PENDING)

supervisor → data_expert → execute_download_from_queue(entry_id)

DownloadOrchestrator → routes to appropriate service (GEO/SRA/PRIDE/MassIVE)

Service downloads → validates → creates AnnData → logs provenance

DataManagerV2 stores modality + updates queue status: COMPLETED

Omics Plugin Architecture

Lobster uses a two-registry design for extensible omics type handling:

  • OmicsTypeRegistry (core/omics_registry.py) — Maps each omics type to its schema, adapters, detection config, QC thresholds, and preferred databases
  • ComponentRegistry (core/component_registry.py) — Discovers and loads component instances (agents, services, adapters, providers) via entry points

The DataTypeDetector (also in core/omics_registry.py) provides unified data type detection, replacing scattered detection functions. All detection now delegates to the registry.

Built-in Omics Types (5):

TypePreferred DatabasesFeature RangeDetection Weight
TranscriptomicsGEO, SRA5K-60K8
ProteomicsPRIDE, MassIVE, GEO100-12K12
GenomicsGEO, SRA, dbGaP10K-10M9
MetabolomicsMetaboLights, Metabolomics Workbench, GEO50-5K11
MetagenomicsSRA, GEO, MG-RAST100-50K10

Entry Point Groups (7 total):

GroupPurpose
lobster.agentsAgent registration
lobster.servicesService class registration
lobster.agent_configsCustom agent LLM configurations
lobster.adaptersModality adapter factories
lobster.providersDatabase provider classes
lobster.download_servicesDownload service classes
lobster.queue_preparersQueue preparation classes
lobster.omics_typesOmicsTypeConfig instances

Registration flow: Entry-point discovery runs FIRST; hardcoded fallbacks are used only when no plugin provides the component. This means adding a new omics type requires zero changes to core — just install a package that registers the appropriate entry points.

MetaboLights Integration:

The metabolomics support includes:

  • MetabolomicsAdapter — Loads LC-MS, GC-MS, and NMR data from CSV/TSV/mzML
  • MetaboLightsProvider — MetaboLights REST API integration
  • MetaboLightsDownloadService — Downloads and parses MetaboLights studies (MAF files)
  • MetaboLightsQueuePreparer — Queue preparation for MTBLS accessions

Download Strategies for MetaboLights:

StrategyDescriptionUse Case
MAF_FIRSTMetabolite Assignment Files (default, 0.90 confidence)Ready-to-analyze intensity matrices
MZML_FIRSTmzML spectral filesRaw spectra for reprocessing
RAW_FIRSTVendor raw files (.raw, .wiff, .d)Full reprocessing from scratch

Metadata Services

Metadata services provide standardization, harmonization, and ontology mapping for cross-dataset integration:

  • MetadataStandardizationService - Pydantic schema validation, controlled vocabularies
  • SampleMappingService - Cross-dataset sample ID mapping (4 strategies: exact, fuzzy, pattern, metadata)
  • DiseaseStandardizationService - Disease terminology normalization with 5-level fuzzy matching
  • DiseaseOntologyService - Centralized disease ontology matching (v0.5.1+, Phase 1)
  • ProtocolExtractionService - 16S microbiome protocol extraction from methods sections
Disease Ontology Service (v0.5.1+, Phase 2 Complete)

The DiseaseOntologyService implements the Strangler Fig migration pattern — a migration-stable API that works with both keyword-based (Phase 1) and embedding-based (Phase 2) backends without requiring consumer code changes. Phase 2 shipped in v1.0.7, adding ChromaDB + SapBERT vector search across 3 ontologies (MONDO diseases, UBERON tissues, Cell Ontology cell types).

Architecture:

Key Design Principle (Gemini 3.0 Pro):

"The return type is the contract. Define a DiseaseMatch model now that works for both phases."

Data Models (lobster/core/schemas/ontology.py):

class DiseaseMatch(BaseModel):
    """Universal disease match result - works for both Phase 1 and Phase 2."""
    disease_id: str      # "crc" (Phase 1) → "MONDO:0005575" (Phase 2)
    name: str            # "Colorectal Cancer"
    confidence: float    # 1.0 (Phase 1) → 0.0-1.0 (Phase 2)
    match_type: str      # "exact_keyword" → "semantic_embedding"
    matched_term: str    # Which term triggered match
    metadata: Dict       # mondo_id, umls_cui, mesh_terms

class DiseaseConcept(BaseModel):
    """Disease knowledge representation."""
    id: str              # Internal ID: "crc", "uc", "cd", "healthy"
    name: str            # Display name: "Colorectal Cancer"
    keywords: List[str]  # Phase 1 matching, Phase 2 boosting
    mondo_id: Optional[str]  # Phase 2 ready
    umls_cui: Optional[str]
    mesh_terms: List[str]

Phase 1 Implementation (Current):

AspectDetails
BackendJSON config (lobster/config/disease_ontology.json)
MatchingCase-insensitive keyword substring search
ConfidenceAlways 1.0 (exact keyword match)
Coverage4 diseases: CRC, UC, CD, Healthy (IBD/CRC focus)
Keywords4-10 variants per disease (merged from 2 previous sources)
Performance<1ms per query (in-memory dict lookup)

Phase 2 Implementation (Shipped - v1.0.7):

AspectDetails
BackendChromaDB vector store with SapBERT embeddings (768-dim)
MatchingSemantic similarity search (handles synonyms, typos, multilingual)
ConfidenceVariable 0.0-1.0 based on cosine similarity
Coverage3 ontologies: ~30K diseases (MONDO), ~15K tissues (UBERON), ~2.5K cell types (Cell Ontology)
KeywordsUsed for hybrid boosting (exact matches get confidence boost)
Performance30-50ms per query (local embeddings, no API calls)
Installpip install 'lobster-ai[vector-search]' (optional extra, included in full)

Migration-Stable API Example:

from lobster.services.metadata.disease_ontology_service import DiseaseOntologyService

service = DiseaseOntologyService.get_instance()

# Phase 1 usage (works today):
matches = service.match_disease("colorectal cancer", k=3, min_confidence=0.7)
# Result: [DiseaseMatch(disease_id='crc', confidence=1.0, match_type='exact_keyword')]

# Phase 2 usage (same API, different backend):
matches = service.match_disease("colon tumor", k=3, min_confidence=0.7)
# Result: [DiseaseMatch(disease_id='MONDO:0005575', confidence=0.89,
#                       match_type='semantic_embedding')]

Consumer Integration:

  1. DiseaseStandardizationService (services/metadata/disease_standardization_service.py):

    • Loads keywords from ontology.get_standardization_variants()
    • Removed 52-line hardcoded DISEASE_MAPPINGS dict
    • Zero behavior change in Phase 1
  2. metadata_assistant._phase1_column_rescan() (agents/metadata_assistant.py):

    • Uses match_disease() API for column scanning
    • Removed 6-line hardcoded disease_keywords dict
    • Added disease_match_type field for provenance
    • Phase 2 ready (works with embeddings when backend swaps)

Architecture Benefits:

  • Eliminated Duplication: 2 hardcoded dictionaries → 1 JSON config
  • Single Source of Truth: lobster/config/disease_ontology.json
  • Phase 2 Ready: Consumer code unchanged when backend swaps to embeddings
  • Extensible: Add tissue/cell type/organism ontologies using same pattern
  • Singleton Pattern: Shared instance via get_instance() for consistency

Implementation Files:

FilePurposeLines
lobster/core/schemas/ontology.pyPydantic models (DiseaseMatch, DiseaseConcept)93
lobster/config/disease_ontology.jsonDisease knowledge (4 diseases, MONDO IDs)57
lobster/services/metadata/disease_ontology_service.pyService with match_disease() API300
tests/unit/services/metadata/test_disease_ontology_service.py21 test cases285

Test Coverage:

✅ 21 new tests (disease_ontology_service)
✅ 31 existing tests (disease_standardization_service) - zero regressions
✅ 249 total metadata service tests pass

Legacy APIs (Backward Compatibility):

During Phase 1 migration, the service provides legacy methods for gradual consumer migration:

  • get_extraction_keywords() - Returns dict for old keyword-based consumers
  • get_standardization_variants() - For DiseaseStandardizationService fuzzy matching
  • validate_disease_id(), get_disease_by_id() - Helper methods

Phase 2 Status (Shipped):

Phase 2 is complete as of v1.0.7. The Strangler Fig migration worked as designed — consumer code required zero changes when the backend swapped from keywords to embeddings:

  • Install: pip install 'lobster-ai[vector-search]' enables the embedding backend
  • Without the extra: Falls back to Phase 1 keyword matching (4 diseases) — no errors
  • With the extra: Full semantic matching across 3 ontologies (~48K total concepts)
  • Config-driven: VectorSearchConfig.from_env() selects backend, embedder, and reranker via environment variables

See the Semantic Search Guide for usage details and the Vector Search Internals for architecture deep-dive.

Other Supporting Services

  • ContentAccessService - Unified literature access with 5 providers (Phase 2 complete)
  • VisualizationService - Interactive plot generation
  • ConcatenationService - Memory-efficient sample merging

3. Data Management Layer

DataManagerV2 orchestrates all data operations:

  • Modality Management - Named biological datasets with metadata
  • Adapter System - Format-specific data loading (transcriptomics, proteomics)
  • Storage Backends - Flexible persistence (H5AD, MuData)
  • Schema Validation - Data quality enforcement
  • Provenance Tracking - Complete analysis history (W3C-PROV compliant)
  • Two-Tier Caching - Fast in-memory session cache + durable workspace filesystem cache

See 39. Two-Tier Caching Architecture for detailed caching system documentation.

4. Configuration & Registry

Centralized configuration management with clean provider abstraction (refactored v0.4.0):

  • Provider Registry - Pluggable LLM provider system (ILLMProvider interface)
  • Agent Registry - Single source of truth for all agents
  • Settings Management - Environment-aware configuration (API keys, logging)
  • Model Configuration - Profile-based model selection with runtime overrides
  • Adapter Registry - Dynamic data format support
  • Configuration Constants - Single source of truth for valid providers/profiles (v0.4.0+)
  • Configuration Base Classes - Shared validation via abstract base classes (v0.4.0+)

Configuration Constants + Base Class Pattern (v0.4.0+)

Problem: Adding new LLM providers (Gemini, OpenAI, etc.) required changes in 4+ files with duplicate validation logic, violating DRY principles.

Solution: Single source of truth for constants + abstract base class with shared Pydantic validation.

Architecture Benefits:

  • ~120 lines removed: Eliminated duplicate validation logic across 4 files
  • Single file change: Adding new provider requires updating only constants.py
  • Type safety: Final[List[str]] ensures immutability
  • Automatic propagation: Changes in constants immediately affect all consumers
  • Pydantic validation: @model_validator(mode="before") provides shared validation

Implementation Details:

FilePurposeKey Elements
constants.pySingle source of truthVALID_PROVIDERS = ["anthropic", "bedrock", "ollama", "gemini", "azure", "openai"]
base_config.pyAbstract base classProviderConfigBase with @model_validator, abstract properties
workspace_config.pyWorkspace configInherits from ProviderConfigBase, uses _model suffix
global_config.pyGlobal configInherits from ProviderConfigBase, uses _default_model suffix
config_resolver.pyResolution logicImports VALID_PROVIDERS for validation

Adding a New Provider (5-Step Process):

# 1. Update constants.py (ONLY file that needs editing for validation)
VALID_PROVIDERS: Final[List[str]] = ["anthropic", "bedrock", "ollama", "gemini", "openai"]
PROVIDER_DISPLAY_NAMES["openai"] = "OpenAI API"

# 2. Add provider class (new file)
# lobster/config/providers/openai_provider.py
class OpenAIProvider(BaseProvider):
    # ~150 lines of implementation

# 3. Register in registry.py
PROVIDER_REGISTRY.register("openai", OpenAIProvider)

# 4. Update config field definitions (workspace + global)
# workspace_config.py: openai_model: Optional[str] = None
# global_config.py: openai_default_model: Optional[str] = None

# 5. Update CLI wizard (cli.py + provider_setup.py)
# Add OpenAI option to lobster init

# That's it! Validation automatically includes OpenAI across all config classes.

Before vs After:

AspectBefore (Duplicated)After (Refactored)
Files to change4+ files (workspace_config, global_config, config_resolver, client)1 file (constants.py)
Validation logic~120 lines duplicatedShared in base class
Type safetyManual list literalsFinal[List[str]]
MaintainabilityHigh risk of inconsistencyGuaranteed consistency

For complete documentation, see Configuration Guide - Configuration Architecture (Advanced).

Provider Abstraction Architecture (v0.4.0+)

Lobster uses a provider abstraction layer enabling easy addition of new LLM providers (OpenAI, Nebius, etc.):

Key Design Principles:

  • Explicit Configuration - No auto-detection, users must configure provider
  • Provider Interface - All providers implement ILLMProvider (7 methods)
  • Security Separation - Config in JSON (versioned), secrets in .env (gitignored)
  • Easy Extensibility - New provider = implement interface + register (~150 lines)

Configuration Priority System (v0.4.0+)

Lobster uses a simplified 3-layer priority system for provider and model selection:

Priority Order (highest to lowest):

1. Runtime CLI flags       --provider, --model (highest priority)
2. Workspace config        .lobster_workspace/provider_config.json
3. FAIL with clear error   No auto-detection, no silent defaults

Configuration Model (Security-First):

provider_config.json (versioned)        .env (gitignored, secrets)
────────────────────────────────        ──────────────────────────
{                                        ANTHROPIC_API_KEY=sk-ant-...
  "global_provider": "anthropic",        AWS_BEDROCK_ACCESS_KEY=...
  "anthropic_model": "claude-4",         AWS_BEDROCK_SECRET_ACCESS_KEY=...
  "profile": "production",               OLLAMA_BASE_URL=http://...
  "per_agent_models": {}
}

✅ Safe to commit                        ❌ Never commit (secrets)

Configuration Files:

FileScopePriorityCreated ByUse Case
provider_config.jsonWorkspace-specific preferencesLayer 2lobster initPer-workspace provider/model
.envAPI keys and secretsN/A (auth only)lobster initAuthentication credentials

Runtime Override Flags:

# Override provider only
lobster query --provider ollama "your question"
lobster chat --provider anthropic

# Override model only (uses provider from config)
lobster query --model "gpt-oss:20b" "your question"
lobster chat --model "llama3:70b-instruct"

# Override both (highest priority - layer 1)
lobster query --provider ollama --model "mixtral:8x7b" "your question"
lobster chat --provider anthropic --model "claude-4-sonnet"

Implementation Files:

  • lobster/config/providers/base_provider.py - ILLMProvider interface
  • lobster/config/providers/registry.py - ProviderRegistry singleton
  • lobster/config/providers/anthropic_provider.py - Anthropic implementation
  • lobster/config/providers/bedrock_provider.py - AWS Bedrock implementation
  • lobster/config/providers/ollama_provider.py - Ollama implementation
  • lobster/config/workspace_config.py - Workspace-scoped configuration (Pydantic)
  • lobster/core/config_resolver.py - 3-layer priority resolution logic
  • lobster/config/llm_factory.py - Factory using ProviderRegistry

Example: First-Time Setup:

# User runs init wizard
$ lobster init

# Creates two files:
# 1. provider_config.json (explicit provider selection)
{
  "global_provider": "anthropic",
  "anthropic_model": "claude-sonnet-4-20250514",
  "profile": "production"
}

# 2. .env (API keys - not committed to git)
ANTHROPIC_API_KEY=sk-ant-api03-...

Example: Runtime Override:

# Workspace config says: "anthropic"
# Override at runtime:
$ lobster query --provider ollama "your question"

# Result: Uses Ollama (layer 1 beats layer 2)

Model Selection:

Model selection follows the same 3-layer priority:

# Layer 1: Runtime override (highest priority)
lobster query --model "mixtral:8x7b" "question"

# Layer 2: Workspace config
# provider_config.json: {"ollama_model": "llama3:70b-instruct"}

# Layer 3: Provider default
# OllamaProvider.get_default_model() chooses largest available model

5. Subscription Tiers & Plugin System (Phase 1, Dec 2025)

Lobster implements a three-tier subscription model that controls agent availability and feature access. This architecture enables the open-core business model with lobster-ai (free) and lobster-premium (paid) packages.

Tier Architecture

Subscription Tiers

TierAgentsKey FeaturesTarget Users
FREE6 core agentsLocal-only, community supportAcademic researchers
PREMIUM10 agentsCloud compute, priority supportSeed-Series B biotech
ENTERPRISEAll + customSLA, custom developmentBiopharma

FREE Tier Agents (6):

  • research_agent - Literature discovery and dataset identification
  • data_expert_agent - Data loading and quality assessment
  • transcriptomics_expert - Single-cell and bulk RNA-seq analysis
  • visualization_expert_agent - Interactive plot generation
  • annotation_expert - Cell type annotation (sub-agent)
  • de_analysis_expert - Differential expression (sub-agent)

PREMIUM Tier Additions (4):

  • metadata_assistant - Cross-dataset harmonization and sample mapping
  • proteomics_expert - MS and affinity proteomics analysis
  • machine_learning_expert_agent - ML-based predictions
  • protein_structure_visualization_expert_agent - Structural analysis

Tier-Based Handoff Restrictions

The subscription tier controls not just which agents are available, but also which agent-to-agent handoffs are permitted:

# FREE tier restriction example
SUBSCRIPTION_TIERS = {
    "free": {
        "agents": ["research_agent", "data_expert_agent", ...],
        "restricted_handoffs": {
            # FREE tier: research_agent cannot handoff to metadata_assistant
            "research_agent": ["metadata_assistant"],
        },
    },
    "premium": {
        "agents": [...],  # All 10 agents
        "restricted_handoffs": {},  # No restrictions
    },
}

This means in FREE tier:

  • research_agent can discover datasets and extract metadata
  • But cannot delegate to metadata_assistant for advanced harmonization
  • Upgrade prompt shown: "Upgrade to Premium for cross-dataset sample mapping"

Plugin Discovery System

The ComponentRegistry discovers and loads components from installed packages via Python entry points. It supports 7 entry point groups across two categories:

Agent/Service Groups: lobster.agents, lobster.services, lobster.agent_configs

Omics Plugin Groups: lobster.adapters, lobster.providers, lobster.download_services, lobster.queue_preparers, lobster.omics_types

# Custom packages declare components in pyproject.toml:
[project.entry-points."lobster.agents"]
metadata_assistant = "lobster_custom_databiomix.agents.metadata_assistant:AGENT_CONFIG"

[project.entry-points."lobster.services"]
publication_processing = "lobster_custom_databiomix.services.orchestration.publication_processing_service:PublicationProcessingService"

[project.entry-points."lobster.omics_types"]
metabolomics = "lobster.core.omics_registry:METABOLOMICS_CONFIG"

Package Discovery Sources:

  1. lobster-premium - Shared premium features (PyPI private index)
  2. lobster-custom-* - Customer-specific packages (per-customer S3 distribution)

Entry-point discovery runs first, with hardcoded fallbacks used only when no plugin provides the component.

License Management

Entitlements are stored in ~/.lobster/license.json and control:

  • Current subscription tier
  • Authorized custom packages
  • Feature flags
  • Expiration date
{
    "tier": "premium",
    "customer_id": "cust_abc123",
    "expires_at": "2025-12-01T00:00:00Z",
    "custom_packages": ["lobster-custom-databiomix"],
    "features": ["cloud_compute", "priority_support"]
}

Tier Detection Priority:

  1. LOBSTER_SUBSCRIPTION_TIER environment variable (dev override)
  2. ~/.lobster/license.json file (production)
  3. Default to FREE tier

CLI Status Command

The lobster status command displays current tier, packages, and agent availability:

$ lobster status

╭─────────────────────╮
│  🦞 Lobster Status  │
╰─────────────────────╯

Subscription Tier: 🆓 Free
Source: default

Installed Packages:
╭────────────┬─────────┬───────────╮
│ Package    │ Version │ Status    │
├────────────┼─────────┼───────────┤
│ lobster-ai │ 0.3.1   │ Installed │
╰────────────┴─────────┴───────────╯

Available Agents (6):
annotation_expert, data_expert_agent, de_analysis_expert,
research_agent, transcriptomics_expert, visualization_expert_agent

Premium Agents (4):
machine_learning_expert_agent, metadata_assistant,
protein_structure_visualization_expert_agent, proteomics_expert

╭────────────────────────────────────────────────────────────╮
│  ⭐ Upgrade to Premium to unlock 4 additional agents       │
│  Visit https://omics-os.com/pricing or run                 │
│  'lobster activate <code>'                                 │
╰────────────────────────────────────────────────────────────╯

Graph-Level Tier Enforcement

The create_bioinformatics_graph() function enforces tier restrictions:

def create_bioinformatics_graph(
    data_manager: DataManagerV2,
    subscription_tier: str = None,  # Auto-detected if None
    agent_filter: callable = None,  # Custom filter function
):
    # Auto-detect tier from license
    if subscription_tier is None:
        subscription_tier = get_current_tier()

    # Create tier-based filter
    if agent_filter is None:
        agent_filter = lambda name, config: is_agent_available(name, subscription_tier)

    # Filter agents before graph creation
    worker_agents = get_worker_agents()
    filtered_agents = {
        name: config
        for name, config in worker_agents.items()
        if agent_filter(name, config)
    }

    # Pass tier to agent factories for handoff restrictions
    for agent_name, agent_config in filtered_agents.items():
        factory_kwargs["subscription_tier"] = subscription_tier
        agent = factory_function(**factory_kwargs)

Implementation Files

FilePurpose
lobster/config/subscription_tiers.pyTier definitions, agent lists, handoff restrictions
lobster/core/registry.pyComponentRegistry for entry point discovery
lobster/core/license_manager.pyEntitlement file handling, tier detection
lobster/agents/graph.pyTier-based agent filtering
lobster/cli.pylobster status and lobster agents commands

6. Identifier Resolution System (P1, Dec 2024)

The AccessionResolver (lobster/core/identifiers/accession_resolver.py) provides centralized, thread-safe identifier resolution for all biobank accessions. This eliminates pattern duplication across providers and enables rapid addition of new database support.

Supported Databases (29 patterns)

CategoryAccession TypesExamples
GEOGSE, GSM, GPL, GDSGSE194247, GSM1234567, GPL570
NCBI SRASRP, SRX, SRR, SRSSRP116709, SRR1234567
ENAERP, ERX, ERR, ERS, PRJEB, SAMEAERP123456, PRJEB83385
DDBJDRP, DRX, DRR, DRS, PRJDB, SAMDDRP123456, PRJDB12345
BioProject/SamplePRJNA, SAMNPRJNA123456, SAMN12345678
ProteomicsPXD (PRIDE), MSV (MassIVE)PXD012345, MSV000012345
MetabolomicsMTBLS, STMTBLS1234, ST001234
OtherArrayExpress, MGnify, DOIE-MTAB-12345, 10.1038/nature12345

Key Methods

from lobster.core.identifiers import get_accession_resolver

resolver = get_accession_resolver()

# Detection: What database does this identifier belong to?
resolver.detect_database("GSE12345")  # → "NCBI Gene Expression Omnibus"
resolver.detect_database("PRJEB83385")  # → "ENA BioProject"

# Text extraction: Find all accessions in abstract/methods
resolver.extract_accessions_by_type("Data at GSE123 and PRIDE PXD012345")
# → {'GEO': ['GSE123'], 'PRIDE': ['PXD012345']}

# Validation: Is this a valid accession for a specific database?
resolver.validate("GSE12345", database="GEO")  # → True
resolver.validate("GSE12345", database="PRIDE")  # → False

# URL generation: Get database URL for accession
resolver.get_url("PXD012345")  # → "https://www.ebi.ac.uk/pride/archive/projects/PXD012345"

# Helper methods for common checks
resolver.is_geo_identifier("GSE12345")  # → True
resolver.is_sra_identifier("SRP123456")  # → True (includes ENA/DDBJ)
resolver.is_proteomics_identifier("PXD012345")  # → True

Architecture Benefits

  • Single Source of Truth: All patterns defined in DATABASE_ACCESSION_REGISTRY (database_mappings.py)
  • Pre-compiled Patterns: <1ms performance for validation/extraction
  • Case-insensitive: Improved UX (gse12345 = GSE12345)
  • Thread-safe Singleton: Safe for multi-agent concurrent access
  • Easy Extension: Add new database in ~1 hour (vs ~1 week previously)

Provider Integration

All providers now delegate identifier validation to AccessionResolver:

ProviderBeforeAfter
pubmed_provider.pyHardcoded patternsresolver.extract_accessions_by_type()
geo_provider.py4 regex patternsresolver.is_geo_identifier()
pride_provider.py1 regex patternresolver.validate(database="PRIDE")
massive_provider.py1 regex patternresolver.validate(database="MassIVE")
sra_provider.py12 regex patternsresolver.is_sra_identifier()
geo_utils.py8 regex patternsresolver.detect_field()

Research & Literature Capabilities

Research System Overview

The refactored research system (Phases 1-6) provides comprehensive literature discovery, dataset identification, and metadata harmonization capabilities through a two-agent architecture with provider-based content access.

Key Components

Two-Agent Architecture:

  • research_agent - Discovery, content extraction, and workspace caching (10 tools)
  • metadata_assistant - Cross-dataset metadata operations and harmonization (4 tools)

Provider Infrastructure:

  • ContentAccessService - Unified publication access replacing legacy PublicationService/UnifiedContentService
  • 5 Specialized Providers - Capability-based routing for optimal performance
  • Three-Tier Cascade - PMC XML → Webpage → PDF fallback strategy

Key Features:

  • Capability-based provider routing for optimal source selection
  • Multi-omics integration workflows with automated sample mapping
  • Session caching with W3C-PROV provenance tracking
  • Workspace persistence for handoff between agents

Provider Architecture (Phase 1-2, v0.2.0+)

ContentAccessService (introduced Phase 2) provides unified publication and dataset access through a capability-based provider infrastructure. This replaces the legacy PublicationService and UnifiedContentService with a modular, extensible architecture.

Core Service: 10 Methods in 4 Categories

Discovery Methods (3):

  • search_literature() - Multi-source literature search (PubMed, bioRxiv, medRxiv)
  • discover_datasets() - Omics dataset discovery with automatic accession detection (GSM/GSE/GDS/GPL)
  • find_linked_datasets() - Cross-database relationship discovery (publication ↔ datasets)

Metadata Methods (2):

  • extract_metadata() - Structured publication/dataset metadata extraction
  • validate_metadata() - Pre-download dataset completeness validation

Content Methods (3):

  • get_abstract() - Fast abstract retrieval (Tier 1: 200-500ms)
  • get_full_content() - Full-text with three-tier cascade (PMC → Webpage → PDF)
  • extract_methods() - Software and parameter extraction from methods sections

System Methods (1):

  • query_capabilities() - Available provider and capability matrix

Five Specialized Providers

The system orchestrates access through 5 registered providers with capability-based routing:

ProviderPriorityCapabilitiesPerformanceCoverage
AbstractProvider10 (high)GET_ABSTRACT200-500msAll PubMed
PubMedProvider10 (high)SEARCH_LITERATURE, FIND_LINKED_DATASETS, EXTRACT_METADATA1-3sPubMed indexed
GEOProvider10 (high)DISCOVER_DATASETS, EXTRACT_METADATA, VALIDATE_METADATA2-5sAll GEO/SRA
PMCProvider10 (high)GET_FULL_CONTENT (PMC XML API, 10x faster than HTML scraping), protocol extraction via ProtocolExtractionService500ms-2s30-40% biomedical lit
WebpageProvider50 (low)GET_FULL_CONTENT (webpage + PDF via DoclingService composition)2-8sMajor publishers + PDFs

Key Design Features:

  • Priority System: Lower number = higher priority (10 = high, 50 = low fallback)
  • Automatic Routing: ProviderRegistry selects optimal provider based on capabilities
  • DoclingService: Internal composition within WebpageProvider (not a separate registered provider)
  • DataManager-First Caching: Session cache + workspace persistence with W3C-PROV provenance

Three-Tier Content Cascade

For full-text retrieval, the system implements intelligent fallback with automatic tier progression:

User Request: "Get full text for PMID:35042229"

Step 1: Check DataManager Cache (Tier 0)
  - Duration: <100ms
  - Success: 100% (if previously accessed)
  → CACHE HIT ✅ (return immediately)
    ↓ (cache miss)
Tier 1: PMC XML API (Priority 10)
  - Duration: 500ms-2s
  - Success Rate: 95%
  - Coverage: 30-40% of biomedical literature (NIH-funded + open access)
  → PMC AVAILABLE ✅ (return)
    ↓ (PMC unavailable)
Tier 2: Webpage Scraping (Priority 50)
  - Duration: 2-5s
  - Success Rate: 80%
  - Coverage: Major publishers (Nature, Science, Cell, etc.)
  → WEBPAGE EXTRACTED ✅ (return)
    ↓ (webpage failed)
Tier 3: PDF via Docling (Priority 50, internal to WebpageProvider)
  - Duration: 3-8s
  - Success Rate: 70%
  - Coverage: Open access PDFs, preprints (bioRxiv, medRxiv)
  → FINAL ATTEMPT

Performance Characteristics:

TierPathDurationSuccess RateTypical Use Case
CacheDataManager lookup<100ms100% (if cached)Repeated access within session
Tier 1PMC XML API500ms-2s95%NIH-funded, open access papers
Tier 2Webpage HTML2-5s80%Publisher websites (Nature, Cell)
Tier 3PDF Parsing3-8s70%Preprints, open access PDFs

Capability-Based Routing

Providers declare capabilities via ProviderCapability enum, enabling automatic optimal provider selection:

Discovery & Search:

  • SEARCH_LITERATURE → PubMedProvider
  • DISCOVER_DATASETS → GEOProvider
  • FIND_LINKED_DATASETS → PubMedProvider

Metadata & Validation:

  • GET_ABSTRACT → AbstractProvider (fast path)
  • EXTRACT_METADATA → PubMedProvider, GEOProvider
  • VALIDATE_METADATA → GEOProvider

Content Retrieval:

  • GET_FULL_CONTENT → PMCProvider (priority 10), WebpageProvider (priority 50, fallback)
  • EXTRACT_METHODS → ContentAccessService (post-processing)

Routing Example:

# User request: "Get full text for PMID:35042229"
# 1. ContentAccessService receives request
# 2. ProviderRegistry routes to GET_FULL_CONTENT capability
# 3. Returns [PMCProvider (priority 10), WebpageProvider (priority 50)]
# 4. Tries PMCProvider first (fast path)
# 5. On PMCNotAvailableError, automatically falls back to WebpageProvider
# 6. Caches result in DataManager for future requests

Protocol Extraction (16S Microbiome)

The ProtocolExtractionService (lobster/services/metadata/protocol_extraction_service.py) automatically extracts technical protocol details from publication methods sections during full-text retrieval. This is integrated into PMCProvider's _extract_parameters() method.

Extracted Fields:

CategoryFieldsExamples
PrimersForward primer, reverse primer, sequences515F, 806R, 27F, 1492R (12 known primers)
V-RegionTarget regionV3-V4, V4, V1-V2, V1-V9
PCR ConditionsAnnealing temperature, cycles55C, 30 cycles
SequencingPlatform, read length, paired-endIllumina MiSeq, 250bp, 2x250bp
Reference DatabaseDatabase name, versionSILVA v138, Greengenes, RDP, GTDB, UNITE
PipelineSoftware, versionQIIME2, DADA2, mothur, USEARCH, VSEARCH
ClusteringMethod, thresholdASV, OTU (97%), zOTU

Usage Pattern:

from lobster.services.metadata.protocol_extraction_service import (
    ProtocolExtractionService,
    ProtocolDetails,
)

service = ProtocolExtractionService()
text = """
    The V3-V4 region of 16S rRNA gene was amplified using
    primers 515F (GTGCCAGCMGCCGCGGTAA) and 806R. PCR was
    performed for 30 cycles with annealing at 55C.
    Sequencing was done on Illumina MiSeq (2x250 bp).
    Sequences were processed using DADA2 with SILVA v138.
"""
details, result = service.extract_protocol(text, source="methods")

# Access extracted fields
print(details.v_region)           # "V3-V4"
print(details.forward_primer)     # "515F"
print(details.pcr_cycles)         # 30
print(details.platform)           # "Illumina MiSeq"
print(details.pipeline)           # "DADA2"
print(details.reference_database) # "SILVA"
print(details.confidence)         # 0.58 (7/12 fields extracted)

Performance Characteristics:

MetricValue
Extraction Time<50ms per publication
Confidence Score0.0-1.0 (fields extracted / 12 total)
Known Primers12 standard primers (515F, 806R, 27F, 1492R, etc.)
Supported Platforms8 (MiSeq, HiSeq, NovaSeq, NextSeq, Ion Torrent, PacBio, Nanopore, 454)
Supported Databases6 (SILVA, Greengenes, RDP, GTDB, NCBI 16S, UNITE)
Supported Pipelines9 (QIIME2, DADA2, mothur, USEARCH, VSEARCH, Deblur, PICRUSt, LEfSe, phyloseq)

Integration with PMCProvider:

When PMCProvider extracts full text, it automatically invokes ProtocolExtractionService on the methods section:

# In PMCProvider._extract_parameters()
service = ProtocolExtractionService()
details, result = service.extract_protocol(methods_text, source="pmc")

# Returned in PublicationContent.parameters dict
parameters = {
    "v_region": details.v_region,
    "forward_primer": details.forward_primer,
    "reverse_primer": details.reverse_primer,
    "platform": details.platform,
    "pipeline": details.pipeline,
    "reference_database": details.reference_database,
    # ... additional fields
}

Agent Architecture (Phase 3-4)

research_agent - Discovery & Content Specialist

The research_agent provides 10 specialized tools organized into 4 categories:

Discovery Tools (3):

  • search_literature - Multi-source literature search (PubMed, bioRxiv, medRxiv)
  • fast_dataset_search - Direct omics database search (GEO, SRA, PRIDE)
  • find_related_entries - Cross-database relationship discovery

Content Tools (4):

  • get_dataset_metadata - Publication and dataset metadata extraction
  • fast_abstract_search - Rapid abstract retrieval (200-500ms)
  • read_full_publication - Full-text access with 3-tier cascade
  • extract_methods - Software and parameter extraction from methods sections

Workspace Tools (3): (Shared factories in tools/workspace_tool.py, v2.5+)

  • write_to_workspace - Cache content for persistence and handoffs with CSV/JSON export (schema-driven, v1.2.0)
  • get_content_from_workspace - Retrieve cached content with detail levels
  • export_publication_queue_samples - Batch export from multiple publications

Schema-Driven Export System (v1.2.0 - December 2024): Professional CSV export with extensible multi-omics column ordering.

Architecture (lobster/core/schemas/export_schemas.py, 370 lines):

  • ExportPriority enum: 6 priority levels (CORE_IDENTIFIERS=1 → OPTIONAL_FIELDS=99)
  • ExportSchemaRegistry: 4 omics schemas (SRA/amplicon, proteomics, metabolomics, transcriptomics)
  • infer_data_type(): Auto-detection from sample fields
  • get_ordered_export_columns(): Returns priority-ordered column list

Extensibility: Add new omics layer in 15 minutes (vs days refactoring hardcode) Performance: 24,158 samples/sec, 100% schema detection accuracy (validated on 46K samples) Integration: workspace_tool.py lines 823-837, 1045-1059

System Tools (1):

  • validate_dataset_metadata - Pre-download validation of dataset completeness

metadata_assistant - Metadata Harmonization Specialist (Phase 3-4)

The metadata_assistant provides 7 tools for cross-dataset operations and publication queue processing:

  • map_samples_by_id - Sample ID mapping with 4 strategies:

    • Exact matching (case-insensitive)
    • Fuzzy matching (RapidFuzz token similarity)
    • Pattern matching (regex-based prefix/suffix removal)
    • Metadata-supported matching (using sample attributes)
  • read_sample_metadata - Extract metadata in 3 formats:

    • Summary format (high-level overview)
    • Detailed format (complete JSON structure)
    • Schema format (DataFrame table)
  • standardize_sample_metadata - Convert to Pydantic schemas:

    • TranscriptomicsMetadataSchema
    • ProteomicsMetadataSchema
    • Controlled vocabulary enforcement
  • validate_dataset_content - 5-check validation:

    • Sample count verification
    • Required conditions check
    • Control sample detection
    • Duplicate ID identification
    • Platform consistency validation

Publication Queue Processing Tools (Phase 4, v2.5+):

  • process_metadata_entry - Process single queue entry with filter criteria
  • process_metadata_queue - Batch process HANDOFF_READY entries, aggregate samples
  • update_metadata_status - Manual status updates for queue entries

Shared Workspace Tools:

  • get_content_from_workspace - Read workspace content (shared with research_agent)
  • write_to_workspace - Export to workspace with CSV/JSON formats (shared with research_agent)

Agent Handoff Patterns

The agents collaborate through structured handoffs:

research_agent discovers datasets

Caches metadata to workspace

Handoff to metadata_assistant for harmonization

metadata_assistant validates and maps samples

Returns standardization report

research_agent reports to supervisor

Supervisor hands off to data_expert for downloads

Publication Queue → Metadata Filtering Workflow (v2.5+)

Use Case: Process large publication collections (.ris files), extract dataset identifiers, filter sample metadata by criteria (e.g., "16S human fecal CRC"), export unified CSV.

Workflow:

Key Features:

  • Auto-Status Detection: Entries automatically transition to HANDOFF_READY when conditions met
  • Batch Processing: Process multiple publications in single operation
  • Filter Composition: Microbiome-aware filtering (16S, host, sample type, disease)
  • Full Schema Export: All SRA metadata fields preserved in CSV

Research Input Flow

The following diagram illustrates how research requests flow through the system:

Multi-Agent Workflows (Phase 5)

Three primary workflows demonstrate the research system capabilities:

Workflow 1: Multi-Omics Integration

Scenario: Integrating RNA-seq and proteomics data from the same publication.

Step-by-Step Process:

  1. Discovery Phase (research_agent):

    # Find datasets linked to publication
    find_related_entries("PMID:35042229", entry_type="dataset")
    # Result: GSE180759 (RNA-seq), PXD034567 (proteomics)
  2. Validation Phase (research_agent):

    # Validate metadata completeness
    validate_dataset_metadata("GSE180759", required_fields="sample_id,condition")
    # Result: ✅ Both datasets have required metadata
  3. Caching Phase (research_agent):

    # Cache for handoff
    write_to_workspace("geo_gse180759_metadata", workspace="metadata")
    write_to_workspace("pxd034567_metadata", workspace="metadata")
  4. Handoff to metadata_assistant:

    handoff_to_metadata_assistant(
        "Map samples between GSE180759 and PXD034567 using sample_id column"
    )
  5. Sample Mapping (metadata_assistant):

    map_samples_by_id("geo_gse180759", "pxd034567",
                      strategies="exact,fuzzy", min_confidence=0.8)
    # Result: 36/36 samples mapped (100% rate, avg confidence 0.96)
  6. Final Report:

    • ✅ Complete sample-level mapping achieved
    • ✅ Ready for integrated multi-omics analysis
    • Handoff to data_expert for dataset downloads

Workflow 2: Meta-Analysis Preparation

Scenario: Combining multiple breast cancer RNA-seq datasets for meta-analysis.

Step-by-Step Process:

  1. Dataset Discovery (research_agent):

    fast_dataset_search("breast cancer RNA-seq",
                       filters='{"organism": "human", "samples": ">50"}')
    # Result: 10 candidate datasets
  2. Metadata Extraction (research_agent):

    # For each dataset
    get_dataset_metadata("GSE12345")  # Dataset 1
    get_dataset_metadata("GSE67890")  # Dataset 2
    get_dataset_metadata("GSE99999")  # Dataset 3
  3. Standardization (metadata_assistant):

    # Standardize each dataset's metadata
    standardize_sample_metadata("geo_gse12345", "transcriptomics")
    # Field coverage: 95%
    
    standardize_sample_metadata("geo_gse67890", "transcriptomics")
    # Field coverage: 85%
    
    standardize_sample_metadata("geo_gse99999", "transcriptomics")
    # Field coverage: 78%
  4. Harmonization Assessment:

    • Dataset 1: ✅ Full integration possible (>90% coverage)
    • Dataset 2: ⚠️ Cohort-level integration (85% coverage)
    • Dataset 3: ⚠️ Cohort-level integration (78% coverage)
  5. Recommendation:

    • Proceed with cohort-level meta-analysis
    • Apply batch correction during analysis
    • Consider imputation for missing metadata fields

Workflow 3: Control Sample Addition

Scenario: Adding public control samples to a private disease dataset.

Step-by-Step Process:

  1. Control Discovery (research_agent):

    fast_dataset_search("healthy control breast tissue",
                       filters='{"condition": "control", "platform": "RNA-seq"}')
    # Result: GSE111111 with 24 control samples
  2. Control Validation (research_agent):

    validate_dataset_metadata("GSE111111",
                            required_values='{"condition": ["control", "normal"]}')
    # Result: ✅ All samples are controls
  3. Metadata Matching (metadata_assistant):

    map_samples_by_id("user_disease_data", "geo_gse111111",
                     strategies="metadata", min_confidence=0.7)
    # Matching on: tissue_type, age_range(±5yr), sex
    # Result: 15/24 controls matched (62.5% rate)
  4. Compatibility Report:

    • Platform compatibility: ✅ Both RNA-seq
    • Metadata overlap: ⚠️ 62.5% matching
    • Batch effect risk: High (different studies)
    • Recommendation: Cohort-level comparison only

Performance Characteristics

Tool Performance Tiers

TierToolsDurationUse Case
Fastfast_abstract_search, get_content_from_workspace200-500msQuick screening, cache retrieval
Moderatesearch_literature, find_related_entries, get_dataset_metadata1-5sDiscovery, metadata extraction
Slowread_full_publication, extract_methods2-8sDeep content extraction
Variablefast_dataset_search, validate_dataset_metadata2-5sDatabase queries

Provider Performance Metrics

ProviderAvg DurationSuccess RateCoverage
PMCProvider500ms95%30-40% of biomedical literature
WebpageProvider2-5s80%Major publishers
DoclingService3-8s70%Open access PDFs, preprints
PubMedProvider1-3s99%All PubMed indexed
GEOProvider2-5s95%All GEO datasets

Optimization Strategies

  • Parallel Provider Queries - Multiple providers queried simultaneously
  • Session Caching - 60s cloud, 10s local cache duration
  • Workspace Persistence - Avoid redundant API calls
  • Smart Routing - Capability-based provider selection
  • Fallback Chains - Graceful degradation on failures

High-Throughput Processing & Rate Limiting (v0.3.0+)

For batch processing of large publication collections (100+ papers), the system implements multi-domain rate limiting and intelligent source selection to prevent IP bans while maximizing throughput.

Multi-Domain Rate Limiter

The MultiDomainRateLimiter (lobster/tools/rate_limiter.py) provides domain-specific rate limiting with Redis-based distributed coordination:

Domain-Specific Rate Limits:

DomainRequests/SecondUse Case
eutils.ncbi.nlm.nih.gov10.0NCBI E-utilities (with API key)
pmc.ncbi.nlm.nih.gov3.0PMC Open Access
europepmc.org2.0Europe PMC
frontiersin.org, mdpi.com, peerj.com1.0Open Access Publishers
nature.com, cell.com, elsevier.com0.5Major Publishers
default0.3Unknown domains

Key Features:

  • Exponential Backoff: Retry delays increase (1s → 3s → 9s → 27s, max 30s)
  • Automatic Domain Detection: URL parsing extracts domain for rate limit selection
  • Retry on HTTP Errors: Automatic retry on 429 (Rate Limited), 502, 503
  • Redis-Based Coordination: Distributed rate limiting across processes
  • Graceful Degradation: Fail-open if Redis unavailable

Usage:

from lobster.tools.rate_limiter import rate_limited_request, MultiDomainRateLimiter

# High-level wrapper with rate limiting + backoff + retry
response = rate_limited_request(
    "https://www.nature.com/articles/123",
    requests.get,
    timeout=30
)

# Direct rate limiter for custom logic
limiter = MultiDomainRateLimiter()
if limiter.wait_for_slot("https://pmc.ncbi.nlm.nih.gov/...", max_wait=30.0):
    # Safe to make request
    pass

PMC-First Source Selection

The PublicationProcessingService (lobster/services/orchestration/publication_processing_service.py) implements a priority-based source selection that maximizes open access content retrieval:

Source Priority Order:

PrioritySourceRationale
1PMC IDGuaranteed free open access (PMC12345)
2PMIDTriggers automatic PMC lookup (PMID:12345)
3PubMed URLExtracts PMID for PMC resolution
4DOIDirect DOI resolution
5Fulltext URLMay be paywalled
6PDF URLDirect PDF access
7Metadata URLLast resort

Benefits:

  • Higher Success Rate: PMC provides 95%+ success vs 70% for publisher URLs
  • Faster Extraction: PMC XML parsing (500ms-2s) vs webpage scraping (2-8s)
  • Avoids Paywalls: PMC content is guaranteed open access
  • Better Structure: PMC XML has consistent, parseable structure

PMC ID Direct Resolution

The PMCProvider (lobster/tools/providers/pmc_provider.py) supports direct PMC ID input without requiring PMID/DOI lookup:

from lobster.tools.providers.pmc_provider import PMCProvider

provider = PMCProvider()

# Direct PMC ID - no NCBI elink lookup needed
pmc_id = provider.get_pmc_id("PMC10425240")  # Returns: "10425240"

# Full text extraction with direct PMC ID
full_text = provider.extract_full_text("PMC10425240")
print(f"Methods: {len(full_text.methods_section)} chars")
print(f"Tables: {len(full_text.tables)}")

Supported Input Formats:

  • PMC10425240 → Direct use
  • PMID:35042229 → Lookup via NCBI elink
  • 10.1038/s41467-024-51651-9 → DOI resolution to PMID, then PMC lookup

Batch Processing Performance

With rate limiting and PMC-first selection, typical batch processing achieves:

MetricValue
Throughput50-100 publications/hour
Success Rate95%+ (with PMC-first)
IP Ban RiskMinimal (domain-aware limiting)
Content QualityHigh (structured PMC XML)

Component Relationships

Research Components Architecture

The following diagram shows detailed research system components and their relationships:

Modality System

Lobster AI uses a modality-centric approach to handle different types of biological data:

Supported Data Types

  1. Single-Cell RNA-seq - 10X, H5AD, CSV formats
  2. Bulk RNA-seq - Count matrices, TPM/FPKM data
  3. Mass Spectrometry Proteomics - MaxQuant, Spectronaut outputs
  4. Affinity Proteomics - Olink NPX, antibody array data
  5. Genomics - VCF, GWAS summary statistics, PLINK formats
  6. Metabolomics - LC-MS, GC-MS, NMR via MetabolomicsAdapter
  7. Metagenomics - Amplicon, shotgun sequencing data
  8. Multi-Omics - Integrated analysis with MuData

Professional Naming Convention

geo_gse12345                          # Raw dataset
├── geo_gse12345_quality_assessed     # QC metrics added
├── geo_gse12345_filtered_normalized  # Preprocessed
├── geo_gse12345_doublets_detected    # Quality control
├── geo_gse12345_clustered           # Analysis results
├── geo_gse12345_markers             # Feature identification
└── geo_gse12345_annotated           # Final annotations

Performance & Scalability

Memory Management

  • Sparse Matrix Support - Efficient single-cell data handling
  • Chunked Processing - Large dataset memory optimization
  • Lazy Loading - On-demand data access
  • Two-Tier Caching - Fast in-memory session cache (Tier 1) + durable filesystem cache (Tier 2)

Computational Efficiency

  • Stateless Services - Parallelizable processing units
  • Vectorized Operations - NumPy/SciPy optimization
  • GPU Detection - Automatic hardware utilization
  • Background Processing - Non-blocking operations

Quality & Standards

Data Quality Compliance

  • 60% Compliant - Full publication-grade standards
  • 26% Partially Compliant - Advanced features with minor gaps
  • 14% Not Compliant - Basic functionality only

Error Handling

  • Hierarchical Exceptions - Specific error types for different failures
  • Graceful Degradation - Fallback mechanisms for robustness
  • Comprehensive Logging - Detailed operation tracking
  • User-Friendly Messages - Clear error explanations with suggestions

Extension Points

The architecture is designed for easy extension:

Adding New Agents

  1. Implement agent factory function
  2. Add entry to Agent Registry
  3. System automatically integrates handoff tools and callbacks

Adding New Services

  1. Implement stateless service class
  2. Follow AnnData input/output pattern
  3. Add comprehensive error handling and logging

Adding New Data Formats

  1. Implement modality adapter factory
  2. Register via lobster.adapters entry point (discovered by DataManagerV2)
  3. Add schema validation rules

Adding New Omics Types

  1. Define OmicsTypeConfig with detection keywords, feature ranges, and preferred databases
  2. Register adapter, provider, download service, and queue preparer via entry points
  3. Register the OmicsTypeConfig via lobster.omics_types entry point
  4. The research agent's routing table updates automatically

Adding New Storage Backends

  1. Implement IDataBackend interface
  2. Register with DataManagerV2
  3. Add format-specific optimization

This modular architecture ensures that Lobster AI can evolve with the rapidly changing bioinformatics landscape while maintaining reliability and ease of use.

On this page

Platform ArchitectureArchitecture DiagramComponent MatrixDeployment PatternsSystem OverviewCore Design PrinciplesHigh-Level System ArchitectureTechnology StackCore TechnologiesLanguage and DependenciesData Flow ArchitectureData Expert Refactoring (Phase 2 - November 2024)OverviewKey ChangesLayered Enrichment PatternDownload Queue WorkflowPerformance ImprovementsArchitecture ImpactCore System Components1. Agent SystemHierarchical Agent Delegation (Tool-Wrapping Pattern)2. Service LayerTranscriptomics ServicesProteomics ServicesData Access Services (Download Infrastructure)Omics Plugin ArchitectureMetadata ServicesDisease Ontology Service (v0.5.1+, Phase 2 Complete)Other Supporting Services3. Data Management Layer4. Configuration & RegistryConfiguration Constants + Base Class Pattern (v0.4.0+)Provider Abstraction Architecture (v0.4.0+)Configuration Priority System (v0.4.0+)5. Subscription Tiers & Plugin System (Phase 1, Dec 2025)Tier ArchitectureSubscription TiersTier-Based Handoff RestrictionsPlugin Discovery SystemLicense ManagementCLI Status CommandGraph-Level Tier EnforcementImplementation Files6. Identifier Resolution System (P1, Dec 2024)Supported Databases (29 patterns)Key MethodsArchitecture BenefitsProvider IntegrationResearch & Literature CapabilitiesResearch System OverviewKey ComponentsProvider Architecture (Phase 1-2, v0.2.0+)Core Service: 10 Methods in 4 CategoriesFive Specialized ProvidersThree-Tier Content CascadeCapability-Based RoutingProtocol Extraction (16S Microbiome)Agent Architecture (Phase 3-4)research_agent - Discovery & Content Specialistmetadata_assistant - Metadata Harmonization Specialist (Phase 3-4)Agent Handoff PatternsPublication Queue → Metadata Filtering Workflow (v2.5+)Research Input FlowMulti-Agent Workflows (Phase 5)Workflow 1: Multi-Omics IntegrationWorkflow 2: Meta-Analysis PreparationWorkflow 3: Control Sample AdditionPerformance CharacteristicsTool Performance TiersProvider Performance MetricsOptimization StrategiesHigh-Throughput Processing & Rate Limiting (v0.3.0+)Multi-Domain Rate LimiterPMC-First Source SelectionPMC ID Direct ResolutionBatch Processing PerformanceComponent RelationshipsResearch Components ArchitectureModality SystemSupported Data TypesProfessional Naming ConventionPerformance & ScalabilityMemory ManagementComputational EfficiencyQuality & StandardsData Quality ComplianceError HandlingExtension PointsAdding New AgentsAdding New ServicesAdding New Data FormatsAdding New Omics TypesAdding New Storage Backends