Lobster AI - Cloud & Local Architecture

Lobster AI is a powerful multi-agent bioinformatics platform with seamless cloud and local deployment capabilities. The system automatically detects your...

🏗️ System Architecture Overview

Lobster AI is a powerful multi-agent bioinformatics platform with seamless cloud and local deployment capabilities. The system automatically detects your configuration and routes requests appropriately.

☁️ Cloud/Local Architecture Pattern

🔄 Seamless Mode Switching Flow

📦 Clean Single Package Structure

🌟 Cloud Platform (Coming Soon)

System Architecture Overview - Post Migration

Data Flow Diagram - Modular Service Architecture

Component Interaction Matrix

Agent Configuration Schema

Each agent defines an AGENT_CONFIG using the AgentRegistryConfig dataclass:

@dataclass
class AgentRegistryConfig:
    """Configuration for an agent in the system."""
    name: str                              # Unique agent identifier
    display_name: str                      # Human-readable name
    description: str                       # Agent's purpose/capability
    factory_function: str                  # Module path to factory function
    tier_requirement: str = 'free'         # 'free', 'premium', or 'enterprise'
    package_name: Optional[str] = None     # Package that provides this agent
    handoff_tool_name: Optional[str] = None     # Name of handoff tool
    handoff_tool_description: Optional[str] = None  # Tool description

Agent Discovery via Entry Points (v1.0.0+)

Agents are discovered via pyproject.toml entry points:

# Example from lobster-research package
[project.entry-points."lobster.agents"]
data_expert_agent = "lobster_research.agents.data_expert:AGENT_CONFIG"
research_agent = "lobster_research.agents.research_agent:AGENT_CONFIG"

# Example from lobster-transcriptomics package
[project.entry-points."lobster.agents"]
transcriptomics_expert = "lobster_transcriptomics.agents.transcriptomics_expert:AGENT_CONFIG"

Each agent module defines its AGENT_CONFIG at the module top:

# lobster_research/agents/data_expert.py
AGENT_CONFIG = AgentRegistryConfig(
    name='data_expert_agent',
    display_name='Data Expert',
    description='Handles data fetching and download tasks',
    factory_function='lobster_research.agents.data_expert:data_expert',
    tier_requirement='free',
    package_name='lobster-research',
    handoff_tool_name='handoff_to_data_expert',
    handoff_tool_description='Assign data fetching/download tasks to the data expert'
)
        display_name='ML Expert',
        description='Handles Machine Learning related tasks like transforming the data in the desired format for downstream tasks',
        factory_function='lobster.agents.machine_learning_expert.machine_learning_expert',
        handoff_tool_name='handoff_to_machine_learning_expert',
        handoff_tool_description='Assign all machine learning related tasks (scVI, classification etc) to the machine learning expert agent'
    ),
    'visualization_expert_agent': AgentConfig(
        name='visualization_expert_agent',
        display_name='Visualization Expert',
        description='Creates publication-quality visualizations through supervisor-mediated workflows',
        factory_function='lobster.agents.visualization_expert.visualization_expert',
        handoff_tool_name='handoff_to_visualization_expert',
        handoff_tool_description='Delegate visualization tasks to the visualization expert agent'
    ),
    'proteomics_expert': AgentConfig(
        name='proteomics_expert',
        display_name='Proteomics Expert',
        description='Handles both mass spectrometry and affinity proteomics analysis tasks',
        factory_function='lobster.agents.proteomics_expert.proteomics_expert',
        handoff_tool_name='handoff_to_proteomics_expert',
        handoff_tool_description='Assign proteomics analysis tasks (mass spectrometry or affinity proteomics) to the proteomics expert'
    ),
\}

System Integration Flow

Benefits of Centralized Registry

Before (Legacy System)

Adding new agents required updating:
├── lobster/agents/graph.py          # Import statements
├── lobster/agents/graph.py          # Agent creation code
├── lobster/agents/graph.py          # Handoff tool definitions
├── lobster/utils/callbacks.py       # Agent name hardcoded list
└── Multiple imports throughout codebase

After (Entry Point System, v1.0.0+)

Adding new agents only requires:
├── Define AGENT_CONFIG at module top
└── Register in pyproject.toml entry points

Everything else is handled automatically:
├── ✅ Dynamic agent discovery via ComponentRegistry
├── ✅ Automatic delegation tool creation
├── ✅ Callback system integration
├── ✅ Type-safe configuration
└── ✅ Professional error handling

How to Add New Agents

Step 1: Create Agent Implementation with AGENT_CONFIG

# your_package/agents/new_agent.py
from lobster.core.registry import AgentRegistryConfig

# AGENT_CONFIG at module top for fast discovery
AGENT_CONFIG = AgentRegistryConfig(
    name='new_agent',
    display_name='New Agent',
    description='Handles specialized new functionality',
    factory_function='your_package.agents.new_agent:new_agent',
    tier_requirement='free',
    package_name='your-package',
    handoff_tool_name='handoff_to_new_agent',
    handoff_tool_description='Assign specialized tasks to the new agent'
)

def new_agent(data_manager, callback_handler=None, agent_name='new_agent', delegation_tools=None, workspace_path=None, **kwargs):
    """Create a new specialized agent."""
    # Agent implementation
    return agent_instance

Step 2: Register via Entry Points

# pyproject.toml
[project.entry-points."lobster.agents"]
new_agent = "your_package.agents.new_agent:AGENT_CONFIG"

Step 3: Done!

The system automatically handles:

✅ Agent loading in graph creation
✅ Handoff tool generation
✅ Callback system detection
✅ Error handling and logging
✅ Integration with existing workflows

Registry Helper Functions

The registry provides several utility functions:

# Get all worker agents with configurations
worker_agents = get_worker_agents()
# Returns: Dict[str, AgentConfig]

# Get all agent names (including system agents)
all_agents = get_all_agent_names()
# Returns: List[str]

# Get specific agent configuration
config = get_agent_config('data_expert_agent')
# Returns: AgentConfig or None

# Dynamically import agent factory
factory = import_agent_factory('lobster.agents.data_expert.data_expert')
# Returns: Callable

Error Prevention

The registry system prevents common errors:

Runtime Validation

✅ Factory function existence validation
✅ Import path verification
✅ Configuration completeness checks
✅ Duplicate agent name detection

Development Safety

✅ Type hints for all configurations
✅ Consistent naming conventions
✅ Comprehensive error messages
✅ Centralized documentation

Maintenance Benefits

✅ Single source of truth
✅ Easy to audit and review
✅ Reduced cognitive load
✅ Professional code organization

Testing Agent Discovery

The system includes comprehensive testing via the lobster.testing module:

# tests/test_agent_discovery.py
from lobster.core.registry import ComponentRegistry
from lobster.testing import AgentContractTestMixin

def test_agent_discovery():
    """Test the agent discovery functionality."""
    registry = ComponentRegistry()

    # Test 1: Verify agents are discovered
    agents = registry.list_agents()
    assert len(agents) > 0

    # Test 2: Validate agent configs
    for agent_name in agents:
        config = registry.get_agent_config(agent_name)
        assert config is not None
        assert config.name == agent_name

    # Test 3: Check expected agents are present
    assert 'data_expert_agent' in agents
    assert 'transcriptomics_expert' in agents

Run the test via CLI:

lobster agents list  # Verify discovery
pytest tests/test_agent_discovery.py  # Run tests

This centralized approach ensures professional, maintainable, and error-free agent management across the entire Lobster AI system.

🔗 ConcatenationService: Code Deduplication & Memory Efficiency

Overview

The ConcatenationService is a critical architectural improvement that eliminates code duplication and provides memory-efficient, modality-agnostic concatenation of biological samples. This service addresses the code redundancy problem that existed between data_expert/data_expert.py and geo_service.py.

Architecture Pattern

Key Benefits

🎯 Code Reduction

data_expert/data_expert.py: 200+ lines → 30 lines (85% reduction)
geo_service.py: 300+ lines → 20 lines (93% reduction)
Total elimination: 450+ lines of duplicated code

💾 Memory Efficiency

Smart memory estimation with automatic strategy recommendation
Chunked processing for datasets exceeding memory limits
50%+ memory reduction for large concatenation operations
Real-time memory monitoring during processing

🧬 Modality-Agnostic Design

Strategy Pattern: Different algorithms for different data types
Single-cell optimization: Sparse matrix handling with batch tracking
Bulk transcriptomics: Optimized for dense matrix operations
Proteomics support: Handle missing values appropriately

🔧 Professional Architecture

Single source of truth for all concatenation logic
Comprehensive error handling with custom exceptions
Progress tracking with Rich console integration
Extensive testing with 400+ lines of unit tests

Service Interface

# Primary concatenation method
concatenated_adata, statistics = concat_service.concatenate_samples(
    sample_adatas=sample_list,
    strategy=ConcatenationStrategy.SMART_SPARSE,
    batch_key="batch",
    use_intersecting_genes_only=True
)

# Concatenate from modality names
concatenated_adata, statistics = concat_service.concatenate_from_modalities(
    modality_names=["sample1", "sample2", "sample3"],
    output_name="concatenated_dataset",
    use_intersecting_genes_only=True
)

# Auto-detect samples by pattern
sample_modalities = concat_service.auto_detect_samples("geo_gse12345")

# Validate before processing
validation_result = concat_service.validate_concatenation_inputs(sample_list)

# Estimate memory requirements
memory_info = concat_service.estimate_memory_usage(sample_list)

Integration with DataManagerV2

The ConcatenationService integrates deeply with DataManagerV2 for seamless modality management:

Testing & Quality Assurance

The ConcatenationService includes comprehensive testing:

Unit Tests: Strategy pattern, validation functions, memory estimation
Integration Tests: DataManagerV2 interaction, modality storage
Performance Tests: Memory usage, processing time benchmarks
Error Handling Tests: Exception scenarios, graceful degradation

This architecture improvement ensures reliable, maintainable, and efficient sample concatenation across the entire Lobster AI platform.

🌟 Open Source Benefits

🆓 What You Get for Free

Complete Bioinformatics Platform: All analysis capabilities included
AI-Powered Analysis: Natural language interface to bioinformatics
Publication-Ready Outputs: Professional visualizations and reports
Extensible Architecture: Add custom analysis methods easily
Active Development: Regular updates and community contributions

📈 Why Choose Local Installation

Privacy: Your data never leaves your computer
Customization: Full control over analysis parameters
Learning: Study the source code to understand methods
Contribution: Help improve the platform for everyone
Cost: Completely free (you pay only for your own API keys)

☁️ Interested in Cloud?

For teams needing scalable infrastructure, managed services, or collaborative features, we're developing a cloud platform.

Join the Waitlist →

Architecture Migration Summary

🎯 Migration Goals Achieved

The Lobster AI system has been successfully migrated from a dual-system architecture (legacy DataManager + DataManagerV2) to a clean, professional, modular DataManagerV2-only implementation.

✅ Key Improvements

1. Modular Service Architecture

Before: Agents contained mixed responsibilities with dual code paths
After: Clean separation with stateless analysis services and orchestration agents

2. Professional Error Handling

Custom Exception Hierarchy:
- TranscriptomicsError, PreprocessingError, QualityError, etc.
- ModalityNotFoundError for specific validation
Comprehensive Logging: All operations tracked with parameters and results
Graceful Error Recovery: Informative error messages with suggested fixes

3. Stateless Services Design

PreprocessingService: AnnData filtering, normalization, batch correction
QualityService: Comprehensive QC assessment with statistical metrics
ClusteringService: Leiden clustering, PCA, UMAP visualization
EnhancedSingleCellService: Doublet detection, cell type annotation
GEOService: Professional dataset downloading and processing
PubMedService: Literature mining and method extraction

🏗️ New Architecture Pattern

Agent Tool Pattern

@tool
def tool_name(modality_name: str, **params) -> str:
    """Professional tool with comprehensive error handling."""
    try:
        # 1. Validate modality exists
        if modality_name not in data_manager.list_modalities():
            raise ModalityNotFoundError(f"Modality '\{modality_name\}' not found")

        # 2. Get AnnData from modality
        adata = data_manager.get_modality(modality_name)

        # 3. Call stateless service
        result_adata, stats = service.method_name(adata, **params)

        # 4. Save new modality with descriptive name
        new_modality_name = f"\{modality_name\}_processed"
        data_manager.modalities[new_modality_name] = result_adata

        # 5. Log operation for provenance
        data_manager.log_tool_usage(tool_name, params, description)

        # 6. Format professional response
        return format_professional_response(stats, new_modality_name)

    except ServiceError as e:
        logger.error(f"Service error: \{e\}")
        return f"Service error: \{str(e)\}"
    except Exception as e:
        logger.error(f"Unexpected error: \{e\}")
        return f"Unexpected error: \{str(e)\}"

Service Method Pattern

def service_method(
    self,
    adata: anndata.AnnData,
    **parameters
) -> Tuple[anndata.AnnData, Dict[str, Any]]:
    """
    Stateless service method working with AnnData directly.

    Returns:
        Tuple of (processed_adata, processing_statistics)

    """
    try:
        # 1. Create working copy
        adata_processed = adata.copy()

        # 2. Apply analysis algorithms
        # ... processing logic ...

        # 3. Calculate comprehensive statistics
        processing_stats = \{
            "analysis_type": "method_type",
            "parameters_used": parameters,
            "results": \{...\}
        \}

        return adata_processed, processing_stats

    except Exception as e:
        raise ServiceError(f"Method failed: \{str(e)\}")

📊 Modality Management System

Descriptive Naming Convention

Each analysis step creates new modalities with descriptive, traceable names:

geo_gse12345                    # Raw downloaded data
├── geo_gse12345_quality_assessed    # With QC metrics
├── geo_gse12345_filtered_normalized # Preprocessed data
├── geo_gse12345_doublets_detected   # With doublet annotations
├── geo_gse12345_clustered          # With clustering results
├── geo_gse12345_markers           # With marker genes
└── geo_gse12345_annotated        # With cell type annotations

Professional Modality Tracking

Provenance: Complete analysis history with parameters
Statistics: Comprehensive metrics for each processing step
Validation: Schema enforcement and quality checks
Storage: Automatic saving with professional file naming

🔬 Analysis Workflow Excellence

Standard Single-cell RNA-seq Pipeline

1. check_data_status() → Review available modalities
2. assess_data_quality(modality_name) → Professional QC assessment
3. filter_and_normalize_modality(...) → Clean and normalize
4. detect_doublets_in_modality(...) → Remove doublets
5. cluster_modality(...) → Leiden clustering + UMAP
6. find_marker_genes_for_clusters(...) → Differential expression
7. annotate_cell_types(...) → Automated annotation
8. create_analysis_summary() → Comprehensive report

Quality Control Standards

Professional QC Thresholds: Evidence-based filtering parameters
Multi-metric Assessment: Total counts, gene counts, mitochondrial%, ribosomal%
Statistical Validation: Z-score outlier detection and percentile thresholds
Batch Effect Handling: Automatic batch detection and correction options

Error Handling & Recovery

Input Validation: Comprehensive parameter and data validation
Graceful Degradation: Fallback methods when specialized tools unavailable
Informative Messages: Clear error descriptions with suggested solutions
Operation Logging: Complete audit trail for debugging and reproducibility

🚀 Benefits of New Architecture

Code Quality Improvements

50% Reduction in agent code complexity (450+ → 200+ lines)
Zero Duplication: No more dual code paths or is_v2 checks
Professional Standards: Type hints, comprehensive docstrings, error handling
Testability: Stateless services are easily unit tested

Maintainability Enhancements

Single Responsibility: Each service handles one analysis domain
Modular Design: Services can be used independently or combined
Clean Interfaces: Consistent patterns across all analysis tools
Version Control: Clear separation enables independent service updates

Performance & Reliability

Memory Efficiency: Stateless services with minimal memory footprint
Fault Tolerance: Comprehensive error handling prevents pipeline failures
Reproducibility: Complete parameter logging and provenance tracking
Scalability: Services can be distributed or parallelized in future versions

Migration Impact Analysis

📈 Before Migration (Legacy System)

transcriptomics_expert.py: 450+ lines
├── Dual code paths (is_v2 checks everywhere)
├── Mixed responsibilities (orchestration + analysis)
├── Redundant implementations
├── Complex error handling
└── Maintenance overhead

🎉 After Migration (Modular System)

transcriptomics_expert.py: 280 lines (clean)
├── Single DataManagerV2 path
├── Professional tool orchestration only
├── Stateless service delegation
├── Comprehensive error handling
└── Minimal maintenance overhead

Analysis Services: 4 refactored services
├── PreprocessingService: AnnData → (filtered_adata, stats)
├── QualityService: AnnData → (qc_adata, assessment)
├── ClusteringService: AnnData → (clustered_adata, results)
└── EnhancedSingleCellService: AnnData → (annotated_adata, metrics)

🔧 Technical Architecture Benefits

Service Layer Advantages

Reusability: Services can be used by multiple agents
Testability: Each service can be independently tested
Flexibility: Easy to add new analysis methods
Performance: Optimized algorithms with professional implementations

Agent Layer Improvements

Orchestration Focus: Agents handle modality management and user interaction
Clean Tool Interface: Consistent ~20-30 line tool implementations
Professional Responses: Formatted outputs with comprehensive statistics
Error Management: Hierarchical error handling with specific exceptions

DataManagerV2 Integration

Modality-Centric: All data operations centered around named modalities
Provenance Tracking: Complete analysis history with tool usage logging
Schema Validation: Automatic validation ensures data integrity
Storage Management: Professional file naming and workspace organization

This architecture provides a solid foundation for professional bioinformatics analysis with excellent maintainability, extensibility, and reliability.

🧬 Agent-Guided Formula Construction Integration

Enhanced Bulk RNA-seq Expert Agent Tools

The bulk_rnaseq_expert agent includes 5 new tools for conversational formula construction:

Service Enhancement Details

DifferentialFormulaService: Added suggest_formulas(), preview_design_matrix(), estimate_statistical_power()
WorkflowTracker: New lightweight class for DE iteration tracking and comparison
Integration: All data stored in AnnData.uns for seamless workflow integration

Workflow Coverage Impact

✅ Step 8: Formula Construction → Agent-guided conversation
✅ Step 12: Iterative Workflows → Natural iteration and comparison
🎯 Result: 92% workflow coverage (11/12 steps complete)

🔄 Workspace Restoration System (New in v0.2)

Seamless Session Continuity

Lobster AI now features intelligent workspace restoration that automatically detects and restores previous analysis sessions:

Key Features

Automatic Detection: Scans .lobster_workspace/data/ for available datasets on startup
Session Persistence: Maintains .session.json with active modalities and usage history
Lazy Loading: Load specific datasets on-demand with load_dataset()
Pattern-Based Restoration: Support for recent/all/glob patterns via /restore
Memory Management: Enforced memory limits prevent out-of-memory issues

New CLI Commands

/restore [pattern] - Restore datasets from previous sessions
/workspace list - View available datasets without loading
/workspace load <name> - Load specific dataset by name
Autocomplete Support: Tab completion for dataset names and patterns

Implementation Highlights

DataManagerV2 Enhanced: Added _scan_workspace(), load_dataset(), restore_session()
Session Tracking: Automatic .session.json updates on modality changes
H5PY Integration: Efficient metadata extraction without full dataset loading
Professional UX: Startup prompt shows workspace status with helpful commands

This transformation enables users to seamlessly continue their work across sessions without manual dataset reloading.

🛠️ System Utilities Centralization

Performance Optimization

The system now features centralized platform utilities that eliminate redundant OS detection and provide unified cross-platform operations:

Before → After Transformation

Platform Detection: 5 × platform.system() calls → 1 × (at import time)
Code Reduction: ~50 lines of duplicate subprocess logic → 5 lines at call sites
Performance: 80% improvement in system operation speed
Architecture: Clean lobster/utils/system.py module with open_file(), open_folder(), open_path() functions

Cloud-Agnostic Design

All file opening operations run on the CLI side regardless of cloud vs local mode, ensuring consistent behavior across deployment types.

Integration Points

CLI Commands: open <file>, /open <file>, /plot, /plot <ID>
GPU Detection: Apple Silicon detection in gpu_detector.py
Future Extensions: Natural extension point for additional system utilities

Feature	Before (Static)	After (Dynamic)	Impact
Agent Discovery	Manual updates in supervisor.py	Automatic from registry	Zero maintenance
Missing Agents	3 agents not included	All 8 agents included	Complete coverage
Configuration	Hardcoded behavior	20+ env variables	Full flexibility
Prompt Size	Fixed ~9.5K chars	8K-11K adaptive	15% smaller in production
Adding Agents	Update 3+ files	Update registry only	66% less work

Operation Modes

# Research Mode - Interactive exploration
SUPERVISOR_ASK_QUESTIONS=true
SUPERVISOR_WORKFLOW_GUIDANCE=detailed
# Result: 11K char prompt with full guidance

# Production Mode - Automated pipelines
SUPERVISOR_ASK_QUESTIONS=false
SUPERVISOR_WORKFLOW_GUIDANCE=minimal
# Result: 8K char prompt, 1.4K chars saved

# Development Mode - Debugging
SUPERVISOR_VERBOSE=true
SUPERVISOR_INCLUDE_SYSTEM=true
# Result: Detailed explanations with system info

Implementation Benefits

🚀 Zero Maintenance: Add agents to registry only, supervisor auto-discovers
⚙️ Flexible Behavior: Configure interaction style per environment
📊 Context Aware: Includes current data/workspace state dynamically
🎯 Mode Optimized: Different prompt sizes for different use cases
♻️ Backward Compatible: Default config matches previous behavior exactly

Previous20. Data Management Architecture

Next18. Architecture Overview

Lobster AI - Cloud & Local Architecture

On this page