Lobster AI - Cloud & Local Architecture
Lobster AI is a powerful multi-agent bioinformatics platform with seamless cloud and local deployment capabilities. The system automatically detects your...
🏗️ System Architecture Overview
Lobster AI is a powerful multi-agent bioinformatics platform with seamless cloud and local deployment capabilities. The system automatically detects your configuration and routes requests appropriately.
☁️ Cloud/Local Architecture Pattern
🔄 Seamless Mode Switching Flow
📦 Clean Single Package Structure
🌟 Cloud Platform (Coming Soon)
System Architecture Overview - Post Migration
Data Flow Diagram - Modular Service Architecture
Component Interaction Matrix
Agent Configuration Schema
Each agent defines an AGENT_CONFIG using the AgentRegistryConfig dataclass:
@dataclass
class AgentRegistryConfig:
"""Configuration for an agent in the system."""
name: str # Unique agent identifier
display_name: str # Human-readable name
description: str # Agent's purpose/capability
factory_function: str # Module path to factory function
tier_requirement: str = 'free' # 'free', 'premium', or 'enterprise'
package_name: Optional[str] = None # Package that provides this agent
handoff_tool_name: Optional[str] = None # Name of handoff tool
handoff_tool_description: Optional[str] = None # Tool descriptionAgent Discovery via Entry Points (v1.0.0+)
Agents are discovered via pyproject.toml entry points:
# Example from lobster-research package
[project.entry-points."lobster.agents"]
data_expert_agent = "lobster_research.agents.data_expert:AGENT_CONFIG"
research_agent = "lobster_research.agents.research_agent:AGENT_CONFIG"
# Example from lobster-transcriptomics package
[project.entry-points."lobster.agents"]
transcriptomics_expert = "lobster_transcriptomics.agents.transcriptomics_expert:AGENT_CONFIG"Each agent module defines its AGENT_CONFIG at the module top:
# lobster_research/agents/data_expert.py
AGENT_CONFIG = AgentRegistryConfig(
name='data_expert_agent',
display_name='Data Expert',
description='Handles data fetching and download tasks',
factory_function='lobster_research.agents.data_expert:data_expert',
tier_requirement='free',
package_name='lobster-research',
handoff_tool_name='handoff_to_data_expert',
handoff_tool_description='Assign data fetching/download tasks to the data expert'
)
display_name='ML Expert',
description='Handles Machine Learning related tasks like transforming the data in the desired format for downstream tasks',
factory_function='lobster.agents.machine_learning_expert.machine_learning_expert',
handoff_tool_name='handoff_to_machine_learning_expert',
handoff_tool_description='Assign all machine learning related tasks (scVI, classification etc) to the machine learning expert agent'
),
'visualization_expert_agent': AgentConfig(
name='visualization_expert_agent',
display_name='Visualization Expert',
description='Creates publication-quality visualizations through supervisor-mediated workflows',
factory_function='lobster.agents.visualization_expert.visualization_expert',
handoff_tool_name='handoff_to_visualization_expert',
handoff_tool_description='Delegate visualization tasks to the visualization expert agent'
),
'proteomics_expert': AgentConfig(
name='proteomics_expert',
display_name='Proteomics Expert',
description='Handles both mass spectrometry and affinity proteomics analysis tasks',
factory_function='lobster.agents.proteomics_expert.proteomics_expert',
handoff_tool_name='handoff_to_proteomics_expert',
handoff_tool_description='Assign proteomics analysis tasks (mass spectrometry or affinity proteomics) to the proteomics expert'
),
\}System Integration Flow
Benefits of Centralized Registry
Before (Legacy System)
Adding new agents required updating:
├── lobster/agents/graph.py # Import statements
├── lobster/agents/graph.py # Agent creation code
├── lobster/agents/graph.py # Handoff tool definitions
├── lobster/utils/callbacks.py # Agent name hardcoded list
└── Multiple imports throughout codebaseAfter (Entry Point System, v1.0.0+)
Adding new agents only requires:
├── Define AGENT_CONFIG at module top
└── Register in pyproject.toml entry points
Everything else is handled automatically:
├── ✅ Dynamic agent discovery via ComponentRegistry
├── ✅ Automatic delegation tool creation
├── ✅ Callback system integration
├── ✅ Type-safe configuration
└── ✅ Professional error handlingHow to Add New Agents
Step 1: Create Agent Implementation with AGENT_CONFIG
# your_package/agents/new_agent.py
from lobster.core.registry import AgentRegistryConfig
# AGENT_CONFIG at module top for fast discovery
AGENT_CONFIG = AgentRegistryConfig(
name='new_agent',
display_name='New Agent',
description='Handles specialized new functionality',
factory_function='your_package.agents.new_agent:new_agent',
tier_requirement='free',
package_name='your-package',
handoff_tool_name='handoff_to_new_agent',
handoff_tool_description='Assign specialized tasks to the new agent'
)
def new_agent(data_manager, callback_handler=None, agent_name='new_agent', delegation_tools=None, workspace_path=None, **kwargs):
"""Create a new specialized agent."""
# Agent implementation
return agent_instanceStep 2: Register via Entry Points
# pyproject.toml
[project.entry-points."lobster.agents"]
new_agent = "your_package.agents.new_agent:AGENT_CONFIG"Step 3: Done!
The system automatically handles:
- ✅ Agent loading in graph creation
- ✅ Handoff tool generation
- ✅ Callback system detection
- ✅ Error handling and logging
- ✅ Integration with existing workflows
Registry Helper Functions
The registry provides several utility functions:
# Get all worker agents with configurations
worker_agents = get_worker_agents()
# Returns: Dict[str, AgentConfig]
# Get all agent names (including system agents)
all_agents = get_all_agent_names()
# Returns: List[str]
# Get specific agent configuration
config = get_agent_config('data_expert_agent')
# Returns: AgentConfig or None
# Dynamically import agent factory
factory = import_agent_factory('lobster.agents.data_expert.data_expert')
# Returns: CallableError Prevention
The registry system prevents common errors:
Runtime Validation
- ✅ Factory function existence validation
- ✅ Import path verification
- ✅ Configuration completeness checks
- ✅ Duplicate agent name detection
Development Safety
- ✅ Type hints for all configurations
- ✅ Consistent naming conventions
- ✅ Comprehensive error messages
- ✅ Centralized documentation
Maintenance Benefits
- ✅ Single source of truth
- ✅ Easy to audit and review
- ✅ Reduced cognitive load
- ✅ Professional code organization
Testing Agent Discovery
The system includes comprehensive testing via the lobster.testing module:
# tests/test_agent_discovery.py
from lobster.core.registry import ComponentRegistry
from lobster.testing import AgentContractTestMixin
def test_agent_discovery():
"""Test the agent discovery functionality."""
registry = ComponentRegistry()
# Test 1: Verify agents are discovered
agents = registry.list_agents()
assert len(agents) > 0
# Test 2: Validate agent configs
for agent_name in agents:
config = registry.get_agent_config(agent_name)
assert config is not None
assert config.name == agent_name
# Test 3: Check expected agents are present
assert 'data_expert_agent' in agents
assert 'transcriptomics_expert' in agentsRun the test via CLI:
lobster agents list # Verify discovery
pytest tests/test_agent_discovery.py # Run testsThis centralized approach ensures professional, maintainable, and error-free agent management across the entire Lobster AI system.
🔗 ConcatenationService: Code Deduplication & Memory Efficiency
Overview
The ConcatenationService is a critical architectural improvement that eliminates code duplication and provides memory-efficient, modality-agnostic concatenation of biological samples. This service addresses the code redundancy problem that existed between data_expert/data_expert.py and geo_service.py.
Architecture Pattern
Key Benefits
🎯 Code Reduction
- data_expert/data_expert.py: 200+ lines → 30 lines (85% reduction)
- geo_service.py: 300+ lines → 20 lines (93% reduction)
- Total elimination: 450+ lines of duplicated code
💾 Memory Efficiency
- Smart memory estimation with automatic strategy recommendation
- Chunked processing for datasets exceeding memory limits
- 50%+ memory reduction for large concatenation operations
- Real-time memory monitoring during processing
🧬 Modality-Agnostic Design
- Strategy Pattern: Different algorithms for different data types
- Single-cell optimization: Sparse matrix handling with batch tracking
- Bulk transcriptomics: Optimized for dense matrix operations
- Proteomics support: Handle missing values appropriately
🔧 Professional Architecture
- Single source of truth for all concatenation logic
- Comprehensive error handling with custom exceptions
- Progress tracking with Rich console integration
- Extensive testing with 400+ lines of unit tests
Service Interface
# Primary concatenation method
concatenated_adata, statistics = concat_service.concatenate_samples(
sample_adatas=sample_list,
strategy=ConcatenationStrategy.SMART_SPARSE,
batch_key="batch",
use_intersecting_genes_only=True
)
# Concatenate from modality names
concatenated_adata, statistics = concat_service.concatenate_from_modalities(
modality_names=["sample1", "sample2", "sample3"],
output_name="concatenated_dataset",
use_intersecting_genes_only=True
)
# Auto-detect samples by pattern
sample_modalities = concat_service.auto_detect_samples("geo_gse12345")
# Validate before processing
validation_result = concat_service.validate_concatenation_inputs(sample_list)
# Estimate memory requirements
memory_info = concat_service.estimate_memory_usage(sample_list)Integration with DataManagerV2
The ConcatenationService integrates deeply with DataManagerV2 for seamless modality management:
Testing & Quality Assurance
The ConcatenationService includes comprehensive testing:
- Unit Tests: Strategy pattern, validation functions, memory estimation
- Integration Tests: DataManagerV2 interaction, modality storage
- Performance Tests: Memory usage, processing time benchmarks
- Error Handling Tests: Exception scenarios, graceful degradation
This architecture improvement ensures reliable, maintainable, and efficient sample concatenation across the entire Lobster AI platform.
🌟 Open Source Benefits
🆓 What You Get for Free
- Complete Bioinformatics Platform: All analysis capabilities included
- AI-Powered Analysis: Natural language interface to bioinformatics
- Publication-Ready Outputs: Professional visualizations and reports
- Extensible Architecture: Add custom analysis methods easily
- Active Development: Regular updates and community contributions
📈 Why Choose Local Installation
- Privacy: Your data never leaves your computer
- Customization: Full control over analysis parameters
- Learning: Study the source code to understand methods
- Contribution: Help improve the platform for everyone
- Cost: Completely free (you pay only for your own API keys)
☁️ Interested in Cloud?
For teams needing scalable infrastructure, managed services, or collaborative features, we're developing a cloud platform.
Architecture Migration Summary
🎯 Migration Goals Achieved
The Lobster AI system has been successfully migrated from a dual-system architecture (legacy DataManager + DataManagerV2) to a clean, professional, modular DataManagerV2-only implementation.
✅ Key Improvements
1. Modular Service Architecture
- Before: Agents contained mixed responsibilities with dual code paths
- After: Clean separation with stateless analysis services and orchestration agents
2. Professional Error Handling
- Custom Exception Hierarchy:
TranscriptomicsError,PreprocessingError,QualityError, etc.ModalityNotFoundErrorfor specific validation
- Comprehensive Logging: All operations tracked with parameters and results
- Graceful Error Recovery: Informative error messages with suggested fixes
3. Stateless Services Design
- PreprocessingService: AnnData filtering, normalization, batch correction
- QualityService: Comprehensive QC assessment with statistical metrics
- ClusteringService: Leiden clustering, PCA, UMAP visualization
- EnhancedSingleCellService: Doublet detection, cell type annotation
- GEOService: Professional dataset downloading and processing
- PubMedService: Literature mining and method extraction
🏗️ New Architecture Pattern
Agent Tool Pattern
@tool
def tool_name(modality_name: str, **params) -> str:
"""Professional tool with comprehensive error handling."""
try:
# 1. Validate modality exists
if modality_name not in data_manager.list_modalities():
raise ModalityNotFoundError(f"Modality '\{modality_name\}' not found")
# 2. Get AnnData from modality
adata = data_manager.get_modality(modality_name)
# 3. Call stateless service
result_adata, stats = service.method_name(adata, **params)
# 4. Save new modality with descriptive name
new_modality_name = f"\{modality_name\}_processed"
data_manager.modalities[new_modality_name] = result_adata
# 5. Log operation for provenance
data_manager.log_tool_usage(tool_name, params, description)
# 6. Format professional response
return format_professional_response(stats, new_modality_name)
except ServiceError as e:
logger.error(f"Service error: \{e\}")
return f"Service error: \{str(e)\}"
except Exception as e:
logger.error(f"Unexpected error: \{e\}")
return f"Unexpected error: \{str(e)\}"Service Method Pattern
def service_method(
self,
adata: anndata.AnnData,
**parameters
) -> Tuple[anndata.AnnData, Dict[str, Any]]:
"""
Stateless service method working with AnnData directly.
Returns:
Tuple of (processed_adata, processing_statistics)
"""
try:
# 1. Create working copy
adata_processed = adata.copy()
# 2. Apply analysis algorithms
# ... processing logic ...
# 3. Calculate comprehensive statistics
processing_stats = \{
"analysis_type": "method_type",
"parameters_used": parameters,
"results": \{...\}
\}
return adata_processed, processing_stats
except Exception as e:
raise ServiceError(f"Method failed: \{str(e)\}")📊 Modality Management System
Descriptive Naming Convention
Each analysis step creates new modalities with descriptive, traceable names:
geo_gse12345 # Raw downloaded data
├── geo_gse12345_quality_assessed # With QC metrics
├── geo_gse12345_filtered_normalized # Preprocessed data
├── geo_gse12345_doublets_detected # With doublet annotations
├── geo_gse12345_clustered # With clustering results
├── geo_gse12345_markers # With marker genes
└── geo_gse12345_annotated # With cell type annotationsProfessional Modality Tracking
- Provenance: Complete analysis history with parameters
- Statistics: Comprehensive metrics for each processing step
- Validation: Schema enforcement and quality checks
- Storage: Automatic saving with professional file naming
🔬 Analysis Workflow Excellence
Standard Single-cell RNA-seq Pipeline
1. check_data_status() → Review available modalities
2. assess_data_quality(modality_name) → Professional QC assessment
3. filter_and_normalize_modality(...) → Clean and normalize
4. detect_doublets_in_modality(...) → Remove doublets
5. cluster_modality(...) → Leiden clustering + UMAP
6. find_marker_genes_for_clusters(...) → Differential expression
7. annotate_cell_types(...) → Automated annotation
8. create_analysis_summary() → Comprehensive reportQuality Control Standards
- Professional QC Thresholds: Evidence-based filtering parameters
- Multi-metric Assessment: Total counts, gene counts, mitochondrial%, ribosomal%
- Statistical Validation: Z-score outlier detection and percentile thresholds
- Batch Effect Handling: Automatic batch detection and correction options
Error Handling & Recovery
- Input Validation: Comprehensive parameter and data validation
- Graceful Degradation: Fallback methods when specialized tools unavailable
- Informative Messages: Clear error descriptions with suggested solutions
- Operation Logging: Complete audit trail for debugging and reproducibility
🚀 Benefits of New Architecture
Code Quality Improvements
- 50% Reduction in agent code complexity (450+ → 200+ lines)
- Zero Duplication: No more dual code paths or is_v2 checks
- Professional Standards: Type hints, comprehensive docstrings, error handling
- Testability: Stateless services are easily unit tested
Maintainability Enhancements
- Single Responsibility: Each service handles one analysis domain
- Modular Design: Services can be used independently or combined
- Clean Interfaces: Consistent patterns across all analysis tools
- Version Control: Clear separation enables independent service updates
Performance & Reliability
- Memory Efficiency: Stateless services with minimal memory footprint
- Fault Tolerance: Comprehensive error handling prevents pipeline failures
- Reproducibility: Complete parameter logging and provenance tracking
- Scalability: Services can be distributed or parallelized in future versions
Migration Impact Analysis
📈 Before Migration (Legacy System)
transcriptomics_expert.py: 450+ lines
├── Dual code paths (is_v2 checks everywhere)
├── Mixed responsibilities (orchestration + analysis)
├── Redundant implementations
├── Complex error handling
└── Maintenance overhead🎉 After Migration (Modular System)
transcriptomics_expert.py: 280 lines (clean)
├── Single DataManagerV2 path
├── Professional tool orchestration only
├── Stateless service delegation
├── Comprehensive error handling
└── Minimal maintenance overhead
Analysis Services: 4 refactored services
├── PreprocessingService: AnnData → (filtered_adata, stats)
├── QualityService: AnnData → (qc_adata, assessment)
├── ClusteringService: AnnData → (clustered_adata, results)
└── EnhancedSingleCellService: AnnData → (annotated_adata, metrics)🔧 Technical Architecture Benefits
Service Layer Advantages
- Reusability: Services can be used by multiple agents
- Testability: Each service can be independently tested
- Flexibility: Easy to add new analysis methods
- Performance: Optimized algorithms with professional implementations
Agent Layer Improvements
- Orchestration Focus: Agents handle modality management and user interaction
- Clean Tool Interface: Consistent ~20-30 line tool implementations
- Professional Responses: Formatted outputs with comprehensive statistics
- Error Management: Hierarchical error handling with specific exceptions
DataManagerV2 Integration
- Modality-Centric: All data operations centered around named modalities
- Provenance Tracking: Complete analysis history with tool usage logging
- Schema Validation: Automatic validation ensures data integrity
- Storage Management: Professional file naming and workspace organization
This architecture provides a solid foundation for professional bioinformatics analysis with excellent maintainability, extensibility, and reliability.
🧬 Agent-Guided Formula Construction Integration
Enhanced Bulk RNA-seq Expert Agent Tools
The bulk_rnaseq_expert agent includes 5 new tools for conversational formula construction:
Service Enhancement Details
- DifferentialFormulaService: Added
suggest_formulas(),preview_design_matrix(),estimate_statistical_power() - WorkflowTracker: New lightweight class for DE iteration tracking and comparison
- Integration: All data stored in AnnData.uns for seamless workflow integration
Workflow Coverage Impact
- ✅ Step 8: Formula Construction → Agent-guided conversation
- ✅ Step 12: Iterative Workflows → Natural iteration and comparison
- 🎯 Result: 92% workflow coverage (11/12 steps complete)
🔄 Workspace Restoration System (New in v0.2)
Seamless Session Continuity
Lobster AI now features intelligent workspace restoration that automatically detects and restores previous analysis sessions:
Key Features
- Automatic Detection: Scans
.lobster_workspace/data/for available datasets on startup - Session Persistence: Maintains
.session.jsonwith active modalities and usage history - Lazy Loading: Load specific datasets on-demand with
load_dataset() - Pattern-Based Restoration: Support for recent/all/glob patterns via
/restore - Memory Management: Enforced memory limits prevent out-of-memory issues
New CLI Commands
/restore [pattern]- Restore datasets from previous sessions/workspace list- View available datasets without loading/workspace load <name>- Load specific dataset by name- Autocomplete Support: Tab completion for dataset names and patterns
Implementation Highlights
- DataManagerV2 Enhanced: Added
_scan_workspace(),load_dataset(),restore_session() - Session Tracking: Automatic
.session.jsonupdates on modality changes - H5PY Integration: Efficient metadata extraction without full dataset loading
- Professional UX: Startup prompt shows workspace status with helpful commands
This transformation enables users to seamlessly continue their work across sessions without manual dataset reloading.
🛠️ System Utilities Centralization
Performance Optimization
The system now features centralized platform utilities that eliminate redundant OS detection and provide unified cross-platform operations:
Before → After Transformation
- Platform Detection: 5 ×
platform.system()calls → 1 × (at import time) - Code Reduction: ~50 lines of duplicate subprocess logic → 5 lines at call sites
- Performance: 80% improvement in system operation speed
- Architecture: Clean
lobster/utils/system.pymodule withopen_file(),open_folder(),open_path()functions
Cloud-Agnostic Design
All file opening operations run on the CLI side regardless of cloud vs local mode, ensuring consistent behavior across deployment types.
Integration Points
- CLI Commands:
open <file>,/open <file>,/plot,/plot <ID> - GPU Detection: Apple Silicon detection in
gpu_detector.py - Future Extensions: Natural extension point for additional system utilities
🎛️ Supervisor Configuration System (v0.2+)
Dynamic Agent Discovery & Configuration
The supervisor agent now features automatic agent discovery and configurable behavior, eliminating manual updates when adding new agents:
Architecture Overview
Key Improvements
| Feature | Before (Static) | After (Dynamic) | Impact |
|---|---|---|---|
| Agent Discovery | Manual updates in supervisor.py | Automatic from registry | Zero maintenance |
| Missing Agents | 3 agents not included | All 8 agents included | Complete coverage |
| Configuration | Hardcoded behavior | 20+ env variables | Full flexibility |
| Prompt Size | Fixed ~9.5K chars | 8K-11K adaptive | 15% smaller in production |
| Adding Agents | Update 3+ files | Update registry only | 66% less work |
Operation Modes
# Research Mode - Interactive exploration
SUPERVISOR_ASK_QUESTIONS=true
SUPERVISOR_WORKFLOW_GUIDANCE=detailed
# Result: 11K char prompt with full guidance
# Production Mode - Automated pipelines
SUPERVISOR_ASK_QUESTIONS=false
SUPERVISOR_WORKFLOW_GUIDANCE=minimal
# Result: 8K char prompt, 1.4K chars saved
# Development Mode - Debugging
SUPERVISOR_VERBOSE=true
SUPERVISOR_INCLUDE_SYSTEM=true
# Result: Detailed explanations with system infoImplementation Benefits
- 🚀 Zero Maintenance: Add agents to registry only, supervisor auto-discovers
- ⚙️ Flexible Behavior: Configure interaction style per environment
- 📊 Context Aware: Includes current data/workspace state dynamically
- 🎯 Mode Optimized: Different prompt sizes for different use cases
- ♻️ Backward Compatible: Default config matches previous behavior exactly
20. Data Management Architecture
The Lobster AI data management system is built around DataManagerV2, a modular orchestration layer that provides unified access to multi-omics biological...
18. Architecture Overview
Lobster AI is a modular bioinformatics platform with pluggable execution environments, LLM providers, and integrated data management. The platform archit...