# Machine Learning



import { AgentHero } from '@/components/AgentHero';

<Callout type="warn">
  **In Development** — This package is not yet published to PyPI. APIs, tool signatures, and agent behavior will change before release.
</Callout>

<AgentHero
  name="lobster-ml"
  tier="free"
  problem="ML data preparation: feature selection, survival analysis, cross-validation, and model interpretability for omics data"
  inputs={["AnnData", "Expression Matrices", "Survival Data", "Multi-Omics"]}
  outputs={["Selected Features", "Cox Models", "SHAP Values", "MOFA Factors", "Enriched Pathways"]}
  install="pip install lobster-ml"
  difficulty="advanced"
  agents={[{
  name: "machine_learning_expert",
  role: "ML preparation and sub-agent routing",
  children: [
    { name: "feature_selection_expert", role: "Biomarker discovery and feature ranking" },
    { name: "survival_analysis_expert", role: "Cox models and Kaplan-Meier analysis" }
  ]
}]}
/>

Agents [#agents]

machine_learning_expert [#machine_learning_expert]

The main orchestrator for machine learning workflows, coordinating between specialized sub-agents.

**Capabilities:**

* ML data preparation and feature engineering
* Data splitting and framework export (PyTorch, TensorFlow)
* Delegation to feature selection and survival analysis sub-agents
* Multi-omics integration via MOFA
* Pathway enrichment analysis via INDRA

feature_selection_expert [#feature_selection_expert]

Specialized agent for biomarker discovery and feature selection in high-dimensional omics data.

**Capabilities:**

* Stability selection (Meinshausen & Buhlmann probability)
* LASSO and Elastic Net regularization
* Variance-based filtering with chunked computation for large matrices
* Importance ranking and automatic feature detection

survival_analysis_expert [#survival_analysis_expert]

Specialized agent for time-to-event analysis and risk stratification.

**Capabilities:**

* Cox proportional hazards models (unregularized and regularized)
* Kaplan-Meier survival curves with median survival and RMST
* C-index reporting with three-tier validation (test, CV, training)
* Threshold optimization with censoring-aware handling
* Risk stratification and hazard ratio computation

Example Workflows [#example-workflows]

ML Feature Preparation [#ml-feature-preparation]

```text
User: Prepare my scRNA-seq data for machine learning classification

[machine_learning_expert]
- Loads AnnData expression matrix
- Scales features and handles missing values
- Applies SMOTE for class imbalance (marks synthetic samples)
- Splits into train/test sets
- Exports to PyTorch-compatible format
```

Biomarker Discovery (Feature Selection) [#biomarker-discovery-feature-selection]

```text
User: Find the most stable biomarkers that distinguish alpha
      cells from beta cells in my pancreas scRNA-seq data

[machine_learning_expert delegates to feature_selection_expert]
- Runs stability selection (50 bootstrap rounds)
- Uses Random Forest or XGBoost importance scoring
- Applies variance filter to remove low-information features
- Reports top stable features with selection probabilities
- Expects biologically meaningful genes (e.g., INS, GCG, SST)
```

Survival Analysis [#survival-analysis]

```text
User: Run survival analysis using the selected biomarkers
      with my clinical outcome data

[machine_learning_expert delegates to survival_analysis_expert]
- Fits Cox PH model on selected features
- Validates with C-index (test set preferred)
- Generates Kaplan-Meier curves for risk groups
- Reports hazard ratios and confidence intervals
- Saves model to workspace/models/
```

Multi-Omics Integration [#multi-omics-integration]

```text
User: Integrate my transcriptomics and proteomics data
      and run feature selection on the combined space

[machine_learning_expert]
- Validates sample overlap between modalities
- Runs MOFA-based integration (factors in adata.obsm['X_mofa'])
- Delegates to feature_selection_expert with feature_space_key="X_mofa"
- Reports top factors and pathway enrichment via INDRA
```

Dependencies [#dependencies]

lobster-ml requires `lobster-ai` as its core dependency. Domain-specific libraries are organized as optional extras:

| Extra                | Libraries         | Purpose                         |
| -------------------- | ----------------- | ------------------------------- |
| **ml**               | torch, scvi-tools | Deep learning, scVI embeddings  |
| **survival**         | scikit-survival   | Cox models, Kaplan-Meier        |
| **interpretability** | shap, interpret   | SHAP values, model explanations |
| **imbalanced**       | imbalanced-learn  | SMOTE, class balancing          |
| **tuning**           | hyperopt          | Hyperparameter optimization     |
| **full**             | All of the above  | Complete ML stack               |

Install with extras:

```bash
pip install lobster-ml[survival]          # Just survival analysis
pip install lobster-ml[full]              # Everything
```

Services [#services]

lobster-ml includes specialized ML services:

| Service                            | Purpose                                             |
| ---------------------------------- | --------------------------------------------------- |
| **FeatureSelectionService**        | Stability selection, LASSO, variance filtering      |
| **SurvivalAnalysisService**        | Cox PH models, Kaplan-Meier, threshold optimization |
| **CrossValidationService**         | Stratified k-fold, nested CV, time series CV        |
| **InterpretabilityService**        | SHAP values, per-class explanations                 |
| **MLPreprocessingService**         | SMOTE balancing, scaling, missing value handling    |
| **MultiOmicsIntegrationService**   | MOFA-based multi-omics factor analysis              |
| **PathwayEnrichmentBridgeService** | GO/Reactome enrichment via INDRA Discovery API      |

Services follow the standard 3-tuple return pattern and are accessed internally by the agents.

Configuration [#configuration]

Enable ML agents in your workspace config:

```toml
# .lobster_workspace/config.toml
enabled = ["machine_learning_expert", "feature_selection_expert", "survival_analysis_expert"]
```

Or use a preset:

```toml
preset = "ml-full"
```

Sub-Agent Architecture [#sub-agent-architecture]

```text
machine_learning_expert (supervisor, accessible from main supervisor)
|-- feature_selection_expert (sub-agent, not directly accessible)
|-- survival_analysis_expert (sub-agent, not directly accessible)
```

Sub-agents are accessed through machine\_learning\_expert delegation, not directly from the supervisor. This architecture allows the orchestrator to:

1. **Route tasks** - Determine which sub-agent is needed based on the analysis type
2. **Manage state** - Track ML pipeline progress across preprocessing, selection, and modeling
3. **Synthesize results** - Combine outputs from feature selection and survival analysis into unified reports

import { NextSteps } from '@/components/NextSteps';
import { Rocket, GraduationCap, Settings } from 'lucide-react';

<NextSteps
  items={[
{
  href: "/docs/getting-started",
  title: "Getting Started",
  description: "Quick setup guide to start analyzing bioinformatics data",
  icon: <Rocket />
},
{
  href: "/docs/agents/transcriptomics",
  title: "Transcriptomics Agent",
  description: "Single-cell and bulk RNA-seq analysis to prepare data for ML workflows",
  icon: <GraduationCap />
},
{
  href: "/docs/getting-started/configuration",
  title: "Configuration",
  description: "Configure agent settings, model profiles, and workspace preferences",
  icon: <Settings />
}
]}
/>
