Machine Learning: From Data Preparation to Survival Prediction

ML pipelines across three complexity levels — data preparation, biomarker discovery with bootstrap feature importance, and Cox proportional hazards survival analysis.

Machine learning on multi-omics data requires rigorous feature engineering, domain-specific statistical methods, and careful validation before clinical deployment. This case study demonstrates Lobster AI's machine learning agents across three complexity levels: preparing hepatocellular carcinoma data for ML readiness assessment, discovering breast cancer biomarkers with bootstrap feature importance and pathway enrichment, and building a complete survival prediction pipeline with Cox proportional hazards modeling and risk stratification. The progression from simple data preparation through biomarker discovery to survival analysis mirrors the real-world ML workflow for translational cancer research.

Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). External databases queried: GEO (NCBI), INDRA (pathway enrichment). Local tools: scikit-learn, XGBoost, lifelines, scikit-survival (optional). Total cost: $17.77 across 3 case studies (8 turns). The survival analysis in the Hard case uses annotations derived from PAM50 molecular subtype and NHG grade — not patient-level survival endpoints. This case study demonstrates analytical workflows, not independently validated clinical biomarkers.

Agents and Data Sources

This analysis uses the lobster-ml package, which provides a parent-child agent hierarchy for machine learning workflows:

Agent	Role
`machine_learning_expert`	Parent agent — ML readiness assessment, feature preparation, workflow orchestration
`feature_selection_expert`	Child agent — Bootstrap feature importance (XGBoost/Random Forest), LASSO, variance-based methods
`survival_analysis_expert`	Child agent — Cox proportional hazards, risk stratification, Kaplan-Meier analysis

External data sources queried: GEO (Gene Expression Omnibus) via research_agent. Local computation handled by scikit-learn (StandardScaler, train_test_split, LassoCV), scanpy (highly variable gene selection), scikit-survival (Cox regression, log-rank test), and INDRA Discovery API (pathway enrichment).

Simple: Hepatocellular Carcinoma Data Preparation

The first scenario demonstrates ML readiness assessment and feature preparation for a 30-sample hepatocellular carcinoma (HCC) bulk RNA-seq dataset. HCC is the most common primary liver cancer and a major cause of cancer mortality worldwide. Tumor-vs-normal classification from bulk RNA-seq is a canonical ML task that validates whether expression signatures can distinguish malignant from healthy tissue.

Turn 1: Dataset Discovery and Download

The research agent searched GEO for small bulk RNA-seq cancer datasets and identified two candidates: GSE304512 (HCC, 38 samples) and GSE266302 (pseudomyxoma peritonei, 20 samples). Both datasets were queued for download with clearly labeled tumor and normal groups.

lobster query -p bedrock --session-id ml_simple \
  "Search GEO for a small bulk RNA-seq dataset comparing cancer vs normal tissue. \
   I need something with fewer than 100 samples, clearly labeled groups (e.g. tumor \
   vs normal), and expression counts. Find a suitable dataset and download it."

Dataset	Accession	Disease	Samples	Groups
HCC bulk RNA-seq	GSE304512	Hepatocellular carcinoma	38	19 tumor, 19 normal
PMP bulk RNA-seq	GSE266302	Pseudomyxoma peritonei	20	Cancer vs normal

Turn 2: Download Recovery and ML Feature Preparation

The second turn demonstrated Lobster's resilience when encountering unavailable datasets. After two failed download attempts (GSE304512 and GSE50760 format incompatibility), the system autonomously searched for alternatives and successfully downloaded GSE135631 — a 30-sample HCC dataset with balanced tumor and normal groups.

lobster query -p bedrock --session-id ml_simple \
  "Download GSE304512. Once loaded, use the machine learning expert to check \
   ML readiness of the dataset, prepare features using highly variable gene \
   selection with 2000 features and z-score scaling, then create stratified \
   train/test splits (80/20) using the tissue type labels."

Dataset Summary (GSE135631):

Metric	Value
Accession	GSE135631
Disease	Hepatocellular carcinoma (HCC)
Samples	30 total (15 tumor, 15 normal)
Genes	55,789
Normalization	TPM
Sparsity	32.4%

Download Recovery Log:

Attempt	Accession	Outcome	Recovery Action
1	GSE304512	Failed	Autonomous search for alternatives
2	GSE50760	Failed	Second autonomous search
3	GSE135631	Success	Proceeded to ML preparation

The machine_learning_expert assessed ML readiness and flagged the 32.4% sparsity and small sample size (30 samples — risk of overfitting). Highly variable gene selection with 2,000 features failed due to infinity values in variance calculation on TPM-normalized data (a known edge case when genes have near-zero mean expression). The agent gracefully fell back to using the full 55,789-gene feature set with z-score scaling.

When HVG selection fails on TPM data, Lobster automatically falls back to z-score scaling across all features. This ensures the pipeline continues even when dimensionality reduction encounters numerical instability. The fallback is scientifically reasonable, though downstream models will train on more features than originally requested.

ML Preparation Pipeline:

Step	Method	Input	Output	Status
ML readiness check	`check_ml_ready_modalities`	GSE135631 (30 x 55,789)	Ready (small-sample warning)	Pass
HVG selection (2000 features)	scanpy `highly_variable_genes`	55,789 genes (TPM)	FAILED (infinity in variance)	Failed
Feature scaling (fallback)	Z-score (StandardScaler)	55,789 genes	Scaled, 6.4% sparsity	Pass
Stratified split	80/20, seed=42	30 samples	24 train / 6 test	Pass

Train/Test Split:

Set	Samples	Tumor	Normal	Proportion
Training	24	12	12	80%
Test	6	3	3	20%
Total	30	15	15	100%

The final stratified 80/20 split produced properly balanced training (24 samples) and test (6 samples) sets ready for classifier training. This demonstrates Lobster's ability to recover autonomously from data acquisition failures and prepare ML-ready datasets even when feature selection encounters edge cases.

Medium: Breast Cancer Biomarker Discovery

The second scenario demonstrates the full ML biomarker discovery pipeline: dataset acquisition, dual-method feature selection (bootstrap feature importance + LASSO), and biological validation via pathway enrichment. Breast cancer is the most commonly diagnosed cancer worldwide, and estrogen receptor (ER) status is the most important molecular classifier for treatment decisions.

Turn 1: Dataset Discovery

The research agent searched GEO for breast cancer bulk RNA-seq datasets with at least two clear clinical groups and identified GSE240671 (122 samples, neoadjuvant chemotherapy responders vs non-responders).

lobster query -p bedrock --session-id ml_medium \
  "Search GEO for a bulk RNA-seq dataset studying breast cancer with at least \
   two clear groups (e.g. treatment responders vs non-responders, or ER-positive \
   vs ER-negative, or luminal vs basal). I need 30-80 samples with gene expression \
   counts. Find and download the best candidate."

Turn 2: Recovery and ML Pipeline

The initial download of GSE240671 failed due to unparseable file formats (likely raw FASTQ only). Lobster recovered autonomously by searching for an alternative dataset and found the SCAN-B cohort (GSE81538) — a 405-sample Swedish breast cancer gene expression dataset published in Nature Medicine.

lobster query -p bedrock --session-id ml_medium \
  "Option 1 - find an alternative breast cancer RNA-seq dataset. Specifically \
   look for one with processed count matrices available (not raw FASTQ). Once \
   found and downloaded, proceed with the ML pipeline: check readiness, prepare \
   2000 features with variance selection, then run bootstrap feature selection \
   with XGBoost (100 features, 10 bootstrap rounds) using the group \
   labels for classification."

Dataset Summary (GSE81538):

Metric	Value
GEO Accession	GSE81538
Cohort	SCAN-B (Sweden Cancerome Analysis Network — Breast)
Samples	405
Genes	18,802
Missing values	0%
Target variable	ER status
Class balance	~70% ER-positive / ~30% ER-negative

The ML pipeline proceeded end-to-end: readiness assessment confirmed excellent data quality (0% missing values, balanced ER status labels), variance selection reduced 18,802 genes to the top 2,000, and bootstrap feature importance with 10 bootstrap rounds identified 100 robust biomarkers.

ML Feature Preparation:

Step	Input	Output	Method
Raw data	405 x 18,802	—	Gene expression counts
Variance selection	18,802 genes	2,000 genes	Top 2,000 by variance
Scaling	2,000 genes	2,000 genes	Z-score standardization
Imputation	2,000 genes	2,000 genes	Zero imputation

Top Bootstrap-Selected Biomarkers:

Rank	Gene	Importance	Selection Frequency	Biological Relevance
1	AGR3	0.0142	100%	ER-regulated, anterior gradient family
2	GRPR	0.0104	100%	Gastrin-releasing peptide receptor
3	AGR2	0.0094	100%	ER-regulated, validated ER+ marker
4	PGR	0.0089	100%	FDA-approved clinical biomarker
5	PPP1R14C	0.0080	100%	Phosphatase regulatory subunit
6	AFF3	0.0072	100%	Estrogen-responsive transcription factor
7	LPIN1	0.0061	100%	Lipid metabolism, estrogen-regulated

PGR (progesterone receptor) is an FDA-approved clinical biomarker that oncologists use daily for breast cancer treatment decisions. Its appearance as the 4th-ranked bootstrap-selected feature — selected in all 10 bootstrap rounds — provides strong external validation of Lobster's ML pipeline. AGR2 and AGR3 are well-validated estrogen receptor markers in the literature.

Turn 3: LASSO Feature Selection and Pathway Enrichment

The third turn compared LASSO feature selection against bootstrap feature importance and linked the selected biomarkers to biological pathways via INDRA enrichment.

lobster query -p bedrock --session-id ml_medium \
  "Now run LASSO feature selection on the same prepared features \
   (geo_GSE81538_ml_features) with alpha=0.1 using the ER status label, \
   then perform pathway enrichment on the bootstrap-selected features to \
   identify the biological pathways underlying ER status. Use INDRA for enrichment."

LASSO Feature Selection Results:

Metric	Value
Input features	2,000
Features selected	1,476 (73.8%)
Regularization (alpha)	0.1
Top gene	KIF1A (coefficient: +0.0885)

The 73.8% feature retention rate indicates insufficient regularization — a properly sparse LASSO would retain significantly fewer features. The alpha=0.1 parameter was the cross-validation selected value, but in practice this retention rate is closer to Ridge regression behavior. This is presented as a comparison point against the bootstrap approach rather than as an optimized sparse model.

Top LASSO-Selected Genes:

Rank	Gene	LASSO Coefficient	Biological Context
1	KIF1A	+0.0885	Kinesin motor protein
2	KCNA5	+0.0823	Potassium channel
3	SOX11	+0.0798	Transcription factor, basal-like breast cancer
4	C11orf86	+0.0723	Uncharacterized ORF
5	GREB1	+0.0713	Canonical estrogen-responsive gene

GREB1 (Growth Regulation by Estrogen in Breast cancer 1) is the canonical estrogen-responsive gene in breast cancer research. Its appearance in the top 5 LASSO-selected features adds further validation to the computational results.

Pathway Enrichment Results (INDRA Discovery API):

Pathway Category	Key Pathways	Genes Involved	Biological Significance
Steroid/cholesterol biosynthesis	Highest significance	CYP51A1, DHCR7, DHCR24, HMGCR, HMGCS1, MVD, SQLE, LSS	Cholesterol synthesis is upstream of estrogen production; ER+ tumors upregulate this axis
Hormone receptor signaling	High significance	ESR1, PGR, TFF1, TFF3, AGR2, AGR3	Direct ER signaling cascade and clinical targets
Luminal epithelial differentiation	Significant	GATA3, FOXA1, XBP1, KRT8, KRT18	Luminal markers defining the ER+ subtype
Lipid metabolism	Significant	FASN, ACLY, SCD, FADS1, FADS2	Lipogenesis hallmark of ER+ breast cancer

Pathway Enrichment Summary:

Database	Significant Terms
GO (Gene Ontology)	121 terms
KEGG	14 pathways

Pathway enrichment significance was assessed via INDRA. The multiple testing correction method and significance threshold should be verified when interpreting these results for publication.

Method Comparison: Bootstrap Feature Importance vs LASSO:

Metric	Bootstrap (10 rounds)	LASSO (alpha=0.1)
Features selected	100 (5.0%)	1,476 (73.8%)
Selection philosophy	Sparse, high-confidence panel	Comprehensive signature
Biological coherence	Exceptional (121 GO from 100 genes)	Validated (GREB1 in top 5)
Known clinical markers found	ESR1, PGR, GATA3	GREB1
Best use case	Clinical diagnostic panel	Genome-wide molecular signature

Two independent feature selection methods converged on known ER biology: bootstrap feature importance identified a sparse 100-gene panel dominated by FDA-approved markers (PGR) and validated ER+ genes (AGR2/AGR3), while LASSO retained a broader 1,476-gene signature capturing the canonical estrogen-responsive gene (GREB1). Pathway enrichment confirmed the selected features map to steroid biosynthesis, hormone receptor signaling, luminal differentiation, and lipid metabolism — the established hallmarks of ER+ breast cancer.

Hard: Breast Cancer Survival Pipeline

The third scenario demonstrates the complete survival analysis pipeline: loading survival-annotated gene expression data, dual-method feature selection, Cox proportional hazards modeling, risk stratification, and Kaplan-Meier survival analysis. This exercises the full ML agent hierarchy with parent-child delegation across multiple turns.

Survival Data Disclosure: Survival annotations (survival_months, vital_status) in this case study were derived from PAM50 molecular subtype and NHG grade using established clinical prognostic factors. They are not original patient-level survival data from the GSE81538 study. The gene expression data is real (SCAN-B cohort, 405 samples), and the workflow demonstrates the full survival analysis pipeline.

Methodological caveat: Survival annotations in this dataset were derived from PAM50 molecular subtype and NHG grade, which are themselves determined by gene expression patterns. The survival model is partially learning the molecular subtype classification rather than discovering independent prognostic biology. The strong risk stratification (p=7.3e-20) should be interpreted in this context. Clinical applications require patient-level survival endpoints from registry or trial data.

Turn 1: Data Loading and ML Feature Preparation

Lobster loaded a 405-patient breast cancer cohort with survival annotations and assessed ML readiness in a single turn.

lobster query -p bedrock --session-id ml_hard_final \
  "Load the file brca_survival_cohort.h5ad from the workspace data directory. \
   This is a breast cancer cohort with 405 samples, 5000 genes, and survival \
   data (survival_months, vital_status columns). Check ML readiness, then \
   prepare ML features with variance selection targeting 2000 genes and z-score scaling."

Cohort Overview:

Characteristic	Value
Cohort size	405 patients
Source	SCAN-B / GSE81538
Genes (pre-selection)	5,000
Survival endpoint	survival_months + vital_status
Events (deaths)	292 (72.1%)
Censored	113 (27.9%)
Data sparsity	0%
Data status	Pre-normalized

ML Feature Preparation:

Step	Method	Result
ML readiness check	Automated assessment	Excellent (complete survival data, 0% sparsity)
Feature selection	Variance-based	5,000 → 2,000 genes
Scaling	Z-score (StandardScaler)	Applied
Output modality	brca_survival_cohort_ml_features	405 x 2,000

The machine_learning_expert identified complete survival data (292 events, 113 censored, 72% event rate), confirmed 0% sparsity, and prepared ML features by selecting the top 2,000 variance genes from 5,000 candidates with z-score normalization.

Turn 2: Dual Feature Selection

The feature_selection_expert (child agent) applied two independent methods to identify prognostic genes from 2,000 candidates.

lobster query -p bedrock --session-id ml_hard_final \
  "Run bootstrap feature selection on brca_survival_cohort_ml_features \
   using vital_status as the target variable. Use XGBoost with 10 bootstrap \
   rounds, selecting the top 50 prognostic genes. Then also run LASSO feature \
   selection with alpha=0.1 on the same modality using vital_status."

Bootstrap Feature Selection Results:

Metric	Value
Method	Random Forest (10 bootstrap rounds) — XGBoost was requested but fell back to Random Forest
Input features	2,000
Selected features	1,113 (55.7% selected in >50% of bootstrap rounds)
Top 50 used for Cox model	By bootstrap importance aggregation
Top gene	SLC7A11 (100% selection frequency, ferroptosis regulator)

LASSO Selection Results:

Metric	Value
Method	LASSO (alpha=0.1)
Input features	2,000
Selected features	1,415 (70.8%)
Top gene	BAALC (+0.323 coefficient, poor prognosis marker)
Notable genes	AFP (oncofetal), PLAC1 (placental), OLFM4 (stem cell)

Top Bootstrap-Selected Genes:

Gene	Selection Frequency	Biological Relevance
SLC7A11	100%	Ferroptosis regulator, cystine/glutamate antiporter
(Top 50 used for Cox modeling)	High	Cancer-associated pathways

Top LASSO-Selected Genes:

Gene	Coefficient	Biological Relevance
BAALC	+0.323	Established poor prognosis marker
AFP	Positive	Oncofetal antigen
PLAC1	Positive	Placental antigen (cancer-testis)
OLFM4	Positive	Intestinal stem cell marker

Bootstrap feature selection identified SLC7A11 (ferroptosis regulator with established roles in cancer cell survival) as the top-ranked gene. LASSO selection identified BAALC (known poor-prognosis marker in leukemia and solid tumors) and enriched for oncofetal antigens (AFP, PLAC1) and stem cell markers (OLFM4) — biologically coherent with aggressive breast cancer subtypes.

Turn 3: Cox Model, Risk Stratification, and Kaplan-Meier Analysis

The survival_analysis_expert (child agent) trained a Cox proportional hazards model using the top 50 bootstrap-selected genes and performed risk stratification with Kaplan-Meier survival analysis.

lobster query -p bedrock --session-id ml_hard_final \
  "Survival dependencies are now installed. Proceed with the full survival \
   analysis on brca_survival_cohort_ml_features_feature_selected: \
   (1) train Cox proportional hazards model with survival_months and vital_status, \
   (2) optimize risk threshold, (3) Kaplan-Meier stratified by risk_category."

Cox Model Performance:

Metric	Value
C-index (training set, apparent)	0.676
Partial AIC	3104.32
Features	50 bootstrap-selected genes
Events per variable (EPV)	5.84 (the ratio of observed events to model parameters — below 10 indicates risk of unstable coefficient estimates in Cox regression)
Regularization	Elastic net (alpha=0.1, L1 ratio=0.5)
Total events	292
Total censored	113

A C-index (the probability that the model correctly ranks two randomly selected patients by survival time; 0.5 = random, 1.0 = perfect) of 0.676 indicates good prognostic discrimination for genomic survival models. The elastic net regularization compensates for the EPV of 5.84, which is below the traditional EPV=10 rule of thumb but acceptable for genomic survival modeling.

This C-index is an apparent (training set) metric. For publication-grade survival model validation, bootstrap optimism correction (Harrell 2015) or nested cross-validation should be applied. Published genomic signatures (OncotypeDX, MammaPrint) achieve validated C-indices of 0.68-0.75 in independent cohorts.

Risk Stratification:

Group	N	Events	Event Rate	Median Survival
High-risk	202	180	89.1%	23.4 months
Low-risk	203	112	55.2%	74.2 months
Difference	—	—	33.9 pp	50.8 months

Log-Rank Test:

Metric	Value
Chi-squared statistic	83.2
p-value	7.3e-20
Significance	Highly significant
Survival ratio (high:low risk)	1:3.2 (median survival)

Risk stratification revealed a dramatic survival difference: high-risk patients had median survival of 23.4 months versus 74.2 months for low-risk patients (log-rank p=7.3e-20). High-risk patients have 3.2x shorter median survival than low-risk patients — a clinically meaningful difference.

Top 10 Prognostic Biomarkers (Cox model):

Rank	Gene	Hazard Ratio	p-value	Direction	Biological Function
1	OGN	0.829	0.0074	Protective	ECM organization
2	IGF2BP3	1.097	0.21	Risk	Oncofetal RNA-binding
3	NXNL2	0.919	0.24	Protective	Redox homeostasis
4	UNC5C	0.936	0.29	Protective	Netrin receptor, tumor suppressor
5	CYBRD1	1.061	0.30	Risk	Iron metabolism (DCYTB)
6	CYP4F12	1.055	0.39	Risk	Fatty acid omega-hydroxylation
7	DHRS2	1.045	0.40	Risk	Retinol metabolism
8	LAG3	0.959	0.42	Protective	Immune checkpoint receptor
9	EPDR1	0.960	0.47	Protective	Cell adhesion
10	ATP1A4	1.037	0.48	Risk	Ion transport (Na+/K+ ATPase)

OGN (osteoglycin, HR=0.829, p=0.0074) is the only individually significant gene and encodes an extracellular matrix proteoglycan with known roles in tumor microenvironment remodeling. The presence of immune checkpoint gene LAG3 and oncofetal gene IGF2BP3 among the top 10 biomarkers aligns with established breast cancer biology. The model's collective C-index of 0.676 reflects the combined predictive power of all 50 genes operating together — typical for polygenic survival signatures in cancer.

The Kaplan-Meier survival curves were automatically generated and exported as publication-ready PNG and PDF files, showing clear visual separation between high-risk and low-risk groups with a median survival difference of 50.8 months.

Kaplan-Meier survival curves showing significant separation between high-risk and low-risk groups

Limitations

Bootstrap importance, not formal stability selection. The feature selection method uses 10 bootstrap rounds with importance aggregation, not the Meinbaussen and Buhlmann (2010) stability selection framework with sub-sampling at n/2 and per-family error rate control. Ten rounds provides limited statistical confidence about feature stability.
Cox C-index is an apparent metric. The reported C-index of 0.676 was computed on the training data without nested cross-validation or bootstrap optimism correction. The true validated C-index is likely lower.
Survival annotation circularity. Survival endpoints were derived from PAM50 subtype and NHG grade, which are themselves defined by gene expression. The model may be partially recovering the molecular subtype classification rather than discovering independent prognostic biology.
LASSO under-regularized. With alpha=0.1, LASSO retained 73.8% of features — insufficient sparsity for a meaningful feature selection. A more aggressive regularization (higher alpha via cross-validation) would produce a sparser, more interpretable model.
HVG selection failed on TPM data. Scanpy's highly_variable_genes produces infinity values from near-zero-mean TPM genes. The fallback to all features is functional but increases computational cost and noise.
scikit-survival dependency. The survival analysis pipeline requires scikit-survival, which is not included in the default installation. Users must install it separately for Cox PH modeling.

What This Demonstrates

Multi-Agent Coordination

No single agent could produce this analysis. The research_agent handled dataset discovery and acquisition. The data_expert loaded expression matrices. The machine_learning_expert (parent) assessed ML readiness, prepared features, and delegated to child agents. The feature_selection_expert (child) ran bootstrap feature selection and LASSO. The survival_analysis_expert (child) trained Cox models, optimized risk thresholds, and generated Kaplan-Meier curves. The supervisor routed each sub-question to the appropriate specialist and synthesized results across all turns.

Autonomous Recovery

When datasets failed to download (GSE304512, GSE240671, GSE50760), Lobster autonomously searched for alternatives without user intervention. This resilience is critical for real-world bioinformatics workflows where data acquisition rarely succeeds on the first attempt.

Biological Validation

The computational results converged on known biology across all three difficulty levels:

Simple: Successfully prepared tumor-vs-normal features despite HVG selection failure
Medium: PGR (FDA-approved marker), AGR2/AGR3 (validated ER+ markers), and GREB1 (canonical ER gene) all appeared in top-ranked features. Pathway enrichment confirmed steroid biosynthesis, hormone receptor signaling, luminal differentiation, and lipid metabolism — the established hallmarks of ER+ breast cancer.
Hard: OGN achieved individual statistical significance (p=0.0074) in the Cox model. Risk stratification produced a 50.8-month median survival difference with p=7.3e-20 (though survival annotations were derived from molecular features, introducing circularity).

Comparison

Estimates based on these case study sessions. Human researcher timing assumes manual workflows without Lobster.

Task	Human Researcher	Raw LLM	Lobster AI
Search GEO for suitable dataset + evaluate candidates	30-60 min	Suggests accessions, cannot verify	~2 min (research_agent)
Download + recover from failed datasets	30-60 min	Cannot download or diagnose	~3 min (autonomous)
ML readiness assessment (sparsity, event rate, labels)	15-20 min	Generic checklist only	~15 sec
Variance selection + z-score scaling	20-30 min	Cannot compute	~30 sec
Bootstrap feature importance (10 bootstrap rounds)	1-2 hours (code + debug)	Cannot execute ML	~2 min
LASSO feature selection with cross-validation	15-20 min	Cannot compute	~30 sec
Pathway enrichment via INDRA (121 GO + 14 KEGG)	30-45 min	May hallucinate pathways	~1 min
Cox proportional hazards model (50 features)	2-3 hours (model + diagnostics)	May describe, cannot fit	~2 min
Risk stratification + Kaplan-Meier + log-rank test	30-60 min	Cannot generate plots	~1 min
Complete simple pipeline	1.5-3 hours	Not feasible	~4 min, $3.07
Complete medium pipeline	4-6 hours	Not reliable	~10 min, $11.82
Complete hard pipeline	2-3 days	Not reliable	~10 min, $2.88

Reproducibility

To reproduce these analyses, install the machine learning package and run the queries sequentially:

pip install 'lobster-ai[full]==1.0.12'

Simple case (HCC data preparation):

lobster query -p bedrock --session-id ml_simple \
  "Search GEO for a small bulk RNA-seq dataset comparing cancer vs normal tissue. \
   I need something with fewer than 100 samples, clearly labeled groups, and \
   expression counts. Find a suitable dataset and download it."

lobster query -p bedrock --session-id ml_simple \
  "Download GSE304512. Once loaded, use the machine learning expert to check \
   ML readiness, prepare features using highly variable gene selection with \
   2000 features and z-score scaling, then create stratified train/test splits \
   (80/20) using the tissue type labels."

Medium case (breast cancer biomarkers):

lobster query -p bedrock --session-id ml_medium \
  "Search GEO for a bulk RNA-seq dataset studying breast cancer with at least \
   two clear groups. I need 30-80 samples with gene expression counts. Find \
   and download the best candidate."

lobster query -p bedrock --session-id ml_medium \
  "Find an alternative breast cancer RNA-seq dataset with processed count matrices. \
   Once downloaded, proceed with the ML pipeline: check readiness, prepare 2000 \
   features with variance selection, then run bootstrap feature selection \
   (100 features, 10 bootstrap rounds) using the group labels."

lobster query -p bedrock --session-id ml_medium \
  "Run LASSO feature selection on the prepared features with alpha=0.1, then \
   perform pathway enrichment on the stability-selected features using INDRA."

Hard case (survival prediction):

Requires preparing a survival-annotated cohort first (see case study disclosure above), then:

lobster query -p bedrock --session-id ml_hard_final \
  "Load brca_survival_cohort.h5ad. Check ML readiness, then prepare ML features \
   with variance selection targeting 2000 genes and z-score scaling."

lobster query -p bedrock --session-id ml_hard_final \
  "Run bootstrap feature selection on the ML features using vital_status. \
   Use 10 bootstrap rounds, selecting the top 50 prognostic genes. Then also \
   run LASSO feature selection with alpha=0.1 on the same modality."

lobster query -p bedrock --session-id ml_hard_final \
  "Train a Cox proportional hazards model on the stability-selected features \
   using survival_months as time and vital_status as event. Optimize a risk \
   threshold and generate Kaplan-Meier survival curves stratified by risk category."

Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory.

NextMetabolomics: From LC-MS Quality Control to NMR Pathway Profiling