Proteomics: From DIA-MS Quality Control to Biomarker Discovery

Mass spectrometry proteomics across three complexity levels — CSF quality control, tumor differential expression, and clinical biomarker panel discovery.

Mass spectrometry proteomics generates massive datasets with systematic technical variation, missing values, and statistical challenges that make analysis time-consuming even for experienced bioinformaticians. This case study demonstrates three progressively complex proteomics workflows using Lobster AI's proteomics agents: from cerebrospinal fluid quality control through colorectal cancer differential expression to ovarian cancer chemoresistance biomarker discovery with nested cross-validation. These workflows show how Lobster coordinates multiple specialized agents to compress workflows that typically span days or weeks into minutes of conversational queries.

Session context: Results generated February 2026 using lobster-ai 1.0.12 on AWS Bedrock (Claude Sonnet 4.5). All three datasets are synthetic — protein identifiers map to real UniProt accessions for pathway and network queries, but expression values are engineered. External databases queried: Enrichr, STRING. Local tools: scikit-learn, scipy. Total cost: $2.72 across 3 case studies (8 turns). This case study demonstrates analytical workflows on controlled data, not biological discovery. Results on real clinical data will differ.

Agents and Data Sources

This analysis uses the lobster-proteomics package, which provides three agents working in a parent-child hierarchy:

Agent	Role
`proteomics_expert`	Parent agent handling data import, quality assessment, filtering, and normalization
`proteomics_de_analysis_expert`	Child agent for differential expression, pathway enrichment, and STRING network analysis
`biomarker_discovery_expert`	Child agent for LASSO/stability feature selection and nested cross-validation

All datasets in this case study are synthetic proteomics data (OC_XXXX protein IDs, patient identifiers, and clinical metadata) designed for controlled demonstration of analytical workflows. External APIs queried include Enrichr (GO/Reactome pathway enrichment) and STRING (protein-protein interaction networks). Local computation uses scipy (Welch t-test), scikit-learn (LASSO, cross-validation), and lifelines (survival analysis).

Simple: Alzheimer's CSF Proteomics QC Pipeline

The first workflow demonstrates core proteomics QC and preprocessing: import, quality assessment, filtering, normalization, and unsupervised clustering in two conversational turns.

The Research Question

How do we quality-assess and preprocess a cerebrospinal fluid proteomics dataset for downstream Alzheimer's disease biomarker analysis?

CSF proteomics in Alzheimer's disease is a high-impact clinical use case. CSF biomarkers like A-beta, tau, and neurofilament light are central to AD diagnosis, and untargeted proteomics is expanding the biomarker landscape. This 12-sample pilot study represents a realistic entry point for validating a new cohort before scaling to larger studies.

Turn 1: Import and Quality Assessment

The first query imports the DIA-MS dataset and performs comprehensive quality assessment.

lobster query --session-id alzheimers_csf \
  "Import the DIA-NN proteomics dataset from alzheimer_csf_proteomics.tsv \
   with the sample metadata from alzheimer_csf_metadata.csv. This is a \
   cerebrospinal fluid (CSF) proteomics study comparing Alzheimer's disease \
   patients versus healthy controls. After importing, assess the data quality \
   and give me a comprehensive summary of the dataset including missing value \
   patterns, intensity distributions, and sample metadata."

The proteomics_expert imported 800 proteins across 12 CSF samples and immediately characterized the dataset.

Metric	Value
Proteins	800
Samples	12
Data completeness	88.1%
Missing values	11.9%
Conditions	Control (6), Alzheimer (6)
Batches	2 (batch_1, batch_2)
Covariates	age, sex, APOE genotype

The agent automatically detected the MNAR (Missing Not At Random) pattern characteristic of mass spectrometry data, where low-abundance proteins are disproportionately missing. MNAR requires specialized handling — median normalization and log2 transformation preserve this structure while enabling differential expression testing.

The agent flagged the small sample size as a limitation while noting the excellent data completeness (88.1%) and rich clinical metadata including APOE genotype — the strongest genetic risk factor for Alzheimer's disease.

Turn 2: Preprocessing and Pattern Analysis

The second query runs the complete preprocessing pipeline with dimensionality reduction and clustering.

lobster query --session-id alzheimers_csf \
  "Now preprocess the Alzheimer CSF proteomics data: (1) filter out proteins \
   with more than 50% missing values and samples with more than 40% missing, \
   (2) normalize using median normalization with log2 transformation, (3) run \
   PCA and Leiden clustering to identify sample grouping patterns. Report on \
   whether the biological groups (Alzheimer vs Control) separate in the PCA \
   and if batch effects are visible."

The agent retained all 12 samples and 732 of 800 proteins after quality filtering.

Processing Step	Input	Output	Removed
Sample filtering (40% threshold)	12 samples	12 samples	0 (0%)
Protein filtering (50% threshold)	800 proteins	732 proteins	68 (8.5%)
Normalization	Median + log2	Applied	--
PCA	Computed	Applied	--
Leiden clustering	--	2 clusters	--

Median normalization with log2 transformation was applied — the standard approach for mass spectrometry data that preserves the MNAR missing value structure. PCA with Leiden clustering identified two clusters, which the agent noted should be visualized to determine whether they correspond to biological groups (AD vs Control) or technical batches.

The 91.5% protein retention rate indicates excellent data quality. Typical DIA-MS datasets lose 15-30% of proteins at this threshold. The agent completed this entire QC pipeline in under 2 minutes across two turns for $0.63 — a workflow that typically takes 1-2 hours manually.

Medium: Colorectal Cancer Tumor Proteomics DE Analysis

The second workflow exercises the full parent-child agent coordination: proteomics_expert handles preprocessing, then hands off to proteomics_de_analysis_expert for differential expression, pathway enrichment, and STRING network analysis.

The Research Question

What are the proteomic signatures of colorectal cancer tumor tissue compared to matched normal tissue, and what protein interaction networks drive tumor biology?

Colorectal cancer is the third most common cancer worldwide, and tumor-vs-normal tissue proteomics is a standard study design for biomarker discovery and pathway characterization. This dataset includes KRAS mutation stratification — a key oncogenic driver in CRC — making it relevant for precision medicine. A 54-sample paired cohort provides sufficient statistical power for robust differential expression testing.

Turn 1: Import and Quality Assessment

The first query imports the tumor-normal paired dataset with clinical metadata.

lobster query --session-id crc_proteomics \
  "Import the proteomics data from crc_proteomics_maxquant.tsv with sample \
   metadata from crc_proteomics_metadata.csv. This is a colorectal cancer \
   tumor vs matched normal tissue proteomics study. After importing, assess \
   the data quality comprehensively and give me a summary."

The proteomics_expert imported 4,500 proteins across 54 samples (27 tumor + 27 matched normal).

Metric	Value
Samples	54 (27 tumor + 27 normal)
Proteins	4,500
Missing values	11.3%
Data completeness	88.7%
Batches	3 (batch_1, batch_2, batch_3)
KRAS mutations	G12D (5), G12V (4), G13D (4), WT (14)
Tumor stages	I, II, IIIa, IIIb, IV

The agent identified the MNAR missing value pattern, verified metadata integration including KRAS mutation status and TNM/AJCC staging, and recommended a standard preprocessing pipeline before differential expression testing.

Turn 2: Differential Expression and Pathway Enrichment

The second query runs preprocessing, differential expression, and pathway enrichment in a single turn.

lobster query --session-id crc_proteomics \
  "Preprocess the CRC proteomics data: (1) filter proteins with more than 50% \
   missing values and remove any low-quality samples, (2) apply median \
   normalization with log2 transformation, (3) run PCA with Leiden clustering \
   to assess tumor/normal separation and batch effects. Then run differential \
   expression analysis comparing Tumor vs Normal using the condition column, \
   and perform GO and Reactome pathway enrichment on the significant DE proteins."

The proteomics_expert preprocessed the data, then delegated to proteomics_de_analysis_expert for differential expression and enrichment.

Processing Step	Input	Output	Removed
Protein filtering (50% threshold)	4,500 proteins	4,080 proteins	420 (9.3%)
Sample filtering (40% threshold)	54 samples	54 samples	0 (0%)
Normalization	Median + log2	Applied	--
Differential expression	4,080 proteins	271 significant	FDR < 0.05
Pathway enrichment	271 DE proteins	303 terms	None FDR < 0.05

Differential expression testing identified 271 significantly dysregulated proteins (6.6% of the proteome) at FDR < 0.05. Pathway enrichment returned 303 associations but none survived FDR correction — a common pattern in heterogeneous tumor biology where DE proteins span diverse biological processes rather than concentrating in specific pathways. The agent noted this and recommended directional pathway analysis (separate up/down-regulated sets) as a next step.

The 27 tumor and 27 matched normal samples represent paired tissue from the same patients. Welch's t-test treats these as independent groups; a paired t-test or limma with a blocking factor would better exploit the paired structure and improve statistical power in a production analysis.

Turn 3: STRING Network Analysis and Age Correlation

The third query runs protein-protein interaction network analysis to identify functional modules.

lobster query --session-id crc_proteomics \
  "Run STRING protein-protein interaction network analysis on the \
   differentially expressed proteins from the CRC tumor vs normal comparison. \
   Use the functional network type with a confidence score threshold of 700 \
   (high confidence). Identify the top hub proteins and report the network \
   topology. Also, run a correlation analysis between the DE protein levels \
   and patient age to identify age-associated proteomic changes."

STRING network analysis at high confidence (score > 700) revealed three functional modules among the 271 DE proteins.

Network Module	Proteins	Biological Function
Glycolytic axis	PKM - PGK1	Metabolic reprogramming (Warburg effect)
Tumor suppressor complex	RUNX3 - CBFB	Transcriptional regulation
Endocytic machinery	PICALM - CLTCL1	Membrane trafficking

The PKM-PGK1 glycolytic axis confirms the Warburg effect in CRC — a metabolic reprogramming where cancer cells favor glycolysis even in the presence of oxygen. The RUNX3-CBFB tumor suppressor complex is known to be disrupted in colorectal cancer, and its downregulation in tumors aligns with published literature.

The CRC synthetic dataset used protein identifiers that mapped to real UniProt accessions, enabling genuine STRING interaction queries. However, the starting DE protein list was generated from synthetic expression data, so the network modules should be interpreted as illustrative of the type of output STRING analysis produces rather than novel biological findings.

Age Correlation Results	Value
Samples analyzed	54
Age range	45-79 years
Proteins tested	4,080
Significant (FDR < 0.05, \|rho\| > 0.3)	0
Median correlation	0.098
Max correlation	0.504

Age correlation analysis found zero significant age-associated proteins at FDR < 0.05, confirming that the proteomic signature is driven by disease biology rather than patient demographics. This is an important finding for biomarker development where confounding by age must be ruled out before clinical translation.

Because the synthetic data did not encode age as a confounding variable, the absence of age correlation is expected by construction and should not be interpreted as evidence that a real clinical proteomic signature would be age-independent.

Hard: Ovarian Cancer Chemoresistance Biomarker Discovery

The third workflow exercises all three proteomics agents: proteomics_expert for preprocessing, proteomics_de_analysis_expert for differential expression, and biomarker_discovery_expert for LASSO/stability-based panel selection with rigorous nested cross-validation.

The Research Question

Can we identify a proteomics biomarker panel that predicts platinum chemoresistance in high-grade serous ovarian cancer before treatment initiation?

High-grade serous ovarian cancer has the highest mortality of gynecological cancers, largely because most patients develop platinum chemoresistance. Predicting resistance before treatment would transform patient care — allowing resistant patients to receive alternative regimens upfront instead of failing first-line therapy. This 80-patient clinical cohort with PFS (progression-free survival) endpoints, BRCA mutation status, and FIGO staging represents a realistic translational proteomics study.

Turn 1: Import, QC, and Preprocessing

The first query imports the clinical dataset and runs the complete preprocessing pipeline.

lobster query --session-id hgsoc_biomarker \
  "Import clinical proteomics data from a high-grade serous ovarian cancer \
   (HGSOC) chemoresistance study. The data file is at hgsoc_clinical_proteomics.tsv \
   with clinical metadata at hgsoc_clinical_metadata.csv. This is an 80-patient \
   cohort with 40 chemo-sensitive and 40 chemo-resistant HGSOC patients profiled \
   by mass spectrometry. Metadata includes PFS (progression-free survival) in days, \
   event status, FIGO stage, residual disease, and BRCA mutation status. After \
   importing: (1) assess data quality, (2) filter proteins with >60% missing and \
   samples with >50% missing, (3) normalize with median normalization and log2 \
   transform. Report the dataset overview and any quality concerns."

The proteomics_expert imported 3,200 proteins across 80 patients (40 chemo-sensitive, 40 chemo-resistant).

Clinical Characteristic	Value
Cohort size	80 patients
Chemo-sensitive	40 (50%)
Chemo-resistant	40 (50%)
FIGO stage IIIC	~65%
FIGO stage IV	~35%
BRCA wild-type	~75%
BRCA1 mutant	~15%
BRCA2 mutant	~10%
All high-grade serous	100%
Batches	4 (batch_A through batch_D)

QC Metric	Value
Proteins quantified	3,200
Missing values	19.3%
Samples removed	0 (0%)
Proteins removed	0 (0%)
Normalization	Median + log2

The agent identified excellent data quality (19.3% missing, well below the typical 30-70% range for MS data) and a perfectly balanced cohort design (40 vs 40). All samples and proteins passed QC thresholds.

Turn 2: Differential Expression Analysis

The second query runs differential expression and pathway enrichment comparing resistant vs sensitive patients.

lobster query --session-id hgsoc_biomarker \
  "Run differential expression analysis comparing Chemo_Resistant vs \
   Chemo_Sensitive groups using the condition column. Use Welch's t-test with \
   FDR correction. Then run GO and Reactome pathway enrichment on the \
   significant DE proteins. Also run STRING network analysis on the DE proteins \
   to identify protein interaction hubs and functional modules."

The proteomics_de_analysis_expert identified 217 significantly dysregulated proteins (7.0% of the proteome) at FDR < 0.05.

DE Statistics	Value
Proteins tested	3,088
Significant (FDR < 0.05)	217 (7.0%)
Top log2FC	-2.42 to +1.99
Top fold change	5.3-fold
Most significant FDR	5.2e-28

Top DE Proteins	log2FC	FDR	Direction
OC_2393	-2.42	5.2e-28	Down in resistant
OC_0310	-2.16	1.7e-28	Down in resistant
OC_0106	-2.03	1.5e-26	Down in resistant
OC_1836	+1.99	1.2e-24	Up in resistant
OC_3101	+1.80	1.9e-22	Up in resistant

The strongest signals showed fold changes exceeding 5x with FDR values below 1e-20, indicating robust molecular differences between resistance phenotypes. The agent correctly identified that pathway enrichment and STRING network analysis require standard gene identifiers and recommended protein ID mapping as a next step — demonstrating intelligent error handling when external databases cannot resolve custom synthetic identifiers (OC_XXXX format).

Turn 3: Biomarker Panel Selection and Nested Cross-Validation

The third query exercises the biomarker_discovery_expert for feature selection and rigorous validation.

lobster query --session-id hgsoc_biomarker \
  "Select a chemoresistance biomarker panel using LASSO and stability selection \
   methods with the condition column as target, n_features=20, n_iterations=100. \
   Then evaluate that panel using nested cross-validation with 5 outer folds \
   and 3 inner folds using logistic regression."

The biomarker_discovery_expert selected a 20-protein panel using LASSO regularization with bootstrap stability selection (100 iterations, >50% appearance threshold).

Nested CV Performance	Value
AUC	1.000 +/- 0.000
Sensitivity	100%
Specificity	100%
Accuracy	100% (80/80)
PPV	100%
NPV	100%
Outer folds	5
Inner folds	3
Classifier	Logistic regression (L2)

Confusion Matrix	Predicted Sensitive	Predicted Resistant
Actual Sensitive	40	0
Actual Resistant	0	40

Top Panel Proteins (by stability)	Stability	Direction	LASSO Coefficient
OC_0275	100.0%	Up in resistant	+0.4182
OC_2223	100.0%	Up in resistant	+0.3956
OC_1877	100.0%	Up in resistant	+0.3847
OC_3011	100.0%	Up in resistant	+0.3729
OC_0394	100.0%	Up in resistant	+0.3651
OC_1642	100.0%	Down in resistant	-0.3582
OC_2890	100.0%	Up in resistant	+0.3498
OC_0127	100.0%	Down in resistant	-0.3412

The panel achieved perfect classification performance (AUC 1.000) under rigorous nested cross-validation — a methodology that prevents data leakage by fitting scalers and hyperparameters only on training data within each fold. The agent identified 14 upregulated and 6 downregulated proteins as the resistance signature.

Critical Caveat: The AUC 1.000 performance is legitimate given the synthetic data with engineered strong biological signal and perfectly balanced cohort. However, the agent appropriately cautioned that external validation on an independent clinical cohort is absolutely required before clinical translation. This level of performance in real-world clinical data would require multi-site validation, targeted MRM-MS assay development, and prospective clinical trials before FDA approval. The workflow demonstrates methodology, not a deployable clinical test.

In published clinical proteomics biomarker studies for ovarian cancer, reported discovery AUCs typically range from 0.72 to 0.88, with validation AUCs of 0.65-0.80. The perfect separation observed here reflects engineered signal in synthetic data, not expected real-world performance.

What This Demonstrates

Multi-Agent Coordination

No single agent could produce these analyses. The proteomics_expert handled data import, quality assessment, filtering, and normalization across all three workflows. The proteomics_de_analysis_expert ran differential expression testing with Welch t-test and FDR correction, queried Enrichr for pathway enrichment, and interfaced with the STRING REST API for protein interaction networks. The biomarker_discovery_expert implemented LASSO regularization with bootstrap stability selection and nested cross-validation with proper scaling and hyperparameter tuning within folds. The supervisor routed each sub-question to the appropriate specialist and synthesized results across all turns.

Database Integration

The agents queried Enrichr (GO/Reactome pathways) and STRING (protein-protein interactions) programmatically through validated API tools — not through LLM approximation. Statistical testing (Welch t-test, FDR correction) and machine learning (LASSO, logistic regression, nested CV) ran locally via scipy and scikit-learn with exact reproducibility.

Rigorous Methodology

The hard workflow demonstrates publication-quality methodology:

Bootstrap stability selection (100 iterations) ensures robust feature selection not driven by random split artifacts
Nested cross-validation (5 outer folds x 3 inner folds) prevents data leakage by tuning hyperparameters only on training data
KNN imputation (k=5) preserves biological structure in missing values
L2-regularized logistic regression prevents overfitting on high-dimensional data

These methodological details are often mishandled in manual analyses, leading to inflated performance estimates that fail external validation.

KNN imputation assumes missing-at-random patterns. After abundance-based filtering removed the lowest-intensity proteins (which are predominantly MNAR), the remaining missing values in the filtered dataset are more compatible with MAR assumptions, making KNN appropriate for this reduced feature set.

Human vs Raw LLM vs Lobster AI

Estimates based on these three case study sessions. Human researcher timing assumes manual workflows without automation.

Task	Human Researcher	Raw LLM	Lobster AI
Import + QC + filter + normalize CSF data	1-2 hours	Cannot compute	~2 min, $0.63
CRC preprocessing + DE + pathway + STRING	3-4 hours	Approximate, no APIs	~5 min, $1.35
HGSOC DE + LASSO/stability + nested CV	1-2 days (code + debug)	Cannot execute ML	~6 min, $0.74
Total across all three workflows	2-3 days	Not reliable	~13 min, $2.72

Raw LLMs cannot load files, compute statistics, query APIs, or execute machine learning. They may suggest methods or approximate results, but cannot produce validated outputs. Lobster AI compresses weeks of manual proteomics work into minutes of conversational queries with full provenance tracking.

Limitations

Synthetic data. All three datasets use synthetic protein expression matrices. While protein identifiers map to real UniProt accessions (enabling genuine pathway and network queries), the expression values and group separations are engineered. Results demonstrate methodology, not biological discovery.
AUC 1.000 is a synthetic data artifact. Perfect classification does not occur in real clinical proteomics. Published biomarker panels for ovarian cancer achieve discovery AUCs of 0.72-0.88.
Paired design analyzed with unpaired test. The CRC dataset contains matched tumor-normal pairs, but Welch's t-test was applied without accounting for the pairing structure. A paired test would be more appropriate.
Missing value strategy. MNAR was correctly identified as the dominant missing mechanism, but KNN imputation (an MAR method) was applied to the filtered dataset. This is acceptable after abundance-based filtering but represents a methodological simplification.
Stability scores at 100%. The top 8 biomarker candidates all achieved 100% stability across bootstrap rounds, reflecting strong synthetic signal. Real clinical data typically produces stability scores of 40-70% for the top features.
No visual outputs. Volcano plots, heatmaps, and PCA score plots would be generated in a full analysis session but are not shown here.

Reproducibility

To reproduce these analyses, install the proteomics package and run the queries sequentially:

pip install 'lobster-ai[full]==1.0.12'

Simple Workflow (CSF QC)

lobster query --session-id alzheimers_csf \
  "Import the DIA-NN proteomics dataset from alzheimer_csf_proteomics.tsv \
   with the sample metadata from alzheimer_csf_metadata.csv. This is a \
   cerebrospinal fluid (CSF) proteomics study comparing Alzheimer's disease \
   patients versus healthy controls. After importing, assess the data quality \
   and give me a comprehensive summary."

lobster query --session-id alzheimers_csf \
  "Preprocess the data: filter proteins with >50% missing and samples with \
   >40% missing, normalize with median + log2, run PCA and Leiden clustering."

Medium Workflow (CRC DE Analysis)

lobster query --session-id crc_proteomics \
  "Import crc_proteomics_maxquant.tsv with metadata from \
   crc_proteomics_metadata.csv. Assess data quality comprehensively."

lobster query --session-id crc_proteomics \
  "Preprocess, then run differential expression comparing Tumor vs Normal, \
   and perform GO and Reactome pathway enrichment."

lobster query --session-id crc_proteomics \
  "Run STRING network analysis on DE proteins (confidence > 700) and age \
   correlation analysis."

Hard Workflow (HGSOC Biomarker Discovery)

lobster query --session-id hgsoc_biomarker \
  "Import hgsoc_clinical_proteomics.tsv with metadata from \
   hgsoc_clinical_metadata.csv. Filter proteins >60% missing, samples >50% \
   missing, normalize with median + log2."

lobster query --session-id hgsoc_biomarker \
  "Run differential expression comparing Chemo_Resistant vs Chemo_Sensitive. \
   Run GO/Reactome pathway enrichment and STRING network analysis."

lobster query --session-id hgsoc_biomarker \
  "Select chemoresistance biomarker panel using LASSO + stability selection \
   (n_features=20, n_iterations=100). Evaluate with nested cross-validation \
   (5 outer folds, 3 inner folds) using logistic regression."

Session continuity via --session-id ensures each turn builds on prior context. Results are stored in the .lobster_workspace/ directory and can be exported with /pipeline export.

NextResearch: From Literature Mining to Dataset Discovery

Proteomics: From DIA-MS Quality Control to Biomarker Discovery

Agents and Data Sources

Simple: Alzheimer's CSF Proteomics QC Pipeline

The Research Question

Turn 1: Import and Quality Assessment

Turn 2: Preprocessing and Pattern Analysis

Medium: Colorectal Cancer Tumor Proteomics DE Analysis

The Research Question

Turn 1: Import and Quality Assessment

Turn 2: Differential Expression and Pathway Enrichment

Turn 3: STRING Network Analysis and Age Correlation

Hard: Ovarian Cancer Chemoresistance Biomarker Discovery

The Research Question

Turn 1: Import, QC, and Preprocessing

Turn 2: Differential Expression Analysis

Turn 3: Biomarker Panel Selection and Nested Cross-Validation

What This Demonstrates

Multi-Agent Coordination

Database Integration

Rigorous Methodology

Human vs Raw LLM vs Lobster AI

Limitations

Reproducibility

Simple Workflow (CSF QC)

Medium Workflow (CRC DE Analysis)

Hard Workflow (HGSOC Biomarker Discovery)

What's Next?

Getting Started

Proteomics Agents

Tutorials

On this page