Omics-OS Docs
Tutorials

Proteomics Analysis TutorialPremium

This comprehensive tutorial demonstrates how to analyze both mass spectrometry and affinity proteomics data using Lobster AI's specialized proteomics platfor...

PremiumThis is premium content. Upgrade for full access

This comprehensive tutorial demonstrates how to analyze both mass spectrometry and affinity proteomics data using Lobster AI's specialized proteomics platform with missing value handling, normalization strategies, and publication-ready visualizations.

Overview

In this tutorial, you will learn to:

  • Analyze mass spectrometry proteomics data (DDA/DIA workflows)
  • Process affinity proteomics data (Olink panels, antibody arrays)
  • Handle missing value patterns (MNAR vs MCAR)
  • Perform proteomics-specific quality control
  • Conduct differential protein expression analysis
  • Generate proteomics visualization dashboards
  • Integrate proteomics with transcriptomics data

Prerequisites

  • Lobster AI installed and configured (see Installation Guide)
  • API keys set up in your .env file
  • Understanding of proteomics data characteristics
  • Familiarity with mass spectrometry or affinity proteomics concepts

Part 1: Mass Spectrometry Proteomics Analysis

Tutorial Dataset: DIA-MS Proteomics

We'll use a DIA-MS (Data-Independent Acquisition) dataset with:

  • 24 samples (12 treatment, 12 control)
  • 2,847 proteins quantified
  • Typical 42% missing values
  • MaxQuant output format
  • Peptide-to-protein aggregation

Step 1: Starting MS Proteomics Analysis

# Start Lobster AI
lobster chat
🦞 LOBSTER by Omics-OS
Multi-Agent Bioinformatics Analysis System v2.0

Ready for comprehensive proteomics analysis!

Step 2: Load MS Proteomics Data

Load your MaxQuant proteinGroups.txt file:

🦞 You: "Load the mass spectrometry proteomics data from proteinGroups.txt. This is MaxQuant output from a DIA experiment comparing treatment vs control samples."

Expected Output:

🦞 Lobster: I'll load your MS proteomics data and assess its characteristics...

βœ“ Loaded MaxQuant proteinGroups.txt file
βœ“ Extracted LFQ intensity values for 24 samples
βœ“ Protein annotations parsed (gene names, descriptions)
βœ“ Detected 2,847 protein groups
βœ“ Missing value assessment: 42.3% missing (typical for MS data)
βœ“ Sample metadata inferred from column headers
βœ“ Stored as modality: 'ms_proteomics_lfq'

MS Proteomics Dataset Summary:
- Samples: 24 (treatment n=12, control n=12)
- Proteins: 2,847 protein groups
- Missing values: 42.3% (1,203 complete proteins)
- Dynamic range: 6.2 orders of magnitude
- Data type: Label-free quantification (LFQ)
- Platform: DIA-MS (MaxQuant processed)

Step 3: MS Proteomics Quality Control

Assess data quality and missing value patterns:

🦞 You: "Perform comprehensive quality control including missing value pattern analysis, sample correlation assessment, and batch effect detection"

Expected Output:

🦞 Lobster: I'll perform MS proteomics-specific quality control analysis...

βœ“ Missing value pattern analysis completed
βœ“ MNAR vs MCAR classification performed
βœ“ Sample correlation matrix calculated
βœ“ Coefficient of variation analysis finished
βœ“ Batch effect assessment completed
βœ“ Protein abundance distribution analyzed
βœ“ Generated comprehensive QC dashboard

Quality Control Results:

Missing Value Analysis:
- Missing at Random (MAR): 18.7% of missing values
- Missing Not at Random (MNAR): 81.3% of missing values
- Low abundance bias confirmed (typical for MS data)
- Missing values concentrated in low-abundance proteins

Sample Quality Metrics:
- Protein identifications per sample: 1,456 - 2,203
- Sample correlations: 0.78 - 0.94 (good)
- Coefficient of variation (median): 23.4% (acceptable)
- Outlier samples: 0 (all samples passed QC)

Batch Effects:
- Batch effect detected: MINIMAL (3.2% variance explained)
- No systematic bias observed
- No correction needed

Step 4: MS Data Preprocessing

Preprocess with appropriate normalization and filtering:

🦞 You: "Filter low-quality proteins, apply log2 transformation, perform median normalization, and handle missing values using appropriate imputation strategies"

Expected Output:

🦞 Lobster: I'll preprocess your MS proteomics data with platform-appropriate methods...

βœ“ Protein filtering applied (valid values in β‰₯50% of samples)
βœ“ Log2 transformation performed
βœ“ Median normalization completed (sample-wise centering)
βœ“ Missing value imputation: MNAR (left-shifted normal distribution)
βœ“ Missing value imputation: MAR (k-nearest neighbors)
βœ“ Data scaling and standardization applied
βœ“ Stored processed data as: 'ms_proteomics_lfq_processed'

Preprocessing Results:
- Proteins retained: 2,847 β†’ 1,923 proteins (67.5%)
- Missing values before imputation: 42.3%
- Missing values after imputation: 0% (all imputed)
- Normalization method: Median centering + log2
- Imputation strategy: Hybrid (MNAR + MAR)
- Quality score: EXCELLENT (ready for analysis)

Step 5: MS Differential Protein Expression

Perform statistical testing for differential expression:

🦞 You: "Run differential protein expression analysis comparing treatment vs control using linear models with empirical Bayes moderation and FDR correction"

Expected Output:

🦞 Lobster: I'll perform differential protein expression analysis using statistical models optimized for proteomics data...

βœ“ Linear model fitted with empirical Bayes moderation
βœ“ Statistical testing completed for treatment vs control
βœ“ Multiple testing correction applied (Benjamini-Hochberg FDR)
βœ“ Effect size calculations completed
βœ“ Confidence intervals estimated
βœ“ Protein annotation enriched
βœ“ Stored results as: 'ms_proteomics_lfq_differential_results'

Differential Expression Results:

Treatment vs Control:
- Total proteins tested: 1,923 proteins
- Significantly DE proteins (FDR < 0.05): 347 proteins (18.0%)
- Up-regulated: 198 proteins (log2FC > 0)
- Down-regulated: 149 proteins (log2FC < 0)
- Strong changes (|log2FC| > 1): 89 proteins (4.6%)

Top Regulated Proteins:
- Most up-regulated: STAT1 (log2FC = 2.34, FDR = 1.2e-12)
- Most down-regulated: MYC (log2FC = -1.87, FDR = 3.4e-9)
- Highest significance: IFNG (log2FC = 1.45, FDR = 2.1e-15)

Statistical Summary:
- Mean |log2FC|: 0.43
- Median p-value: 0.12
- FDR threshold met: 347/1923 proteins (18.0%)

Step 6: MS Proteomics Visualization

Generate comprehensive visualizations:

🦞 You: "Create proteomics visualization dashboard including volcano plots, missing value heatmaps, protein correlation networks, and QC summary panels"

Expected Output:

🦞 Lobster: I'll create a comprehensive proteomics visualization suite...

βœ“ Volcano plot generated with statistical significance overlay
βœ“ Missing value pattern heatmap created
βœ“ Protein correlation network constructed
βœ“ PCA analysis with sample groupings completed
βœ“ Intensity distribution plots generated
βœ“ QC metrics dashboard assembled
βœ“ Interactive exploration panels created

Generated MS Proteomics Visualizations:
- volcano_plot_ms.html: Interactive volcano plot with protein annotations
- missing_value_heatmap.html: Missing value patterns across samples
- protein_correlation_network.html: NetworkX protein-protein correlations
- pca_samples_ms.html: PCA colored by treatment groups
- intensity_distributions.html: Protein abundance distributions
- cv_analysis.html: Coefficient of variation analysis
- qc_dashboard_ms.html: Comprehensive quality control summary
- comprehensive_proteomics_dashboard.html: Multi-panel interactive dashboard

Alternative: Spectronaut DIA-MS Analysis

If you're using Spectronaut (Biognosys) instead of MaxQuant, Lobster AI provides native support for both long-format and matrix-format Spectronaut exports.

Step 2a: Load Spectronaut Data

🦞 You: "Load the Spectronaut DIA proteomics data from spectronaut_report.tsv. Apply Q-value filtering at 0.01 and log2 transformation."

Expected Output:

🦞 Lobster: I'll load your Spectronaut DIA-MS data with automatic format detection...

βœ“ Detected Spectronaut long format file
βœ“ Identified sample column: R.FileName
βœ“ Using quantity column: PG.Quantity (linear scale)
βœ“ Applied Q-value filtering (threshold = 0.01)
βœ“ Removed 347 rows with Q-value > 0.01
βœ“ Log2 transformation applied (pseudocount = 1.0)
βœ“ Detected 24 samples from cleaned filenames
βœ“ Identified 2,634 protein groups
βœ“ Filtered contaminants: 45 proteins (CON__, KERATIN patterns)
βœ“ Filtered reverse hits: 12 proteins (REV__ patterns)
βœ“ Stored as modality: 'spectronaut_dia_proteomics'

Spectronaut DIA-MS Dataset Summary:
- Samples: 24 (treatment n=12, control n=12)
- Proteins: 2,577 protein groups (after filtering)
- Missing values: 38.4% (typical for DIA-MS)
- Q-value filtered: 347 protein measurements removed
- Data type: DIA-MS quantification (log2-transformed)
- Platform: Spectronaut (Biognosys)
- Contaminants removed: 45 proteins
- Reverse hits removed: 12 proteins

Key Advantages of Spectronaut Parser:

  • Automatic Format Detection: Handles both long-format and matrix-format exports
  • Built-in Q-value Filtering: FDR control at protein group level
  • Protein Group Support: Correctly handles semicolon-separated protein IDs
  • Sample Name Cleaning: Removes .raw, .d, .wiff file extensions automatically
  • Gene Symbol Indexing: Uses gene names for biologist-friendly output

Alternative Loading Options:

# Matrix format with custom parameters
🦞 You: "Load spectronaut_matrix.xlsx using PG.Normalised column with Q-value threshold 0.05"

# Long format with relaxed filtering for exploratory analysis
🦞 You: "Load spectronaut_report.csv with Q-value 0.05 and keep contaminants for QC review"

# Without log transformation (for downstream tools requiring linear scale)
🦞 You: "Load spectronaut_data.tsv without log transformation"

After loading Spectronaut data, continue with the same workflow:

  • Step 3: Quality Control (same QC metrics apply)
  • Step 4: Preprocessing (already includes log2 transformation)
  • Step 5: Differential Expression (same statistical methods)
  • Step 6: Visualization (same plotting functions)

Part 2: Affinity Proteomics Analysis

Now we'll analyze an Olink targeted proteomics panel with:

  • 96 samples (4 conditions Γ— 24 samples each)
  • 384 proteins (Olink Explore panel)
  • Low missing values (18%)
  • NPX (Normalized Protein eXpression) values
  • High technical reproducibility

Step 7: Load Affinity Proteomics Data

🦞 You: "Load the Olink proteomics data from the NPX file. This is targeted proteomics with 384 proteins across 96 samples from 4 different conditions."

Expected Output:

🦞 Lobster: I'll load your Olink affinity proteomics data...

βœ“ Loaded Olink NPX data file
βœ“ Extracted normalized protein expression values
βœ“ Sample metadata parsed from headers
βœ“ Protein annotations imported (UniProt IDs, gene names)
βœ“ Quality flags assessed for all measurements
βœ“ Stored as modality: 'olink_affinity_proteomics'

Affinity Proteomics Dataset Summary:
- Samples: 96 (4 conditions Γ— 24 samples each)
- Proteins: 384 targeted proteins (Olink Explore panel)
- Missing values: 18.2% (excellent for affinity proteomics)
- Data type: NPX (Normalized Protein eXpression)
- Platform: Olink Proximity Extension Assay (PEA)
- Technical CV: 8.7% median (excellent reproducibility)

Step 8: Affinity Proteomics Quality Assessment

🦞 You: "Assess the quality of the Olink data including antibody performance metrics, coefficient of variation analysis, and cross-sample correlations"

Expected Output:

🦞 Lobster: I'll perform affinity proteomics-specific quality assessment...

βœ“ Antibody validation metrics calculated
βœ“ Technical reproducibility assessment completed
βœ“ Cross-sample correlation analysis performed
βœ“ Detection frequency analysis finished
βœ“ Dynamic range assessment completed
βœ“ Quality flag validation performed

Affinity Proteomics Quality Results:

Antibody Performance:
- High-performance antibodies: 312/384 (81.3%)
- Medium-performance antibodies: 58/384 (15.1%)
- Low-performance antibodies: 14/384 (3.6%)
- Failed quality checks: 0/384 (0%)

Technical Metrics:
- Median CV within replicates: 8.7% (excellent)
- Detection frequency > 50%: 367/384 proteins (95.6%)
- Dynamic range: 4.8 log2 units (good)
- Cross-sample correlation: 0.89 - 0.97 (excellent)

Sample Quality:
- Outlier samples: 2/96 (2.1%) - flagged for review
- Missing data per sample: 12.3% - 24.1%
- Overall data quality: EXCELLENT

Step 9: Affinity Proteomics Statistical Analysis

🦞 You: "Perform differential expression analysis across the 4 conditions using ANOVA with post-hoc pairwise comparisons and multiple testing correction"

Expected Output:

🦞 Lobster: I'll perform multi-group differential analysis for your Olink data...

βœ“ One-way ANOVA performed across 4 conditions
βœ“ Post-hoc pairwise comparisons completed (6 contrasts)
βœ“ Multiple testing correction applied (FDR < 0.05)
βœ“ Effect size calculations finished
βœ“ Pathway enrichment analysis performed
βœ“ Stored results as: 'olink_affinity_proteomics_anova_results'

Multi-Group Analysis Results:

ANOVA Results (4 conditions):
- Total proteins tested: 384 proteins
- Significantly different (FDR < 0.05): 156 proteins (40.6%)
- Strong effect size (etaΒ² > 0.1): 89 proteins (23.2%)
- Very strong effect (etaΒ² > 0.25): 34 proteins (8.9%)

Pairwise Comparisons (selected):
Condition A vs B: 67 DE proteins
Condition A vs C: 89 DE proteins
Condition A vs D: 124 DE proteins
Condition B vs C: 45 DE proteins
Condition B vs D: 78 DE proteins
Condition C vs D: 56 DE proteins

Top Regulated Proteins:
- Highest F-statistic: IL6 (F = 45.7, FDR = 2.1e-18)
- Largest effect size: TNF (etaΒ² = 0.34, FDR = 5.7e-12)
- Most consistent: CRP (significant in 5/6 comparisons)

Step 10: Affinity Proteomics Visualization

🦞 You: "Create comprehensive visualizations for the Olink data including multi-group comparisons, antibody performance metrics, and pathway analysis results"

Expected Output:

🦞 Lobster: I'll create specialized affinity proteomics visualizations...

βœ“ Multi-group volcano plots generated
βœ“ Antibody performance dashboard created
βœ“ Protein expression profiles across conditions plotted
βœ“ Pathway enrichment network visualization created
βœ“ Sample clustering with condition annotations completed
βœ“ Technical reproducibility plots generated

Generated Affinity Proteomics Visualizations:
- multigroup_volcano.html: Volcano plots for all pairwise comparisons
- antibody_performance_dashboard.html: CV analysis and detection frequencies
- protein_expression_profiles.html: Expression patterns across conditions
- pathway_network.html: Enriched pathways as interactive networks
- sample_clustering_olink.html: Hierarchical clustering with annotations
- technical_reproducibility.html: CV analysis and quality metrics
- comprehensive_olink_dashboard.html: Complete analysis dashboard

Part 3: Multi-Omics Integration

Step 11: Proteomics-Transcriptomics Integration

Integrate proteomics data with RNA-seq results:

🦞 You: "Integrate the MS proteomics results with bulk RNA-seq data from the same samples to identify concordant changes at mRNA and protein levels"

Expected Output:

🦞 Lobster: I'll integrate your proteomics and transcriptomics data for multi-omics analysis...

βœ“ Gene-protein mapping established (UniProt ↔ Ensembl)
βœ“ Sample correspondence verified across platforms
βœ“ Correlation analysis between mRNA and protein levels performed
βœ“ Concordance analysis for differential expression completed
βœ“ Multi-omics pathway analysis performed
βœ“ Integrated visualization dashboard created

Multi-Omics Integration Results:

Gene-Protein Correspondence:
- Matched genes/proteins: 1,456 pairs
- RNA-protein correlation (overall): r = 0.42 (typical)
- High correlation pairs (r > 0.7): 287 genes/proteins (19.7%)
- Low correlation pairs (r < 0.2): 445 genes/proteins (30.5%)

Differential Expression Concordance:
- Both RNA and protein DE (same direction): 89 genes (6.1%)
- RNA DE only: 1,234 genes (84.8%)
- Protein DE only: 258 proteins (17.7%)
- Opposite direction changes: 23 genes (1.6%)

Key Biological Insights:
- Post-translational regulation evident in 258 protein-only changes
- Transcriptional control dominant for 1,234 RNA-only changes
- Concordant changes likely represent core regulatory programs
- Opposite changes suggest negative feedback mechanisms

Step 12: Pathway-Level Integration

🦞 You: "Perform pathway-level integration analysis to identify biological processes that are consistently altered across both transcriptomics and proteomics data"

Expected Output:

🦞 Lobster: I'll perform pathway-level multi-omics integration...

βœ“ Gene set enrichment analysis performed for both datasets
βœ“ Pathway overlap analysis completed
βœ“ Multi-omics pathway scoring calculated
βœ“ Concordance metrics computed for biological processes
βœ“ Integrated pathway networks constructed

Pathway-Level Integration Results:

Consistently Altered Pathways (RNA + Protein):
1. Immune system process - RNA FDR: 2.1e-15, Protein FDR: 5.7e-8
2. Inflammatory response - RNA FDR: 1.2e-12, Protein FDR: 3.4e-6
3. Cytokine signaling - RNA FDR: 6.8e-11, Protein FDR: 1.9e-5

RNA-Specific Pathways:
1. Transcription regulation - RNA FDR: 4.5e-18
2. Cell cycle - RNA FDR: 2.3e-14
3. DNA repair - RNA FDR: 8.9e-12

Protein-Specific Pathways:
1. Protein folding - Protein FDR: 1.2e-9
2. Post-translational modification - Protein FDR: 5.6e-7
3. Protein transport - Protein FDR: 2.8e-6

Integration Score: 0.67 (good concordance at pathway level)

Step 13: Export and Reporting

Export comprehensive proteomics analysis results:

🦞 You: "/export proteomics-analysis"

Expected Output:

🦞 Lobster: Exporting comprehensive proteomics analysis results...

βœ“ MS proteomics results exported (differential expression tables)
βœ“ Olink affinity proteomics results exported (ANOVA results)
βœ“ Multi-omics integration results exported
βœ“ Quality control metrics exported
βœ“ Protein annotations and metadata exported
βœ“ Statistical summaries generated
βœ“ Publication-ready figures exported

Exported Proteomics Files:
πŸ“Š MS Proteomics Data:
- ms_differential_results.csv: Differential protein expression results
- ms_normalized_data.csv: Processed protein intensities
- ms_qc_metrics.csv: Quality control summaries

πŸ“Š Affinity Proteomics Data:
- olink_anova_results.csv: Multi-group statistical results
- olink_pairwise_comparisons.csv: All pairwise contrast results
- olink_antibody_performance.csv: Antibody validation metrics

πŸ“Š Multi-Omics Integration:
- multiomics_correlation_results.csv: RNA-protein correlations
- multiomics_pathway_analysis.csv: Integrated pathway results
- gene_protein_concordance.csv: Differential expression concordance

🎨 Visualizations:
- figures/ms_proteomics/: Mass spectrometry visualizations
- figures/affinity_proteomics/: Olink analysis plots
- figures/multiomics/: Integration analysis plots
- figures/publication/: High-resolution publication figures

πŸ“‹ Analysis Documentation:
- proteomics_analysis_summary.txt: Complete analysis summary
- methods_proteomics.txt: Methods description for publication
- analysis_parameters.json: All analysis settings

Advanced Proteomics Analysis Options

Time-Series Proteomics

🦞 You: "Analyze temporal changes in protein expression over 6 time points using spline regression and cluster analysis"

Spatial Proteomics

🦞 You: "Analyze spatial proteomics data to identify subcellular localization changes upon treatment"

Protein Complex Analysis

🦞 You: "Identify protein complexes and analyze how complex compositions change between conditions"

Cross-Platform Validation

🦞 You: "Validate MS proteomics findings using targeted analysis with selected reaction monitoring (SRM)"

Troubleshooting Common Issues

Issue 1: High Missing Values in MS Data

🦞 You: "My MS data has >60% missing values - is this normal?"

Solution: Normal for MS data. Use appropriate MNAR imputation and consider protein filtering strategies.

Issue 2: Poor Sample Correlations

🦞 You: "Sample correlations are below 0.7 - what should I check?"

Solution: Check for batch effects, sample degradation, or technical issues. Consider outlier removal.

Issue 3: Low Protein Coverage

🦞 You: "I'm only detecting 500 proteins in my MS experiment"

Solution: Check sample preparation, LC-MS settings, and database search parameters.

🦞 You: "Several Olink samples failed quality control"

Solution: Check sample integrity, pipetting accuracy, and plate effects.

Best Practices

MS Proteomics

  1. Missing Values: Use MNAR-appropriate imputation for low-abundance proteins
  2. Normalization: Apply median centering before statistical testing
  3. Filtering: Remove proteins with <50% valid values across samples
  4. Statistics: Use empirical Bayes methods for small sample sizes

Affinity Proteomics

  1. Quality Control: Monitor CV values and detection frequencies
  2. Antibody Validation: Check antibody performance metrics regularly
  3. Technical Replicates: Include technical replicates for CV assessment
  4. Cross-Platform: Validate findings across multiple panels when possible

Multi-Omics Integration

  1. Sample Matching: Ensure proper sample correspondence across platforms
  2. Timing: Account for different extraction and processing times
  3. Normalization: Use platform-appropriate normalization methods
  4. Interpretation: Consider biological vs technical correlations

Next Steps

After completing this tutorial, explore:

  1. Single-Cell Tutorial - Single-cell proteomics analysis
  2. Bulk RNA-seq Tutorial - Transcriptomics integration
  3. Advanced Analysis Cookbook - Specialized proteomics workflows
  4. Custom Agent Tutorial - Create proteomics-specific agents

Summary

You have successfully:

  • βœ… Analyzed mass spectrometry proteomics data with proper missing value handling
  • βœ… Processed affinity proteomics data with antibody performance assessment
  • βœ… Performed differential protein expression analysis with appropriate statistics
  • βœ… Created comprehensive proteomics visualization dashboards
  • βœ… Integrated proteomics with transcriptomics data for multi-omics insights
  • βœ… Exported publication-ready results and figures
  • βœ… Learned platform-specific best practices and troubleshooting

This comprehensive workflow demonstrates Lobster AI's advanced proteomics analysis capabilities, handling the unique challenges of protein-level data analysis with professional-grade algorithms and specialized visualizations.

On this page