Omics-OS Docs
Tutorials

Bulk RNA-seq Analysis TutorialPremium

This comprehensive tutorial demonstrates how to perform bulk RNA-seq differential expression analysis using Lobster AI with pyDESeq2 integration, formula-bas...

PremiumThis is premium content. Upgrade for full access

This comprehensive tutorial demonstrates how to perform bulk RNA-seq differential expression analysis using Lobster AI with pyDESeq2 integration, formula-based experimental design, and publication-ready results.

Overview

In this tutorial, you will learn to:

  • Load and process bulk RNA-seq count matrices
  • Design complex experimental formulas using R-style syntax
  • Perform differential expression analysis with pyDESeq2
  • Handle batch effects and complex designs
  • Create publication-quality visualizations
  • Export results for downstream analysis

Prerequisites

  • Lobster AI installed and configured (see Installation Guide)
  • API keys set up in your .env file
  • Understanding of experimental design concepts
  • Familiarity with RNA-seq data formats

Tutorial Dataset

We'll use GSE123456 (example dataset), a bulk RNA-seq experiment with:

  • 24 samples (12 treatment, 12 control)
  • 2 batches (sequencing runs)
  • 2 time points (early, late response)
  • ~20,000 genes quantified
  • Complex design: ~condition + batch + time + condition:time

Step 1: Starting Your Analysis

Start Lobster AI and prepare for bulk RNA-seq analysis:

# Start Lobster AI with enhanced CLI
lobster chat

Welcome screen with professional interface:

🦞 LOBSTER by Omics-OS
Multi-Agent Bioinformatics Analysis System v0.2

Ready for bulk RNA-seq differential expression analysis!

Step 2: Data Loading and Inspection

⚠️ NEW in v0.2+: Quantification files are now loaded directly via CLI /read command (no longer requires agent interaction).

Load per-sample quantification files using the CLI:

/read /path/to/kallisto_output

Or for Salmon:

/read /path/to/salmon_output

Expected Directory Structure:

quantification_output/
β”œβ”€β”€ sample1/
β”‚   └── abundance.tsv  (or quant.sf for Salmon)
β”œβ”€β”€ sample2/
β”‚   └── abundance.tsv
└── sample3/
    └── abundance.tsv

Expected Output:

πŸ“ Detected quantification directory with Kallisto files
πŸ” Auto-detected: Kallisto format
πŸ“Š Merging 24 sample files...

βœ… Successfully loaded quantification data!

πŸ“Š Quantification Tool: Kallisto
πŸ“ Source Directory: /path/to/kallisto_output
πŸ“ˆ Data Shape: 24 samples Γ— 19,847 genes
πŸ’Ύ Modality Created: kallisto_output

Sample IDs:
control_rep1, control_rep2, treatment_rep1, treatment_rep2...

βœ… Ready for analysis!
Use "show me the data status" to view loaded datasets.

After loading, use natural language to interact with the agent:

🦞 You: "Show me the data status and run quality control"

Expected Agent Response:

🦞 Lobster: I'll check the quantification data and run QC...

πŸ“Š Bulk RNA-seq modality 'kallisto_output' ready for analysis:
- Shape: 24 samples Γ— 19,847 genes
- Sample metadata: sample_id, source...
- Quality metrics available

Next Steps:
1. Quality assessment
2. Filtering and normalization
3. Differential expression analysis

Key Features:

  • Direct CLI Loading: Use /read command - no need to ask the agent
  • Automatic Tool Detection: CLI detects Kallisto vs Salmon automatically
  • Per-Sample Merging: Merges abundance.tsv files from each sample subdirectory
  • Correct Orientation: Automatically transposes to samples Γ— genes format
  • Sample Names: Extracted from subdirectory names
  • Quality Validation: Verifies file integrity and consistency

Option B: Loading Count Matrices (Traditional)

Load pre-computed count matrix and sample metadata:

🦞 You: "Load the bulk RNA-seq count matrix from counts.csv and sample metadata from metadata.csv. The data has treatment vs control samples with batch effects."

Expected Output:

🦞 Lobster: I'll load your bulk RNA-seq data and inspect the experimental design...

βœ“ Loaded count matrix: 24 samples Γ— 19,847 genes
βœ“ Loaded sample metadata with experimental factors
βœ“ Detected design factors: condition, batch, time_point
βœ“ Count matrix format validated (integer counts)
βœ“ Sample-metadata correspondence verified
βœ“ Stored as modality: 'bulk_rnaseq_counts'

Experimental Design Summary:
- Conditions: treatment (n=12), control (n=12)
- Batches: batch1 (n=12), batch2 (n=12)
- Time points: early (n=12), late (n=12)
- Design matrix: 24 samples Γ— 4 conditions
- Balanced design: βœ“ (equal samples per group)

Inspect the loaded data:

🦞 You: "/data"

Shows dataset overview including sample counts, design factors, and data quality metrics.

Step 3: Quality Control Assessment

Assess data quality and detect potential issues:

🦞 You: "Perform quality control analysis including sample correlation, batch effect detection, and count distribution assessment"

Expected Output:

🦞 Lobster: I'll perform comprehensive quality control for your bulk RNA-seq data...

βœ“ Sample correlation analysis completed
βœ“ Batch effect assessment performed
βœ“ Count distribution analysis finished
βœ“ Library size normalization factors calculated
βœ“ Gene filtering recommendations generated
βœ“ Generated QC visualization dashboard

Quality Control Results:
- Library sizes: 15.2M - 28.7M counts (good range)
- Sample correlations: 0.85 - 0.98 (excellent)
- Batch effect detected: MODERATE (batch explains 12% variance)
- Low-count genes: 3,247 genes (< 10 counts total)
- Recommended filtering: Remove genes with < 1 CPM in < 3 samples
- Overall data quality: EXCELLENT

Generated QC visualizations:

🦞 You: "/plots"

QC plots include:

  • sample_correlation_heatmap.html - Sample-to-sample correlations
  • library_size_distribution.html - Count distribution per sample
  • pca_batch_effects.html - PCA showing batch and treatment effects
  • count_density_plots.html - Count distribution analysis

Step 4: Experimental Design and Formula Construction

Create the design matrix using R-style formulas:

🦞 You: "Create a design matrix using the formula ~condition + batch + time_point + condition:time_point to model treatment effects, batch effects, and time-dependent responses"

Expected Output:

🦞 Lobster: I'll construct the design matrix using your experimental formula...

βœ“ Formula parsed: ~condition + batch + time_point + condition:time_point
βœ“ Design matrix constructed: 24 samples Γ— 6 coefficients
βœ“ Factor levels identified and ordered
βœ“ Interaction terms calculated
βœ“ Model rank and identifiability verified
βœ“ Contrast matrix prepared for hypothesis testing

Design Matrix Structure:
- Intercept: baseline (control, batch1, early)
- conditiontreatment: main treatment effect
- batchbatch2: batch correction term
- time_pointlate: main time effect
- conditiontreatment:time_pointlate: treatmentΓ—time interaction
- Full rank: βœ“ (all contrasts estimable)

Available Contrasts:
1. Treatment vs Control (main effect)
2. Early vs Late response (time effect)
3. TreatmentΓ—Time interaction
4. Treatment effect at early time point
5. Treatment effect at late time point

Step 5: Differential Expression Analysis

Perform differential expression with pyDESeq2:

🦞 You: "Run differential expression analysis using pyDESeq2 with the design formula. Test for the main treatment effect and the treatmentΓ—time interaction."

Expected Output:

🦞 Lobster: I'll run differential expression analysis using pyDESeq2...

βœ“ pyDESeq2 analysis initialized with design formula
βœ“ Size factor normalization completed
βœ“ Dispersion estimation finished (gene-wise and fitted)
βœ“ Statistical testing performed for specified contrasts
βœ“ Multiple testing correction applied (Benjamini-Hochberg)
βœ“ Results tables generated with effect sizes
βœ“ Stored results as: 'bulk_rnaseq_counts_deseq2_results'

Differential Expression Results:

Main Treatment Effect (Treatment vs Control):
- Total genes tested: 16,600 genes
- Significantly DE genes (FDR < 0.05): 2,847 genes
- Up-regulated: 1,423 genes (log2FC > 0)
- Down-regulated: 1,424 genes (log2FC < 0)
- Largest effect: GENE123 (log2FC = 4.2, FDR = 1.2e-45)

TreatmentΓ—Time Interaction:
- Significantly interacting genes (FDR < 0.05): 456 genes
- Strong interactions (|log2FC| > 1): 187 genes
- Early-specific response: 89 genes
- Late-specific response: 98 genes

Step 6: Multiple Contrast Testing

Test specific biological hypotheses:

🦞 You: "Test additional contrasts: treatment effect at early time point, treatment effect at late time point, and time effect in control samples only"

Expected Output:

🦞 Lobster: I'll test additional biological contrasts using the fitted model...

βœ“ Custom contrast matrices constructed
βœ“ Statistical tests performed for each contrast
βœ“ Effect sizes and confidence intervals calculated
βœ“ Multiple testing correction applied per contrast
βœ“ Comparative analysis between contrasts completed

Additional Contrast Results:

Treatment Effect at Early Time Point:
- DE genes (FDR < 0.05): 1,891 genes
- Average log2FC: +0.73 (treatment higher)
- Top pathway: Immune Response (p = 2.3e-12)

Treatment Effect at Late Time Point:
- DE genes (FDR < 0.05): 3,204 genes
- Average log2FC: +1.15 (stronger response)
- Top pathway: Cell Cycle (p = 5.1e-18)

Time Effect in Control Samples:
- DE genes (FDR < 0.05): 567 genes
- Average log2FC: +0.34 (modest changes)
- Top pathway: Circadian Rhythm (p = 1.8e-8)

Key Finding: Treatment effect amplifies over time (interaction significant)

Step 7: Publication-Quality Visualizations

Generate comprehensive visualization suite:

🦞 You: "Create publication-quality visualizations including volcano plots, MA plots, heatmaps of top DE genes, and PCA plots colored by experimental factors"

Expected Output:

🦞 Lobster: I'll create a comprehensive visualization suite...

βœ“ Volcano plots generated for all contrasts
βœ“ MA plots created showing fold-change vs expression
βœ“ Expression heatmaps for top DE genes created
βœ“ PCA analysis with factor coloring completed
βœ“ Sample clustering dendrogram generated
βœ“ Interactive dashboards created for data exploration
βœ“ All plots optimized for publication (300 DPI, proper fonts)

Generated Visualizations:
- volcano_treatment_main.html: Volcano plot for main treatment effect
- volcano_interaction.html: Volcano plot for treatmentΓ—time interaction
- ma_plot_treatment.html: MA plot showing fold-change distribution
- heatmap_top_genes.html: Expression heatmap (top 100 DE genes)
- pca_experimental_factors.html: PCA colored by condition and batch
- sample_clustering.html: Hierarchical clustering of samples
- gene_expression_profiles.html: Expression trajectories over time
- comprehensive_dashboard.html: Interactive multi-panel dashboard

Step 8: Pathway and Functional Analysis

Perform pathway enrichment analysis:

🦞 You: "Run pathway enrichment analysis on the up-regulated and down-regulated genes using Gene Ontology and KEGG pathways"

Expected Output:

🦞 Lobster: I'll perform pathway enrichment analysis on your DE gene lists...

βœ“ Gene set enrichment analysis completed
βœ“ GO Biological Process enrichment performed
βœ“ KEGG pathway analysis finished
βœ“ Reactome pathway analysis completed
βœ“ Enrichment statistics calculated (hypergeometric test)
βœ“ Multiple testing correction applied (FDR)
βœ“ Pathway visualization networks generated

Pathway Enrichment Results:

Up-regulated Genes (Treatment > Control):
Top GO Terms:
1. Immune system process (GO:0002376) - 134 genes, FDR = 2.1e-23
2. Response to stimulus (GO:0050896) - 298 genes, FDR = 5.7e-19
3. Cell activation (GO:0001775) - 89 genes, FDR = 1.2e-16

Top KEGG Pathways:
1. Cytokine-cytokine receptor interaction (hsa04060) - 45 genes, FDR = 3.4e-12
2. T cell receptor signaling pathway (hsa04660) - 32 genes, FDR = 8.9e-9
3. NF-kappa B signaling pathway (hsa04064) - 28 genes, FDR = 2.1e-8

Down-regulated Genes (Treatment < Control):
Top GO Terms:
1. Metabolic process (GO:0008152) - 287 genes, FDR = 1.7e-15
2. Cellular biosynthetic process (GO:0044249) - 198 genes, FDR = 4.2e-12
3. Gene expression (GO:0010467) - 156 genes, FDR = 7.8e-11

Step 9: Results Export and Reporting

Export comprehensive results:

🦞 You: "/export bulk-rnaseq-results"

Expected Output:

🦞 Lobster: Exporting comprehensive bulk RNA-seq analysis results...

βœ“ DE results tables exported (CSV format)
βœ“ Normalized counts matrix exported
βœ“ Sample metadata with QC metrics exported
βœ“ Statistical summaries generated
βœ“ Pathway enrichment results exported
βœ“ Analysis parameters logged
βœ“ Publication-ready figures exported (SVG + PNG)

Exported Files:
πŸ“Š Data Files:
- de_results_main_treatment.csv: Main treatment effect results
- de_results_interaction.csv: TreatmentΓ—time interaction results
- normalized_counts_matrix.csv: Normalized expression values
- sample_metadata_qc.csv: Sample info with QC metrics

πŸ“ˆ Statistical Summaries:
- analysis_summary.txt: Complete statistical summary
- pathway_enrichment_results.csv: All pathway results
- gene_annotations.csv: Gene symbols and descriptions

🎨 Visualizations:
- figures/publication/: High-resolution publication figures
- figures/interactive/: Interactive HTML plots
- figures/supplementary/: Additional analysis plots

πŸ“‹ Documentation:
- analysis_parameters.json: All analysis settings
- session_log.txt: Complete analysis history
- methods_description.txt: Analysis methods for publication

Step 10: Advanced Analysis Options

Gene Set Analysis with Custom Gene Lists

🦞 You: "Perform gene set enrichment analysis using custom gene sets from literature - analyze if our treatment signature overlaps with known drug response signatures"

Temporal Analysis

🦞 You: "Create temporal expression profiles showing how gene expression changes over time for both treatment and control conditions"

Batch Effect Correction

🦞 You: "Apply ComBat batch correction and re-run the differential expression analysis to compare results with and without batch correction"

Integration with External Databases

🦞 You: "Query the Connectivity Map (CMap) database to identify potential drug compounds that could reverse our treatment signature"

Working with Complex Designs

Multi-Factor Experiments

For experiments with multiple factors (e.g., treatment, time, cell type):

🦞 You: "Design a three-way interaction model: ~treatment * time * celltype with proper contrast specification"

Paired Sample Analysis

For paired/matched sample designs:

🦞 You: "Analyze paired samples using the formula ~patient + condition to account for patient-specific effects"

Time Series Analysis

For temporal experiments with multiple time points:

🦞 You: "Model time as a continuous variable and identify genes with linear, quadratic, or cubic temporal patterns"

Troubleshooting Common Issues

Issue 1: Low Count Genes

🦞 You: "Many genes have very low counts - should I filter them?"

Solution: Filter genes with < 1 CPM in < 3 samples (or minimum group size).

Issue 2: Batch Effects Too Strong

🦞 You: "Batch effects are overwhelming the biological signal"

Solution: Use RUVSeq or sva for batch correction, or include batch as blocking factor.

Issue 3: No Significant Genes

🦞 You: "My analysis found no significantly DE genes"

Solution: Check sample sizes, effect sizes, and consider less stringent FDR thresholds.

Issue 4: Design Matrix Issues

🦞 You: "Getting 'matrix not full rank' error"

Solution: Check for confounded factors or redundant terms in the design.

Best Practices

  1. Sample Size: Minimum n=3 per group, preferably nβ‰₯6 for robust results
  2. Quality Control: Always inspect data before analysis
  3. Multiple Testing: Use FDR correction for genome-wide testing
  4. Effect Size: Report fold changes with statistical significance
  5. Validation: Validate key findings with qRT-PCR or independent datasets
  6. Documentation: Keep detailed records of analysis parameters

Integration with Other Analyses

Convert to Single-Cell Format

🦞 You: "Simulate single-cell data from bulk RNA-seq for method comparison"

Meta-Analysis

🦞 You: "Combine my results with published datasets for meta-analysis"

Proteomics Integration

🦞 You: "Compare my RNA-seq results with proteomics data from the same experiment"

Next Steps

After completing this tutorial, explore:

  1. Proteomics Tutorial - Integrate with proteomics data
  2. Advanced Analysis Cookbook - Specialized workflows
  3. Custom Agent Tutorial - Create specialized analysis agents
  4. Single-Cell Tutorial - Compare bulk vs single-cell approaches

Summary

You have successfully:

  • βœ… Loaded and quality-controlled bulk RNA-seq data
  • βœ… Designed complex experimental models with interactions
  • βœ… Performed differential expression with pyDESeq2
  • βœ… Tested multiple biological contrasts
  • βœ… Generated publication-quality visualizations
  • βœ… Performed pathway enrichment analysis
  • βœ… Exported comprehensive results for publication
  • βœ… Learned advanced analysis strategies

This workflow demonstrates Lobster AI's sophisticated bulk RNA-seq analysis capabilities using natural language interaction with professional-grade statistical methods and visualization tools.

On this page