Bulk RNA-seq Analysis TutorialPremium
This comprehensive tutorial demonstrates how to perform bulk RNA-seq differential expression analysis using Lobster AI with pyDESeq2 integration, formula-bas...
This comprehensive tutorial demonstrates how to perform bulk RNA-seq differential expression analysis using Lobster AI with pyDESeq2 integration, formula-based experimental design, and publication-ready results.
Overview
In this tutorial, you will learn to:
- Load and process bulk RNA-seq count matrices
- Design complex experimental formulas using R-style syntax
- Perform differential expression analysis with pyDESeq2
- Handle batch effects and complex designs
- Create publication-quality visualizations
- Export results for downstream analysis
Prerequisites
- Lobster AI installed and configured (see Installation Guide)
- API keys set up in your
.envfile - Understanding of experimental design concepts
- Familiarity with RNA-seq data formats
Tutorial Dataset
We'll use GSE123456 (example dataset), a bulk RNA-seq experiment with:
- 24 samples (12 treatment, 12 control)
- 2 batches (sequencing runs)
- 2 time points (early, late response)
- ~20,000 genes quantified
- Complex design:
~condition + batch + time + condition:time
Step 1: Starting Your Analysis
Start Lobster AI and prepare for bulk RNA-seq analysis:
# Start Lobster AI with enhanced CLI
lobster chatWelcome screen with professional interface:
π¦ LOBSTER by Omics-OS
Multi-Agent Bioinformatics Analysis System v0.2
Ready for bulk RNA-seq differential expression analysis!Step 2: Data Loading and Inspection
Option A: Loading Kallisto/Salmon Quantification Files (Recommended)
β οΈ NEW in v0.2+: Quantification files are now loaded directly via CLI /read command (no longer requires agent interaction).
Load per-sample quantification files using the CLI:
/read /path/to/kallisto_outputOr for Salmon:
/read /path/to/salmon_outputExpected Directory Structure:
quantification_output/
βββ sample1/
β βββ abundance.tsv (or quant.sf for Salmon)
βββ sample2/
β βββ abundance.tsv
βββ sample3/
βββ abundance.tsvExpected Output:
π Detected quantification directory with Kallisto files
π Auto-detected: Kallisto format
π Merging 24 sample files...
β
Successfully loaded quantification data!
π Quantification Tool: Kallisto
π Source Directory: /path/to/kallisto_output
π Data Shape: 24 samples Γ 19,847 genes
πΎ Modality Created: kallisto_output
Sample IDs:
control_rep1, control_rep2, treatment_rep1, treatment_rep2...
β
Ready for analysis!
Use "show me the data status" to view loaded datasets.After loading, use natural language to interact with the agent:
π¦ You: "Show me the data status and run quality control"Expected Agent Response:
π¦ Lobster: I'll check the quantification data and run QC...
π Bulk RNA-seq modality 'kallisto_output' ready for analysis:
- Shape: 24 samples Γ 19,847 genes
- Sample metadata: sample_id, source...
- Quality metrics available
Next Steps:
1. Quality assessment
2. Filtering and normalization
3. Differential expression analysisKey Features:
- Direct CLI Loading: Use
/readcommand - no need to ask the agent - Automatic Tool Detection: CLI detects Kallisto vs Salmon automatically
- Per-Sample Merging: Merges abundance.tsv files from each sample subdirectory
- Correct Orientation: Automatically transposes to samples Γ genes format
- Sample Names: Extracted from subdirectory names
- Quality Validation: Verifies file integrity and consistency
Option B: Loading Count Matrices (Traditional)
Load pre-computed count matrix and sample metadata:
π¦ You: "Load the bulk RNA-seq count matrix from counts.csv and sample metadata from metadata.csv. The data has treatment vs control samples with batch effects."Expected Output:
π¦ Lobster: I'll load your bulk RNA-seq data and inspect the experimental design...
β Loaded count matrix: 24 samples Γ 19,847 genes
β Loaded sample metadata with experimental factors
β Detected design factors: condition, batch, time_point
β Count matrix format validated (integer counts)
β Sample-metadata correspondence verified
β Stored as modality: 'bulk_rnaseq_counts'
Experimental Design Summary:
- Conditions: treatment (n=12), control (n=12)
- Batches: batch1 (n=12), batch2 (n=12)
- Time points: early (n=12), late (n=12)
- Design matrix: 24 samples Γ 4 conditions
- Balanced design: β (equal samples per group)Inspect the loaded data:
π¦ You: "/data"Shows dataset overview including sample counts, design factors, and data quality metrics.
Step 3: Quality Control Assessment
Assess data quality and detect potential issues:
π¦ You: "Perform quality control analysis including sample correlation, batch effect detection, and count distribution assessment"Expected Output:
π¦ Lobster: I'll perform comprehensive quality control for your bulk RNA-seq data...
β Sample correlation analysis completed
β Batch effect assessment performed
β Count distribution analysis finished
β Library size normalization factors calculated
β Gene filtering recommendations generated
β Generated QC visualization dashboard
Quality Control Results:
- Library sizes: 15.2M - 28.7M counts (good range)
- Sample correlations: 0.85 - 0.98 (excellent)
- Batch effect detected: MODERATE (batch explains 12% variance)
- Low-count genes: 3,247 genes (< 10 counts total)
- Recommended filtering: Remove genes with < 1 CPM in < 3 samples
- Overall data quality: EXCELLENTGenerated QC visualizations:
π¦ You: "/plots"QC plots include:
sample_correlation_heatmap.html- Sample-to-sample correlationslibrary_size_distribution.html- Count distribution per samplepca_batch_effects.html- PCA showing batch and treatment effectscount_density_plots.html- Count distribution analysis
Step 4: Experimental Design and Formula Construction
Create the design matrix using R-style formulas:
π¦ You: "Create a design matrix using the formula ~condition + batch + time_point + condition:time_point to model treatment effects, batch effects, and time-dependent responses"Expected Output:
π¦ Lobster: I'll construct the design matrix using your experimental formula...
β Formula parsed: ~condition + batch + time_point + condition:time_point
β Design matrix constructed: 24 samples Γ 6 coefficients
β Factor levels identified and ordered
β Interaction terms calculated
β Model rank and identifiability verified
β Contrast matrix prepared for hypothesis testing
Design Matrix Structure:
- Intercept: baseline (control, batch1, early)
- conditiontreatment: main treatment effect
- batchbatch2: batch correction term
- time_pointlate: main time effect
- conditiontreatment:time_pointlate: treatmentΓtime interaction
- Full rank: β (all contrasts estimable)
Available Contrasts:
1. Treatment vs Control (main effect)
2. Early vs Late response (time effect)
3. TreatmentΓTime interaction
4. Treatment effect at early time point
5. Treatment effect at late time pointStep 5: Differential Expression Analysis
Perform differential expression with pyDESeq2:
π¦ You: "Run differential expression analysis using pyDESeq2 with the design formula. Test for the main treatment effect and the treatmentΓtime interaction."Expected Output:
π¦ Lobster: I'll run differential expression analysis using pyDESeq2...
β pyDESeq2 analysis initialized with design formula
β Size factor normalization completed
β Dispersion estimation finished (gene-wise and fitted)
β Statistical testing performed for specified contrasts
β Multiple testing correction applied (Benjamini-Hochberg)
β Results tables generated with effect sizes
β Stored results as: 'bulk_rnaseq_counts_deseq2_results'
Differential Expression Results:
Main Treatment Effect (Treatment vs Control):
- Total genes tested: 16,600 genes
- Significantly DE genes (FDR < 0.05): 2,847 genes
- Up-regulated: 1,423 genes (log2FC > 0)
- Down-regulated: 1,424 genes (log2FC < 0)
- Largest effect: GENE123 (log2FC = 4.2, FDR = 1.2e-45)
TreatmentΓTime Interaction:
- Significantly interacting genes (FDR < 0.05): 456 genes
- Strong interactions (|log2FC| > 1): 187 genes
- Early-specific response: 89 genes
- Late-specific response: 98 genesStep 6: Multiple Contrast Testing
Test specific biological hypotheses:
π¦ You: "Test additional contrasts: treatment effect at early time point, treatment effect at late time point, and time effect in control samples only"Expected Output:
π¦ Lobster: I'll test additional biological contrasts using the fitted model...
β Custom contrast matrices constructed
β Statistical tests performed for each contrast
β Effect sizes and confidence intervals calculated
β Multiple testing correction applied per contrast
β Comparative analysis between contrasts completed
Additional Contrast Results:
Treatment Effect at Early Time Point:
- DE genes (FDR < 0.05): 1,891 genes
- Average log2FC: +0.73 (treatment higher)
- Top pathway: Immune Response (p = 2.3e-12)
Treatment Effect at Late Time Point:
- DE genes (FDR < 0.05): 3,204 genes
- Average log2FC: +1.15 (stronger response)
- Top pathway: Cell Cycle (p = 5.1e-18)
Time Effect in Control Samples:
- DE genes (FDR < 0.05): 567 genes
- Average log2FC: +0.34 (modest changes)
- Top pathway: Circadian Rhythm (p = 1.8e-8)
Key Finding: Treatment effect amplifies over time (interaction significant)Step 7: Publication-Quality Visualizations
Generate comprehensive visualization suite:
π¦ You: "Create publication-quality visualizations including volcano plots, MA plots, heatmaps of top DE genes, and PCA plots colored by experimental factors"Expected Output:
π¦ Lobster: I'll create a comprehensive visualization suite...
β Volcano plots generated for all contrasts
β MA plots created showing fold-change vs expression
β Expression heatmaps for top DE genes created
β PCA analysis with factor coloring completed
β Sample clustering dendrogram generated
β Interactive dashboards created for data exploration
β All plots optimized for publication (300 DPI, proper fonts)
Generated Visualizations:
- volcano_treatment_main.html: Volcano plot for main treatment effect
- volcano_interaction.html: Volcano plot for treatmentΓtime interaction
- ma_plot_treatment.html: MA plot showing fold-change distribution
- heatmap_top_genes.html: Expression heatmap (top 100 DE genes)
- pca_experimental_factors.html: PCA colored by condition and batch
- sample_clustering.html: Hierarchical clustering of samples
- gene_expression_profiles.html: Expression trajectories over time
- comprehensive_dashboard.html: Interactive multi-panel dashboardStep 8: Pathway and Functional Analysis
Perform pathway enrichment analysis:
π¦ You: "Run pathway enrichment analysis on the up-regulated and down-regulated genes using Gene Ontology and KEGG pathways"Expected Output:
π¦ Lobster: I'll perform pathway enrichment analysis on your DE gene lists...
β Gene set enrichment analysis completed
β GO Biological Process enrichment performed
β KEGG pathway analysis finished
β Reactome pathway analysis completed
β Enrichment statistics calculated (hypergeometric test)
β Multiple testing correction applied (FDR)
β Pathway visualization networks generated
Pathway Enrichment Results:
Up-regulated Genes (Treatment > Control):
Top GO Terms:
1. Immune system process (GO:0002376) - 134 genes, FDR = 2.1e-23
2. Response to stimulus (GO:0050896) - 298 genes, FDR = 5.7e-19
3. Cell activation (GO:0001775) - 89 genes, FDR = 1.2e-16
Top KEGG Pathways:
1. Cytokine-cytokine receptor interaction (hsa04060) - 45 genes, FDR = 3.4e-12
2. T cell receptor signaling pathway (hsa04660) - 32 genes, FDR = 8.9e-9
3. NF-kappa B signaling pathway (hsa04064) - 28 genes, FDR = 2.1e-8
Down-regulated Genes (Treatment < Control):
Top GO Terms:
1. Metabolic process (GO:0008152) - 287 genes, FDR = 1.7e-15
2. Cellular biosynthetic process (GO:0044249) - 198 genes, FDR = 4.2e-12
3. Gene expression (GO:0010467) - 156 genes, FDR = 7.8e-11Step 9: Results Export and Reporting
Export comprehensive results:
π¦ You: "/export bulk-rnaseq-results"Expected Output:
π¦ Lobster: Exporting comprehensive bulk RNA-seq analysis results...
β DE results tables exported (CSV format)
β Normalized counts matrix exported
β Sample metadata with QC metrics exported
β Statistical summaries generated
β Pathway enrichment results exported
β Analysis parameters logged
β Publication-ready figures exported (SVG + PNG)
Exported Files:
π Data Files:
- de_results_main_treatment.csv: Main treatment effect results
- de_results_interaction.csv: TreatmentΓtime interaction results
- normalized_counts_matrix.csv: Normalized expression values
- sample_metadata_qc.csv: Sample info with QC metrics
π Statistical Summaries:
- analysis_summary.txt: Complete statistical summary
- pathway_enrichment_results.csv: All pathway results
- gene_annotations.csv: Gene symbols and descriptions
π¨ Visualizations:
- figures/publication/: High-resolution publication figures
- figures/interactive/: Interactive HTML plots
- figures/supplementary/: Additional analysis plots
π Documentation:
- analysis_parameters.json: All analysis settings
- session_log.txt: Complete analysis history
- methods_description.txt: Analysis methods for publicationStep 10: Advanced Analysis Options
Gene Set Analysis with Custom Gene Lists
π¦ You: "Perform gene set enrichment analysis using custom gene sets from literature - analyze if our treatment signature overlaps with known drug response signatures"Temporal Analysis
π¦ You: "Create temporal expression profiles showing how gene expression changes over time for both treatment and control conditions"Batch Effect Correction
π¦ You: "Apply ComBat batch correction and re-run the differential expression analysis to compare results with and without batch correction"Integration with External Databases
π¦ You: "Query the Connectivity Map (CMap) database to identify potential drug compounds that could reverse our treatment signature"Working with Complex Designs
Multi-Factor Experiments
For experiments with multiple factors (e.g., treatment, time, cell type):
π¦ You: "Design a three-way interaction model: ~treatment * time * celltype with proper contrast specification"Paired Sample Analysis
For paired/matched sample designs:
π¦ You: "Analyze paired samples using the formula ~patient + condition to account for patient-specific effects"Time Series Analysis
For temporal experiments with multiple time points:
π¦ You: "Model time as a continuous variable and identify genes with linear, quadratic, or cubic temporal patterns"Troubleshooting Common Issues
Issue 1: Low Count Genes
π¦ You: "Many genes have very low counts - should I filter them?"Solution: Filter genes with < 1 CPM in < 3 samples (or minimum group size).
Issue 2: Batch Effects Too Strong
π¦ You: "Batch effects are overwhelming the biological signal"Solution: Use RUVSeq or sva for batch correction, or include batch as blocking factor.
Issue 3: No Significant Genes
π¦ You: "My analysis found no significantly DE genes"Solution: Check sample sizes, effect sizes, and consider less stringent FDR thresholds.
Issue 4: Design Matrix Issues
π¦ You: "Getting 'matrix not full rank' error"Solution: Check for confounded factors or redundant terms in the design.
Best Practices
- Sample Size: Minimum n=3 per group, preferably nβ₯6 for robust results
- Quality Control: Always inspect data before analysis
- Multiple Testing: Use FDR correction for genome-wide testing
- Effect Size: Report fold changes with statistical significance
- Validation: Validate key findings with qRT-PCR or independent datasets
- Documentation: Keep detailed records of analysis parameters
Integration with Other Analyses
Convert to Single-Cell Format
π¦ You: "Simulate single-cell data from bulk RNA-seq for method comparison"Meta-Analysis
π¦ You: "Combine my results with published datasets for meta-analysis"Proteomics Integration
π¦ You: "Compare my RNA-seq results with proteomics data from the same experiment"Next Steps
After completing this tutorial, explore:
- Proteomics Tutorial - Integrate with proteomics data
- Advanced Analysis Cookbook - Specialized workflows
- Custom Agent Tutorial - Create specialized analysis agents
- Single-Cell Tutorial - Compare bulk vs single-cell approaches
Summary
You have successfully:
- β Loaded and quality-controlled bulk RNA-seq data
- β Designed complex experimental models with interactions
- β Performed differential expression with pyDESeq2
- β Tested multiple biological contrasts
- β Generated publication-quality visualizations
- β Performed pathway enrichment analysis
- β Exported comprehensive results for publication
- β Learned advanced analysis strategies
This workflow demonstrates Lobster AI's sophisticated bulk RNA-seq analysis capabilities using natural language interaction with professional-grade statistical methods and visualization tools.