Single-Cell RNA-seq Analysis Tutorial
This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpret...
This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpretation.
Overview
In this tutorial, you will learn to:
- Download single-cell datasets from GEO using natural language
- Perform quality control and filtering
- Normalize and cluster cells
- Identify cell types and marker genes
- Create publication-ready visualizations
- Export and interpret results
Prerequisites
- Lobster AI installed and configured (see Installation Guide)
- API keys set up in your
.envfile - Basic understanding of single-cell RNA-seq concepts
Tutorial Dataset
We'll use GSE109564, a well-characterized single-cell dataset of immune cells from the tumor microenvironment. This dataset contains:
- ~5,000 cells
- ~20,000 genes
- Multiple immune cell types
- Excellent for demonstrating clustering and annotation
Step 1: Starting Lobster AI
Start the interactive chat interface with the enhanced CLI:
# Start Lobster AI with Rich CLI interface
lobster chatYou'll see the welcome screen with orange branding:
🦞 lobster chat
┌─ Welcome ────────────────────────────────────────────────────────────┐
│ 🦞 LOBSTER by Omics-OS │
│ Multi-Agent Bioinformatics Analysis System v0.2 │
│ │
│ 🧬 Key Tasks: │
│ • Analyze RNA-seq data │
│ • Generate visualizations and plots │
│ • Extract insights from bioinformatics datasets │
│ • Access GEO & literature databases │
└──────────────────────────────────────────────────────────────────────┘
🦞 ~/projects ▸Step 2: Download and Load Data
Use natural language to download the dataset:
🦞 You: "Download GSE109564 from GEO - it's a single-cell RNA-seq dataset of immune cells from the tumor microenvironment"Expected Output:
🦞 Lobster: I'll help you download the GSE109564 dataset from GEO...
✓ Found GSE109564: Single-cell RNA sequencing of tumor-infiltrating immune cells
✓ Dataset contains: 5,063 cells × 20,729 genes
✓ Downloaded and processed count matrix
✓ Added sample metadata and experimental design
✓ Stored as modality: 'geo_gse109564'
Dataset successfully loaded! This appears to be a high-quality single-cell dataset
from tumor-infiltrating immune cells. Ready for analysis.Verify the data was loaded:
🦞 You: "/data"This shows current dataset information including cell counts, gene counts, and basic statistics.
Step 3: Quality Assessment
Assess the quality of the loaded dataset:
🦞 You: "Assess the quality of the loaded single-cell data and show me basic statistics"Expected Output:
🦞 Lobster: I'll perform comprehensive quality assessment of your single-cell data...
✓ Quality metrics calculated for 5,063 cells
✓ Mitochondrial gene percentage: mean 8.2%, max 45%
✓ Ribosomal gene percentage: mean 12.4%
✓ Total UMI count: mean 2,847 per cell
✓ Genes detected: mean 1,234 per cell
✓ Generated quality control visualizations
Quality Assessment Summary:
- 287 cells (5.7%) have high mitochondrial content (>25%)
- 134 cells (2.6%) have very low gene counts (<200 genes)
- 89 cells (1.8%) have extremely high UMI counts (potential doublets)
- Overall data quality: EXCELLENT
- Recommended filtering: Remove 510 low-quality cellsCheck the generated plots:
🦞 You: "/plots"You'll see quality control plots including:
qc_violin_plot.html- Distribution of QC metricsqc_scatter_plot.html- Gene count vs UMI count correlationmitochondrial_gene_plot.html- Mitochondrial gene content analysis
Step 4: Data Preprocessing and Filtering
Filter out low-quality cells and prepare data for analysis:
🦞 You: "Filter low-quality cells based on the QC metrics, normalize the data, and identify highly variable genes"Expected Output:
🦞 Lobster: I'll preprocess your single-cell data with quality-based filtering...
✓ Filtered cells: 5,063 → 4,553 cells (510 cells removed)
✓ Filtered genes: 20,729 → 18,345 genes (2,384 low-expression genes removed)
✓ Log1p normalization applied
✓ Highly variable genes identified: 2,000 genes selected
✓ Data scaled and centered for downstream analysis
✓ Stored filtered data as: 'geo_gse109564_filtered_normalized'
Preprocessing complete! Your data is now ready for clustering and dimensional reduction.
Quality-filtered dataset: 4,553 high-quality cells × 18,345 genesStep 5: Clustering and Dimensionality Reduction
Perform clustering to identify cell populations:
🦞 You: "Perform PCA, compute neighbors, run Leiden clustering, and generate UMAP visualization"Expected Output:
🦞 Lobster: I'll perform clustering and dimensional reduction analysis...
✓ Principal component analysis: 50 components computed
✓ Neighbor graph constructed (15 neighbors)
✓ Leiden clustering completed: 12 clusters identified
✓ UMAP embedding generated (2D projection)
✓ Cluster validation metrics calculated
✓ Stored clustered data as: 'geo_gse109564_clustered'
Clustering Results:
- 12 distinct cell clusters identified
- Silhouette score: 0.73 (excellent separation)
- Largest cluster: 847 cells (18.6%)
- Smallest cluster: 89 cells (2.0%)
- Generated interactive UMAP visualizationView the clustering results:
🦞 You: "/plots"New visualizations include:
umap_clusters.html- Interactive UMAP with cluster colorscluster_composition.html- Cell count per clusterpca_variance_plot.html- Principal component variance explained
Step 6: Marker Gene Identification
Find genes that characterize each cluster:
🦞 You: "Find marker genes for each cluster using differential expression analysis and create a heatmap"Expected Output:
🦞 Lobster: I'll identify marker genes for each cluster...
✓ Differential expression analysis completed (Wilcoxon rank-sum test)
✓ 2,847 significant marker genes identified (FDR < 0.05)
✓ Top markers per cluster: 25 genes selected
✓ Expression heatmap generated
✓ Marker gene table exported
✓ Stored results as: 'geo_gse109564_markers'
Top Marker Genes by Cluster:
- Cluster 0: CD8A, CD8B, GZMB (CD8+ T cells)
- Cluster 1: CD4, IL7R, CCR7 (CD4+ T cells)
- Cluster 2: CD14, LYZ, S100A9 (Monocytes)
- Cluster 3: CD79A, MS4A1, CD19 (B cells)
- Cluster 4: KLRF1, NCR1, NKG7 (NK cells)
[... additional clusters]Step 7: Cell Type Annotation
⚠️ Important: Marker Validation Required
CRITICAL: Built-in marker gene templates are preliminary and not scientifically validated.
Before using automatic annotation, you will be prompted to:
- Provide custom validated markers for your specific tissue/context, OR
- Explicitly acknowledge the limitations of built-in preliminary markers
Recommended workflow for production analysis:
# Option A: Provide custom markers (RECOMMENDED)
🦞 You: "I want to annotate cell types using custom markers. Here are my validated markers for PBMC:
- CD8+ T cells: CD3D, CD3E, CD8A, CD8B, GZMK
- CD4+ T cells: CD3D, CD3E, CD4, IL7R, CCR7
- B cells: CD19, MS4A1, CD79A, CD79B
- NK cells: GNLY, NKG7, KLRD1, NCR1, PRF1
- Monocytes: CD14, LYZ, S100A8, S100A9
Please annotate the clusters using these markers."
# Option B: Use reference-based tools (RECOMMENDED)
🦞 You: "Use Azimuth reference-based annotation for PBMC cell types"
# Option C: Acknowledge limitations and use built-in (NOT RECOMMENDED for production)
# The agent will warn you and ask for explicit confirmationWhy custom markers matter:
- Built-in templates lack evidence scoring (AUC, logFC, specificity)
- Not validated against reference atlases (Azimuth, CellTypist, HCA)
- May contain mouse genes or activation/injury markers
- SASP/senescence and tumor detection are not reliable
See Manual Annotation Guide for details on providing custom markers.
Example: Using Built-in Markers (After Acknowledgment)
If you proceed with built-in markers after acknowledging limitations:
🦞 You: "Based on the marker genes, annotate the clusters with immune cell types. Use standard immune cell markers for annotation."Expected Output:
🦞 Lobster: I'll annotate clusters based on marker gene expression patterns...
✓ Cell type annotation completed using marker genes
✓ 12 clusters annotated with immune cell types
✓ Annotation confidence scores calculated
✓ Updated UMAP with cell type labels
✓ Stored annotated data as: 'geo_gse109564_annotated'
Cell Type Annotations:
- Cluster 0 (847 cells): CD8+ T cells - Cytotoxic
- Cluster 1 (623 cells): CD4+ T cells - Helper
- Cluster 2 (543 cells): Monocytes - Classical
- Cluster 3 (421 cells): B cells - Mature
- Cluster 4 (298 cells): NK cells - Activated
- Cluster 5 (287 cells): CD8+ T cells - Memory
- Cluster 6 (234 cells): Dendritic cells - Conventional
- Cluster 7 (198 cells): CD4+ T cells - Regulatory
- Cluster 8 (156 cells): Macrophages - M1
- Cluster 9 (123 cells): Plasma cells
- Cluster 10 (89 cells): Dendritic cells - Plasmacytoid
- Cluster 11 (67 cells): Neutrophils
Annotation confidence: 94% (high confidence based on canonical markers)Step 8: Advanced Visualization
Create comprehensive visualizations:
🦞 You: "Create a comprehensive visualization dashboard showing UMAP with cell types, marker gene expression, and cluster statistics"Expected Output:
🦞 Lobster: I'll create a comprehensive visualization dashboard...
✓ Multi-panel UMAP visualization created
✓ Marker gene expression overlays generated
✓ Cell type composition plots created
✓ Interactive dashboard with filtering options
✓ Cluster statistics summary generated
Generated Visualizations:
- umap_celltypes.html: UMAP colored by cell type
- marker_expression_umap.html: Top marker genes overlaid on UMAP
- celltype_composition.html: Cell type proportions and statistics
- expression_heatmap.html: Marker genes × cell types heatmap
- qc_by_celltype.html: Quality metrics by cell type
- comprehensive_dashboard.html: Interactive multi-panel dashboardStep 9: Export Results
Export your analysis for further use:
🦞 You: "/export results"Expected Output:
🦞 Lobster: Exporting analysis results...
✓ Exported annotated data (H5AD format): geo_gse109564_annotated.h5ad
✓ Exported marker genes: marker_genes_table.csv
✓ Exported cell metadata: cell_annotations.csv
✓ Exported cluster statistics: cluster_summary.csv
✓ Exported visualizations: plots/ directory
✓ Analysis provenance: analysis_history.json
Export complete! Files saved to: /path/to/workspace/exports/Step 10: Biological Interpretation
Ask Lobster to interpret the results:
🦞 You: "Interpret the biological significance of these immune cell populations in the tumor microenvironment context"Expected Output:
🦞 Lobster: Based on the single-cell analysis, here's the biological interpretation...
🧬 Tumor Microenvironment Analysis:
Key Findings:
1. **Diverse Immune Infiltration**: 11 distinct immune cell types identified
2. **T Cell Dominance**: CD8+ and CD4+ T cells comprise 32% of total cells
3. **Active Immune Response**: High expression of cytotoxic markers (GZMB, PRF1)
4. **Regulatory Elements**: Presence of Tregs (198 cells) may suppress immunity
Clinical Implications:
- High CD8+ T cell infiltration suggests active anti-tumor response
- Presence of M1 macrophages indicates pro-inflammatory environment
- Treg population may limit therapeutic efficacy
- NK cell activation (NKG7+) supports innate immunity
Recommended Follow-up:
- Pseudobulk analysis for population-level statistics
- Trajectory analysis for T cell activation states
- Ligand-receptor analysis for cell-cell communicationWorking with the Results
Accessing Your Data
# Check available datasets
🦞 You: "/files"
# Read specific files
🦞 You: "/read marker_genes_table.csv"
# View workspace structure
🦞 You: "/tree"Advanced Analysis Options
# Convert to pseudobulk for statistical analysis
🦞 You: "Convert the annotated single-cell data to pseudobulk format for differential expression analysis between cell types"
# Perform pathway analysis
🦞 You: "Run pathway enrichment analysis on the marker genes for each cell type"
# Export for external tools
🦞 You: "Export the data in Seurat format for R analysis"Troubleshooting Common Issues
Issue 1: Download Fails
🦞 You: "The GEO download failed with a timeout error"Solution: Check internet connection and try smaller datasets first.
Issue 2: Poor Clustering
🦞 You: "The clustering results don't look good - I see poorly separated clusters"Solution: Adjust resolution parameter or filtering thresholds.
Issue 3: Missing Cell Types
🦞 You: "Some clusters don't have clear cell type annotations"Solution: Check additional marker genes or use reference-based annotation.
Best Practices
- Quality Control: Always inspect QC metrics before filtering
- Parameter Testing: Try different clustering resolutions for optimal results
- Marker Validation: Verify cell type annotations with literature
- Visualization: Use interactive plots to explore data thoroughly
- Documentation: Export analysis history for reproducibility
Next Steps
After completing this tutorial, consider:
- Bulk RNA-seq Tutorial - Convert to pseudobulk and perform population-level analysis
- Proteomics Tutorial - Integrate with proteomics data
- Advanced Analysis - Trajectory analysis, cell-cell communication
- Custom Workflows - Create specialized analysis agents
Summary
You have successfully:
- ✅ Downloaded and loaded a single-cell dataset from GEO
- ✅ Performed comprehensive quality control
- ✅ Filtered and normalized the data
- ✅ Identified 12 distinct immune cell populations
- ✅ Annotated clusters with biological cell types
- ✅ Generated publication-ready visualizations
- ✅ Exported results for further analysis
- ✅ Interpreted biological significance
This complete workflow demonstrates Lobster AI's power for single-cell RNA-seq analysis using natural language interactions and professional-grade bioinformatics algorithms.
Proteomics Analysis Tutorial
This comprehensive tutorial demonstrates how to analyze both mass spectrometry and affinity proteomics data using Lobster AI's specialized proteomics platfor...
Drug Discovery: ALK Inhibitor Investment Analysis
Multi-agent drug target validation, compound profiling, and resistance pharmacogenomics using Lobster AI's drug discovery agents.