Single-Cell RNA-seq Analysis Tutorial
This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpret...
This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpretation.
Overview
In this tutorial, you will learn to:
- Download single-cell datasets from GEO using natural language
- Perform quality control and filtering
- Normalize and cluster cells
- Identify cell types and marker genes
- Create publication-ready visualizations
- Export and interpret results
Prerequisites
- Lobster AI installed and configured (see Installation Guide)
- API keys set up in your
.envfile - Basic understanding of single-cell RNA-seq concepts
Tutorial Dataset
We'll use GSE109564, a well-characterized single-cell dataset of immune cells from the tumor microenvironment. This dataset contains:
- ~5,000 cells
- ~20,000 genes
- Multiple immune cell types
- Excellent for demonstrating clustering and annotation
Step 1: Starting Lobster AI
Start the interactive chat interface with the enhanced CLI:
# Start Lobster AI with Rich CLI interface
lobster chatYou'll see the welcome screen with orange branding:
🦞 lobster chat
┌─ Welcome ────────────────────────────────────────────────────────────┐
│ 🦞 LOBSTER by Omics-OS │
│ Multi-Agent Bioinformatics Analysis System v0.2 │
│ │
│ 🧬 Key Tasks: │
│ • Analyze RNA-seq data │
│ • Generate visualizations and plots │
│ • Extract insights from bioinformatics datasets │
│ • Access GEO & literature databases │
└──────────────────────────────────────────────────────────────────────┘
🦞 ~/projects ▸Step 2: Download and Load Data
Use natural language to download the dataset:
🦞 You: "Download GSE109564 from GEO - it's a single-cell RNA-seq dataset of immune cells from the tumor microenvironment"Expected Output:
🦞 Lobster: I'll help you download the GSE109564 dataset from GEO...
✓ Found GSE109564: Single-cell RNA sequencing of tumor-infiltrating immune cells
✓ Dataset contains: 5,063 cells × 20,729 genes
✓ Downloaded and processed count matrix
✓ Added sample metadata and experimental design
✓ Stored as modality: 'geo_gse109564'
Dataset successfully loaded! This appears to be a high-quality single-cell dataset
from tumor-infiltrating immune cells. Ready for analysis.Verify the data was loaded:
🦞 You: "/data"This shows current dataset information including cell counts, gene counts, and basic statistics.
Step 3: Quality Assessment
Assess the quality of the loaded dataset:
🦞 You: "Assess the quality of the loaded single-cell data and show me basic statistics"Expected Output:
🦞 Lobster: I'll perform comprehensive quality assessment of your single-cell data...
✓ Quality metrics calculated for 5,063 cells
✓ Mitochondrial gene percentage: mean 8.2%, max 45%
✓ Ribosomal gene percentage: mean 12.4%
✓ Total UMI count: mean 2,847 per cell
✓ Genes detected: mean 1,234 per cell
✓ Generated quality control visualizations
Quality Assessment Summary:
- 287 cells (5.7%) have high mitochondrial content (>25%)
- 134 cells (2.6%) have very low gene counts (<200 genes)
- 89 cells (1.8%) have extremely high UMI counts (potential doublets)
- Overall data quality: EXCELLENT
- Recommended filtering: Remove 510 low-quality cellsCheck the generated plots:
🦞 You: "/plots"You'll see quality control plots including:
qc_violin_plot.html- Distribution of QC metricsqc_scatter_plot.html- Gene count vs UMI count correlationmitochondrial_gene_plot.html- Mitochondrial gene content analysis
Step 4: Data Preprocessing and Filtering
Filter out low-quality cells and prepare data for analysis:
🦞 You: "Filter low-quality cells based on the QC metrics, normalize the data, and identify highly variable genes"Expected Output:
🦞 Lobster: I'll preprocess your single-cell data with quality-based filtering...
✓ Filtered cells: 5,063 → 4,553 cells (510 cells removed)
✓ Filtered genes: 20,729 → 18,345 genes (2,384 low-expression genes removed)
✓ Log1p normalization applied
✓ Highly variable genes identified: 2,000 genes selected
✓ Data scaled and centered for downstream analysis
✓ Stored filtered data as: 'geo_gse109564_filtered_normalized'
Preprocessing complete! Your data is now ready for clustering and dimensional reduction.
Quality-filtered dataset: 4,553 high-quality cells × 18,345 genesStep 5: Clustering and Dimensionality Reduction
Perform clustering to identify cell populations:
🦞 You: "Perform PCA, compute neighbors, run Leiden clustering, and generate UMAP visualization"Expected Output:
🦞 Lobster: I'll perform clustering and dimensional reduction analysis...
✓ Principal component analysis: 50 components computed
✓ Neighbor graph constructed (15 neighbors)
✓ Leiden clustering completed: 12 clusters identified
✓ UMAP embedding generated (2D projection)
✓ Cluster validation metrics calculated
✓ Stored clustered data as: 'geo_gse109564_clustered'
Clustering Results:
- 12 distinct cell clusters identified
- Silhouette score: 0.73 (excellent separation)
- Largest cluster: 847 cells (18.6%)
- Smallest cluster: 89 cells (2.0%)
- Generated interactive UMAP visualizationView the clustering results:
🦞 You: "/plots"New visualizations include:
umap_clusters.html- Interactive UMAP with cluster colorscluster_composition.html- Cell count per clusterpca_variance_plot.html- Principal component variance explained
Step 6: Marker Gene Identification
Find genes that characterize each cluster:
🦞 You: "Find marker genes for each cluster using differential expression analysis and create a heatmap"Expected Output:
🦞 Lobster: I'll identify marker genes for each cluster...
✓ Differential expression analysis completed (Wilcoxon rank-sum test)
✓ 2,847 significant marker genes identified (FDR < 0.05)
✓ Top markers per cluster: 25 genes selected
✓ Expression heatmap generated
✓ Marker gene table exported
✓ Stored results as: 'geo_gse109564_markers'
Top Marker Genes by Cluster:
- Cluster 0: CD8A, CD8B, GZMB (CD8+ T cells)
- Cluster 1: CD4, IL7R, CCR7 (CD4+ T cells)
- Cluster 2: CD14, LYZ, S100A9 (Monocytes)
- Cluster 3: CD79A, MS4A1, CD19 (B cells)
- Cluster 4: KLRF1, NCR1, NKG7 (NK cells)
[... additional clusters]Step 7: Cell Type Annotation
⚠️ Important: Marker Validation Required
CRITICAL: Built-in marker gene templates are preliminary and not scientifically validated.
Before using automatic annotation, you will be prompted to:
- Provide custom validated markers for your specific tissue/context, OR
- Explicitly acknowledge the limitations of built-in preliminary markers
Recommended workflow for production analysis:
# Option A: Provide custom markers (RECOMMENDED)
🦞 You: "I want to annotate cell types using custom markers. Here are my validated markers for PBMC:
- CD8+ T cells: CD3D, CD3E, CD8A, CD8B, GZMK
- CD4+ T cells: CD3D, CD3E, CD4, IL7R, CCR7
- B cells: CD19, MS4A1, CD79A, CD79B
- NK cells: GNLY, NKG7, KLRD1, NCR1, PRF1
- Monocytes: CD14, LYZ, S100A8, S100A9
Please annotate the clusters using these markers."
# Option B: Use reference-based tools (RECOMMENDED)
🦞 You: "Use Azimuth reference-based annotation for PBMC cell types"
# Option C: Acknowledge limitations and use built-in (NOT RECOMMENDED for production)
# The agent will warn you and ask for explicit confirmationWhy custom markers matter:
- Built-in templates lack evidence scoring (AUC, logFC, specificity)
- Not validated against reference atlases (Azimuth, CellTypist, HCA)
- May contain mouse genes or activation/injury markers
- SASP/senescence and tumor detection are not reliable
See Manual Annotation Guide for details on providing custom markers.
Example: Using Built-in Markers (After Acknowledgment)
If you proceed with built-in markers after acknowledging limitations:
🦞 You: "Based on the marker genes, annotate the clusters with immune cell types. Use standard immune cell markers for annotation."Expected Output:
🦞 Lobster: I'll annotate clusters based on marker gene expression patterns...
✓ Cell type annotation completed using marker genes
✓ 12 clusters annotated with immune cell types
✓ Annotation confidence scores calculated
✓ Updated UMAP with cell type labels
✓ Stored annotated data as: 'geo_gse109564_annotated'
Cell Type Annotations:
- Cluster 0 (847 cells): CD8+ T cells - Cytotoxic
- Cluster 1 (623 cells): CD4+ T cells - Helper
- Cluster 2 (543 cells): Monocytes - Classical
- Cluster 3 (421 cells): B cells - Mature
- Cluster 4 (298 cells): NK cells - Activated
- Cluster 5 (287 cells): CD8+ T cells - Memory
- Cluster 6 (234 cells): Dendritic cells - Conventional
- Cluster 7 (198 cells): CD4+ T cells - Regulatory
- Cluster 8 (156 cells): Macrophages - M1
- Cluster 9 (123 cells): Plasma cells
- Cluster 10 (89 cells): Dendritic cells - Plasmacytoid
- Cluster 11 (67 cells): Neutrophils
Annotation confidence: 94% (high confidence based on canonical markers)Step 8: Advanced Visualization
Create comprehensive visualizations:
🦞 You: "Create a comprehensive visualization dashboard showing UMAP with cell types, marker gene expression, and cluster statistics"Expected Output:
🦞 Lobster: I'll create a comprehensive visualization dashboard...
✓ Multi-panel UMAP visualization created
✓ Marker gene expression overlays generated
✓ Cell type composition plots created
✓ Interactive dashboard with filtering options
✓ Cluster statistics summary generated
Generated Visualizations:
- umap_celltypes.html: UMAP colored by cell type
- marker_expression_umap.html: Top marker genes overlaid on UMAP
- celltype_composition.html: Cell type proportions and statistics
- expression_heatmap.html: Marker genes × cell types heatmap
- qc_by_celltype.html: Quality metrics by cell type
- comprehensive_dashboard.html: Interactive multi-panel dashboardStep 9: Export Results
Export your analysis for further use:
🦞 You: "/export results"Expected Output:
🦞 Lobster: Exporting analysis results...
✓ Exported annotated data (H5AD format): geo_gse109564_annotated.h5ad
✓ Exported marker genes: marker_genes_table.csv
✓ Exported cell metadata: cell_annotations.csv
✓ Exported cluster statistics: cluster_summary.csv
✓ Exported visualizations: plots/ directory
✓ Analysis provenance: analysis_history.json
Export complete! Files saved to: /path/to/workspace/exports/Step 10: Biological Interpretation
Ask Lobster to interpret the results:
🦞 You: "Interpret the biological significance of these immune cell populations in the tumor microenvironment context"Expected Output:
🦞 Lobster: Based on the single-cell analysis, here's the biological interpretation...
🧬 Tumor Microenvironment Analysis:
Key Findings:
1. **Diverse Immune Infiltration**: 11 distinct immune cell types identified
2. **T Cell Dominance**: CD8+ and CD4+ T cells comprise 32% of total cells
3. **Active Immune Response**: High expression of cytotoxic markers (GZMB, PRF1)
4. **Regulatory Elements**: Presence of Tregs (198 cells) may suppress immunity
Clinical Implications:
- High CD8+ T cell infiltration suggests active anti-tumor response
- Presence of M1 macrophages indicates pro-inflammatory environment
- Treg population may limit therapeutic efficacy
- NK cell activation (NKG7+) supports innate immunity
Recommended Follow-up:
- Pseudobulk analysis for population-level statistics
- Trajectory analysis for T cell activation states
- Ligand-receptor analysis for cell-cell communicationWorking with the Results
Accessing Your Data
# Check available datasets
🦞 You: "/files"
# Read specific files
🦞 You: "/read marker_genes_table.csv"
# View workspace structure
🦞 You: "/tree"Advanced Analysis Options
# Convert to pseudobulk for statistical analysis
🦞 You: "Convert the annotated single-cell data to pseudobulk format for differential expression analysis between cell types"
# Perform pathway analysis
🦞 You: "Run pathway enrichment analysis on the marker genes for each cell type"
# Export for external tools
🦞 You: "Export the data in Seurat format for R analysis"Troubleshooting Common Issues
Issue 1: Download Fails
🦞 You: "The GEO download failed with a timeout error"Solution: Check internet connection and try smaller datasets first.
Issue 2: Poor Clustering
🦞 You: "The clustering results don't look good - I see poorly separated clusters"Solution: Adjust resolution parameter or filtering thresholds.
Issue 3: Missing Cell Types
🦞 You: "Some clusters don't have clear cell type annotations"Solution: Check additional marker genes or use reference-based annotation.
Best Practices
- Quality Control: Always inspect QC metrics before filtering
- Parameter Testing: Try different clustering resolutions for optimal results
- Marker Validation: Verify cell type annotations with literature
- Visualization: Use interactive plots to explore data thoroughly
- Documentation: Export analysis history for reproducibility
Next Steps
After completing this tutorial, consider:
- Bulk RNA-seq Tutorial - Convert to pseudobulk and perform population-level analysis
- Proteomics Tutorial - Integrate with proteomics data
- Advanced Analysis - Trajectory analysis, cell-cell communication
- Custom Workflows - Create specialized analysis agents
Summary
You have successfully:
- ✅ Downloaded and loaded a single-cell dataset from GEO
- ✅ Performed comprehensive quality control
- ✅ Filtered and normalized the data
- ✅ Identified 12 distinct immune cell populations
- ✅ Annotated clusters with biological cell types
- ✅ Generated publication-ready visualizations
- ✅ Exported results for further analysis
- ✅ Interpreted biological significance
This complete workflow demonstrates Lobster AI's power for single-cell RNA-seq analysis using natural language interactions and professional-grade bioinformatics algorithms.
Proteomics Analysis Tutorial
This comprehensive tutorial demonstrates how to analyze both mass spectrometry and affinity proteomics data using Lobster AI's specialized proteomics platfor...
Agents API Reference
The Agents API provides specialized AI agents for different analytical domains in bioinformatics. Each agent is designed as an expert in its specific area, o...