Single-Cell RNA-seq Analysis Tutorial

This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpret...

This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpretation.

Overview

In this tutorial, you will learn to:

Download single-cell datasets from GEO using natural language
Perform quality control and filtering
Normalize and cluster cells
Identify cell types and marker genes
Create publication-ready visualizations
Export and interpret results

Prerequisites

Lobster AI installed and configured (see Installation Guide)
API keys set up in your .env file
Basic understanding of single-cell RNA-seq concepts

Tutorial Dataset

We'll use GSE109564, a well-characterized single-cell dataset of immune cells from the tumor microenvironment. This dataset contains:

~5,000 cells
~20,000 genes
Multiple immune cell types
Excellent for demonstrating clustering and annotation

Step 1: Starting Lobster AI

Start the interactive chat interface with the enhanced CLI:

# Start Lobster AI with Rich CLI interface
lobster chat

You'll see the welcome screen with orange branding:

🦞 lobster chat

┌─ Welcome ────────────────────────────────────────────────────────────┐
│  🦞 LOBSTER by Omics-OS                                             │
│  Multi-Agent Bioinformatics Analysis System v0.2                    │
│                                                                      │
│  🧬 Key Tasks:                                                       │
│  • Analyze RNA-seq data                                  │
│  • Generate visualizations and plots                                │
│  • Extract insights from bioinformatics datasets                    │
│  • Access GEO & literature databases                               │
└──────────────────────────────────────────────────────────────────────┘

🦞 ~/projects ▸

Step 2: Download and Load Data

Use natural language to download the dataset:

🦞 You: "Download GSE109564 from GEO - it's a single-cell RNA-seq dataset of immune cells from the tumor microenvironment"

Expected Output:

🦞 Lobster: I'll help you download the GSE109564 dataset from GEO...

✓ Found GSE109564: Single-cell RNA sequencing of tumor-infiltrating immune cells
✓ Dataset contains: 5,063 cells × 20,729 genes
✓ Downloaded and processed count matrix
✓ Added sample metadata and experimental design
✓ Stored as modality: 'geo_gse109564'

Dataset successfully loaded! This appears to be a high-quality single-cell dataset
from tumor-infiltrating immune cells. Ready for analysis.

Verify the data was loaded:

🦞 You: "/data"

This shows current dataset information including cell counts, gene counts, and basic statistics.

Step 3: Quality Assessment

Assess the quality of the loaded dataset:

🦞 You: "Assess the quality of the loaded single-cell data and show me basic statistics"

Expected Output:

🦞 Lobster: I'll perform comprehensive quality assessment of your single-cell data...

✓ Quality metrics calculated for 5,063 cells
✓ Mitochondrial gene percentage: mean 8.2%, max 45%
✓ Ribosomal gene percentage: mean 12.4%
✓ Total UMI count: mean 2,847 per cell
✓ Genes detected: mean 1,234 per cell
✓ Generated quality control visualizations

Quality Assessment Summary:
- 287 cells (5.7%) have high mitochondrial content (>25%)
- 134 cells (2.6%) have very low gene counts (<200 genes)
- 89 cells (1.8%) have extremely high UMI counts (potential doublets)
- Overall data quality: EXCELLENT
- Recommended filtering: Remove 510 low-quality cells

Check the generated plots:

🦞 You: "/plots"

You'll see quality control plots including:

qc_violin_plot.html - Distribution of QC metrics
qc_scatter_plot.html - Gene count vs UMI count correlation
mitochondrial_gene_plot.html - Mitochondrial gene content analysis

Step 4: Data Preprocessing and Filtering

Filter out low-quality cells and prepare data for analysis:

🦞 You: "Filter low-quality cells based on the QC metrics, normalize the data, and identify highly variable genes"

Expected Output:

🦞 Lobster: I'll preprocess your single-cell data with quality-based filtering...

✓ Filtered cells: 5,063 → 4,553 cells (510 cells removed)
✓ Filtered genes: 20,729 → 18,345 genes (2,384 low-expression genes removed)
✓ Log1p normalization applied
✓ Highly variable genes identified: 2,000 genes selected
✓ Data scaled and centered for downstream analysis
✓ Stored filtered data as: 'geo_gse109564_filtered_normalized'

Preprocessing complete! Your data is now ready for clustering and dimensional reduction.
Quality-filtered dataset: 4,553 high-quality cells × 18,345 genes

Step 5: Clustering and Dimensionality Reduction

Perform clustering to identify cell populations:

🦞 You: "Perform PCA, compute neighbors, run Leiden clustering, and generate UMAP visualization"

Expected Output:

🦞 Lobster: I'll perform clustering and dimensional reduction analysis...

✓ Principal component analysis: 50 components computed
✓ Neighbor graph constructed (15 neighbors)
✓ Leiden clustering completed: 12 clusters identified
✓ UMAP embedding generated (2D projection)
✓ Cluster validation metrics calculated
✓ Stored clustered data as: 'geo_gse109564_clustered'

Clustering Results:
- 12 distinct cell clusters identified
- Silhouette score: 0.73 (excellent separation)
- Largest cluster: 847 cells (18.6%)
- Smallest cluster: 89 cells (2.0%)
- Generated interactive UMAP visualization

View the clustering results:

🦞 You: "/plots"

New visualizations include:

umap_clusters.html - Interactive UMAP with cluster colors
cluster_composition.html - Cell count per cluster
pca_variance_plot.html - Principal component variance explained

Step 6: Marker Gene Identification

Find genes that characterize each cluster:

🦞 You: "Find marker genes for each cluster using differential expression analysis and create a heatmap"

Expected Output:

🦞 Lobster: I'll identify marker genes for each cluster...

✓ Differential expression analysis completed (Wilcoxon rank-sum test)
✓ 2,847 significant marker genes identified (FDR < 0.05)
✓ Top markers per cluster: 25 genes selected
✓ Expression heatmap generated
✓ Marker gene table exported
✓ Stored results as: 'geo_gse109564_markers'

Top Marker Genes by Cluster:
- Cluster 0: CD8A, CD8B, GZMB (CD8+ T cells)
- Cluster 1: CD4, IL7R, CCR7 (CD4+ T cells)
- Cluster 2: CD14, LYZ, S100A9 (Monocytes)
- Cluster 3: CD79A, MS4A1, CD19 (B cells)
- Cluster 4: KLRF1, NCR1, NKG7 (NK cells)
[... additional clusters]

Step 7: Cell Type Annotation

⚠️ Important: Marker Validation Required

CRITICAL: Built-in marker gene templates are preliminary and not scientifically validated.

Before using automatic annotation, you will be prompted to:

Provide custom validated markers for your specific tissue/context, OR
Explicitly acknowledge the limitations of built-in preliminary markers

Recommended workflow for production analysis:

# Option A: Provide custom markers (RECOMMENDED)
🦞 You: "I want to annotate cell types using custom markers. Here are my validated markers for PBMC:
- CD8+ T cells: CD3D, CD3E, CD8A, CD8B, GZMK
- CD4+ T cells: CD3D, CD3E, CD4, IL7R, CCR7
- B cells: CD19, MS4A1, CD79A, CD79B
- NK cells: GNLY, NKG7, KLRD1, NCR1, PRF1
- Monocytes: CD14, LYZ, S100A8, S100A9
Please annotate the clusters using these markers."

# Option B: Use reference-based tools (RECOMMENDED)
🦞 You: "Use Azimuth reference-based annotation for PBMC cell types"

# Option C: Acknowledge limitations and use built-in (NOT RECOMMENDED for production)
# The agent will warn you and ask for explicit confirmation

Why custom markers matter:

Built-in templates lack evidence scoring (AUC, logFC, specificity)
Not validated against reference atlases (Azimuth, CellTypist, HCA)
May contain mouse genes or activation/injury markers
SASP/senescence and tumor detection are not reliable

See Manual Annotation Guide for details on providing custom markers.

Example: Using Built-in Markers (After Acknowledgment)

If you proceed with built-in markers after acknowledging limitations:

🦞 You: "Based on the marker genes, annotate the clusters with immune cell types. Use standard immune cell markers for annotation."

Expected Output:

🦞 Lobster: I'll annotate clusters based on marker gene expression patterns...

✓ Cell type annotation completed using marker genes
✓ 12 clusters annotated with immune cell types
✓ Annotation confidence scores calculated
✓ Updated UMAP with cell type labels
✓ Stored annotated data as: 'geo_gse109564_annotated'

Cell Type Annotations:
- Cluster 0 (847 cells): CD8+ T cells - Cytotoxic
- Cluster 1 (623 cells): CD4+ T cells - Helper
- Cluster 2 (543 cells): Monocytes - Classical
- Cluster 3 (421 cells): B cells - Mature
- Cluster 4 (298 cells): NK cells - Activated
- Cluster 5 (287 cells): CD8+ T cells - Memory
- Cluster 6 (234 cells): Dendritic cells - Conventional
- Cluster 7 (198 cells): CD4+ T cells - Regulatory
- Cluster 8 (156 cells): Macrophages - M1
- Cluster 9 (123 cells): Plasma cells
- Cluster 10 (89 cells): Dendritic cells - Plasmacytoid
- Cluster 11 (67 cells): Neutrophils

Annotation confidence: 94% (high confidence based on canonical markers)

Step 8: Advanced Visualization

Create comprehensive visualizations:

🦞 You: "Create a comprehensive visualization dashboard showing UMAP with cell types, marker gene expression, and cluster statistics"

Expected Output:

🦞 Lobster: I'll create a comprehensive visualization dashboard...

✓ Multi-panel UMAP visualization created
✓ Marker gene expression overlays generated
✓ Cell type composition plots created
✓ Interactive dashboard with filtering options
✓ Cluster statistics summary generated

Generated Visualizations:
- umap_celltypes.html: UMAP colored by cell type
- marker_expression_umap.html: Top marker genes overlaid on UMAP
- celltype_composition.html: Cell type proportions and statistics
- expression_heatmap.html: Marker genes × cell types heatmap
- qc_by_celltype.html: Quality metrics by cell type
- comprehensive_dashboard.html: Interactive multi-panel dashboard

Step 9: Export Results

Export your analysis for further use:

🦞 You: "/export results"

Expected Output:

🦞 Lobster: Exporting analysis results...

✓ Exported annotated data (H5AD format): geo_gse109564_annotated.h5ad
✓ Exported marker genes: marker_genes_table.csv
✓ Exported cell metadata: cell_annotations.csv
✓ Exported cluster statistics: cluster_summary.csv
✓ Exported visualizations: plots/ directory
✓ Analysis provenance: analysis_history.json

Export complete! Files saved to: /path/to/workspace/exports/

Step 10: Biological Interpretation

Ask Lobster to interpret the results:

🦞 You: "Interpret the biological significance of these immune cell populations in the tumor microenvironment context"

Expected Output:

🦞 Lobster: Based on the single-cell analysis, here's the biological interpretation...

🧬 Tumor Microenvironment Analysis:

Key Findings:
1. **Diverse Immune Infiltration**: 11 distinct immune cell types identified
2. **T Cell Dominance**: CD8+ and CD4+ T cells comprise 32% of total cells
3. **Active Immune Response**: High expression of cytotoxic markers (GZMB, PRF1)
4. **Regulatory Elements**: Presence of Tregs (198 cells) may suppress immunity

Clinical Implications:
- High CD8+ T cell infiltration suggests active anti-tumor response
- Presence of M1 macrophages indicates pro-inflammatory environment
- Treg population may limit therapeutic efficacy
- NK cell activation (NKG7+) supports innate immunity

Recommended Follow-up:
- Pseudobulk analysis for population-level statistics
- Trajectory analysis for T cell activation states
- Ligand-receptor analysis for cell-cell communication

Working with the Results

Accessing Your Data

# Check available datasets
🦞 You: "/files"

# Read specific files
🦞 You: "/read marker_genes_table.csv"

# View workspace structure
🦞 You: "/tree"

Advanced Analysis Options

# Convert to pseudobulk for statistical analysis
🦞 You: "Convert the annotated single-cell data to pseudobulk format for differential expression analysis between cell types"

# Perform pathway analysis
🦞 You: "Run pathway enrichment analysis on the marker genes for each cell type"

# Export for external tools
🦞 You: "Export the data in Seurat format for R analysis"

Troubleshooting Common Issues

Issue 1: Download Fails

🦞 You: "The GEO download failed with a timeout error"

Solution: Check internet connection and try smaller datasets first.

Issue 2: Poor Clustering

🦞 You: "The clustering results don't look good - I see poorly separated clusters"

Solution: Adjust resolution parameter or filtering thresholds.

Issue 3: Missing Cell Types

🦞 You: "Some clusters don't have clear cell type annotations"

Solution: Check additional marker genes or use reference-based annotation.

Best Practices

Quality Control: Always inspect QC metrics before filtering
Parameter Testing: Try different clustering resolutions for optimal results
Marker Validation: Verify cell type annotations with literature
Visualization: Use interactive plots to explore data thoroughly
Documentation: Export analysis history for reproducibility

Next Steps

After completing this tutorial, consider:

Bulk RNA-seq Tutorial - Convert to pseudobulk and perform population-level analysis
Proteomics Tutorial - Integrate with proteomics data
Advanced Analysis - Trajectory analysis, cell-cell communication
Custom Workflows - Create specialized analysis agents

Summary

You have successfully:

✅ Downloaded and loaded a single-cell dataset from GEO
✅ Performed comprehensive quality control
✅ Filtered and normalized the data
✅ Identified 12 distinct immune cell populations
✅ Annotated clusters with biological cell types
✅ Generated publication-ready visualizations
✅ Exported results for further analysis
✅ Interpreted biological significance

This complete workflow demonstrates Lobster AI's power for single-cell RNA-seq analysis using natural language interactions and professional-grade bioinformatics algorithms.

Single-Cell RNA-seq Analysis Tutorial

On this page