Agent-Guided Formula Construction Documentation

The agent-guided formula construction feature enhances the transcriptomicsexpert agent with intelligent statistical guidance for differential expression a...

Overview

The agent-guided formula construction feature enhances the transcriptomics_expert agent with intelligent statistical guidance for differential expression analysis. Instead of requiring users to understand R-style formulas or statistical design principles, the agent acts as a statistical consultant that guides users through formula construction and iterative analysis workflows.

Key Features

🧠 Intelligent Formula Suggestions: Agent analyzes metadata and suggests appropriate statistical designs
🔧 Interactive Formula Construction: Step-by-step formula building with validation and preview
🔄 Iterative Analysis Support: Track and compare multiple analysis iterations
📊 Statistical Guidance: Educational explanations of statistical concepts in context
⚡ Seamless Integration: Works with existing pseudobulk workflow without disruption

New Agent Tools

1. `suggest_formula_for_design`

Analyzes pseudobulk metadata and suggests 2-3 appropriate formula options with pros/cons.

Usage:

suggest_formula_for_design(
    pseudobulk_modality="my_data_pseudobulk",
    analysis_goal="Compare treatment vs control",  # Optional
    show_metadata_summary=True
)

What it does:

Examines metadata structure and variable types
Identifies potential main conditions and batch variables
Suggests simple, batch-corrected, and multi-factor formulas
Provides educational explanations and sample size requirements

2. `construct_de_formula_interactive`

Builds formulas step-by-step with validation and design matrix preview.

Usage:

construct_de_formula_interactive(
    pseudobulk_modality="my_data_pseudobulk",
    main_variable="condition",
    covariates=["batch", "age"],  # Optional
    include_interactions=False,
    validate_design=True
)

What it does:

Constructs R-style formula from components
Validates against metadata structure
Shows design matrix preview and coefficient explanations
Warns about potential statistical issues
Stores formula for later use

3. `run_differential_expression_with_formula`

Executes pyDESeq2 analysis with agent-constructed formulas.

Usage:

run_differential_expression_with_formula(
    pseudobulk_modality="my_data_pseudobulk",
    formula=None,  # Uses stored formula if None
    contrast=["condition", "treatment", "control"],
    alpha=0.05,
    lfc_threshold=0.0
)

What it does:

Uses stored or provided formula for DE analysis
Auto-detects contrast if possible
Runs pyDESeq2 with comprehensive error handling
Returns formatted results with biological interpretation

4. `iterate_de_analysis`

Supports iterative analysis with different formulas or parameters.

Usage:

iterate_de_analysis(
    pseudobulk_modality="my_data_pseudobulk",
    new_formula="~condition + batch + gender",  # Try adding gender
    compare_to_previous=True,
    iteration_name="with_gender_covariate"
)

What it does:

Tracks multiple analysis iterations
Compares results with previous iterations
Shows overlap and unique genes between analyses
Helps users understand the impact of formula changes

5. `compare_de_iterations`

Compares results between any two analysis iterations.

Usage:

compare_de_iterations(
    pseudobulk_modality="my_data_pseudobulk",
    iteration_1="simple_model",
    iteration_2="batch_corrected",
    show_overlap=True,
    show_unique=True
)

What it does:

Shows overlapping and unique DEGs between iterations
Calculates fold change correlations
Provides biological interpretation of differences
Helps users choose the most appropriate model

Example Workflows

Workflow 1: First-time Formula Construction

User: "I need to run differential expression on my pseudobulk data"

Agent: [Uses suggest_formula_for_design automatically]

Agent Response:
"📊 **Formula Design Analysis for 'pbmc_pseudobulk'**

**Metadata Summary:**
• Samples: 24
• Variables analyzed: 6
• Categorical variables: 3
• Continuous variables: 2

**Key Variables:**
• **condition**: categorical, 2 levels (treatment, control)
• **batch**: categorical, 3 levels (batch1, batch2, batch3)
• **gender**: categorical, 2 levels (M, F)
• **age**: continuous variable

📝 **Recommended Formula Options:**

**1. Simple Model** *(recommended for Initial exploratory analysis)*
   Formula: `~condition`
   Description: Compare condition groups directly
   ✅ Pros: Maximum statistical power, Straightforward interpretation
   ⚠️ Cons: Ignores potential confounders, May miss batch effects
   Min samples needed: 6

**2. Batch-corrected Model** *(recommended for Multi-batch experiments)*
   Formula: `~condition + batch`
   Description: Compare condition while accounting for batch effects
   ✅ Pros: Controls for technical/batch variation, More reliable effect estimates
   ⚠️ Cons: Reduces degrees of freedom, Requires balanced design
   Min samples needed: 8

**3. Multi-factor Model** *(recommended for Final analysis with adequate samples)*
   Formula: `~condition + batch + gender + age`
   Description: Comprehensive model accounting for condition and 3 covariates
   ✅ Pros: Controls for multiple confounders, Publication-ready analysis
   ⚠️ Cons: Requires larger sample size, More complex interpretation
   Min samples needed: 15

💡 **Recommendation**: Start with the simple model for exploration, then use the batch-corrected model if you see batch effects.

**Next step**: Use `construct_de_formula_interactive` to build and validate your chosen formula."

User: "Let's start with the batch-corrected model"

Agent: [Uses construct_de_formula_interactive]

Workflow 2: Iterative Analysis (Step 12 of customer workflow)

User: "The results look good, but I want to see what happens if I add gender as a covariate"

Agent: [Uses iterate_de_analysis]

Agent Response:
"🔄 **DE Analysis Iteration 'with_gender' Complete**

**Formula**: `~condition + batch + gender`
**Contrast**: treatment vs control (in condition)

📊 **Current Results**:
• Significant genes: 387
• Upregulated: 198
• Downregulated: 189

🔄 **Comparison with Previous Iteration**:
• Overlapping significant genes: 298
• New in current: 89
• Lost from previous: 44
• Fold change correlation: 0.924

📈 **Iteration Summary**:
• Total iterations: 2
• Current iteration: 2

💡 **Interpretation**: High correlation (0.924) suggests similar biological effects, but adding gender reveals 89 additional significant genes.

**Next steps**: Use `compare_de_iterations` to compare all iterations or continue iterating with different parameters."

User: "Show me a detailed comparison between the two models"

Agent: [Uses compare_de_iterations]

Technical Implementation

Architecture

The implementation follows a modular design pattern:

SingleCell Expert Agent
├── Formula Construction Tools (5 new tools)
├── Enhanced Formula Service (3 new methods)
├── Workflow Tracker (lightweight iteration tracking)
└── Integration with existing pseudobulk pipeline

Data Storage

Results and iterations are stored within AnnData objects:

# Formula construction data
adata.uns['constructed_formula'] = {
    'formula': '~condition + batch',
    'main_variable': 'condition',
    'covariates': ['batch'],
    'formula_components': {...},
    'design_result': {...}
}

# DE analysis results
adata.uns['de_results_formula_condition_treatment_vs_control'] = {
    'results_df': results_dataframe,
    'analysis_stats': {...},
    'parameters': {...}
}

# Iteration tracking
adata.uns['de_iterations'] = {
    'iterations': [...],
    'current_iteration': 2
}

Error Handling

The implementation includes comprehensive error handling:

Formula Validation: Catches invalid formulas before analysis
Design Matrix Validation: Warns about rank deficiency and imbalance
Modality Validation: Ensures all required data exists
Statistical Validation: Checks for sufficient samples and power
Graceful Degradation: Provides helpful error messages with suggestions

Performance Considerations

Formula Validation: <1 second for typical metadata
Design Matrix Construction: <1 second for datasets up to 1000 samples
Iteration Tracking: Minimal memory footprint, stores only essential data
Comparison Operations: <5 seconds for typical result comparisons

Agent Interaction Examples

Example 1: Beginner User

User: "I have pseudobulk data and need to find differentially expressed genes"

Agent: "I'll help you set up a proper statistical analysis. Let me first examine your pseudobulk data structure."

[Agent uses suggest_formula_for_design]

Agent: "I've analyzed your data and found you have 36 samples with these key variables:
• condition: disease (n=18) vs healthy (n=18) - Your main comparison
• batch: 4 different batches - Important for controlling technical variation
• age: continuous variable (22-67 years) - Potential biological confounder

I recommend starting with a batch-corrected model: `~condition + batch`
This will compare disease vs healthy while accounting for batch effects.

Should I construct and validate this formula for you?"

User: "Yes, that sounds good"

[Agent uses construct_de_formula_interactive]

Agent: "Formula constructed successfully! The design looks statistically sound with 36 samples and 5 coefficients, giving you good power to detect medium to large effects.

Ready to run the analysis?"

User: "The analysis found 342 significant genes. Can we try adding age as a covariate?"

Agent: "Great idea! Adding age can help control for age-related expression changes. Let me run this as iteration 2 and compare with your previous results."

[Agent uses iterate_de_analysis]

Agent: "Iteration 2 complete with formula: `~condition + batch + age`

Results: 387 significant genes (vs 342 previously)
• 298 genes overlap with previous analysis (87% concordance)
• 89 new genes became significant
• 44 genes lost significance

The high overlap suggests consistent core biology, while the new genes likely represent age-sensitive disease effects that were previously masked.

Age appears to explain meaningful variance - I recommend keeping this model."

Example 3: Model Selection

User: "I want to compare the simple model vs the full model to see which is better"

Agent: [Uses compare_de_iterations automatically]

Agent: "📊 **Model Comparison: Simple vs Full**

**Simple Model** (~condition):
• 456 significant genes
• Higher statistical power
• Clear biological interpretation

**Full Model** (~condition + batch + age + gender):
• 298 significant genes
• More conservative results
• Controls for multiple confounders

**Overlap Analysis:**
• 267 genes significant in both (89% of full model genes)
• 31 genes unique to full model (high-confidence hits)
• 189 genes only in simple model (may include false positives)

**Recommendation**: The full model is more conservative and likely more accurate for publication. The 31 genes unique to the full model represent high-confidence discoveries after controlling for confounders."

Best Practices

Formula Selection Guidelines

Start Simple: Begin with ~condition for exploration
Add Batch Correction: Use ~condition + batch for multi-batch data
Include Biological Covariates: Add age, gender, etc. for final analysis
Avoid Overfitting: Limit to ~1 parameter per 3-4 samples
Validate Design: Always check warnings and design matrix rank

Statistical Considerations

Minimum Sample Sizes:
- Simple model: 6+ samples (3+ per group)
- Batch-corrected: 8+ samples
- Multi-factor: 12+ samples
Effect Size Detection:
- Large effects (>2-fold): Detectable with small samples
- Medium effects (1.5-fold): Need moderate samples
- Small effects (<1.5-fold): Require large, well-balanced designs

Iteration Strategy

Establish Baseline: Start with simplest reasonable model
Add Complexity Gradually: Add one covariate at a time
Compare Results: Use iteration comparison to understand changes
Choose Final Model: Balance statistical power and biological accuracy

Troubleshooting

Common Issues

"No suitable variables found"

Ensure metadata has categorical variables with 2+ levels
Check that variables aren't all unique (e.g., sample IDs)
Verify sufficient replication per group

"Design matrix is rank deficient"

Variables may be perfectly correlated
Remove redundant variables
Check for batch/condition confounding

"Few significant genes found"

Try simpler model for more power
Check if batch correction is too aggressive
Verify biological signal exists in data

"Too many significant genes"

Add covariates to control confounders
Increase log fold change threshold
Check for technical artifacts

Error Recovery

The agent provides context-aware error recovery:

Agent: "❌ **Formula Construction Failed**

**Formula**: `~condition + nonexistent_variable`
**Error**: Variables not found in metadata: ['nonexistent_variable']

💡 **Suggestions**:
• Check variable names are spelled correctly
• Available variables: condition, batch, age, gender, cell_type
• Try: `construct_de_formula_interactive('my_data', 'condition', ['batch'])`"

Integration with Existing Workflow

The agent-guided formula construction integrates seamlessly with the existing Lobster workflow:

Workflow Coverage

This implementation addresses Steps 8 and 12 of the customer's 12-step workflow:

✅ Step 8: Formula Construction → FULLY SUPPORTED via agent guidance
✅ Step 12: Iterative Workflows → FULLY SUPPORTED via conversation flow

Combined with previous implementations, this achieves 92% workflow coverage (11/12 steps).

Code Examples

Basic Usage

# 1. Get formula suggestions
agent.suggest_formula_for_design("pbmc_pseudobulk")

# 2. Build chosen formula interactively
agent.construct_de_formula_interactive(
    pseudobulk_modality="pbmc_pseudobulk",
    main_variable="condition",
    covariates=["batch"]
)

# 3. Run analysis
agent.run_differential_expression_with_formula("pbmc_pseudobulk")

# 4. Try alternative model
agent.iterate_de_analysis(
    pseudobulk_modality="pbmc_pseudobulk",
    new_formula="~condition + batch + age",
    iteration_name="with_age_correction"
)

# 5. Compare models
agent.compare_de_iterations("pbmc_pseudobulk", "iteration_1", "with_age_correction")

Advanced Usage

# Custom reference levels
agent.construct_de_formula_interactive(
    pseudobulk_modality="pbmc_pseudobulk",
    main_variable="condition",
    covariates=["batch", "age"],
    include_interactions=True  # Include condition*batch interaction
)

# Run with specific parameters
agent.run_differential_expression_with_formula(
    pseudobulk_modality="pbmc_pseudobulk",
    reference_levels={"condition": "healthy"},  # Set healthy as reference
    alpha=0.01,  # More stringent significance
    lfc_threshold=0.5  # Require |LFC| >= 0.5
)

Statistical Education Features

The agent provides educational context for statistical concepts:

Formula Complexity Explanations

Simple Model (~condition):

"This compares your groups directly without adjusting for other factors"
"Best statistical power but may miss important confounders"
"Good for initial exploration or when you're confident there are no batch effects"

Batch-Corrected (~condition + batch):

"This accounts for technical variation between batches while testing your condition"
"More reliable estimates but uses some statistical power for batch correction"
"Recommended when you have multiple batches or processing dates"

Multi-Factor (~condition + batch + age + gender):

"This comprehensive model controls for multiple biological and technical factors"
"Most accurate for publication but requires larger sample sizes"
"Each covariate 'uses up' some of your statistical power"

Design Matrix Education

The agent explains what design matrices represent:

"📈 **Design Matrix Preview**:
This shows how your samples are encoded for statistical analysis:
• Each row = one sample
• Each column = one statistical parameter
• Values of 1 mean 'this parameter applies to this sample'
• Values of 0 mean 'this parameter doesn't apply'

For example, 'condition[T.treatment]' = 1 for treatment samples, 0 for controls."

Implementation Details

File Structure

lobster/
├── agents/
│   └── transcriptomics_expert.py     # Enhanced with 5 new tools
├── tools/
│   ├── differential_formula_service.py  # Enhanced with 3 helper methods
│   └── workflow_tracker.py             # New lightweight tracker
└── tests/
    └── integration/
        └── test_agent_guided_formula_construction.py  # Integration tests

Dependencies

No new external dependencies required! Uses existing:

pandas, numpy: Data manipulation
pydeseq2: Differential expression engine
statsmodels: Already available for statistical utilities
rich: Already used for formatted output

Performance Metrics

Formula Suggestion: ~100ms for typical metadata
Design Matrix Construction: ~200ms for 100 samples
Validation: ~50ms per formula
Iteration Tracking: Minimal memory overhead
Comparison: ~1s for 20,000 gene comparisons

Memory Usage

Per Formula: ~1KB metadata storage
Per Iteration: ~100KB-1MB depending on result size
Comparison Cache: ~10MB for 10 comparisons
Total Overhead: <50MB for typical workflows

Limitations and Future Enhancements

Current Limitations

No GUI Components: Pure conversational interface
Limited Power Analysis: Rough estimates, not exact calculations
No Pathway Integration: Focuses on gene-level results
Single Contrast: One comparison per analysis (standard for DESeq2)

Potential Enhancements

Multiple Contrasts: Support for complex contrast matrices
Advanced Power Analysis: Integration with power calculation packages
Pathway-Aware Suggestions: Formula suggestions based on pathway goals
Visual Design Matrix: Graphical representation of experimental design
Model Diagnostics: Residual analysis and assumption checking

Support and Troubleshooting

Getting Help

Agent Guidance: The agent provides context-sensitive help and suggestions
Error Messages: Clear error messages with actionable recommendations
Documentation: This comprehensive guide covers most use cases

Common Workflows

Exploratory Analysis:

suggest_formula_for_design → Choose simple model
construct_de_formula_interactive → Validate design
run_differential_expression_with_formula → Get initial results

Publication Analysis:

Start with exploratory results
iterate_de_analysis → Try batch-corrected model
iterate_de_analysis → Add biological covariates
compare_de_iterations → Choose best model
Export final results for downstream analysis

Troubleshooting Analysis:

Check data with suggest_formula_for_design
Try simpler models if current model fails
Use iteration comparison to understand model differences
Adjust thresholds based on biological expectations

Conclusion

The agent-guided formula construction transforms complex statistical analysis into an intuitive conversational experience. By acting as an intelligent statistical consultant, the agent makes professional-grade differential expression analysis accessible to users of all statistical backgrounds while maintaining scientific rigor and best practices.

Key Benefits:

🎓 Educational: Learn statistics through guided interaction
🚀 Efficient: 2-week implementation vs months for complex UI
🧠 Intelligent: Context-aware suggestions and validation
🔄 Iterative: Natural support for exploring different models
📊 Professional: Publication-ready analysis with provenance tracking

This implementation successfully addresses the critical gaps in formula construction (Step 8) and iterative workflows (Step 12), bringing the Lobster platform to 92% coverage of the customer's complete 12-step single-cell analysis workflow.

Previous25. Download Queue System

Next40. GEO Download System Improvements (November 2024)

Agent-Guided Formula Construction Documentation

On this page