Agent-Guided Formula Construction Documentation
The agent-guided formula construction feature enhances the transcriptomicsexpert agent with intelligent statistical guidance for differential expression a...
Overview
The agent-guided formula construction feature enhances the transcriptomics_expert agent with intelligent statistical guidance for differential expression analysis. Instead of requiring users to understand R-style formulas or statistical design principles, the agent acts as a statistical consultant that guides users through formula construction and iterative analysis workflows.
Key Features
- 🧠 Intelligent Formula Suggestions: Agent analyzes metadata and suggests appropriate statistical designs
- 🔧 Interactive Formula Construction: Step-by-step formula building with validation and preview
- 🔄 Iterative Analysis Support: Track and compare multiple analysis iterations
- 📊 Statistical Guidance: Educational explanations of statistical concepts in context
- ⚡ Seamless Integration: Works with existing pseudobulk workflow without disruption
New Agent Tools
1. suggest_formula_for_design
Analyzes pseudobulk metadata and suggests 2-3 appropriate formula options with pros/cons.
Usage:
suggest_formula_for_design(
pseudobulk_modality="my_data_pseudobulk",
analysis_goal="Compare treatment vs control", # Optional
show_metadata_summary=True
)What it does:
- Examines metadata structure and variable types
- Identifies potential main conditions and batch variables
- Suggests simple, batch-corrected, and multi-factor formulas
- Provides educational explanations and sample size requirements
2. construct_de_formula_interactive
Builds formulas step-by-step with validation and design matrix preview.
Usage:
construct_de_formula_interactive(
pseudobulk_modality="my_data_pseudobulk",
main_variable="condition",
covariates=["batch", "age"], # Optional
include_interactions=False,
validate_design=True
)What it does:
- Constructs R-style formula from components
- Validates against metadata structure
- Shows design matrix preview and coefficient explanations
- Warns about potential statistical issues
- Stores formula for later use
3. run_differential_expression_with_formula
Executes pyDESeq2 analysis with agent-constructed formulas.
Usage:
run_differential_expression_with_formula(
pseudobulk_modality="my_data_pseudobulk",
formula=None, # Uses stored formula if None
contrast=["condition", "treatment", "control"],
alpha=0.05,
lfc_threshold=0.0
)What it does:
- Uses stored or provided formula for DE analysis
- Auto-detects contrast if possible
- Runs pyDESeq2 with comprehensive error handling
- Returns formatted results with biological interpretation
4. iterate_de_analysis
Supports iterative analysis with different formulas or parameters.
Usage:
iterate_de_analysis(
pseudobulk_modality="my_data_pseudobulk",
new_formula="~condition + batch + gender", # Try adding gender
compare_to_previous=True,
iteration_name="with_gender_covariate"
)What it does:
- Tracks multiple analysis iterations
- Compares results with previous iterations
- Shows overlap and unique genes between analyses
- Helps users understand the impact of formula changes
5. compare_de_iterations
Compares results between any two analysis iterations.
Usage:
compare_de_iterations(
pseudobulk_modality="my_data_pseudobulk",
iteration_1="simple_model",
iteration_2="batch_corrected",
show_overlap=True,
show_unique=True
)What it does:
- Shows overlapping and unique DEGs between iterations
- Calculates fold change correlations
- Provides biological interpretation of differences
- Helps users choose the most appropriate model
Example Workflows
Workflow 1: First-time Formula Construction
User: "I need to run differential expression on my pseudobulk data"
Agent: [Uses suggest_formula_for_design automatically]
Agent Response:
"📊 **Formula Design Analysis for 'pbmc_pseudobulk'**
**Metadata Summary:**
• Samples: 24
• Variables analyzed: 6
• Categorical variables: 3
• Continuous variables: 2
**Key Variables:**
• **condition**: categorical, 2 levels (treatment, control)
• **batch**: categorical, 3 levels (batch1, batch2, batch3)
• **gender**: categorical, 2 levels (M, F)
• **age**: continuous variable
📝 **Recommended Formula Options:**
**1. Simple Model** *(recommended for Initial exploratory analysis)*
Formula: `~condition`
Description: Compare condition groups directly
✅ Pros: Maximum statistical power, Straightforward interpretation
⚠️ Cons: Ignores potential confounders, May miss batch effects
Min samples needed: 6
**2. Batch-corrected Model** *(recommended for Multi-batch experiments)*
Formula: `~condition + batch`
Description: Compare condition while accounting for batch effects
✅ Pros: Controls for technical/batch variation, More reliable effect estimates
⚠️ Cons: Reduces degrees of freedom, Requires balanced design
Min samples needed: 8
**3. Multi-factor Model** *(recommended for Final analysis with adequate samples)*
Formula: `~condition + batch + gender + age`
Description: Comprehensive model accounting for condition and 3 covariates
✅ Pros: Controls for multiple confounders, Publication-ready analysis
⚠️ Cons: Requires larger sample size, More complex interpretation
Min samples needed: 15
💡 **Recommendation**: Start with the simple model for exploration, then use the batch-corrected model if you see batch effects.
**Next step**: Use `construct_de_formula_interactive` to build and validate your chosen formula."
User: "Let's start with the batch-corrected model"
Agent: [Uses construct_de_formula_interactive]Workflow 2: Iterative Analysis (Step 12 of customer workflow)
User: "The results look good, but I want to see what happens if I add gender as a covariate"
Agent: [Uses iterate_de_analysis]
Agent Response:
"🔄 **DE Analysis Iteration 'with_gender' Complete**
**Formula**: `~condition + batch + gender`
**Contrast**: treatment vs control (in condition)
📊 **Current Results**:
• Significant genes: 387
• Upregulated: 198
• Downregulated: 189
🔄 **Comparison with Previous Iteration**:
• Overlapping significant genes: 298
• New in current: 89
• Lost from previous: 44
• Fold change correlation: 0.924
📈 **Iteration Summary**:
• Total iterations: 2
• Current iteration: 2
💡 **Interpretation**: High correlation (0.924) suggests similar biological effects, but adding gender reveals 89 additional significant genes.
**Next steps**: Use `compare_de_iterations` to compare all iterations or continue iterating with different parameters."
User: "Show me a detailed comparison between the two models"
Agent: [Uses compare_de_iterations]Technical Implementation
Architecture
The implementation follows a modular design pattern:
SingleCell Expert Agent
├── Formula Construction Tools (5 new tools)
├── Enhanced Formula Service (3 new methods)
├── Workflow Tracker (lightweight iteration tracking)
└── Integration with existing pseudobulk pipelineData Storage
Results and iterations are stored within AnnData objects:
# Formula construction data
adata.uns['constructed_formula'] = {
'formula': '~condition + batch',
'main_variable': 'condition',
'covariates': ['batch'],
'formula_components': {...},
'design_result': {...}
}
# DE analysis results
adata.uns['de_results_formula_condition_treatment_vs_control'] = {
'results_df': results_dataframe,
'analysis_stats': {...},
'parameters': {...}
}
# Iteration tracking
adata.uns['de_iterations'] = {
'iterations': [...],
'current_iteration': 2
}Error Handling
The implementation includes comprehensive error handling:
- Formula Validation: Catches invalid formulas before analysis
- Design Matrix Validation: Warns about rank deficiency and imbalance
- Modality Validation: Ensures all required data exists
- Statistical Validation: Checks for sufficient samples and power
- Graceful Degradation: Provides helpful error messages with suggestions
Performance Considerations
- Formula Validation: <1 second for typical metadata
- Design Matrix Construction: <1 second for datasets up to 1000 samples
- Iteration Tracking: Minimal memory footprint, stores only essential data
- Comparison Operations: <5 seconds for typical result comparisons
Agent Interaction Examples
Example 1: Beginner User
User: "I have pseudobulk data and need to find differentially expressed genes"
Agent: "I'll help you set up a proper statistical analysis. Let me first examine your pseudobulk data structure."
[Agent uses suggest_formula_for_design]
Agent: "I've analyzed your data and found you have 36 samples with these key variables:
• condition: disease (n=18) vs healthy (n=18) - Your main comparison
• batch: 4 different batches - Important for controlling technical variation
• age: continuous variable (22-67 years) - Potential biological confounder
I recommend starting with a batch-corrected model: `~condition + batch`
This will compare disease vs healthy while accounting for batch effects.
Should I construct and validate this formula for you?"
User: "Yes, that sounds good"
[Agent uses construct_de_formula_interactive]
Agent: "Formula constructed successfully! The design looks statistically sound with 36 samples and 5 coefficients, giving you good power to detect medium to large effects.
Ready to run the analysis?"Example 2: Iterative Refinement
User: "The analysis found 342 significant genes. Can we try adding age as a covariate?"
Agent: "Great idea! Adding age can help control for age-related expression changes. Let me run this as iteration 2 and compare with your previous results."
[Agent uses iterate_de_analysis]
Agent: "Iteration 2 complete with formula: `~condition + batch + age`
Results: 387 significant genes (vs 342 previously)
• 298 genes overlap with previous analysis (87% concordance)
• 89 new genes became significant
• 44 genes lost significance
The high overlap suggests consistent core biology, while the new genes likely represent age-sensitive disease effects that were previously masked.
Age appears to explain meaningful variance - I recommend keeping this model."Example 3: Model Selection
User: "I want to compare the simple model vs the full model to see which is better"
Agent: [Uses compare_de_iterations automatically]
Agent: "📊 **Model Comparison: Simple vs Full**
**Simple Model** (~condition):
• 456 significant genes
• Higher statistical power
• Clear biological interpretation
**Full Model** (~condition + batch + age + gender):
• 298 significant genes
• More conservative results
• Controls for multiple confounders
**Overlap Analysis:**
• 267 genes significant in both (89% of full model genes)
• 31 genes unique to full model (high-confidence hits)
• 189 genes only in simple model (may include false positives)
**Recommendation**: The full model is more conservative and likely more accurate for publication. The 31 genes unique to the full model represent high-confidence discoveries after controlling for confounders."Best Practices
Formula Selection Guidelines
- Start Simple: Begin with
~conditionfor exploration - Add Batch Correction: Use
~condition + batchfor multi-batch data - Include Biological Covariates: Add
age,gender, etc. for final analysis - Avoid Overfitting: Limit to ~1 parameter per 3-4 samples
- Validate Design: Always check warnings and design matrix rank
Statistical Considerations
-
Minimum Sample Sizes:
- Simple model: 6+ samples (3+ per group)
- Batch-corrected: 8+ samples
- Multi-factor: 12+ samples
-
Effect Size Detection:
- Large effects (>2-fold): Detectable with small samples
- Medium effects (1.5-fold): Need moderate samples
- Small effects (<1.5-fold): Require large, well-balanced designs
Iteration Strategy
- Establish Baseline: Start with simplest reasonable model
- Add Complexity Gradually: Add one covariate at a time
- Compare Results: Use iteration comparison to understand changes
- Choose Final Model: Balance statistical power and biological accuracy
Troubleshooting
Common Issues
"No suitable variables found"
- Ensure metadata has categorical variables with 2+ levels
- Check that variables aren't all unique (e.g., sample IDs)
- Verify sufficient replication per group
"Design matrix is rank deficient"
- Variables may be perfectly correlated
- Remove redundant variables
- Check for batch/condition confounding
"Few significant genes found"
- Try simpler model for more power
- Check if batch correction is too aggressive
- Verify biological signal exists in data
"Too many significant genes"
- Add covariates to control confounders
- Increase log fold change threshold
- Check for technical artifacts
Error Recovery
The agent provides context-aware error recovery:
Agent: "❌ **Formula Construction Failed**
**Formula**: `~condition + nonexistent_variable`
**Error**: Variables not found in metadata: ['nonexistent_variable']
💡 **Suggestions**:
• Check variable names are spelled correctly
• Available variables: condition, batch, age, gender, cell_type
• Try: `construct_de_formula_interactive('my_data', 'condition', ['batch'])`"Integration with Existing Workflow
The agent-guided formula construction integrates seamlessly with the existing Lobster workflow:
Workflow Coverage
This implementation addresses Steps 8 and 12 of the customer's 12-step workflow:
- ✅ Step 8: Formula Construction → FULLY SUPPORTED via agent guidance
- ✅ Step 12: Iterative Workflows → FULLY SUPPORTED via conversation flow
Combined with previous implementations, this achieves 92% workflow coverage (11/12 steps).
Code Examples
Basic Usage
# 1. Get formula suggestions
agent.suggest_formula_for_design("pbmc_pseudobulk")
# 2. Build chosen formula interactively
agent.construct_de_formula_interactive(
pseudobulk_modality="pbmc_pseudobulk",
main_variable="condition",
covariates=["batch"]
)
# 3. Run analysis
agent.run_differential_expression_with_formula("pbmc_pseudobulk")
# 4. Try alternative model
agent.iterate_de_analysis(
pseudobulk_modality="pbmc_pseudobulk",
new_formula="~condition + batch + age",
iteration_name="with_age_correction"
)
# 5. Compare models
agent.compare_de_iterations("pbmc_pseudobulk", "iteration_1", "with_age_correction")Advanced Usage
# Custom reference levels
agent.construct_de_formula_interactive(
pseudobulk_modality="pbmc_pseudobulk",
main_variable="condition",
covariates=["batch", "age"],
include_interactions=True # Include condition*batch interaction
)
# Run with specific parameters
agent.run_differential_expression_with_formula(
pseudobulk_modality="pbmc_pseudobulk",
reference_levels={"condition": "healthy"}, # Set healthy as reference
alpha=0.01, # More stringent significance
lfc_threshold=0.5 # Require |LFC| >= 0.5
)Statistical Education Features
The agent provides educational context for statistical concepts:
Formula Complexity Explanations
Simple Model (~condition):
- "This compares your groups directly without adjusting for other factors"
- "Best statistical power but may miss important confounders"
- "Good for initial exploration or when you're confident there are no batch effects"
Batch-Corrected (~condition + batch):
- "This accounts for technical variation between batches while testing your condition"
- "More reliable estimates but uses some statistical power for batch correction"
- "Recommended when you have multiple batches or processing dates"
Multi-Factor (~condition + batch + age + gender):
- "This comprehensive model controls for multiple biological and technical factors"
- "Most accurate for publication but requires larger sample sizes"
- "Each covariate 'uses up' some of your statistical power"
Design Matrix Education
The agent explains what design matrices represent:
"📈 **Design Matrix Preview**:
This shows how your samples are encoded for statistical analysis:
• Each row = one sample
• Each column = one statistical parameter
• Values of 1 mean 'this parameter applies to this sample'
• Values of 0 mean 'this parameter doesn't apply'
For example, 'condition[T.treatment]' = 1 for treatment samples, 0 for controls."Implementation Details
File Structure
lobster/
├── agents/
│ └── transcriptomics_expert.py # Enhanced with 5 new tools
├── tools/
│ ├── differential_formula_service.py # Enhanced with 3 helper methods
│ └── workflow_tracker.py # New lightweight tracker
└── tests/
└── integration/
└── test_agent_guided_formula_construction.py # Integration testsDependencies
No new external dependencies required! Uses existing:
pandas,numpy: Data manipulationpydeseq2: Differential expression enginestatsmodels: Already available for statistical utilitiesrich: Already used for formatted output
Performance Metrics
- Formula Suggestion: ~100ms for typical metadata
- Design Matrix Construction: ~200ms for 100 samples
- Validation: ~50ms per formula
- Iteration Tracking: Minimal memory overhead
- Comparison: ~1s for 20,000 gene comparisons
Memory Usage
- Per Formula: ~1KB metadata storage
- Per Iteration: ~100KB-1MB depending on result size
- Comparison Cache: ~10MB for 10 comparisons
- Total Overhead: <50MB for typical workflows
Limitations and Future Enhancements
Current Limitations
- No GUI Components: Pure conversational interface
- Limited Power Analysis: Rough estimates, not exact calculations
- No Pathway Integration: Focuses on gene-level results
- Single Contrast: One comparison per analysis (standard for DESeq2)
Potential Enhancements
- Multiple Contrasts: Support for complex contrast matrices
- Advanced Power Analysis: Integration with power calculation packages
- Pathway-Aware Suggestions: Formula suggestions based on pathway goals
- Visual Design Matrix: Graphical representation of experimental design
- Model Diagnostics: Residual analysis and assumption checking
Support and Troubleshooting
Getting Help
- Agent Guidance: The agent provides context-sensitive help and suggestions
- Error Messages: Clear error messages with actionable recommendations
- Documentation: This comprehensive guide covers most use cases
Common Workflows
Exploratory Analysis:
suggest_formula_for_design→ Choose simple modelconstruct_de_formula_interactive→ Validate designrun_differential_expression_with_formula→ Get initial results
Publication Analysis:
- Start with exploratory results
iterate_de_analysis→ Try batch-corrected modeliterate_de_analysis→ Add biological covariatescompare_de_iterations→ Choose best model- Export final results for downstream analysis
Troubleshooting Analysis:
- Check data with
suggest_formula_for_design - Try simpler models if current model fails
- Use iteration comparison to understand model differences
- Adjust thresholds based on biological expectations
Conclusion
The agent-guided formula construction transforms complex statistical analysis into an intuitive conversational experience. By acting as an intelligent statistical consultant, the agent makes professional-grade differential expression analysis accessible to users of all statistical backgrounds while maintaining scientific rigor and best practices.
Key Benefits:
- 🎓 Educational: Learn statistics through guided interaction
- 🚀 Efficient: 2-week implementation vs months for complex UI
- 🧠 Intelligent: Context-aware suggestions and validation
- 🔄 Iterative: Natural support for exploring different models
- 📊 Professional: Publication-ready analysis with provenance tracking
This implementation successfully addresses the critical gaps in formula construction (Step 8) and iterative workflows (Step 12), bringing the Lobster platform to 92% coverage of the customer's complete 12-step single-cell analysis workflow.
46. Multi-Omics Integration
Multi-omics integration combines data from multiple biological measurement modalities (transcriptomics, proteomics, metabolomics, etc.) to gain comprehensive...
48. Manual Sample Enrichment Workflow
Problem: SRA sample metadata is often incomplete - missing disease, demographics (age/sex), tissue details despite this information existing in the sourc...