Provenance Tracking
W3C-PROV compliant analysis tracking with ProvenanceTracker
Provenance Tracking
The ProvenanceTracker provides W3C-PROV compliant tracking of all analysis operations. This enables full reproducibility, audit trails for regulatory compliance, and automatic generation of Jupyter notebooks from recorded workflows.
Overview
ProvenanceTracker captures:
- Activities - Operations performed (normalization, clustering, etc.)
- Entities - Data artifacts (modalities, files, results)
- Agents - Software/services that performed operations
- Lineage - DAG of how outputs derive from inputs
Basic Usage
ProvenanceTracker is accessed via DataManagerV2:
from lobster.core.data_manager_v2 import DataManagerV2
dm = DataManagerV2(enable_provenance=True)
# Access the provenance tracker
tracker = dm.provenanceLogging Tool Usage
The standard pattern for tracking operations:
# Service returns 3-tuple: (result, stats, ir)
result, stats, ir = clustering_service.cluster(adata, resolution=1.0)
# Log to provenance with IR for notebook export
dm.log_tool_usage(
tool_name="cluster_cells",
parameters={"modality": "my_dataset", "resolution": 1.0},
description="Leiden clustering at resolution 1.0",
ir=ir # AnalysisStep enables notebook generation
)Getting All Activities
# Get all recorded activities
activities = dm.provenance.get_all_activities()
for activity in activities:
print(f"{activity['type']}: {activity['description']}")
print(f" Parameters: {activity['parameters']}")
print(f" Timestamp: {activity['timestamp']}")Activity Records
Each activity contains:
{
"id": "lobster:activity:abc123...",
"type": "clustering", # Activity type
"agent": "lobster:agent:clustering_service",
"timestamp": "2024-01-15T10:30:00Z",
"inputs": [{"entity": "...", "role": "input_data"}],
"outputs": [{"entity": "...", "role": "clustered_data"}],
"parameters": {"resolution": 1.0},
"description": "Leiden clustering at resolution 1.0",
"software_versions": {
"scanpy": "1.9.6",
"anndata": "0.10.3",
"lobster": "1.0.0"
},
"ir": { # AnalysisStep for notebook export
"operation": "scanpy.tl.leiden",
"code_template": "sc.tl.leiden(adata, resolution={{ resolution }})",
"imports": ["import scanpy as sc"],
...
}
}Creating Activities Manually
For custom operations:
from lobster.core.analysis_ir import AnalysisStep
# Create IR for the operation
ir = AnalysisStep(
operation="custom.my_analysis",
tool_name="my_tool",
description="Custom analysis step",
library="custom",
code_template="my_custom_function(adata, param={{ param }})",
imports=["from my_module import my_custom_function"],
parameters={"param": 42}
)
# Create activity with IR
activity_id = tracker.create_activity(
activity_type="custom_analysis",
agent="my_tool",
inputs=[{"entity": input_entity_id, "role": "input"}],
outputs=[{"entity": output_entity_id, "role": "output"}],
parameters={"param": 42},
description="Performed custom analysis",
ir=ir
)Entity Management
Entities represent data artifacts:
# Create entity for a dataset
entity_id = tracker.create_entity(
entity_type="modality_data",
uri="/path/to/data.h5ad",
format="h5ad",
metadata={"n_cells": 5000, "n_genes": 20000}
)
# Checksum is calculated automatically for filesAgent Registry
Agents represent software components:
# Create agent record
agent_id = tracker.create_agent(
name="ClusteringService",
agent_type="software",
version="1.0.0",
description="Leiden clustering implementation"
)Lineage Tracking
Get the complete derivation history:
# Get lineage for an entity
lineage = tracker.get_lineage(entity_id)
for activity in lineage:
print(f"{activity['type']}: {activity['description']}")Export Formats
Jupyter Notebook Export
The primary export format - reproducible Python notebooks:
# Export session as notebook
path = dm.export_notebook(
name="my_analysis",
description="Single-cell QC and clustering workflow",
filter_strategy="successful" # "successful" | "all" | "manual"
)The notebook contains:
- All imports from IR records
- Code cells generated from
code_templatefields - Parameters substituted via Jinja2
- Markdown documentation
JSON Export
For programmatic access:
# Export as dictionary
provenance_data = tracker.to_dict()
# Contains namespace, activities, entities, agents
import json
with open("provenance.json", "w") as f:
json.dump(provenance_data, f, indent=2)Import from JSON
# Restore from saved provenance
with open("provenance.json") as f:
data = json.load(f)
tracker.from_dict(data)AnnData Integration
Provenance can be embedded in AnnData objects:
# Add provenance to AnnData
adata = tracker.add_to_anndata(adata)
# Stored in adata.uns['provenance']
# Extract from AnnData
success = tracker.extract_from_anndata(adata)
if success:
print("Restored provenance from AnnData")Specialized Logging Methods
Convenience methods for common operations:
# Log data loading
activity_id = tracker.log_data_loading(
source_path="/data/raw.h5ad",
output_entity_id=entity_id,
adapter_name="TranscriptomicsAdapter",
parameters={"validate": True}
)
# Log data processing
activity_id = tracker.log_data_processing(
input_entity_id=input_id,
output_entity_id=output_id,
processing_type="normalization",
agent_name="NormalizationService",
parameters={"method": "log1p"},
description="Log-normalize counts"
)
# Log data saving
activity_id = tracker.log_data_saving(
input_entity_id=entity_id,
output_path="/data/processed.h5ad",
backend_name="H5ADBackend",
parameters={"compression": "gzip"}
)Software Version Tracking
Versions are captured automatically:
# Versions tracked per activity
activity = tracker.get_all_activities()[-1]
versions = activity["software_versions"]
# {"scanpy": "1.9.6", "anndata": "0.10.3", "lobster": "1.0.0", ...}Session Management
Provenance is scoped to the DataManagerV2 session:
# Create isolated session
dm = DataManagerV2(workspace_path="./analysis", enable_provenance=True)
# All operations tracked within this session
dm.log_tool_usage(...)
# Export before session ends
dm.export_notebook("session_workflow", "Analysis session")Cloud Integration
For Omics-OS Cloud deployments, provenance integrates with:
- Centralized storage - Activities stored in cloud database
- Cross-session lineage - Track derivation across sessions
- Compliance exports - Generate audit reports
- Collaboration - Share reproducible workflows
Best Practices
- Always pass IR -
log_tool_usage(..., ir=ir)enables notebook export - Use descriptive operations -
scanpy.pp.normalize_totalnot justnormalize - Track all inputs/outputs - Complete lineage requires all entities
- Export regularly - Save notebooks before long-running operations
- Include parameter schemas - Helps notebook generation with defaults
API Reference
ProvenanceTracker Methods
| Method | Description |
|---|---|
create_activity(...) | Record a new operation |
create_entity(...) | Register a data artifact |
create_agent(...) | Register a software agent |
get_lineage(entity_id) | Get derivation history |
get_all_activities() | List all recorded activities |
to_dict() | Export as dictionary |
from_dict(data) | Import from dictionary |
add_to_anndata(adata) | Embed in AnnData |
extract_from_anndata(adata) | Restore from AnnData |
DataManagerV2 Provenance Methods
| Method | Description |
|---|---|
log_tool_usage(tool, params, ir) | Log operation with IR |
export_notebook(name, description) | Generate Jupyter notebook |
run_notebook(path, modality) | Execute saved notebook |