Omics-OS Docs
Architecture

Data Integrity, Security, and Compliance

Comprehensive Reference: Security architecture, compliance features, and deployment guidance Available: Lobster AI v0.3.4+ Compliance: 21 CFR Par...

Comprehensive Reference: Security architecture, compliance features, and deployment guidance Available: Lobster AI v0.3.4+ Compliance: 21 CFR Part 11, ALCOA+, GxP, HIPAA, GDPR, ISO/IEC 27001, SOC 2


1. Overview

1.1 Purpose and Audience

What this document covers: Lobster AI's comprehensive security architecture, data integrity features, compliance capabilities, and deployment guidance for regulated environments. This reference enables:

  • QA teams to assess Lobster for regulated use (GxP, HIPAA, clinical trials)
  • DevOps teams to deploy securely (local, cloud, validated environments)
  • Compliance officers to map Lobster features to regulatory requirements (21 CFR Part 11, ALCOA+, ISO 27001)
  • Enterprise customers to understand security posture and compliance readiness

Target audiences:

RolePrimary NeedsKey Sections
QA / ComplianceAssess GxP readiness, audit trails3, 10
DevOps / IT SecurityDeploy securely, monitor systems9, 11
AnalystsUnderstand data integrity, use securely2, 5, 8
Enterprise BuyersEvaluate security for procurement1.2, 10.1

1.2 Why Security Matters in Bioinformatics

Bioinformatics presents unique security challenges:

  1. Sensitive data - Patient genomic data (HIPAA), clinical trial results (GxP), proprietary research (trade secrets)
  2. Reproducibility crisis - 70% of researchers unable to reproduce others' results (Nature survey)
  3. Data integrity - Single base pair error can invalidate conclusions, impact patient care
  4. Regulatory complexity - FDA, EMA, HIPAA, GDPR all impose different requirements
  5. Long-term value - Analyses must remain valid for 7-10+ years (regulatory retention)

Lobster's security philosophy:

PrincipleImplementationBenefit
Security by defaultW3C-PROV enabled, integrity manifests automaticNo opt-in required
Audit everythingEvery operation logged with attributionComplete audit trail
Cryptographic proofSHA-256 hashes, RSA-2048 signaturesTamper-evident records
Principle of least privilegeWorkspace isolation, subscription tiersMinimal attack surface
Graceful degradationLocal mode (max security) or cloud (scalability)Flexible deployment
Standards complianceW3C-PROV, NIST algorithms, ISO formatsIndustry best practices

1.3 Compliance Coverage Matrix

What regulations does Lobster support?

RegulationCurrent StatusDeployment ModeKey Features
21 CFR Part 11✅ ReadyLocal + CloudAudit trails, tamper-evidence, validation support
ALCOA+✅ ReadyLocal + CloudAll 9 principles implemented (see 10.1)
GxP (GAMP 5)⚠️ PartialLocal (Cat 4 ready), Cloud (validation TBD)IQ/OQ/PQ templates available
HIPAA⚠️ ConditionalLocal (ready), Cloud (BAA required)Encryption, audit logs, access control
GDPR⚠️ ConditionalLocal (ready), Cloud (region + DPA)Data residency, anonymization, retention
ISO/IEC 27001✅ ReadyLocal + CloudInformation security controls (A.8.1-A.8.24)
SOC 2 Type II⚠️ PartialCloud (AWS certified), Lobster (pending)AWS inherits certification

Feature coverage by section:

SectionRegulation SupportKey Features
[2] Data Integrity Manifest21 CFR Part 11 § 11.10(a)SHA-256 hashes, tamper-evidence
[3] Audit Trail21 CFR Part 11 § 11.10(d,e)W3C-PROV, AnalysisStep IR, session tracking
[4] Access ControlHIPAA, GDPR, ISO 27001License management, tier enforcement, API keys
[5] Secure Execution21 CFR Part 11 § 11.10(k)Subprocess isolation, forbidden modules
[6] Data ProtectionHIPAA, GDPRWorkspace isolation, concurrent access protection
[7] Network SecurityISO 27001 A.13Rate limiting, timeout handling, HTTPS
[8] Validation21 CFR Part 11 § 11.10(k)Schema validation, pre-download checks
[9] DeploymentSOC 2, HIPAADocker, S3 encryption, AWS security
[10] ComplianceGxP, 21 CFR Part 11ALCOA+ mapping, deployment patterns, SOPs
[11] Best PracticesAllEnvironment security, access control, monitoring

1.4 Document Structure

How to use this guide:

For quick assessment (QA teams, 30 minutes):

  1. Read 1.3 Compliance Coverage Matrix - Understand regulation support
  2. Read 10.1 GxP-Ready Checklist - See ALCOA+ and 21 CFR Part 11 mapping
  3. Read 10.2 Deployment Patterns - Choose deployment model
  4. Review 10.3 SOPs - Template procedures

For deep technical review (DevOps, 2-4 hours):

  1. Read entire document (Sections 2-11)
  2. Review linked detailed documentation (wiki pages)
  3. Test features in staging environment
  4. Validate with IQ/OQ/PQ scripts (10.4)

For compliance audit (inspectors, 1-2 hours):

  1. Start with 3. Audit Trail - Verify W3C-PROV implementation
  2. Review 2. Data Integrity Manifest - Understand cryptographic controls
  3. Check 10. Compliance Features - Map to regulations
  4. Request provenance export and verify hashes

Navigation tips:

  • Each section starts with "What it is" (executive summary)
  • Tables provide quick reference (capabilities, compliance benefits)
  • Code examples show practical usage
  • "For complete details" links to authoritative documentation (no duplication)

2. Data Integrity Manifest

2.1 What You'll See

When you export a notebook using /pipeline export, the second cell contains a data integrity manifest:

## 🔒 Data Integrity Manifest

**Purpose**: Cryptographic verification of data integrity (ALCOA+ compliance)

{
  "data_integrity_manifest": {
    "generated_at": "2026-01-01T14:23:45.123456",
    "provenance": {
      "session_id": "session_20260101_142000",
      "sha256": "7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069",
      "activities": 15,
      "entities": 8
    },
    "input_files": {
      "geo_gse109564.h5ad": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
    },
    "system": {
      "lobster_version": "0.3.4",
      "git_commit": "dd2c126f",
      "python_version": "3.13.9",
      "platform": "darwin"
    }
  }
}

Understanding the Manifest

Provenance Section

"provenance": {
  "session_id": "session_20260101_142000",
  "sha256": "7f83b165...",
  "activities": 15,
  "entities": 8
}

What this proves:

  • Links notebook to specific analysis session
  • Cryptographic hash of the session's audit trail
  • Documents scope: 15 analysis steps, 8 data entities

Input Files Section

"input_files": {
  "geo_gse109564.h5ad": "e3b0c442...",
  "geo_gse109564_filtered.h5ad": "5d41402a..."
}

What this proves:

  • Exact data files used in analysis
  • Each file has unique cryptographic fingerprint
  • Any modification changes the hash

System Section

"system": {
  "lobster_version": "0.3.4",
  "git_commit": "dd2c126f",
  "python_version": "3.13.9",
  "platform": "darwin"
}

What this proves:

  • Exact software version documented
  • Enables long-term reproducibility
  • Environment can be reconstructed

How to Verify Data Integrity

Basic Verification

macOS/Linux:

shasum -a 256 geo_gse109564.h5ad
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Windows PowerShell:

Get-FileHash -Algorithm SHA256 geo_gse109564.h5ad

Python:

import hashlib

def verify_file_hash(filepath, expected_hash):
    """Verify file matches expected SHA-256 hash."""
    sha256 = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256.update(chunk)

    actual_hash = sha256.hexdigest()
    if actual_hash == expected_hash:
        print("✅ VERIFIED: File hash matches manifest")
        return True
    else:
        print("❌ MISMATCH: File has been modified!")
        print(f"Expected: {expected_hash}")
        print(f"Actual:   {actual_hash}")
        return False

# Usage
verify_file_hash(
    "geo_gse109564.h5ad",
    "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
)

Automated Verification Script

Create a verification script for your notebooks:

#!/usr/bin/env python3
"""Verify data integrity for Lobster AI notebook."""

import json
import hashlib
import nbformat
from pathlib import Path

def verify_notebook_integrity(notebook_path, data_directory):
    """Verify all input files match manifest hashes."""
    # Read notebook
    with open(notebook_path) as f:
        nb = nbformat.read(f, as_version=4)

    # Find manifest cell
    manifest = None
    for cell in nb.cells:
        if "data_integrity_manifest" in cell.source:
            # Extract JSON from cell
            lines = cell.source.split("\n")
            json_lines = []
            in_json = False
            for line in lines:
                if line.strip().startswith("{"):
                    in_json = True
                if in_json:
                    json_lines.append(line)
                if line.strip().endswith("}") and in_json:
                    break
            manifest = json.loads("\n".join(json_lines))
            break

    if not manifest:
        print("❌ No integrity manifest found in notebook")
        return False

    # Verify each input file
    input_files = manifest["data_integrity_manifest"]["input_files"]
    all_verified = True

    for filename, expected_hash in input_files.items():
        filepath = Path(data_directory) / filename

        if not filepath.exists():
            print(f"⚠️  {filename}: File not found")
            all_verified = False
            continue

        # Calculate hash
        sha256 = hashlib.sha256()
        with open(filepath, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                sha256.update(chunk)
        actual_hash = sha256.hexdigest()

        if actual_hash == expected_hash:
            print(f"✅ {filename}: Verified")
        else:
            print(f"❌ {filename}: HASH MISMATCH")
            print(f"   Expected: {expected_hash}")
            print(f"   Actual:   {actual_hash}")
            all_verified = False

    return all_verified

# Usage
if __name__ == "__main__":
    import sys
    if len(sys.argv) != 3:
        print("Usage: verify_integrity.py <notebook.ipynb> <data_directory>")
        sys.exit(1)

    verified = verify_notebook_integrity(sys.argv[1], sys.argv[2])
    sys.exit(0 if verified else 1)

Save as: verify_integrity.py

Usage:

python verify_integrity.py my_analysis.ipynb ~/.lobster/

Common Scenarios

Scenario 1: Hashes Match ✅

$ shasum -a 256 geo_gse109564.h5ad
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Meaning: File is authentic and unchanged Action: Proceed with analysis review


Scenario 2: Hash Mismatch ❌

$ shasum -a 256 geo_gse109564.h5ad
a1b2c3d4... (DIFFERENT HASH)

Possible Causes:

  1. File was re-downloaded or updated (intentional)
  2. File corruption (disk error, network issue)
  3. File was modified (accidental or malicious)

Action:

  1. Check file modification date
  2. Verify with original data source
  3. If intentional update: Re-run analysis to get new notebook with updated hashes
  4. If unexpected: Investigate security incident

Scenario 3: File Not Found

$ shasum -a 256 geo_gse109564.h5ad
shasum: geo_gse109564.h5ad: No such file or directory

Meaning: Data file has been moved or deleted

Action:

  1. Check if file was archived
  2. Restore from backup if needed
  3. Cannot reproduce analysis without original file

Why This Matters

For Regulatory Compliance

PrincipleRequirementHow Manifest Helps
ALCOA+ "Original"Prove data is authenticSHA-256 verifies file identity
ALCOA+ "Accurate"Detect tamperingHash mismatch reveals changes
21 CFR Part 11Tamper-evident recordsCryptographic binding
GxP Audit TrailDocument system stateVersion info captured

For Scientific Reproducibility

Problem: "Which version of the data did I use?"

Solution: The hash uniquely identifies the exact file version:

  • Same data = Same hash (every time)
  • Different data = Different hash (guaranteed)
  • Cannot fake a hash (mathematically impossible)

Best Practices

1. Verify Hashes Before Review

When reviewing a colleague's notebook:

# Extract hashes from manifest
# Verify each input file
# Only proceed if all hashes match

2. Archive Data with Notebooks

Store notebooks alongside their input data:

analysis_project/
├── my_analysis.ipynb          # Notebook with manifest
├── data/
│   ├── geo_gse109564.h5ad     # Input file
│   └── metadata.csv           # Metadata file
└── verify_integrity.py        # Verification script

3. Include Verification in SOPs

Standard Operating Procedure Example:

  1. Analyst exports notebook
  2. QA verifies hashes
  3. If verified → Review code
  4. If mismatch → Investigate before review

4. Document Hash Verification

Keep records of verification:

Analysis: GSE109564_clustering
Notebook: my_analysis.ipynb
Verified: 2026-01-01 by QA-UserName
Hash Status: ✅ All inputs verified
Reviewer: [Signature]

FAQ

Q: Is this automatic?

A: Yes. Every notebook export includes the manifest automatically. No extra steps required.

Q: Does this slow down my analysis?

A: No. Hashing happens only during export (~0.5 seconds for typical files). Zero impact on analysis performance.

Q: What if I need to update my data?

A: Re-run the analysis and export a new notebook. The new notebook will have new hashes reflecting the updated data. Both notebooks remain valid records of what data was used at each point in time.

Q: Can I use this in non-regulated environments?

A: Absolutely! Even outside GxP environments, data integrity verification is a scientific best practice. It helps you:

  • Track which version of data was used
  • Prevent accidental use of wrong files
  • Document your analysis provenance

Q: What hash algorithm is used?

A: SHA-256 (Secure Hash Algorithm 256-bit). This is:

  • NIST-approved standard
  • Used by GitHub, Bitcoin, SSL certificates
  • Collision-resistant (virtually impossible to find duplicates)
  • Industry standard for data integrity


2.2 H5AD Validation and Compression (v3.4.2+)

What it is: Lobster includes utilities for validating H5AD file integrity and optimizing storage via compression. These features ensure data quality and efficient storage in production deployments.

H5AD validation (core/utils/h5ad_utils.py):

CheckPurposeError Detection
File formatVerify valid HDF5 structureDetects corrupted files
Required keysCheck for .obs, .var, .XDetects incomplete files
Shape consistencyVerify n_obs × n_vars matchesDetects truncated data
Compression validTest gzip/lzf decompressionDetects compression errors
Metadata presentCheck for .uns metadataDetects missing annotations

Validation usage:

from lobster.core.utils.h5ad_utils import validate_h5ad

# Validate H5AD file before analysis
is_valid, error_msg = validate_h5ad("geo_gse109564.h5ad")

if is_valid:
    print("✅ H5AD file is valid")
    adata = sc.read_h5ad("geo_gse109564.h5ad")
else:
    print(f"❌ Validation failed: {error_msg}")
    # Handle error (re-download, investigate corruption)

H5AD compression (storage optimization):

CompressionMethodRatioSpeedUse Case
gzip (level 6)Deflate5-10xMediumDefault, balanced
gzip (level 9)Deflate8-15xSlowLong-term archival
lzfLZF3-5xFastReal-time processing

Compression usage:

from lobster.core.utils.h5ad_utils import compress_h5ad

# Compress H5AD for archival
original_size = Path("geo_gse109564.h5ad").stat().st_size
compress_h5ad("geo_gse109564.h5ad", compression="gzip", compression_opts=9)
compressed_size = Path("geo_gse109564.h5ad").stat().st_size

print(f"Original: {original_size / 1e9:.2f} GB")
print(f"Compressed: {compressed_size / 1e9:.2f} GB")
print(f"Ratio: {original_size / compressed_size:.1f}x")
# Output: Original: 2.4 GB, Compressed: 0.3 GB, Ratio: 8.0x

Compliance benefits:

  • Data integrity - Pre-load validation catches corruption
  • Storage efficiency - 5-10x compression reduces costs
  • Quality assurance - Automated validation in CI/CD
  • Audit trail - Validation results logged to provenance

For complete implementation details, see:


2.3 Atomic File Operations

What it is: Lobster uses atomic file operations (temp file + fsync + atomic replace) to ensure crash-safe writes for critical files (session metadata, queues, provenance logs). This prevents data corruption from crashes, power failures, or kill signals.

Atomic write pattern:

def atomic_write_json(path: Path, data: dict):
    """Crash-safe JSON write."""
    temp_path = path.with_suffix('.tmp')

    # Step 1: Write to temp file
    with open(temp_path, 'w') as f:
        json.dump(data, f, indent=2)
        f.flush()
        os.fsync(f.fileno())  # Force write to disk (bypass OS cache)

    # Step 2: Atomic replace (POSIX guarantee)
    os.replace(temp_path, path)  # Atomic on POSIX systems

Why atomic writes matter:

ScenarioWithout Atomic WritesWith Atomic Writes
Crash during writePartial data written, file corruptedTemp file discarded, original intact
Power failureFile may contain garbageTemp file or complete file, never partial
Kill signalIncomplete JSON, parse errorComplete file or original preserved
Concurrent accessRace conditions, corruptionCombined with file locks, safe

Protected files (using atomic writes):

  • .session.json - Session metadata
  • provenance.json - W3C-PROV audit trail
  • download_queue.jsonl - Download queue entries
  • publication_queue.jsonl - Publication queue entries
  • cache_metadata.json - Cache tracking

POSIX atomicity guarantee:

os.replace(src, dst) on POSIX:
- Atomically replaces dst with src
- If dst exists: overwritten atomically
- If crash occurs: dst is either old OR new (never partial)
- Thread-safe + process-safe (when combined with locks)

Compliance benefits:

  • Data integrity - Crash-safe writes prevent corruption
  • ALCOA+ "Accurate" - File contents always valid
  • Audit trail integrity - Provenance never corrupted
  • Reliability - Production-ready for 24/7 operation

For complete implementation details, see:


3. Audit Trail & Provenance

3.1 W3C-PROV Compliance

What it is: Lobster implements the World Wide Web Consortium (W3C) PROV standard for complete audit trails of all analysis operations. Every action is recorded as a directed acyclic graph (DAG) with three key components:

  • Activities: What was done (e.g., clustering, quality control, differential expression)
  • Entities: What data was used and generated (datasets, plots, metadata files)
  • Agents: Who/what performed the action (singlecell_expert, data_expert, human users)

This creates an immutable, traceable record from raw data download through final publication-ready results.

Why it matters for compliance:

Compliance PrincipleRequirementHow W3C-PROV Helps
ALCOA+ "Traceable"Complete operation historyDAG links all operations to source data
ALCOA+ "Attributable"User/agent identificationEvery activity attributed to specific agent
21 CFR Part 11 § 11.10(e)Audit trail requirementsTimestamped, immutable activity log
ISO/IEC 27001:2022Change loggingComplete provenance graph exportable as JSON

Key capabilities:

FeatureDescriptionCompliance Benefit
Activity trackingAll operations logged with parametersComplete audit trail
Entity lineageData provenance from source to resultTraceability
Agent attributionUser/agent identification for each stepAccountability
Temporal orderingTimestamp-based activity sequencingContemporaneous recording
Exportable formatW3C-PROV JSON standardPortability & long-term archival
Query interfaceProgrammatic provenance queriesAudit support

Quick start:

# View current session provenance
lobster status

# Export provenance for specific session (W3C-PROV JSON)
# Provenance automatically saved to: .lobster_workspace/provenance.json

Example provenance graph (simplified):

[PubMed Search] → [GEO GSE109564 Metadata] → [Download Dataset]

                                          [geo_gse109564.h5ad]

                                          [Quality Control]

                                          [Filter Cells/Genes]

                                          [Normalization]

                                          [Clustering]

                                          [Annotated Dataset]

For complete implementation details, see:


3.2 AnalysisStep Intermediate Representation (IR)

What it is: Every analysis operation returns an AnalysisStep object that captures the complete specification for reproducing that step. This IR (Intermediate Representation) enables:

  1. Notebook export - Generates executable Python code from analysis history
  2. Parameter validation - Ensures reproducibility through schema enforcement
  3. Audit trail integration - Links provenance to executable protocols
  4. Method documentation - Self-documenting analysis workflows

3-Tuple Pattern (all services follow this):

def analyze(adata: AnnData, **params) -> Tuple[AnnData, Dict[str, Any], AnalysisStep]:
    # ... processing ...
    return processed_adata, stats, analysis_step_ir

What gets recorded in AnalysisStep:

FieldPurposeExample
operationMethod name"scanpy.pp.filter_cells"
tool_nameService method"quality_service.assess_quality"
descriptionHuman explanation"Filter cells based on QC metrics"
librarySoftware library"scanpy", "pyDESeq2"
code_templateJinja2 template"sc.pp.filter_cells(adata, min_genes=\{\{ min_genes \}\})"
importsRequired imports["import scanpy as sc"]
parametersActual values used\{"min_genes": 200, "max_genes": 8000\}
parameter_schemaValidation rulesTypes, defaults, constraints
input_entitiesData dependencies["geo_gse109564.h5ad"]
output_entitiesGenerated data["geo_gse109564_filtered.h5ad"]
execution_contextRuntime metadataTimestamps, agent, session ID

Compliance benefits:

  • Complete method documentation - Every parameter recorded
  • Parameter validation - Schema prevents invalid configurations
  • Reproducible protocols - Code templates generate executable notebooks
  • Audit trail integration - AnalysisStep embedded in W3C-PROV activities
  • ALCOA+ "Accurate" - Parameter schema prevents data entry errors

Example AnalysisStep (clustering):

AnalysisStep(
    operation="scanpy.tl.leiden",
    tool_name="clustering_service.perform_clustering",
    description="Leiden clustering with resolution 0.5",
    library="scanpy",
    code_template="sc.tl.leiden(adata, resolution={{ resolution }})",
    imports=["import scanpy as sc"],
    parameters={"resolution": 0.5, "random_state": 42},
    parameter_schema={
        "resolution": {"type": "float", "default": 1.0, "min": 0.0},
        "random_state": {"type": "int", "default": 0}
    },
    input_entities=["geo_gse109564_normalized.h5ad"],
    output_entities=["geo_gse109564_clustered.h5ad"]
)

For complete implementation details, see:


3.3 Session and Tool Usage Tracking

What it is: Lobster maintains session-level metadata that tracks all tool invocations, agent handoffs, and data operations across multi-turn conversations. This enables:

  • Cross-session traceability - Continue analysis from previous sessions
  • Usage auditing - Track which agents/tools were used and when
  • Session restoration - Recover from interruptions
  • Compliance reporting - Generate audit reports per session

Session metadata structure:

{
  "session_id": "session_20260101_142000",
  "created_at": "2026-01-01T14:20:00.123456Z",
  "last_updated": "2026-01-01T15:45:32.789012Z",
  "workspace_path": "/Users/analyst/.lobster_workspace",
  "modalities": {
    "geo_gse109564": {
      "created_at": "2026-01-01T14:23:15Z",
      "n_obs": 5000,
      "n_vars": 2000,
      "layers": ["counts", "normalized"],
      "last_modified": "2026-01-01T15:30:00Z"
    }
  },
  "tool_usage": [
    {
      "tool_name": "search_pubmed",
      "agent": "research_agent",
      "timestamp": "2026-01-01T14:20:30Z",
      "parameters": {"query": "single-cell CRISPR screening"},
      "result": "Found 15 publications"
    },
    {
      "tool_name": "download_geo_dataset",
      "agent": "data_expert",
      "timestamp": "2026-01-01T14:23:00Z",
      "parameters": {"accession": "GSE109564"},
      "result": "Downloaded 5000 cells × 2000 genes"
    }
  ],
  "agent_handoffs": [
    {
      "from": "supervisor",
      "to": "research_agent",
      "timestamp": "2026-01-01T14:20:15Z",
      "reason": "User requested literature search"
    },
    {
      "from": "research_agent",
      "to": "data_expert",
      "timestamp": "2026-01-01T14:22:45Z",
      "reason": "Download queue entry created for GSE109564"
    }
  ]
}

Key capabilities:

FeatureDescriptionCompliance Benefit
Unique session IDsTimestamp-based unique identifiersSession-level traceability
UTC timestampsAll times in UTC (ISO 8601)Contemporaneous recording
Agent attributionEvery action linked to agentALCOA+ "Attributable"
Tool usage logComplete invocation historyAudit trail support
Cross-session continuity--session-id latest continues workAnalysis reproducibility
Automatic backupSession saved after each operationCrash recovery

Session commands:

# Start new session with custom ID
lobster query --session-id "project_gse109564" "Download GSE109564 and cluster"

# Continue previous session
lobster query --session-id latest "Add differential expression analysis"

# View current session status
lobster status

# Export session (includes provenance + metadata)
# Automatically saved to: .lobster_workspace/.session.json

Compliance benefits:

  • ALCOA+ "Contemporaneous" - Timestamped in real-time
  • ALCOA+ "Attributable" - User/agent identification
  • 21 CFR Part 11 § 11.10(e) - Session-level audit capability
  • ISO/IEC 27001 - Access logging requirements

For complete implementation details, see:


3.4 Provenance Hash and Tamper-Evidence

What it is: Lobster creates a cryptographic hash (SHA-256) of the complete provenance graph (activities + entities + agents) and embeds it in the notebook's Data Integrity Manifest. This creates a tamper-evident link between the notebook and its audit trail.

How it works:

  1. Analysis phase: Provenance graph built as operations execute
  2. Export phase: Complete provenance serialized to JSON
  3. Hash calculation: SHA-256 computed over canonical JSON representation
  4. Manifest embedding: Hash included in notebook's second cell
  5. Verification: Recompute hash from provenance.json and compare

Verification guarantees:

PropertyGuaranteeAttack Prevention
ImmutabilityAny provenance modification changes hashPrevents retroactive edits
BindingHash proves notebook ↔ provenance linkagePrevents data substitution
CompletenessHash covers all activities/entitiesPrevents omission of steps
Non-repudiationProvenance includes agent attributionAccountability enforcement

Example integrity manifest (provenance section):

{
  "data_integrity_manifest": {
    "generated_at": "2026-01-01T15:45:00Z",
    "provenance": {
      "session_id": "session_20260101_142000",
      "sha256": "7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069",
      "activities": 15,
      "entities": 8,
      "agents": 3,
      "time_span": {
        "start": "2026-01-01T14:20:00Z",
        "end": "2026-01-01T15:30:00Z"
      }
    },
    "input_files": { ... },
    "system": { ... }
  }
}

Verification workflow:

import hashlib
import json

# Read provenance from workspace
with open(".lobster_workspace/provenance.json") as f:
    provenance = json.load(f)

# Compute hash (canonical JSON, sorted keys)
provenance_json = json.dumps(provenance, sort_keys=True)
computed_hash = hashlib.sha256(provenance_json.encode()).hexdigest()

# Compare to manifest hash
manifest_hash = "7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069"

if computed_hash == manifest_hash:
    print("✅ VERIFIED: Provenance matches notebook manifest")
else:
    print("❌ TAMPERED: Provenance has been modified!")

Compliance benefits:

  • 21 CFR Part 11 § 11.10(a) - Tamper-evident audit trails
  • ALCOA+ "Original" - Proves audit trail authenticity
  • ISO/IEC 27001:2022 - Integrity monitoring
  • GxP - Supports 21 CFR Part 11 requirements for electronic records

For complete implementation details, see:


4. Access Control & Authentication

4.1 License Management System

What it is: Lobster uses a cryptographic license service (AWS-hosted) to validate entitlements and enforce subscription tiers. The system uses server-side RSA signing with client-side verification via JWKS (JSON Web Key Set), following industry-standard JWT/JWS patterns.

Architecture (AWS Serverless):

  • AWS Lambda - License service endpoints (Python 3.12, ARM64)
  • API Gateway - REST API (https://x6gm9vfgl5.execute-api.us-east-1.amazonaws.com/v1)
  • DynamoDB - Entitlements, customers, audit logs
  • AWS KMS - RSA-2048 signing key (HSM-backed, private key never leaves AWS)
  • S3 + CloudFront - JWKS public endpoint for signature verification

How it works (5-step activation):

  1. User activates: lobster activate &lt;cloud-key>
  2. CLI calls AWS license service: POST /api/v1/activate
  3. Service validates key, signs entitlement via AWS KMS (server-side)
  4. Entitlement saved to: ~/.lobster/license.json
  5. CLI verifies signature via JWKS on each run (client-side)

Entitlement file structure:

{
  "cloud_key": "lbstr_abc123...",
  "customer_id": "cust_databiomix",
  "subscription_tier": "premium",
  "features": ["metadata_assistant", "proteomics_expert"],
  "issued_at": "2026-01-01T12:00:00Z",
  "expires_at": "2027-01-01T12:00:00Z",
  "signature": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
  "revocation_status": {
    "last_checked": "2026-01-01T18:00:00Z",
    "is_revoked": false
  }
}

Security properties:

PropertyImplementationBenefit
Server-side signingPrivate key in AWS KMS (never exposed)Cannot be compromised by client
Client-side verificationJWKS public key from S3Standard JWT pattern, offline verification
Tamper-evidentRSA-2048 signature validationAny modification invalidates entitlement
Revocation checkingPeriodic status checks (24h interval)Supports license revocation
Audit loggingDynamoDB AuditLogs tableComplete activation/refresh history

CLI commands:

# Activate cloud license
lobster activate lbstr_abc123...

# Check license status and tier
lobster status

# Output shows:
# Subscription Tier: premium
# Features: metadata_assistant, proteomics_expert
# Expires: 2027-01-01

Compliance benefits:

  • Access control - Tier-based feature restrictions
  • Audit trail - All activations logged
  • Non-repudiation - Cryptographic signatures prove entitlements
  • Revocation support - Invalidate compromised keys

For complete implementation details, see:


4.2 Subscription Tier Enforcement

What it is: Role-based access control (RBAC) implemented via three subscription tiers (FREE, PREMIUM, ENTERPRISE). Each tier unlocks specific agents and features, enforced at CLI startup and agent creation.

Three-tier model:

TierAgentsFeaturesUse Case
FREE7 agentsCore workflowsOpen-source, academic
• supervisorBasic analysisIndividual researchers
• research_agentLiterature searchEducation
• data_expertData loading
• transcriptomics_expertRNA-seq
• visualization_expertPlotting
• machine_learning_expertML models
• protein_structure_visualization_expertStructural biology
PREMIUM+2 agentsAdvanced workflowsBiotech, CRO
• metadata_assistantPublication processingSmall teams
• proteomics_expertMass specCommercial use
ENTERPRISE+custom packagesCustomer-specificPharma, large orgs
• Via lobster-custom-*Proprietary agentsValidated environments

Enforcement mechanism (4 layers):

  1. License Manager (core/license_manager.py) - Validates tier at CLI startup
  2. ComponentRegistry (core/registry.py) - Checks tier via tier_requirement in AGENT_CONFIG
  3. Handoff Restrictions (config/subscription_tiers.py) - Prevents unauthorized delegation
  4. Entry Point Discovery - Premium packages registered via lobster.agents entry points

Example tier checking:

from lobster.config.subscription_tiers import is_agent_available

# Check if agent is available for current tier
if is_agent_available("metadata_assistant", current_tier="free"):
    # Agent available (False for FREE tier)
    create_agent()
else:
    # Show upgrade message
    print("⚠️ metadata_assistant requires PREMIUM tier")

Handoff restrictions (FREE tier example):

# supervisor can handoff to research_agent (✅)
# supervisor can handoff to data_expert (✅)
# research_agent CANNOT handoff to metadata_assistant (❌ - PREMIUM only)

Graceful degradation:

User: "Process my publication queue and filter metadata"
Lobster (FREE): "⚠️ Publication queue processing requires PREMIUM tier
                 (metadata_assistant agent). Available with PREMIUM subscription.
                 Visit https://omics-os.com/pricing for upgrade options."

Compliance benefits:

  • Access control - Feature restrictions without authentication overhead
  • Commercial licensing - Supports AGPL-3.0 + commercial model
  • Audit trail - Tier logged in session metadata
  • Customer segmentation - Different capabilities per contract

For complete implementation details, see:


4.3 API Key Security

What it is: Secure management of third-party API credentials (LLM providers, NCBI, cloud services) via environment variables, workspace-level configuration, and secret management integration.

Supported API keys:

KeyPurposeRequiredRate Limit Impact
ANTHROPIC_API_KEYAnthropic Direct LLMConditional*N/A
AWS_BEDROCK_ACCESS_KEYAWS Bedrock LLMConditional*N/A
AWS_BEDROCK_SECRET_ACCESS_KEYAWS Bedrock LLMConditional*N/A
GOOGLE_API_KEYGoogle Gemini LLM (v0.4.0+)Conditional*N/A
NCBI_API_KEYNCBI E-utilitiesOptional3 → 10 req/s
LOBSTER_CLOUD_KEYCloud service activationOptionalN/A

*At least one LLM provider required (Anthropic OR AWS Bedrock OR Google Gemini OR Ollama local)

Configuration hierarchy (priority order):

  1. Workspace-level .env - Per-project keys (highest priority)
  2. Global ~/.lobster/.env - User-wide keys
  3. Environment variables - System-level (e.g., CI/CD)

Security best practices:

PracticeImplementationSecurity Benefit
Never commit keys.gitignore includes .env filesPrevents credential leaks
Use .env templates.env.example (no real keys)Safe to commit, guides setup
Rotate regularlyQuarterly for NCBI, on team changes for LLMsLimits exposure window
Scope appropriatelyWorkspace > Global > SystemPrinciple of least privilege
Validate on startupCLI checks key validityFail fast on invalid keys

Example secure setup:

# ✅ GOOD: Use .env file (gitignored)
cat > .env << EOF
ANTHROPIC_API_KEY=sk-ant-api03-...
NCBI_API_KEY=abc123...
EOF

# ✅ GOOD: Workspace-specific keys
mkdir -p ~/project1/.lobster_workspace
echo "ANTHROPIC_API_KEY=sk-ant-project1..." > ~/project1/.env

# ❌ BAD: Hardcode in scripts (committed to git)
export ANTHROPIC_API_KEY="sk-ant-api03-..."  # Will be committed!

# ❌ BAD: Share keys across users
echo "ANTHROPIC_API_KEY=shared-key" > /etc/lobster/.env  # Security risk

Enterprise secret management:

# AWS Secrets Manager integration
export ANTHROPIC_API_KEY=$(aws secretsmanager get-secret-value \
  --secret-id lobster/anthropic-key \
  --query SecretString \
  --output text)

# HashiCorp Vault integration
export ANTHROPIC_API_KEY=$(vault kv get -field=api_key secret/lobster/anthropic)

# CI/CD environment injection (GitHub Actions example)
# Stored in repository secrets, injected at runtime
- name: Run Lobster analysis
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: lobster query "Analyze GSE109564"

Compliance benefits:

  • Access control - Workspace-level isolation
  • Audit trail - Key usage logged (not key values)
  • Credential rotation - Supports regular key updates
  • Principle of least privilege - Scoped per project

For complete implementation details, see:


4.4 Cloud vs Local Security Models

What it is: Lobster supports two deployment modes with different security postures: Local Mode (default, maximum data control) and Cloud Mode (managed infrastructure, cloud key required).

Local Mode (Default):

PropertyStatusDetails
Data location✅ Local machineNever leaves your environment
Network egress⚠️ MinimalOnly for data downloads (GEO, PubMed)
Workspace isolation✅ Full controlUser manages permissions
API key management⚠️ User-managedStore in .env files
Suitable for✅ Sensitive dataHIPAA, confidential, air-gapped
Compliance✅ GxP-readyFull audit trail, local storage

Cloud Mode (LOBSTER_CLOUD_KEY set):

PropertyStatusDetails
Data location⚠️ Cloud APIData sent to cloud service
Network dependency⚠️ RequiredMust have internet connection
License validation✅ AutomaticCloud key verified via AWS
Scalable compute✅ ManagedAuto-scaling infrastructure
API key management✅ Server-sideNo local credential storage
Suitable for⚠️ Non-sensitiveValidated, non-PHI data
Compliance⚠️ Check BAAHIPAA requires Business Associate Agreement

Decision matrix (choosing a mode):

RequirementLocalCloud
Air-gapped environment
Sensitive patient data (HIPAA)⚠️ (BAA required)
Large-scale processing (100s of datasets)⚠️
No infrastructure management
Maximum data control⚠️
Regulatory compliance (21 CFR Part 11)⚠️ (validated cloud)
Multi-user collaboration⚠️
Cost predictability✅ (compute only)⚠️ (usage-based)

Hybrid deployment pattern (recommended):

Development & Exploration → Local Mode
├─ Literature search
├─ Dataset discovery
├─ Small-scale testing
└─ Sensitive data analysis

Production & Scale → Cloud Mode
├─ Large batch processing
├─ Multi-user workflows
├─ Non-sensitive data
└─ Managed infrastructure

Switching modes:

# Local mode (default)
unset LOBSTER_CLOUD_KEY
lobster chat

# Cloud mode
export LOBSTER_CLOUD_KEY=lbstr_abc123...
lobster activate $LOBSTER_CLOUD_KEY
lobster chat  # Now uses cloud backend

Security considerations (cloud mode):

  • Data sovereignty: Understand where data is processed (AWS region)
  • Compliance: Verify BAA for HIPAA, DPA for GDPR
  • Network security: Use VPN/private endpoints for enterprise
  • Access logs: Cloud service logs all API calls
  • Data retention: Understand cloud provider's retention policies

Compliance benefits:

  • Flexibility - Choose security posture per use case
  • Data control - Local mode for regulated data
  • Scalability - Cloud mode for production
  • Audit trail - Both modes support W3C-PROV

For complete implementation details, see:


5. Secure Code Execution

5.1 Custom Code Execution Service

What it is: Lobster includes a CustomCodeExecutionService that allows analysts to execute arbitrary Python code for edge-case data manipulations (e.g., complex metadata transformations, custom filtering). This feature balances flexibility (handle unexpected scenarios) with security (protect the system).

Use cases:

  • Complex metadata filtering not covered by standard tools
  • Custom data transformations (e.g., multi-column merging)
  • Edge-case QC operations
  • Format conversions for specialized databases

Security model (Phase 1 - Current):

ControlImplementationProtection
Subprocess isolationRuns in separate Python processCrash isolation (subprocess failure doesn't crash main)
Timeout enforcement300-second default limitPrevents infinite loops
Workspace-only accessWorking directory set to workspaceCannot access files outside workspace
Forbidden module blockingAST analysis + import hooksBlocks subprocess, os.system, importlib, eval, exec
Output captureStdout/stderr redirectedAll output logged to provenance
Error handlingException isolationUser-friendly error messages

Known limitations (acceptable for local CLI, NOT for cloud SaaS):

  • ⚠️ Network access not restricted - Code can make HTTP requests
  • ⚠️ File permissions not sandboxed - Uses user's OS permissions
  • ⚠️ Resource limits basic - Only timeout, no CPU/memory quotas
  • NOT suitable for cloud SaaS - Requires Docker isolation (Phase 2)

Testing rigor:

  • 30+ security test files in tests/manual/custom_code_execution/
  • 201+ attack vectors tested across 6 categories:
    • File system attacks (path traversal, permission bypass)
    • Network attacks (data exfiltration, SSRF)
    • Resource exhaustion (memory bombs, CPU thrashing)
    • Privilege escalation (setuid, sudo abuse)
    • Code injection (eval, exec, import manipulation)
    • Crash attacks (segfault, assertion failures)

Example usage:

# Via agent tool
execute_custom_code("""
import pandas as pd

# Load metadata from workspace
metadata = pd.read_csv(WORKSPACE / 'metadata.csv')

# Complex filtering (example: multiple conditions)
filtered = metadata[
    (metadata['disease'] == 'cancer') &
    (metadata['tissue'].isin(['lung', 'breast'])) &
    (metadata['age'] > 50)
]

# Save to exports directory (user-facing files)
OUTPUT_DIR.mkdir(exist_ok=True)
filtered.to_csv(OUTPUT_DIR / 'filtered_metadata.csv', index=False)
""", persist=True)

Compliance benefits:

  • Audit trail - All custom code logged to provenance
  • Reproducibility - Code captured in AnalysisStep IR
  • Isolation - Subprocess protects main process
  • Timeout enforcement - Prevents runaway processes
  • ⚠️ Limited sandboxing - Suitable for local, NOT cloud

For complete implementation details, see:


5.2 Security Controls

What it is: Multi-layered security controls to prevent malicious or accidental misuse of custom code execution.

1. Forbidden Module Blocking (AST-based static analysis):

FORBIDDEN_MODULES = [
    'subprocess',       # Prevent shell command execution
    'os.system',        # Prevent system calls
    'os.popen',         # Prevent pipe-based execution
    'importlib',        # Prevent dynamic imports
    '__import__',       # Prevent import manipulation
    'eval',             # Prevent code evaluation
    'exec',             # Prevent code execution
    'compile',          # Prevent bytecode compilation
    'open',             # Restricted to workspace paths only
]

Example blocked code:

# ❌ BLOCKED: Attempt to use subprocess
import subprocess
subprocess.run(['rm', '-rf', '/'])  # Detected and blocked at import time

# ❌ BLOCKED: Dynamic import
importlib.import_module('os').system('evil')  # Detected via AST analysis

# ❌ BLOCKED: Code evaluation
eval("__import__('os').system('evil')")  # Detected via AST analysis

2. Timeout Enforcement:

# Subprocess killed after timeout
process = subprocess.run(
    [sys.executable, script_path],
    timeout=300,  # 5 minutes default
    capture_output=True,
    text=True,
    cwd=str(workspace_path)  # Workspace boundary
)

Example timeout protection:

# ❌ BLOCKED: Infinite loop (killed after 300s)
while True:
    pass

# ❌ BLOCKED: Excessive computation
for i in range(10**15):
    x = i ** 2  # Killed after timeout

3. Workspace Boundary Enforcement:

# Custom code executes in workspace directory
# All file paths relative to workspace
WORKSPACE = Path(workspace_path)  # e.g., /Users/analyst/.lobster_workspace
OUTPUT_DIR = WORKSPACE / "exports"  # User-facing files

# Accessing files outside workspace requires absolute paths
# (user's OS permissions apply)

Example workspace access:

# ✅ ALLOWED: Read/write within workspace
data = pd.read_csv(WORKSPACE / 'metadata.csv')
data.to_csv(OUTPUT_DIR / 'result.csv')

# ⚠️ USER PERMISSION: Access outside workspace (if OS allows)
external_data = pd.read_csv('/external/path/data.csv')  # OS permissions apply

4. Provenance Logging (all custom code logged):

# Every execution logged to provenance
log_tool_usage(
    tool_name="execute_custom_code",
    parameters={"code": code_snippet, "persist": True},
    stats={"execution_time_ms": 1234, "output_lines": 5},
    ir=AnalysisStep(...)  # Complete code captured for reproducibility
)

Compliance benefits:

  • Attack surface reduction - Forbidden modules blocked
  • Resource protection - Timeout prevents DoS
  • Workspace isolation - Limited file access scope
  • Complete audit trail - All code logged
  • Reproducibility - Code captured in AnalysisStep

5.3 Deployment Recommendations

What it is: Guidance on when CustomCodeExecutionService is appropriate for different deployment environments.

Deployment decision matrix:

EnvironmentStatusRationaleRecommendation
Local CLI✅ Production-readyTrusted users, local data control, user's OS permissionsDeploy with current security model
Enterprise (on-premise)⚠️ ConditionalAssess risk tolerance, trusted users, air-gapped OKDeploy with current model OR wait for Phase 2
Cloud SaaS❌ Requires Phase 2Untrusted users, need full isolation, network restrictionsWait for Docker sandboxing
Academic/Research✅ Production-readyTrusted users, flexibility > security, local controlDeploy with current model
Regulated (GxP)⚠️ ConditionalRisk assessment required, consider code review workflowEvaluate per use case

Phase 2 enhancements (roadmap for cloud SaaS):

EnhancementTechnologyBenefit
Docker sandboxinggVisor or Kata ContainersFull isolation (filesystem, network, process)
Network isolationDocker bridge modeNo outbound connections allowed
Resource quotascgroupsCPU, memory, disk limits
Read-only input mountsDocker volumesInput data immutable
Runtime security scanningFalco or SysdigDetect anomalous behavior
Egress firewalliptables/nftablesBlock all external connections

Phase 2 timeline: Estimated 4-6 weeks implementation + testing

Risk assessment questions (for enterprise deployment):

  1. Who executes code? Trusted employees? External analysts?
  2. What data sensitivity? PHI/PII? Confidential? Public?
  3. Network environment? Air-gapped? Internet-connected?
  4. Risk tolerance? Low (wait for Phase 2)? Medium (conditional)? High (deploy now)?
  5. Code review workflow? Manual review? Automated scanning? None?

Recommendations by scenario:

Academic Lab (trusted users, public data)
→ ✅ Deploy now with Phase 1 security

Biotech Startup (small team, confidential data, on-premise)
→ ✅ Deploy now + code review workflow

Pharma Enterprise (GxP, validated environment)
→ ⚠️ Conditional: Risk assessment + code review + SOP

Cloud SaaS Provider (untrusted users, multi-tenant)
→ ❌ Wait for Phase 2 (Docker isolation)

Compliance considerations:

  • Audit trail - All code logged (GxP requirement)
  • Reproducibility - Code captured in notebooks (21 CFR Part 11)
  • ⚠️ Validation - Phase 2 required for IQ/OQ/PQ (validated environments)
  • ⚠️ Data integrity - Phase 1 OK for local, Phase 2 for cloud

For complete implementation details, see:


5.4 Best Practices for Custom Code

What it is: Operational guidance for analysts using custom code execution and administrators deploying the feature.

For Analysts (using custom code):

✅ GOOD: Simple data manipulation

execute_custom_code("""
import pandas as pd

# Load data
metadata = pd.read_csv(WORKSPACE / 'metadata.csv')

# Filter by condition
cancer_samples = metadata[metadata['disease'] == 'cancer']

# Save to exports (user-facing files)
OUTPUT_DIR.mkdir(exist_ok=True)
cancer_samples.to_csv(OUTPUT_DIR / 'cancer_metadata.csv', index=False)
""", persist=True)

✅ GOOD: Custom QC checks

execute_custom_code("""
import pandas as pd
import numpy as np

# Load metadata
meta = pd.read_csv(WORKSPACE / 'metadata.csv')

# Custom QC: Check for missing values
missing_counts = meta.isnull().sum()
qc_pass = missing_counts.max() < 10  # Threshold: < 10 missing per column

# Save QC report
report = pd.DataFrame({
    'column': missing_counts.index,
    'missing_count': missing_counts.values,
    'qc_status': ['PASS' if c < 10 else 'FAIL' for c in missing_counts]
})
report.to_csv(OUTPUT_DIR / 'qc_report.csv', index=False)
""", persist=True)

❌ BAD: Attempting forbidden operations

execute_custom_code("""
import subprocess  # ❌ BLOCKED at import time
subprocess.run(['rm', '-rf', '/'])  # Won't execute

import os
os.system('evil command')  # ❌ BLOCKED (os.system forbidden)

eval("__import__('os').system('evil')")  # ❌ BLOCKED (eval forbidden)
""")

❌ BAD: Inefficient operations (timeout risk)

execute_custom_code("""
# ❌ Will timeout after 300 seconds
while True:
    pass

# ❌ Excessive memory usage (may crash subprocess)
data = [0] * (10 ** 10)  # 10 billion elements
""")

For Administrators (deploying Lobster):

1. Local Deployment Checklist:

# ✅ Verify workspace permissions (user-only access)
chmod 700 ~/.lobster_workspace

# ✅ Set timeout (optional, default 300s)
export LOBSTER_CUSTOM_CODE_TIMEOUT=600  # 10 minutes

# ✅ Enable audit logging (automatically enabled)
lobster query "Your analysis request"

# ✅ Review provenance logs periodically
cat ~/.lobster_workspace/provenance.json | jq '.activities[] | select(.tool_name == "execute_custom_code")'

2. Enterprise Deployment Checklist:

# ⚠️ Risk assessment (required)
# - Document: Who executes code? What data sensitivity?
# - Review: Security controls sufficient for risk tolerance?

# ✅ Code review workflow (recommended)
# - Require peer review for custom code blocks
# - Document in SOP: "Custom code must be reviewed by lead analyst"

# ✅ Periodic audit (quarterly)
# - Review custom code usage via provenance logs
# - Identify patterns, create standard tools for common operations

# ✅ User training
# - Document allowed patterns (data filtering, QC checks)
# - Document forbidden patterns (network access, subprocess)

3. Monitoring & Alerting (enterprise):

# Example: Monitor custom code usage
import json
from pathlib import Path

def audit_custom_code_usage(workspace_path):
    """Generate custom code usage report."""
    provenance_path = Path(workspace_path) / "provenance.json"
    with open(provenance_path) as f:
        prov = json.load(f)

    custom_code_activities = [
        act for act in prov.get('activities', [])
        if act.get('tool_name') == 'execute_custom_code'
    ]

    print(f"Total custom code executions: {len(custom_code_activities)}")
    for act in custom_code_activities:
        print(f"  - {act['timestamp']}: {act['agent']} (session: {act['session_id']})")

# Run monthly
audit_custom_code_usage("~/.lobster_workspace")

4. Standard Operating Procedure (SOP template):

## SOP: Custom Code Execution in Lobster AI

**Purpose**: Define approved use of custom code execution feature

**Scope**: All analysts using Lobster AI for bioinformatics analysis

**Approved Use Cases**:
1. Complex metadata filtering (>3 conditions)
2. Custom QC checks not covered by standard tools
3. Format conversions for specialized databases

**Forbidden Operations**:
1. Network requests (data exfiltration risk)
2. File access outside workspace (data leakage risk)
3. Subprocess execution (security risk)

**Review Process**:
1. Analyst documents custom code in lab notebook
2. Lead analyst reviews code before execution
3. QA team audits custom code usage quarterly

**Audit Trail**:
- All custom code logged to provenance.json
- Captured in notebook exports (Data Integrity Manifest)
- Reviewable via `lobster status` command

Compliance benefits:

  • Documented workflows - SOPs for custom code usage
  • Audit trail - Complete logging via provenance
  • Training - Clear guidance on allowed patterns
  • Periodic review - Quarterly audits recommended
  • Risk mitigation - Code review workflow for sensitive environments

For complete implementation details, see:


6. Data Protection & Isolation

6.1 Workspace Isolation

What it is: Lobster uses a workspace-based architecture where each analysis session operates in an isolated directory. This provides data isolation between projects, prevents cross-contamination, and enables clean archival of complete analyses.

Workspace resolution order (priority):

  1. --workspace CLI flag - Explicit path (highest priority)
  2. LOBSTER_WORKSPACE environment variable - Project/session-level configuration
  3. Current directory default - ./.lobster_workspace (automatic creation)

Workspace structure:

.lobster_workspace/
├── .session.json              # Current session metadata (multi-process safe)
├── provenance.json            # W3C-PROV audit trail
├── dataset_name.h5ad          # Modality data files
├── plots/                     # Visualizations (PNG, HTML)
│   ├── umap_plot.html
│   └── qc_metrics.png
├── exports/                   # User-facing files (v2.4+)
│   ├── metadata_filtered.csv  # Exported metadata
│   └── de_results.tsv         # Differential expression
├── literature/                # Cached papers (PDF, TXT)
│   └── PMID_12345678.txt
├── metadata/                  # Sample metadata (CSV, TSV)
│   └── GSE109564_metadata.csv
├── download_queue.jsonl       # Download coordination (multi-process safe)
└── publication_queue.jsonl    # Publication processing (multi-process safe)

Security properties:

PropertyImplementationSecurity Benefit
Per-project isolationSeparate workspace per projectNo cross-project data leakage
Clean archivalZip workspace = complete analysisEasy backup, transfer, compliance
Multi-user supportEach user has own workspaceNo data mixing in shared environments
Permission inheritanceOS-level permissions (chmod 700)Access control via filesystem
Centralized exportsexports/ for user files (v2.4+)Clear distinction: internal vs user-facing

Best practices:

# ✅ GOOD: Dedicated workspace per project
lobster chat --workspace ~/projects/gse109564_analysis/

# ✅ GOOD: Set global workspace for long session
export LOBSTER_WORKSPACE=~/current_project
lobster chat

# ✅ GOOD: Archive complete analysis
tar -czf gse109564_analysis.tar.gz ~/projects/gse109564_analysis/

# ❌ BAD: Mix multiple projects in same workspace
lobster query "Analyze GSE12345"
lobster query "Analyze GSE67890"  # Mixed in same workspace, confusing provenance

Multi-user deployment (shared server):

# Each user gets isolated workspace
export LOBSTER_WORKSPACE=/shared/workspaces/$USER
chmod 700 /shared/workspaces/$USER  # User-only access

# Or project-based workspaces
export LOBSTER_WORKSPACE=/shared/projects/project_123
# Set group permissions for team collaboration
chmod 770 /shared/projects/project_123
chgrp bioinfo_team /shared/projects/project_123

Compliance benefits:

  • Data isolation - HIPAA/GDPR data separation
  • Complete audit trail - All files in one location
  • Clean archival - 21 CFR Part 11 electronic records
  • Access control - OS-level permission enforcement
  • Multi-user support - Enterprise deployment ready

For complete implementation details, see:


6.2 Concurrent Access Protection

What it is: Lobster implements multi-process safe file locking for shared state files (download queue, publication queue, session metadata). This prevents race conditions, data corruption, and lost updates in concurrent scenarios.

Protected files (multi-process safe):

FileProtectionUse Case
download_queue.jsonlInterProcessFileLockMultiple agents downloading datasets
publication_queue.jsonlInterProcessFileLockBatch publication processing
.session.jsonInterProcessFileLockSession state updates
cache_metadata.jsonInterProcessFileLockCache tracking

Implementation (core/queue_storage.py):

1. Cross-platform file locking:

# POSIX (macOS, Linux): fcntl.flock
# Windows: msvcrt.locking

class InterProcessFileLock:
    """File-based lock for cross-process coordination."""
    def __enter__(self):
        if platform.system() == 'Windows':
            msvcrt.locking(self.fd, msvcrt.LK_LOCK, 1)
        else:
            fcntl.flock(self.fd, fcntl.LOCK_EX)  # Exclusive lock

2. Thread + process locking:

from contextlib import contextmanager

@contextmanager
def queue_file_lock(thread_lock, lock_path):
    """Combines threading.Lock + file lock."""
    with thread_lock:  # Thread-safe within process
        with InterProcessFileLock(lock_path):  # Process-safe across processes
            yield

3. Atomic writes (crash-safe):

def atomic_write_json(path: Path, data: dict):
    """Crash-safe JSON write (temp file + fsync + atomic replace)."""
    temp_path = path.with_suffix('.tmp')

    # Write to temp file
    with open(temp_path, 'w') as f:
        json.dump(data, f, indent=2)
        f.flush()
        os.fsync(f.fileno())  # Force write to disk

    # Atomic replace (POSIX guarantees atomicity)
    os.replace(temp_path, path)

Example usage (download queue):

# Thread A and Thread B both trying to update queue
with queue_file_lock(self._lock, self._lock_path):
    # Critical section - guaranteed exclusive access
    entries = self._read_all_entries()
    entries.append(new_entry)
    atomic_write_jsonl(self._queue_path, entries)
# Lock released - other threads/processes can proceed

Concurrency scenarios protected:

ScenarioWithout LockingWith Locking
Concurrent writesLost updates, corruptionSequential writes, no loss
Read during writePartial data readRead waits for write completion
Process crash during writeCorrupted fileTemp file discarded, original intact
Multi-user accessRace conditionsCoordinated access

Performance characteristics:

  • Lock overhead: ~1-5ms per acquisition (local filesystem)
  • Blocking: Writers wait for exclusive access (FIFO order)
  • Scalability: Suitable for 10-100 concurrent processes (local FS limitation)
  • Not recommended for: NFS/SMB (file locking issues), high-frequency updates (>100 ops/sec)

Compliance benefits:

  • Data integrity - Prevents corruption in concurrent scenarios
  • Crash safety - Atomic writes prevent partial updates
  • Multi-user support - Enterprise deployment ready
  • Audit trail integrity - Session metadata protected
  • Queue reliability - Download/publication queues consistent

For complete implementation details, see:


6.3 Session Management and Data Restoration

What it is: Lobster maintains persistent session state that enables cross-session continuity, analysis restoration, and historical provenance queries. Sessions support multi-turn conversations with automatic state management.

Session lifecycle:

Session Creation → Multi-Turn Analysis → Session Snapshot → Restoration
     ↓                    ↓                    ↓                ↓
session_123.json    Updates per tool      .session.json     Resume analysis
  (metadata)         (provenance log)       (checkpoint)     (--session-id)

Session metadata structure (saved to .session.json):

{
  "session_id": "session_20260101_142000",
  "created_at": "2026-01-01T14:20:00.123456Z",
  "last_updated": "2026-01-01T15:45:32.789012Z",
  "workspace_path": "/Users/analyst/.lobster_workspace",
  "subscription_tier": "premium",
  "modalities": {
    "geo_gse109564": {
      "created_at": "2026-01-01T14:23:15Z",
      "n_obs": 5000,
      "n_vars": 2000,
      "layers": ["counts", "normalized"],
      "last_modified": "2026-01-01T15:30:00Z",
      "file_path": "geo_gse109564.h5ad",
      "size_bytes": 12345678
    }
  },
  "tool_usage_count": 15,
  "agent_handoffs_count": 4
}

Key capabilities:

FeatureDescriptionUse Case
Session continuity--session-id latest resumes previous sessionMulti-day analysis
Named sessions--session-id "project_gse109564"Long-term projects
Automatic checkpointingSession saved after each operationCrash recovery
Cross-session restorationLoad data from previous sessionReproduce results
Historical queriesQuery provenance from any sessionAudit support

Session commands:

# Start new named session
lobster query --session-id "project_gse109564" "Download GSE109564 and cluster"

# Continue most recent session
lobster query --session-id latest "Add differential expression"

# Resume specific session
lobster query --session-id "session_20260101_142000" "Export results"

# View current session
lobster status
# Output:
# Session ID: session_20260101_142000
# Workspace: /Users/analyst/.lobster_workspace
# Modalities: 3 datasets loaded
# Tool usage: 15 operations
# Last updated: 2026-01-01 15:45:32 UTC

Session restoration workflow:

# Automatically handled by DataManagerV2
class DataManagerV2:
    def restore_session(self, session_id: str):
        """Restore complete session state."""
        # 1. Load session metadata
        session_meta = self._load_session_file(session_id)

        # 2. Restore modalities (lazy loading)
        for modality_name, info in session_meta['modalities'].items():
            self.modalities[modality_name] = self._load_h5ad(info['file_path'])

        # 3. Restore provenance context
        self.provenance.load_session_activities(session_id)

        # 4. Resume analysis
        return session_meta

Security considerations:

AspectProtectionBenefit
Session filesWorkspace permissions (chmod 700)User-only access
Session IDsTimestamp-based (not guessable)Prevents session hijacking
Automatic backupWritten after each operationCrash recovery
No sensitive dataAPI keys NOT stored in sessionCredential protection

Compliance benefits:

  • ALCOA+ "Contemporaneous" - Real-time session updates
  • ALCOA+ "Traceable" - Complete session history
  • Crash recovery - Automatic checkpointing
  • Historical audit - Query any previous session
  • Multi-day analysis - Session continuity support

For complete implementation details, see:


7. Network Security & Rate Limiting

7.1 Redis Rate Limiter Architecture

What it is: Lobster uses a Redis-backed token bucket rate limiter to prevent API rate limit violations when accessing external databases (NCBI, GEO, PubMed, PRIDE, MassIVE, etc.). This ensures good API citizenship and prevents 429 errors that interrupt analysis workflows.

Architecture:

  • Token bucket algorithm - Tokens replenish over time, requests consume tokens
  • Redis connection pool - Thread-safe, health-check enabled (30s interval)
  • Graceful degradation - Fail-open if Redis unavailable (warning only)
  • Cross-process coordination - Multiple lobster processes share rate limit

Deployment modes:

ModeRedis RequiredBehaviorUse Case
DevelopmentNoWarning logged, no blockingSingle-user, local analysis
ProductionYesRate limits enforcedMulti-user, shared server
CI/CDOptionalWarning-only (no Redis)Automated testing

Setup:

# Development: No Redis needed (warning only)
lobster query "Search PubMed for CRISPR papers"
# Warning: Redis not available, rate limiting disabled

# Production: Redis for multi-process coordination
docker run -d -p 6379:6379 redis:alpine
export REDIS_URL=redis://localhost:6379
lobster query "Search PubMed for CRISPR papers"
# Rate limits enforced across all processes

Key features:

FeatureImplementationBenefit
Connection poolingredis.ConnectionPool with health checksAuto-recovery from stale connections
Thread-safeDouble-checked locking for lazy initSafe for multi-threaded agents
Process-safeRedis keys with TTLCross-process coordination
Automatic retryExponential backoff on 429 errorsTransparent error recovery
Provider-specificSeparate keys per domainPrecise rate limit compliance

Example usage (automatic):

from lobster.tools.rate_limiter import get_rate_limiter

# Decorator automatically applied to provider methods
@get_rate_limiter().with_rate_limit(domain="ncbi")
def search_pubmed(query: str):
    # Rate limit enforced before API call
    # If limit exceeded: waits for token availability
    return ncbi_api.search(query)

Redis key structure:

rate_limit:ncbi       → Token bucket for NCBI (10 req/s with API key)
rate_limit:pmc        → Token bucket for PMC (3 req/s)
rate_limit:geo        → Token bucket for GEO (10 req/s)
rate_limit:pride      → Token bucket for PRIDE (2 req/s)

Compliance benefits:

  • API compliance - Respects provider rate limits (NCBI Terms of Service)
  • Reliability - Prevents 429 errors that interrupt workflows
  • Good citizenship - Prevents overloading public databases
  • Multi-user support - Coordinates rate limits across users
  • Audit trail - Rate limit violations logged to provenance

For complete implementation details, see:


7.2 Multi-Domain Rate Limiting

What it is: Lobster integrates with 29+ external databases (genomics, proteomics, metabolomics, literature), each with different rate limits. The system enforces provider-specific limits to ensure compliance and prevent access denial.

Rate limits by provider:

DomainBase Rate LimitWith API KeyEnforcementProtocol
NCBI E-utilities3 req/s10 req/sRedis + backoffHTTPS
PMC Open Access3 req/sN/ARedis + backoffHTTPS
GEO10 req/sN/ARedis + backoffHTTPS/FTP
SRA3 req/s10 req/s (same NCBI key)Redis + backoffHTTPS
PRIDE2 req/sN/ARedis + backoffHTTPS
MassIVE1 req/sN/ARedis + backoffHTTPS
MetaboLights2 req/sN/ARedis + backoffHTTPS
Publisher APIs0.3-2.0 req/sVariesRedis + backoffHTTPS

NCBI API key benefits (recommended):

MetricWithout KeyWith KeyImprovement
Rate limit3 req/s10 req/s3.3x faster
Batch PubMed search~300 queries/min~1000 queries/min3.3x faster
Large dataset downloadRate-limitedPriority queueBetter service

Setup NCBI API key (free, 2-minute registration):

# 1. Register at: https://www.ncbi.nlm.nih.gov/account/settings/
# 2. Generate API key
# 3. Add to .env file
echo "NCBI_API_KEY=your-key-here" >> .env

# 4. Verify (rate limit increases automatically)
lobster query "Search PubMed for 500 papers on CRISPR"
# Completes in ~30s instead of ~100s

Exponential backoff (automatic retry):

# Automatic retry on 429 Too Many Requests
@rate_limiter.with_rate_limit(domain="ncbi")
def fetch_data(accession):
    """Automatically retries with exponential backoff."""
    # Retry schedule: 1s, 2s, 4s, 8s, 16s (max 5 attempts)
    # Total wait time: ~31s max
    return api.get(accession)

Rate limit error handling:

# Example internal implementation
def _handle_rate_limit_error(response, attempt):
    """Handle 429 Too Many Requests."""
    if response.status_code == 429:
        retry_after = response.headers.get('Retry-After', 2 ** attempt)
        logger.warning(f"Rate limit exceeded, retrying in {retry_after}s")
        time.sleep(retry_after)
        return True  # Retry
    return False  # Don't retry

Multi-domain coordination (example workflow):

User: "Search PubMed for CRISPR papers, download top 10 datasets from GEO, fetch protein structures from PDB"

research_agent → NCBI rate limiter (10 req/s with key)

data_expert → GEO rate limiter (10 req/s)

protein_structure_visualization_expert → PDB rate limiter (10 req/s)

Each domain has independent token bucket → No cross-domain interference

Monitoring & alerting (enterprise):

# Example: Monitor rate limit usage
def check_rate_limit_health():
    """Check Redis connection and rate limit status."""
    limiter = get_rate_limiter()

    if not limiter.is_available():
        logger.error("⚠️ Redis unavailable - rate limiting disabled")
        return False

    # Check token availability for critical domains
    critical_domains = ["ncbi", "geo", "pride"]
    for domain in critical_domains:
        tokens = limiter.get_available_tokens(domain)
        if tokens < 10:  # Low token threshold
            logger.warning(f"⚠️ Low tokens for {domain}: {tokens}")

    return True

# Run periodically (e.g., every 5 minutes)
check_rate_limit_health()

Compliance benefits:

  • Terms of Service compliance - Respects NCBI, GEO, PRIDE rate limits
  • Prevents access denial - Avoids IP blocking from excessive requests
  • Good API citizenship - Responsible use of public resources
  • Multi-domain coordination - Independent limits per provider
  • Audit trail - All API calls logged to provenance

For complete implementation details, see:


7.3 API Timeout and Error Handling

What it is: Lobster implements robust timeout handling and error recovery for all external API calls, ensuring graceful degradation when network issues occur or APIs are unavailable.

Timeout configuration:

Timeout TypeDefaultConfigurablePurpose
Connection timeout10sYesTime to establish connection
Read timeout30sYesTime to receive first byte
Total timeout5minYesMaximum request duration
Retry timeout2minYesMaximum retry duration

Error handling strategy:

Error TypeRetry?BackoffMax AttemptsLogged?
Connection errors✅ YesExponential5✅ Yes
429 Rate limit✅ YesExponential5✅ Yes
5xx Server errors✅ YesExponential3✅ Yes
4xx Client errors❌ NoN/A1✅ Yes
Timeout✅ YesLinear3✅ Yes

Example timeout configuration:

# Provider-specific timeouts (in provider classes)
class PubMedProvider:
    DEFAULT_TIMEOUT = (10, 30)  # (connect, read) in seconds

    def search(self, query: str):
        try:
            response = requests.get(
                url,
                timeout=self.DEFAULT_TIMEOUT,
                headers={"User-Agent": "Lobster-AI/0.3.4"}
            )
            response.raise_for_status()
            return response.json()
        except requests.Timeout:
            logger.warning("PubMed search timed out, retrying...")
            # Automatic retry via decorator
        except requests.ConnectionError:
            logger.error("Connection to PubMed failed")
            raise ProviderError("Network error")

Retry with exponential backoff:

# Automatic retry logic (internal)
def retry_with_backoff(func, max_attempts=5):
    """Exponential backoff: 1s, 2s, 4s, 8s, 16s."""
    for attempt in range(max_attempts):
        try:
            return func()
        except (requests.Timeout, requests.ConnectionError) as e:
            if attempt == max_attempts - 1:
                raise  # Final attempt failed
            wait_time = 2 ** attempt
            logger.warning(f"Retry {attempt + 1}/{max_attempts} in {wait_time}s")
            time.sleep(wait_time)

Network security best practices:

PracticeImplementationSecurity Benefit
HTTPS onlyAll API calls use HTTPSEncrypted communication
SSL verificationverify=True (default)Prevents MITM attacks
Timeouts enforcedAll requests have timeoutPrevents hanging
User-Agent headerLobster-AI/versionIdentifies client, enables rate limit cooperation
Error loggingAll failures logged to provenanceAudit trail, debugging

Example error handling (user-facing):

User: "Download GSE12345 from GEO"

# Scenario 1: Network timeout
research_agent: "⚠️ Network timeout while accessing GEO. Retrying... (attempt 1/5)"
# ... exponential backoff ...
research_agent: "✅ Successfully downloaded GSE12345 (retry 3 succeeded)"

# Scenario 2: Rate limit exceeded
research_agent: "⚠️ Rate limit exceeded (429). Waiting 8 seconds before retry..."
# ... automatic backoff ...
research_agent: "✅ Request succeeded after rate limit backoff"

# Scenario 3: Permanent failure
research_agent: "❌ Failed to download GSE12345 after 5 attempts. GEO may be unavailable. Please try again later."

Monitoring network health (enterprise):

# Check recent network errors
cat ~/.lobster_workspace/provenance.json | \
  jq '.activities[] | select(.status == "error") | .error_message'

# Count errors by type
cat ~/.lobster_workspace/provenance.json | \
  jq '.activities[] | select(.status == "error") | .error_type' | \
  sort | uniq -c

Compliance benefits:

  • Graceful degradation - Analysis continues despite transient failures
  • Audit trail - All network errors logged to provenance
  • Security - HTTPS + SSL verification enforced
  • Reliability - Automatic retry with backoff
  • User experience - Clear error messages, transparent retry

For complete implementation details, see:


8. Validation & Data Quality

8.1 Schema Validation

What it is: Lobster uses Pydantic-based schema validation for all modality data (transcriptomics, proteomics, metabolomics, metagenomics). This enforces data integrity and standardization at load time, preventing downstream analysis errors.

Schema architecture:

  • Per-modality schemas - Domain-specific validation rules
  • Pydantic models - Type checking, constraint validation, automatic coercion
  • Pre-load validation - Errors caught before data enters workspace
  • Quality checks - Missing value thresholds, column requirements

Supported schemas (core/schemas/):

SchemaPurposeKey Validations
transcriptomics_schema.pyRNA-seq QC metricsMin cells/genes, count thresholds, QC column checks
proteomics_schema.pyMass spec dataMissing value limits, intensity ranges, peptide columns
metabolomics_schema.pyMetabolite datam/z ranges, retention times, peak intensity
metagenomics_schema.py16S/shotgun dataTaxonomy levels, abundance validation
database_mappings.pyAccession patterns29 database identifier formats

Example validation (transcriptomics):

from lobster.core.schemas.transcriptomics_schema import TranscriptomicsMetadata

# Validate H5AD metadata before loading
class TranscriptomicsMetadata(BaseModel):
    n_obs: int = Field(gt=0, description="Number of observations (cells)")
    n_vars: int = Field(gt=0, description="Number of variables (genes)")
    layers: List[str] = Field(..., description="Required layers")
    obs_columns: List[str] = Field(..., description="Observation annotations")

    @validator("layers")
    def check_required_layers(cls, v):
        required = ["counts"]
        if not any(layer in v for layer in required):
            raise ValueError(f"Missing required layer: {required}")
        return v

    @validator("n_obs")
    def check_minimum_cells(cls, v):
        if v < 10:
            raise ValueError(f"Too few cells: {v} (minimum: 10)")
        return v

Validation workflow:

User: "Load my single-cell data"

data_expert → load_modality()

ModalityAdapter.load() → H5AD file read

TranscriptomicsMetadata.validate() → Schema checks
    ├─ ✅ PASS → Data loaded into workspace
    └─ ❌ FAIL → ValidationError with details

Example error:
"ValidationError: n_obs=5 (minimum: 10 cells required)"
"ValidationError: Missing required layer: 'counts'"

Validation categories:

CategoryChecksExample
StructureRequired columns, layers.obs['cell_type'], .layers['counts']
ThresholdsMin/max valuesn_obs >= 10, n_vars >= 200
Data typesType enforcementInteger counts, float normalized values
ConsistencyCross-field validationlen(obs) == adata.shape[0]
QualityQC metric rangespct_counts_mt &lt; 20%

Benefits for analysts:

# ❌ WITHOUT VALIDATION: Silent failures, corrupted analysis
adata = load_data("bad_file.h5ad")  # Missing 'counts' layer
adata = sc.pp.normalize_total(adata)  # KeyError: 'counts'

# ✅ WITH VALIDATION: Early error detection
adata = load_data("bad_file.h5ad")
# ValidationError: Missing required layer: 'counts'
# Fix data before analysis, avoid wasted time

Compliance benefits:

  • ALCOA+ "Accurate" - Data integrity enforced
  • ALCOA+ "Complete" - Required fields validated
  • Quality assurance - Pre-analysis QC
  • Audit trail - Validation results logged to provenance
  • Reproducibility - Schema version captured in metadata

For complete implementation details, see:


8.2 Accession Validation

What it is: Lobster uses a centralized AccessionResolver to validate and parse identifiers from 29 public databases (GEO, SRA, PRIDE, MassIVE, MetaboLights, etc.). This prevents typos, detects invalid accessions, and enables database-specific download strategies.

Supported databases (29 patterns):

DatabasePattern ExamplesCategory
GEOGSE109564, GSM*, GPL*, GDS*Genomics
SRASRP*, SRX*, SRR*, SRS*Sequencing
ENA/DDBJERP*, ERX*, ERR*, DRP*Sequencing
BioProject/BioSamplePRJNA*, SAMN*Metadata
PRIDEPXD012345Proteomics
MassIVEMSV000082048Proteomics
MetaboLightsMTBLS123Metabolomics
ArrayExpressE-MTAB-, E-GEOD-Microarrays
DOI10.1234/examplePublications

Architecture:

  • Thread-safe singleton - Single instance via get_accession_resolver()
  • Pre-compiled regex - 29 patterns compiled at import time
  • Case-insensitive - gse12345 = GSE12345 (better UX)
  • URL generation - Automatic URL construction per database
  • Centralized source - core/schemas/database_mappings.py (single source of truth)

Key methods:

from lobster.core.identifiers.accession_resolver import get_accession_resolver

resolver = get_accession_resolver()

# Detect database type
db = resolver.detect_database("GSE109564")
# Returns: "geo"

# Validate accession
is_valid = resolver.validate("GSE109564", "geo")
# Returns: True

# Extract accessions from text
accessions = resolver.extract_accessions_by_type("Check GSE109564 and SRP123456")
# Returns: {"geo": ["GSE109564"], "sra": ["SRP123456"]}

# Generate URL
url = resolver.get_url("GSE109564", "geo")
# Returns: "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109564"

Pre-download validation (prevents failed downloads):

User: "Download INVALID12345"

research_agent → validate accession

AccessionResolver.detect_database("INVALID12345")
    ├─ ✅ Matched pattern → Proceed to download
    └─ ❌ No match → "❌ Invalid accession format: INVALID12345"

User: "Download GSE109564"

AccessionResolver.detect_database("GSE109564")  → "geo"

AccessionResolver.validate("GSE109564", "geo")  → True

Create DownloadQueueEntry(accession="GSE109564", database="geo")

data_expert → execute_download_from_queue()

Helper methods (commonly used):

# Check if GEO identifier
if resolver.is_geo_identifier("GSE109564"):
    # Handle GEO-specific logic
    pass

# Check if SRA identifier
if resolver.is_sra_identifier("SRP123456"):
    # Handle SRA-specific logic
    pass

# Check if proteomics identifier
if resolver.is_proteomics_identifier("PXD012345"):
    # Handle PRIDE-specific logic
    pass

Migration from hardcoded patterns (✅ Done):

# ❌ OLD: Hardcoded regex in every provider (duplicated, error-prone)
class GEOProvider:
    def _validate(self, accession):
        if not re.match(r"^GSE\d+$", accession):  # Duplicated across 10+ files
            raise ValueError("Invalid")

# ✅ NEW: Centralized validation (single source of truth)
class GEOProvider:
    def _validate(self, accession):
        if not get_accession_resolver().validate(accession, "geo"):
            raise ValueError("Invalid")

Compliance benefits:

  • Data integrity - Invalid accessions rejected early
  • Audit trail - All validation logged to provenance
  • Error prevention - Typos caught before expensive downloads
  • Consistency - Single source of truth for patterns
  • Extensibility - Add new databases in one place

For complete implementation details, see:


8.3 Pre-Download Validation

What it is: Lobster performs multi-layer validation before initiating dataset downloads. This prevents wasted time, bandwidth, and storage on invalid or problematic datasets.

Validation layers (executed sequentially):

LayerChecksExampleFailure Action
1. Accession formatRegex pattern matchGSE109564 valid, INVALID123 invalidReject immediately
2. Database detectionIdentify data sourceGSE* → GEO, PXD* → PRIDERoute to correct service
3. Metadata fetchAPI call for dataset infoSample count, file sizes, organismLog metadata to queue entry
4. Availability checkVerify dataset existsHTTP HEAD requestMark as FAILED if 404
5. Size estimationCheck file sizesWarn if >10 GBUser confirmation required
6. Queue uniquenessPrevent duplicate downloadsCheck existing queue entriesSkip if already queued

Example validation workflow:

User: "Download GSE109564"

Layer 1: Accession format validation
├─ AccessionResolver.validate("GSE109564", "geo")
└─ ✅ PASS: Valid GEO accession

Layer 2: Database detection
├─ AccessionResolver.detect_database("GSE109564")
└─ ✅ PASS: Detected as "geo"

Layer 3: Metadata fetch
├─ GEOProvider.get_metadata("GSE109564")
├─ Returns: {"n_samples": 5000, "organism": "Homo sapiens", "platform": "GPL24676"}
└─ ✅ PASS: Metadata retrieved

Layer 4: Availability check
├─ requests.head(f"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE109nnn/GSE109564/")
└─ ✅ PASS: Status 200 (dataset exists)

Layer 5: Size estimation
├─ Estimated size: 1.2 GB
└─ ✅ PASS: Below 10 GB threshold (no confirmation needed)

Layer 6: Queue uniqueness
├─ DownloadQueue.check_existing("GSE109564")
└─ ✅ PASS: Not already in queue

Result: Create DownloadQueueEntry(status=PENDING)

Early rejection examples (prevents wasted resources):

# Invalid accession format
User: "Download BADFORMAT123"
→ ❌ Rejected at Layer 1: "Invalid accession format: BADFORMAT123"

# Dataset doesn't exist
User: "Download GSE999999999"
→ ❌ Rejected at Layer 4: "Dataset not found: GSE999999999 (404)"

# Already in queue
User: "Download GSE109564" (twice)
→ ⚠️ Skipped at Layer 6: "GSE109564 already in download queue (status: PENDING)"

Size warnings (large datasets):

User: "Download GSE150614"

Layer 5: Size estimation
├─ Estimated size: 15 GB
└─ ⚠️ WARNING: Large dataset detected

research_agent: "⚠️ GSE150614 is large (~15 GB). Download may take 10-30 minutes depending on network speed. Proceed? (yes/no)"

User: "yes"
→ ✅ Proceed to download

User: "no"
→ ❌ Download cancelled

Queue status tracking (all stages logged):

{
  "entry_id": "download_20260101_142000",
  "accession": "GSE109564",
  "database": "geo",
  "status": "PENDING",
  "validation_results": {
    "accession_format": "PASS",
    "database_detection": "PASS (geo)",
    "metadata_fetch": "PASS (5000 samples)",
    "availability_check": "PASS (200 OK)",
    "size_estimation": "PASS (1.2 GB)",
    "queue_uniqueness": "PASS"
  },
  "created_at": "2026-01-01T14:20:00Z"
}

Benefits:

BenefitImpactExample
Time savingsReject invalid accessions in <1s vs 5-10min download attemptInvalid format caught immediately
Bandwidth savingsSkip non-existent datasets404 check prevents failed downloads
Storage savingsPrevent duplicate downloadsUniqueness check avoids re-downloading
User experienceClear error messages"Invalid format" vs generic failure
Audit trailAll validations loggedCompliance support

Compliance benefits:

  • ALCOA+ "Accurate" - Data integrity verified before download
  • ALCOA+ "Complete" - Metadata completeness checked
  • Resource efficiency - Prevents wasted bandwidth/storage
  • Audit trail - Validation results logged to provenance
  • Quality assurance - Pre-download QC

For complete implementation details, see:


9. Deployment Security

9.1 Docker Deployment

What it is: Lobster provides containerized deployment via Docker for consistent, reproducible environments across development, staging, and production. Two container types support different use cases: CLI (local analysis) and Server (cloud API).

Container types:

ContainerImagePurposePublishedUse Case
CLIomicsos/lobster:latestLocal analysis✅ Docker HubIndividual users, CI/CD
Server(private)Cloud API service❌ Private onlyEnterprise cloud deployment

CLI container architecture:

FROM python:3.11-slim

# Security: Non-root user
RUN useradd -m -u 1000 lobster
USER lobster

# Install lobster-ai package
RUN pip install --no-cache-dir lobster-ai

# Workspace mounted at runtime
WORKDIR /workspace

ENTRYPOINT ["lobster"]

Running CLI container:

# Basic usage
docker run -v $(pwd):/workspace omicsos/lobster:latest query "Analyze GSE109564"

# With API keys
docker run \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -e NCBI_API_KEY=$NCBI_API_KEY \
  -v $(pwd):/workspace \
  omicsos/lobster:latest chat

# With persistent workspace
docker run \
  -v $(pwd)/.lobster_workspace:/workspace \
  omicsos/lobster:latest query "Download GSE109564"

# With Redis for rate limiting
docker network create lobster-net
docker run -d --name redis --network lobster-net redis:alpine
docker run \
  --network lobster-net \
  -e REDIS_URL=redis://redis:6379 \
  -v $(pwd):/workspace \
  omicsos/lobster:latest query "Search PubMed"

Security properties:

PropertyImplementationSecurity Benefit
Non-root userUID 1000 (lobster)Prevents privilege escalation
Minimal base imagepython:3.11-slimReduced attack surface
No secrets in imageAPI keys via env varsPrevents credential leaks
Read-only filesystemOptional --read-only flagImmutable container
Resource limits--memory, --cpus flagsDoS prevention

Multi-service deployment (docker-compose):

version: '3.8'

services:
  redis:
    image: redis:alpine
    restart: unless-stopped
    volumes:
      - redis-data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s

  lobster-cli:
    image: omicsos/lobster:latest
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - REDIS_URL=redis://redis:6379
    volumes:
      - ./workspace:/workspace
    depends_on:
      - redis

volumes:
  redis-data:

Compliance benefits:

  • Reproducibility - Fixed environment, version pinning
  • Isolation - Container-level separation
  • Audit trail - Image tags document versions
  • Portability - Run anywhere (local, cloud, HPC)
  • Security - Non-root user, minimal attack surface

For complete implementation details, see:


9.2 S3 Backend Security

What it is: Lobster supports Amazon S3 as a storage backend for workspaces, enabling cloud-native deployments with centralized data management. The S3 backend implements AWS security best practices for data protection.

Architecture:

  • S3DataBackend - Implements IDataBackend interface
  • boto3 integration - AWS SDK for Python
  • Workspace prefix - Per-workspace isolation in bucket
  • Server-side encryption - AES-256 or KMS encryption
  • Access control - IAM policies + bucket policies

S3 workspace structure:

s3://lobster-workspaces/
├── user1_project_gse109564/           # Workspace prefix
│   ├── .session.json                   # Session metadata
│   ├── provenance.json                 # W3C-PROV audit trail
│   ├── geo_gse109564.h5ad              # Modality data
│   ├── plots/
│   │   └── umap_plot.html
│   └── exports/
│       └── metadata_filtered.csv
├── user2_project_xyz/
│   └── ...

Security configuration:

Security ControlImplementationCompliance Benefit
Encryption at restS3 SSE-S3 (AES-256) or SSE-KMSHIPAA, GDPR compliance
Encryption in transitHTTPS (TLS 1.2+)Data protection
Access controlIAM roles + bucket policiesPrinciple of least privilege
VersioningS3 versioning enabledData recovery, audit trail
LoggingS3 access logs + CloudTrailAudit support
MFA deleteRequired for object deletionAccidental deletion prevention

IAM policy example (least privilege):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::lobster-workspaces/user1_*/*",
        "arn:aws:s3:::lobster-workspaces"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": ["user1_*"]
        }
      }
    }
  ]
}

Bucket policy example (enforce encryption):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnencryptedObjectUploads",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::lobster-workspaces/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}

Usage (automatic backend selection):

# Configure S3 backend (environment variables)
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export LOBSTER_S3_BUCKET=lobster-workspaces
export LOBSTER_S3_WORKSPACE_PREFIX=user1_project_gse109564

# Lobster automatically uses S3 backend
lobster query "Analyze GSE109564"
# Data stored to: s3://lobster-workspaces/user1_project_gse109564/

Data lifecycle policies (cost optimization + compliance):

{
  "Rules": [
    {
      "Id": "TransitionToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        }
      ],
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 7,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "ExpireOldVersions",
      "Status": "Enabled",
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      }
    }
  ]
}

Compliance benefits:

  • HIPAA compliance - Encryption at rest + in transit, access logs
  • GDPR compliance - Data residency (region selection), encryption
  • 21 CFR Part 11 - Audit trails (CloudTrail), versioning
  • SOC 2 - AWS SOC 2 certification + Lobster audit trail
  • Cost optimization - Lifecycle policies for long-term storage

For complete implementation details, see:


9.3 AWS License Service Deployment

What it is: Lobster's license service is deployed as an AWS serverless application using Lambda, API Gateway, DynamoDB, and KMS. This section covers the security architecture and deployment best practices for the license service.

Architecture overview:

API Gateway (REST) → Lambda (Python 3.12) → DynamoDB + KMS
       ↓                    ↓                       ↓
  HTTPS only          ARM64 runtime         RSA-2048 signing
  Rate limiting       256 MB memory         Private key in HSM
  IAM auth            10s timeout           Automatic rotation

Security layers:

LayerControlImplementation
API GatewayRate limiting1000 req/sec burst, 5000 req/sec steady
API GatewayHTTPS enforcementTLS 1.2+ required
LambdaIAM execution roleLeast privilege (DynamoDB, KMS only)
LambdaVPC isolationOptional (private subnet)
DynamoDBEncryption at restAWS-managed keys (KMS)
DynamoDBPoint-in-time recoveryAutomatic backups
KMSHSM-backed keysFIPS 140-2 Level 2 validated
KMSKey rotationAutomatic annual rotation
CloudWatchAudit loggingAll API calls logged

Lambda execution role (least privilege):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:Query"
      ],
      "Resource": [
        "arn:aws:dynamodb:us-east-1:*:table/LobsterEntitlements",
        "arn:aws:dynamodb:us-east-1:*:table/LobsterCustomers",
        "arn:aws:dynamodb:us-east-1:*:table/LobsterAuditLogs"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Sign",
        "kms:GetPublicKey"
      ],
      "Resource": "arn:aws:kms:us-east-1:*:key/license-signing-key"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:*:log-group:/aws/lambda/LobsterLicenseService:*"
    }
  ]
}

DynamoDB table security:

TableEncryptionBackupTTL
EntitlementsKMSPITRNo
CustomersKMSPITRNo
AuditLogsKMSPITR90 days

Deployment (AWS CDK):

# Install dependencies
cd lobster-cloud
source .venv/bin/activate
pip install -r requirements.txt

# Deploy license service
export AWS_REGION=us-east-1
export JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=1

cdk deploy LobsterLicenseService \
  --context signing_key_arn=arn:aws:kms:us-east-1:*:key/your-key-id \
  --context environment=production

# Output:
# API Gateway URL: https://x6gm9vfgl5.execute-api.us-east-1.amazonaws.com/v1
# JWKS URL: https://d123abc456.cloudfront.net/.well-known/jwks.json

Monitoring & alerting:

# CloudWatch alarms (CDK automatically creates)
- High error rate (>5% errors)
- High latency (>1s P99)
- DynamoDB throttling
- KMS rate limiting
- Lambda concurrency limit

# CloudWatch Insights queries
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

Security best practices:

  1. API Gateway: Enable AWS WAF for DDoS protection
  2. Lambda: Use VPC endpoints for DynamoDB/KMS (no internet)
  3. KMS: Enable automatic key rotation (annual)
  4. DynamoDB: Enable point-in-time recovery (PITR)
  5. CloudWatch: Set up alarms for security events
  6. IAM: Regularly audit IAM roles and policies

Compliance benefits:

  • SOC 2 - AWS SOC 2 certified services
  • HIPAA - DynamoDB encryption, audit logging
  • GDPR - Data residency (region selection)
  • 21 CFR Part 11 - Audit trails (CloudTrail, CloudWatch)
  • FedRAMP - AWS GovCloud deployment option

For complete implementation details, see:


10. Compliance Features for Regulated Environments

10.1 GxP-Ready Checklist

What it is: Lobster implements multiple features that align with Good Practice (GxP) requirements for regulated pharmaceutical and clinical research. This checklist maps Lobster features to GxP principles and regulatory requirements.

ALCOA+ Principles (FDA Data Integrity Guidance):

PrincipleRequirementLobster ImplementationSection Reference
AttributableWho performed the action?Agent attribution in W3C-PROV, session metadata tracks users3.1
LegibleCan data be read/understood?Human-readable JSON, Plotly visualizations, markdown reports2.0
ContemporaneousRecorded in real-time?UTC timestamps on all operations, immediate session updates3.3
OriginalFirst recording preserved?W3C-PROV provenance, immutable activity log3.1
AccurateFree from errors?Schema validation, pre-download validation, QC checks8.1
+ CompleteAll data present?Required field validation, session metadata complete8.1
+ ConsistentData relationships valid?Cross-field validation, modality compatibility checks8.3
+ EnduringLong-term preservation?H5AD/MuData formats, S3 archival, notebook exports9.2
+ AvailableAccessible when needed?Session restoration, workspace archival, S3 backend6.3

21 CFR Part 11 Requirements (Electronic Records):

RegulationRequirementLobster ImplementationStatus
§ 11.10(a)System validationTesting framework, CI/CD, reproducible builds✅ Ready
§ 11.10(b)Ability to generate accurate copiesNotebook export, workspace archival, provenance export✅ Ready
§ 11.10(c)Protection against unauthorized accessWorkspace permissions, subscription tiers, license validation✅ Ready
§ 11.10(d)Secure timestamped audit trailsW3C-PROV with UTC timestamps, session tracking✅ Ready
§ 11.10(e)Audit trail reviewProvenance queries, session status, activity logs✅ Ready
§ 11.10(k)(1)System checks (validation)Schema validation, accession validation, QC metrics✅ Ready
§ 11.10(k)(2)Authority checksSubscription tier enforcement, license manager✅ Ready

ISO/IEC 27001:2022 (Information Security):

ControlCategoryLobster ImplementationSection
A.8.1Asset managementModality tracking, workspace inventory6.1
A.8.10Information deletionModality removal with audit trail6.3
A.8.15Access loggingSession metadata, tool usage tracking3.3
A.8.16Audit logsW3C-PROV provenance, DynamoDB audit logs3.1
A.8.24Cryptographic controlsSHA-256 hashing, RSA-2048 signatures, KMS2.0, 4.1

Compliance readiness matrix:

RegulationCurrent StatusDeployment ModeNotes
21 CFR Part 11✅ ReadyLocal + CloudFull audit trail, validation, access control
HIPAA⚠️ ConditionalLocal (ready), Cloud (BAA required)Encryption, access logs available
GDPR⚠️ ConditionalLocal (ready), Cloud (region + DPA)Data residency configurable
GxP (GAMP 5)⚠️ PartialLocal (ready for Cat 4), Cloud (validation TBD)IQ/OQ/PQ documentation needed
ISO/IEC 27001✅ ReadyLocal + CloudInformation security controls implemented
SOC 2 Type II⚠️ PartialCloud (AWS certified), Lobster (audit pending)AWS inherits certification

Quick deployment checklist (regulated environments):

# 1. ✅ Enable all audit features
export LOBSTER_ENABLE_PROVENANCE=true  # Default: enabled
export LOBSTER_ENABLE_INTEGRITY_MANIFEST=true  # Default: enabled

# 2. ✅ Configure secure workspace
mkdir -p /validated/workspaces/project_gse109564
chmod 700 /validated/workspaces/project_gse109564
export LOBSTER_WORKSPACE=/validated/workspaces/project_gse109564

# 3. ✅ Use PREMIUM tier (metadata_assistant for publication processing)
lobster activate lbstr_premium_key_abc123

# 4. ✅ Enable Redis for rate limiting (multi-user)
docker run -d -p 6379:6379 redis:alpine
export REDIS_URL=redis://localhost:6379

# 5. ✅ Set up API keys securely
cat > .env << EOF
ANTHROPIC_API_KEY=sk-ant-api03-...
NCBI_API_KEY=abc123...
EOF
chmod 600 .env

# 6. ✅ Verify compliance features
lobster status
# Check: Provenance enabled, Subscription tier, Workspace path

Compliance benefits:

  • Complete ALCOA+ coverage - All 9 principles supported
  • 21 CFR Part 11 ready - Electronic records & signatures
  • Multi-regulation support - HIPAA, GDPR, GxP, ISO 27001
  • Audit-ready - Comprehensive provenance + integrity manifests
  • Validation-friendly - Reproducible notebooks, fixed environments

For complete implementation details, see:


10.2 Deployment Patterns for Regulated Environments

What it is: Recommended deployment architectures for different regulatory compliance levels (GxP, HIPAA, GDPR, SOC 2).

Pattern 1: Academic/Research (minimal compliance):

Deployment: Local CLI
Security: Basic
Compliance: Internal only

Components:
├── Local machine (macOS/Linux/Windows)
├── Lobster CLI (pip install lobster-ai)
├── Local workspace (~/.lobster_workspace)
└── API keys in .env file

Suitable for:
- Academic research (public data)
- Exploratory analysis
- Individual researchers

Pattern 2: Biotech Startup (moderate compliance):

Deployment: Local CLI + shared workspaces
Security: Enhanced
Compliance: GLP, internal QA

Components:
├── Shared Linux server
├── Lobster CLI (Docker container)
├── Redis (rate limiting coordination)
├── Shared workspaces (/shared/projects/*)
├── API keys in HashiCorp Vault
└── Weekly provenance audits

Suitable for:
- Small biotech companies (5-20 users)
- Confidential but non-GxP data
- Internal QA requirements

Pattern 3: Pharma Enterprise (full compliance):

Deployment: Validated environment
Security: Maximum
Compliance: GxP (GAMP Cat 4), 21 CFR Part 11

Components:
├── Validated Linux environment (air-gapped)
├── Lobster CLI (Docker, validated image)
├── Redis (high availability cluster)
├── S3 backend (encrypted, versioned, WORM)
├── API keys in AWS Secrets Manager
├── Automated IQ/OQ/PQ testing
├── Change control process
└── Annual validation review

Suitable for:
- Pharmaceutical companies (GxP data)
- Clinical trials (patient data)
- Regulatory submissions (FDA, EMA)

Pattern 4: Cloud SaaS (multi-tenant):

Deployment: AWS serverless
Security: Maximum + isolation
Compliance: HIPAA (BAA), SOC 2, GDPR

Components:
├── AWS Lambda (auto-scaling)
├── API Gateway (rate limiting)
├── DynamoDB (encrypted)
├── S3 (per-tenant workspaces)
├── Redis ElastiCache (rate limiting)
├── KMS (encryption keys)
├── CloudWatch (audit logs)
└── AWS WAF (DDoS protection)

Suitable for:
- Multi-tenant SaaS
- Managed service offering
- Large-scale processing (100s of users)

Comparison matrix:

RequirementAcademicBiotechPharmaCloud SaaS
Setup time10 min2 hours2-4 weeks1-2 weeks
ComplianceNoneInternal QAGxP validatedHIPAA/SOC 2
Cost$0$100-500/mo$5K-20K one-time$10K-50K setup
SecurityBasicEnhancedMaximumMaximum
User capacity1-55-2020-100100-1000s

Deployment decision tree:

Q: Do you handle patient data (PHI/PII)?
├─ YES: Pattern 3 (Pharma) or Pattern 4 (Cloud with BAA)
└─ NO: Continue

Q: Do you need GxP validation?
├─ YES: Pattern 3 (Pharma)
└─ NO: Continue

Q: Do you have >20 users?
├─ YES: Pattern 4 (Cloud SaaS)
└─ NO: Continue

Q: Do you need multi-user coordination?
├─ YES: Pattern 2 (Biotech)
└─ NO: Pattern 1 (Academic)

Compliance benefits:

  • Flexible deployment - Choose pattern per compliance needs
  • Scalability - Start simple, upgrade as regulations require
  • Cost-effective - Pay only for compliance level needed
  • Audit-ready - All patterns support provenance tracking

For complete implementation details, see:


10.3 Standard Operating Procedures (SOPs)

What it is: Template SOPs for integrating Lobster AI into regulated workflows. These templates can be customized for specific organizational requirements.

SOP 1: Data Analysis with Lobster AI (template):

## SOP-LOBSTER-001: Bioinformatics Data Analysis

**Purpose**: Standardize use of Lobster AI for bioinformatics analysis in GxP environment

**Scope**: All analysts performing bioinformatics analysis on GxP data

**Responsibilities**:
- Analyst: Execute analysis, document decisions
- Lead Analyst: Review analysis, approve results
- QA: Verify data integrity, audit provenance

**Procedure**:

1. **Session Initialization**
   - Create dedicated workspace: `lobster chat --workspace /validated/project_name/`
   - Verify subscription tier: `lobster status` (PREMIUM required for GxP)
   - Document session ID in lab notebook

2. **Data Loading**
   - Use validated data sources only (GEO, internal repositories)
   - Verify accession format before download
   - Check Data Integrity Manifest after download
   - Document: Dataset ID, download date, file hash

3. **Analysis Execution**
   - Follow validated workflows (clustering, DE, etc.)
   - Document all custom code with justification
   - Review QC metrics at each step
   - Save all plots to workspace/plots/

4. **Notebook Export**
   - Export pipeline: `/pipeline export`
   - Verify Data Integrity Manifest present
   - Verify provenance hash included
   - Archive notebook + data + provenance.json

5. **Review and Approval**
   - Lead analyst reviews notebook
   - QA verifies file hashes match manifest
   - Approve for downstream use (submission, publication)
   - Document approval in QMS (Quality Management System)

**Audit Trail**:
- All operations logged to provenance.json
- Session metadata saved to .session.json
- Notebook includes Data Integrity Manifest
- Hashes verify data authenticity

**Revision History**:
- Version 1.0: Initial SOP (2026-01-01)

SOP 2: Hash Verification for Data Integrity (template):

## SOP-LOBSTER-002: Data Integrity Verification

**Purpose**: Verify cryptographic hashes in Lobster AI notebooks

**Scope**: All notebooks used for regulatory submissions or GxP decisions

**Procedure**:

1. **Notebook Receipt**
   - Receive notebook file (.ipynb) from analyst
   - Receive data files referenced in notebook
   - Receive provenance.json from workspace

2. **Hash Extraction**
   - Open notebook in Jupyter or text editor
   - Locate "🔒 Data Integrity Manifest" cell (cell 2)
   - Extract input_files section with SHA-256 hashes

3. **Hash Verification**
   ```bash
   # For each file in input_files
   shasum -a 256 filename.h5ad
   # Compare output to manifest hash
  1. Provenance Verification

    # Verify provenance hash
    python verify_provenance_hash.py provenance.json manifest_hash
  2. Documentation

    • Record verification results in QA log
    • If PASS: Approve notebook for review
    • If FAIL: Return to analyst for investigation

Acceptance Criteria:

  • ✅ All file hashes match manifest
  • ✅ Provenance hash matches manifest
  • ✅ System info documented (lobster version, git commit)

Revision History:

  • Version 1.0: Initial SOP (2026-01-01)

**SOP 3: Custom Code Review** (template):

```markdown
## SOP-LOBSTER-003: Custom Code Review

**Purpose**: Review and approve custom code blocks in Lobster analyses

**Scope**: All analyses using execute_custom_code tool

**Procedure**:

1. **Pre-Execution Review** (required for GxP):
   - Analyst documents custom code justification
   - Lead analyst reviews code for:
     - Security risks (subprocess, network access)
     - Data integrity risks (file manipulation)
     - Scientific validity (correct operations)
   - Approval documented in lab notebook

2. **Post-Execution Audit** (quarterly):
   - QA extracts custom code from provenance.json
   - Review for patterns (can standardize?)
   - Check for forbidden operations
   - Document findings in QA report

**Forbidden Patterns** (auto-blocked):
- ❌ subprocess, os.system, importlib
- ❌ eval, exec, compile
- ❌ File access outside workspace

**Allowed Patterns**:
- ✅ Pandas filtering (complex conditions)
- ✅ Custom QC checks
- ✅ Format conversions

**Revision History**:
- Version 1.0: Initial SOP (2026-01-01)

Compliance benefits:

  • Standardized procedures - Consistent workflows across teams
  • Documentation - SOPs required for GxP validation
  • Training - Clear guidance for analysts and QA
  • Audit support - Procedures documented for inspections

10.4 Validation Testing for GxP

What it is: Guidance for performing Installation Qualification (IQ), Operational Qualification (OQ), and Performance Qualification (PQ) testing for Lobster AI in validated environments.

GAMP 5 Category: Category 4 (Configurable software) Validation approach: Risk-based, leveraging vendor testing

IQ (Installation Qualification) - Verify correct installation:

TestProcedureAcceptance Criteria
Version verificationlobster --versionMatches validated version
Dependency check`pip listgrep -E "(scanpy
Workspace creationlobster query "test" --workspace /validated/test/Workspace created successfully
Provenance enabledCheck .lobster_workspace/provenance.json existsFile created
License validationlobster statusPREMIUM tier active

IQ checklist:

# 1. Version check
lobster --version
# Expected: Lobster AI CLI v0.3.4 (or specified version)

# 2. Dependency verification
pip list | grep -E "(scanpy|anndata|pydeseq2|plotly)"

# 3. Create test workspace
lobster query "test installation" --workspace /validated/iq_test/
ls /validated/iq_test/.lobster_workspace/

# 4. Verify provenance file
cat /validated/iq_test/.lobster_workspace/provenance.json

# 5. Check license
lobster status
# Expected: Subscription Tier: premium

# 6. Document results in IQ report

OQ (Operational Qualification) - Verify features work correctly:

TestProcedureAcceptance Criteria
Data downloadDownload GSE109564Data loaded, hash in manifest
Quality controlRun QC on test datasetQC metrics calculated
ClusteringPerform Leiden clusteringClusters assigned
VisualizationGenerate UMAP plotPlot saved to plots/
Notebook export/pipeline exportNotebook with integrity manifest
Hash verificationVerify SHA-256 hashesAll hashes match
Provenance queryQuery session activitiesComplete audit trail

OQ test script (automated):

#!/usr/bin/env python3
"""OQ test script for Lobster AI validation."""

import subprocess
import json
import hashlib
from pathlib import Path

WORKSPACE = Path("/validated/oq_test/.lobster_workspace")

def run_oq_test():
    """Execute operational qualification tests."""
    tests = [
        ("Download dataset", "Download GSE109564 and assess quality"),
        ("Clustering", "Cluster the data with resolution 0.5"),
        ("Export notebook", "Export the analysis pipeline"),
    ]

    for test_name, command in tests:
        print(f"\n{'='*60}")
        print(f"OQ Test: {test_name}")
        print(f"{'='*60}")

        result = subprocess.run(
            ["lobster", "query", command, "--workspace", str(WORKSPACE.parent)],
            capture_output=True,
            text=True,
            timeout=600
        )

        if result.returncode == 0:
            print(f"✅ PASS: {test_name}")
        else:
            print(f"❌ FAIL: {test_name}")
            print(f"Error: {result.stderr}")
            return False

    # Verify provenance file
    prov_path = WORKSPACE / "provenance.json"
    if prov_path.exists():
        with open(prov_path) as f:
            prov = json.load(f)
        print(f"\n✅ Provenance file exists: {len(prov.get('activities', []))} activities")
    else:
        print("\n❌ Provenance file missing")
        return False

    return True

if __name__ == "__main__":
    success = run_oq_test()
    exit(0 if success else 1)

PQ (Performance Qualification) - Verify performance with real data:

TestProcedureAcceptance Criteria
Large datasetDownload GSE150614 (15 GB)Completes in <30 min
Complex analysisFull workflow (QC → cluster → DE)Completes in <2 hours
Batch processingProcess 10 datasetsAll complete successfully
Concurrent users5 users simultaneousNo conflicts, no errors
Data integrityVerify hashes for all outputsAll hashes match

PQ acceptance criteria:

  • ✅ Performance meets specifications (time limits)
  • ✅ Results scientifically valid (QC metrics within range)
  • ✅ Data integrity maintained (all hashes verify)
  • ✅ Audit trail complete (all operations logged)
  • ✅ Reproducibility confirmed (re-run generates same results)

Validation documentation structure:

validation_package/
├── VP_001_Validation_Plan.pdf
├── IQ_001_Installation_Qualification.pdf
│   ├── Test cases 1-6
│   ├── Screenshots
│   └── Signatures (analyst, QA, manager)
├── OQ_001_Operational_Qualification.pdf
│   ├── Test cases 1-7
│   ├── Test data
│   └── Signatures
├── PQ_001_Performance_Qualification.pdf
│   ├── Test cases 1-5
│   ├── Performance data
│   └── Signatures
└── Summary_Report.pdf
    ├── Validation summary
    ├── Deviations (if any)
    └── Final approval signatures

Compliance benefits:

  • GxP validation - IQ/OQ/PQ documented
  • Risk-based - GAMP 5 Category 4 approach
  • Audit-ready - Complete validation package
  • Reproducible - Automated test scripts
  • Change control - Re-validation on version updates

For complete implementation details, see:


11. Security Best Practices

11.1 Environment Configuration Security

What it is: Best practices for securely configuring Lobster AI environments (development, staging, production) to prevent credential leaks, unauthorized access, and configuration drift.

Configuration hierarchy (secure defaults):

LevelFile LocationPriorityUse CaseSecurity
Workspace./project/.envHighestProject-specific keysProject isolation
Global~/.lobster/.envMediumUser-wide keysUser isolation
System/etc/lobster/.envLowSystem-wide (enterprise)Shared configs only
Environmentexport VAR=valueLowestCI/CD, containersEphemeral

Secure .env file setup:

# ✅ GOOD: Create workspace-specific .env (highest priority)
cat > ~/project1/.env << EOF
# LLM Provider (required)
ANTHROPIC_API_KEY=sk-ant-api03-...

# Optional: NCBI API key (3x rate limit increase)
NCBI_API_KEY=abc123...

# Optional: Cloud key (for cloud mode)
LOBSTER_CLOUD_KEY=lbstr_premium_...
EOF

# Secure permissions (user-only)
chmod 600 ~/project1/.env

# ✅ GOOD: Add to .gitignore (prevent commits)
echo ".env" >> .gitignore
echo "*.env" >> .gitignore
git add .gitignore

Common security mistakes:

# ❌ BAD: Commit secrets to git
git add .env
git commit -m "Add config"  # Credentials leaked to history!

# ❌ BAD: World-readable permissions
chmod 644 .env  # Anyone can read API keys

# ❌ BAD: Hardcode in scripts
export ANTHROPIC_API_KEY="sk-ant-..."  >> setup.sh
git add setup.sh  # Key committed to git

# ❌ BAD: Share keys across users
echo "ANTHROPIC_API_KEY=shared" > /etc/lobster/.env  # Security risk

# ❌ BAD: Store keys in plaintext in cloud
aws s3 cp .env s3://public-bucket/.env  # Publicly accessible!

Environment-specific best practices:

Development:

# Use personal API keys (not shared)
cat > .env << EOF
ANTHROPIC_API_KEY=$PERSONAL_ANTHROPIC_KEY
NCBI_API_KEY=$PERSONAL_NCBI_KEY
EOF
chmod 600 .env

# Use .env.example for team (no real keys)
cat > .env.example << EOF
ANTHROPIC_API_KEY=sk-ant-api03-your-key-here
NCBI_API_KEY=your-ncbi-key-here
EOF
git add .env.example

Staging:

# Use staging-specific keys (rotated monthly)
export ANTHROPIC_API_KEY=$(aws secretsmanager get-secret-value \
  --secret-id staging/lobster/anthropic \
  --query SecretString --output text)

# Separate workspace
export LOBSTER_WORKSPACE=/staging/workspaces/

Production:

# Use production keys (rotated quarterly)
export ANTHROPIC_API_KEY=$(vault kv get -field=api_key secret/prod/lobster/anthropic)

# Read-only data sources
export LOBSTER_WORKSPACE=/production/workspaces/
chmod 500 /production/data/  # Read + execute only

# Enable all audit features
export LOBSTER_ENABLE_PROVENANCE=true
export LOBSTER_ENABLE_INTEGRITY_MANIFEST=true

Secret rotation schedule:

Secret TypeRotation FrequencyTriggerProcess
NCBI API keysQuarterlyCalendarGenerate new key, update .env, test
Anthropic keysQuarterlyCalendarRotate via Anthropic Console
AWS credentialsOn departureTeam member leavesRevoke IAM access, issue new keys
Lobster cloud keysAnnualSubscription renewalLicense service auto-renews
SSH keysBi-annuallyCalendarGenerate new keypair

Compliance benefits:

  • Credential protection - Secure storage, rotation policies
  • Access control - File permissions enforce isolation
  • Audit trail - Configuration changes logged
  • Incident response - Rapid key rotation on compromise
  • Principle of least privilege - Workspace > global > system

For complete implementation details, see:


11.2 Access Control Best Practices

What it is: Operational best practices for managing user access, workspace permissions, and data isolation in multi-user deployments.

User access model (recommended):

User TypeAccess LevelWorkspaceSubscription TierUse Case
AnalystRead/Write own workspace~/workspaces/$USERNAME/FREE/PREMIUMIndividual analysis
Lead AnalystRead all team workspaces/team/workspaces/*/PREMIUMTeam oversight
QARead-only all workspaces/validated/workspaces/*/PREMIUMAudit and review
AdminFull system accessAll pathsENTERPRISESystem maintenance

Filesystem permissions (Linux/macOS):

# Individual analyst workspace (user-only)
mkdir -p ~/workspaces/analyst1
chmod 700 ~/workspaces/analyst1  # rwx------
chown analyst1:analyst1 ~/workspaces/analyst1

# Team workspace (group access)
mkdir -p /team/project_gse109564
chmod 770 /team/project_gse109564  # rwxrwx---
chown analyst1:bioinfo_team /team/project_gse109564

# QA workspace (read-only for QA team)
mkdir -p /validated/project_gse109564
chmod 750 /validated/project_gse109564  # rwxr-x---
chown analyst1:qa_team /validated/project_gse109564

# Archive workspace (read-only for everyone)
mkdir -p /archives/completed_analyses
chmod 555 /archives/completed_analyses  # r-xr-xr-x

Docker multi-user deployment:

# docker-compose.yml (multi-user)
version: '3.8'

services:
  lobster-analyst1:
    image: omicsos/lobster:latest
    user: "1000:1000"  # UID:GID for analyst1
    environment:
      - ANTHROPIC_API_KEY=${ANALYST1_API_KEY}
      - REDIS_URL=redis://redis:6379
    volumes:
      - /workspaces/analyst1:/workspace:rw  # Read-write own workspace
      - /team/shared:/team:rw                # Read-write team workspace
      - /validated:/validated:ro             # Read-only validated data

  lobster-qa:
    image: omicsos/lobster:latest
    user: "2000:2000"  # UID:GID for QA user
    environment:
      - ANTHROPIC_API_KEY=${QA_API_KEY}
    volumes:
      - /validated:/workspace:ro             # Read-only access
      - /archives:/archives:ro               # Read-only archives

Access logging (audit trail):

# Monitor workspace access (Linux)
auditctl -w /validated/workspaces/ -p rwa -k lobster_access

# View access logs
ausearch -k lobster_access

# Or use inotify for real-time monitoring
inotifywait -m -r -e access,modify,create,delete /validated/workspaces/

Access control checklist (enterprise):

# 1. ✅ Verify user isolation
ls -la ~/workspaces/
# Each user should only see their own directory

# 2. ✅ Test read-only restrictions (QA user)
su - qa_user
lobster query "Load data" --workspace /validated/project/
# Should succeed (read-only)

echo "test" > /validated/project/.lobster_workspace/unauthorized.txt
# Should fail (permission denied)

# 3. ✅ Verify subscription tier enforcement
su - analyst_free
lobster status
# Should show: Subscription Tier: free

lobster query "Process publication queue"
# Should fail: metadata_assistant requires PREMIUM

# 4. ✅ Audit trail verification
cat /validated/project/.lobster_workspace/.session.json | jq '.tool_usage'
# Should show all operations with timestamps + agent attribution

Compliance benefits:

  • Access control - OS-level + subscription tier enforcement
  • Data isolation - Per-user/per-team workspaces
  • Audit trail - All access logged
  • Principle of least privilege - Role-based permissions
  • Multi-user support - Enterprise deployment ready

11.3 Data Handling Best Practices

What it is: Operational guidance for securely handling sensitive data (PHI, PII, confidential) with Lobster AI.

Data classification (example):

ClassificationExamplesLobster DeploymentCompliance
PublicGEO datasets, published papersLocal or CloudNone
InternalUnpublished experimentsLocal (recommended)Internal QA
ConfidentialProprietary assays, IPLocal onlyNDA, trade secret
SensitivePatient data (PHI/PII)Local (validated)HIPAA, GDPR

Handling sensitive data:

# ✅ GOOD: Local mode for PHI/PII
unset LOBSTER_CLOUD_KEY  # Ensure local mode
export LOBSTER_WORKSPACE=/encrypted/phi_data/project_xyz
lobster chat --workspace /encrypted/phi_data/project_xyz

# ✅ GOOD: Encrypted filesystem (Linux)
cryptsetup luksFormat /dev/sdb1
cryptsetup luksOpen /dev/sdb1 encrypted_data
mkfs.ext4 /dev/mapper/encrypted_data
mount /dev/mapper/encrypted_data /encrypted/phi_data

# ✅ GOOD: Automatic workspace cleanup (after archival)
tar -czf project_archive.tar.gz /encrypted/phi_data/project_xyz
shasum -a 256 project_archive.tar.gz >> archive_manifest.txt
rm -rf /encrypted/phi_data/project_xyz  # After archival only

# ❌ BAD: Cloud mode with PHI (without BAA)
export LOBSTER_CLOUD_KEY=lbstr_...
lobster query "Analyze patient data"  # PHI sent to cloud (HIPAA violation!)

Data retention policies (example):

Data TypeRetentionStorageDeletion
Raw data7 yearsS3 GlacierAutomated (lifecycle)
Analysis results3 yearsS3 StandardManual review
Provenance logs7 yearsS3 GlacierAutomated
NotebooksPermanentS3 Standard-IANever (archival)
Temporary files30 daysLocal diskAutomated cleanup

Data anonymization (for PHI):

# ✅ GOOD: Anonymize metadata before analysis
execute_custom_code("""
import pandas as pd

# Load metadata with PHI
metadata = pd.read_csv(WORKSPACE / 'metadata_with_phi.csv')

# Remove PHI columns
phi_columns = ['patient_id', 'patient_name', 'date_of_birth', 'ssn']
metadata_anon = metadata.drop(columns=phi_columns)

# Generate anonymous IDs
metadata_anon['sample_id'] = [f"SAMPLE_{i:06d}" for i in range(len(metadata_anon))]

# Save anonymized version
metadata_anon.to_csv(OUTPUT_DIR / 'metadata_anonymized.csv', index=False)

# Delete original (after verification)
# os.remove(WORKSPACE / 'metadata_with_phi.csv')  # Manual step
""")

Data transfer security:

# ✅ GOOD: Encrypted transfer (SCP)
scp -i ~/.ssh/id_rsa -C \
  analyst1@server:/workspaces/project/analysis.tar.gz \
  ~/local_copy/

# ✅ GOOD: Verify integrity after transfer
shasum -a 256 ~/local_copy/analysis.tar.gz
# Compare to hash from manifest

# ✅ GOOD: Use SFTP for large files
sftp analyst1@server
sftp> get /workspaces/project/large_dataset.h5ad

# ❌ BAD: Unencrypted transfer
ftp analyst1@server  # Plain FTP (no encryption)
scp -o "StrictHostKeyChecking=no" ...  # Disables host verification (MITM risk)

Compliance benefits:

  • Data protection - Classification-based handling
  • HIPAA compliance - PHI isolation, local mode
  • GDPR compliance - Data retention, anonymization
  • Audit trail - All data operations logged
  • Incident response - Clear procedures for data breaches

11.4 Monitoring and Incident Response

What it is: Proactive monitoring, alerting, and incident response procedures for Lobster AI deployments.

Monitoring stack (recommended):

ComponentToolPurposeAlert Threshold
System healthlobster statusCheck CLI functionalityErrors in output
Disk usagedf -hMonitor workspace size>80% full
Provenance logsjq queriesAudit trail analysisError rate >5%
Redis healthredis-cli pingRate limiter availabilityConnection failures
API errorsLog aggregationNetwork failure tracking>10 failures/hour

Health check script (automated monitoring):

#!/usr/bin/env python3
"""Lobster AI health check for monitoring systems."""

import subprocess
import json
from pathlib import Path

def check_lobster_health():
    """Run health checks and return status."""
    checks = {
        "cli_available": False,
        "workspace_writable": False,
        "provenance_enabled": False,
        "redis_available": False,
        "disk_space_ok": False
    }

    # 1. CLI availability
    result = subprocess.run(["lobster", "--version"], capture_output=True)
    checks["cli_available"] = result.returncode == 0

    # 2. Workspace writable
    workspace = Path.home() / ".lobster_workspace"
    try:
        test_file = workspace / ".health_check"
        test_file.touch()
        test_file.unlink()
        checks["workspace_writable"] = True
    except:
        pass

    # 3. Provenance enabled
    result = subprocess.run(
        ["lobster", "query", "test", "--workspace", str(workspace.parent)],
        capture_output=True,
        timeout=30
    )
    prov_file = workspace / "provenance.json"
    checks["provenance_enabled"] = prov_file.exists()

    # 4. Redis availability (if configured)
    result = subprocess.run(["redis-cli", "ping"], capture_output=True)
    checks["redis_available"] = b"PONG" in result.stdout

    # 5. Disk space
    result = subprocess.run(["df", "-h", str(workspace)], capture_output=True, text=True)
    # Parse disk usage (simplified)
    checks["disk_space_ok"] = "100%" not in result.stdout

    # Report
    all_ok = all(checks.values())
    if all_ok:
        print("✅ ALL CHECKS PASSED")
        return 0
    else:
        print("❌ HEALTH CHECK FAILURES:")
        for check, status in checks.items():
            if not status:
                print(f"  - {check}: FAIL")
        return 1

if __name__ == "__main__":
    exit(check_lobster_health())

Alerting rules (example for Prometheus/Grafana):

# Lobster health monitoring
groups:
  - name: lobster_alerts
    interval: 5m
    rules:
      - alert: LobsterHighErrorRate
        expr: rate(lobster_errors_total[5m]) > 0.05
        annotations:
          summary: "Lobster error rate >5% in last 5 minutes"

      - alert: LobsterWorkspaceFull
        expr: node_filesystem_avail_bytes{mountpoint="/workspaces"} / node_filesystem_size_bytes < 0.2
        annotations:
          summary: "Workspace disk <20% free"

      - alert: RedisDown
        expr: redis_up == 0
        annotations:
          summary: "Redis unavailable - rate limiting disabled"

      - alert: HighAPILatency
        expr: histogram_quantile(0.95, rate(lobster_api_duration_seconds[5m])) > 5
        annotations:
          summary: "95th percentile API latency >5 seconds"

Incident response procedures:

Incident 1: Suspected data corruption

1. **Immediate Actions** (within 1 hour):
   - Isolate affected workspace (chmod 000)
   - Notify QA team and data owner
   - Preserve logs (copy provenance.json, .session.json)

2. **Investigation** (within 4 hours):
   - Verify file hashes against manifest
   - Check provenance for unexpected operations
   - Review access logs (who accessed workspace?)
   - Identify root cause (corruption, tampering, bug)

3. **Recovery** (within 24 hours):
   - Restore from backup (if corruption)
   - Re-run analysis from validated data (if tampering)
   - Document incident in QA log

4. **Prevention** (within 1 week):
   - Fix root cause (bug fix, permission change)
   - Update SOP if procedural issue
   - Re-train users if human error

Incident 2: API key compromise

1. **Immediate Actions** (within 30 minutes):
   - Revoke compromised key (Anthropic Console / AWS IAM)
   - Generate new key
   - Update .env files on all systems
   - Notify security team

2. **Investigation** (within 2 hours):
   - Review API usage logs (unusual activity?)
   - Check git history (was key committed?)
   - Identify exposure vector (how was key leaked?)

3. **Remediation** (within 4 hours):
   - Rotate all related keys (defense in depth)
   - Update .gitignore (prevent future commits)
   - Scan git history for other secrets

4. **Prevention** (within 1 week):
   - Implement pre-commit hooks (detect secrets)
   - User training (secure credential handling)
   - Consider secret management system (Vault, Secrets Manager)

Monitoring dashboard (example metrics):

Lobster AI - Production Dashboard

System Health:
├── CLI Status: ✅ Healthy
├── Workspace Disk: 65% used (warning at 80%)
├── Redis: ✅ Connected (15ms latency)
└── API Keys: ✅ Valid (expires: 2027-01-01)

Usage Metrics (last 24h):
├── Queries: 1,245
├── Downloads: 87
├── Errors: 12 (0.96% error rate)
└── Avg query time: 2.3 minutes

Rate Limiting:
├── NCBI: 8,432 requests (no throttling)
├── GEO: 1,234 requests (no throttling)
└── PRIDE: 45 requests (no throttling)

Security Events:
├── Failed license validations: 0
├── Workspace permission errors: 2 (investigate)
└── API key rotation: Due in 15 days

Compliance benefits:

  • Proactive monitoring - Issues detected early
  • Incident response - Clear procedures documented
  • Audit trail - All incidents logged
  • Continuous improvement - Metrics drive optimization
  • Regulatory readiness - Demonstrates control

For complete implementation details, see:


12. Future Enhancements

12.1 Security Roadmap (Phases 2-4)

What's next: Lobster AI's security roadmap includes four phases. Phase 1 (current) provides production-ready security for local CLI deployments. Future phases target cloud SaaS, full GxP validation, and advanced compliance automation.

Phase 2: Enhanced Sandboxing (Target: Q2 2026)

EnhancementTechnologyBenefitEffort
Docker sandboxing for custom codegVisor or Kata ContainersFull isolation (filesystem, network, process)4-6 weeks
Network isolationDocker bridge mode + iptablesNo outbound connections2 weeks
Resource quotascgroups (CPU, memory, disk)DoS prevention, fair resource allocation2 weeks
Read-only input mountsDocker volumesInput data immutable1 week
Runtime security scanningFalco or SysdigDetect anomalous behavior in real-time3 weeks

Phase 2 deliverables:

  • ✅ Cloud SaaS ready (multi-tenant isolation)
  • ✅ Custom code sandboxing (untrusted users)
  • ✅ Network egress firewall (prevent data exfiltration)
  • ✅ Resource limits (prevent resource exhaustion)

Phase 3: HIPAA & SOC 2 Certification (Target: Q3 2026)

EnhancementTechnologyBenefitEffort
Business Associate AgreementLegal + technical controlsHIPAA-compliant cloud6-8 weeks
SOC 2 Type II auditIndependent auditorThird-party validation12-16 weeks
PHI de-identificationAutomated PII scrubbingSafe use of patient data4 weeks
Breach notificationAutomated alertingHIPAA § 164.404 compliance2 weeks
Access logs (HIPAA)Enhanced logging + retentionHIPAA § 164.312(b) compliance3 weeks

Phase 3 deliverables:

  • ✅ HIPAA-compliant cloud service (with BAA)
  • ✅ SOC 2 Type II certified
  • ✅ PHI de-identification workflows
  • ✅ HIPAA audit trail enhancements

Phase 4: Full GxP Validation (Target: Q4 2026)

EnhancementTechnologyBenefitEffort
IQ/OQ/PQ automationAutomated validation frameworkReduces validation time from 2-4 weeks to 2-3 days6 weeks
Electronic signatures21 CFR Part 11 § 11.50/11.70Secure, auditable approvals4 weeks
Change control integrationGit-based workflow + approvalsGAMP 5 change control4 weeks
CAPA trackingCorrective/Preventive Action logQuality management3 weeks
Validation package generatorAuto-generate IQ/OQ/PQ docs90% reduction in validation effort8 weeks

Phase 4 deliverables:

  • ✅ Full GxP validation support (GAMP 5 Cat 4)
  • ✅ Electronic signatures (21 CFR Part 11 compliant)
  • ✅ Automated validation package generation
  • ✅ Change control + CAPA tracking

12.2 Feature Roadmap

Data Integrity (near-term):

FeatureDescriptionTimelineCompliance Impact
Runtime hash verificationAuto-verify hashes when notebook re-runsQ1 2026ALCOA+ "Accurate"
Visual hash indicatorsGreen ✅ / Red ❌ in notebook cellsQ1 2026User experience
Hash history trackingTrack hash changes for evolving datasetsQ2 2026Data lineage
Batch verification CLIlobster verify --workspace /path/Q1 2026QA automation

Access Control (mid-term):

FeatureDescriptionTimelineCompliance Impact
LDAP/Active DirectoryEnterprise authentication integrationQ2 2026ISO 27001 A.9.2
Role-based permissionsFine-grained workspace access controlQ2 2026Principle of least privilege
Audit user actionsUser-level attribution (not just agent)Q2 2026ALCOA+ "Attributable"
MFA enforcementTwo-factor authenticationQ3 2026Enhanced security

Compliance Automation (long-term):

FeatureDescriptionTimelineCompliance Impact
Auto-generate compliance reports21 CFR Part 11 compliance report from provenanceQ3 2026Reduces audit burden
ALCOA+ validatorAutomatic checks for ALCOA+ complianceQ3 2026Quality assurance
Regulatory submission packageFDA/EMA submission-ready exportsQ4 2026Streamlines submissions
GxP dashboardReal-time compliance metricsQ4 2026Continuous monitoring

12.3 Community Feedback

What to expect: Lobster AI's security roadmap is influenced by customer feedback, regulatory changes, and industry best practices. Contributions welcome!

Request a feature:

  • GitHub Issues: Report feature requests
  • Enterprise customers: Contact via customer success team
  • Community discussion: GitHub Discussions

Upcoming based on customer requests:

  1. GDPR right-to-erasure - Automated data deletion workflows (Q2 2026)
  2. Data residency controls - Region-specific S3 backends (Q2 2026)
  3. Audit report templates - Pre-built compliance reports (Q3 2026)
  4. Validation test library - IQ/OQ/PQ test templates (Q3 2026)

13.1 Security & Compliance Documentation

This wiki page (42): Security architecture, compliance features, deployment guidance (executive summaries + deep links)

Detailed technical documentation:

Customer-facing documentation:


13.2 Architecture & Implementation Documentation

Core architecture:

Key subsystems:

Developer documentation:


13.3 Configuration & Deployment Documentation

Configuration guides:

User guides:


13.4 API & Developer Reference

API documentation:

Tutorials:


13.5 Troubleshooting & Support

Troubleshooting:

Support channels:


Technical Implementation (Advanced)

Architecture

The manifest is generated by NotebookExporter class during the export() method:

  1. Hash Calculation - SHA-256 of each input file (chunked for memory efficiency)
  2. Provenance Hash - Fingerprint of the session's audit trail
  3. System Info - Captures Lobster version, Git commit, Python version
  4. Manifest Cell - Inserted as second cell in notebook (after header)

Code Location

  • Implementation: lobster/core/notebook_exporter.py
  • Methods:
    • _create_integrity_manifest_cell() - Creates manifest
    • _get_input_file_hashes() - Hashes data files
    • _get_provenance_hash() - Hashes session
    • _calculate_file_hash() - SHA-256 computation

Future Enhancements

Coming in future versions:

  1. Runtime Verification - Auto-verify hashes when notebook is re-run
  2. Visual Indicators - Green ✅ / Red ❌ status in notebook
  3. Hash History - Track hash changes over time for evolving datasets
  4. Verification Tools - Built-in CLI command for batch verification

Support

For questions about data integrity features:

  • GitHub Issues: Report issues
  • Documentation: See this guide
  • Compliance questions: Contact your organization's QA/compliance team


Last Updated: 2026-01-01 Document Version: 2.0 Sections: 13 (Overview, Data Integrity, Audit Trail, Access Control, Secure Execution, Data Protection, Network Security, Validation, Deployment, Compliance, Best Practices, Future Enhancements, Related Documentation) Compliance Coverage: 21 CFR Part 11, ALCOA+, GxP, HIPAA, GDPR, ISO/IEC 27001, SOC 2

On this page

1. Overview1.1 Purpose and Audience1.2 Why Security Matters in Bioinformatics1.3 Compliance Coverage Matrix1.4 Document Structure2. Data Integrity Manifest2.1 What You'll SeeUnderstanding the ManifestProvenance SectionInput Files SectionSystem SectionHow to Verify Data IntegrityBasic VerificationAutomated Verification ScriptCommon ScenariosScenario 1: Hashes Match ✅Scenario 2: Hash Mismatch ❌Scenario 3: File Not FoundWhy This MattersFor Regulatory ComplianceFor Scientific ReproducibilityBest Practices1. Verify Hashes Before Review2. Archive Data with Notebooks3. Include Verification in SOPs4. Document Hash VerificationFAQQ: Is this automatic?Q: Does this slow down my analysis?Q: What if I need to update my data?Q: Can I use this in non-regulated environments?Q: What hash algorithm is used?2.2 H5AD Validation and Compression (v3.4.2+)2.3 Atomic File Operations3. Audit Trail & Provenance3.1 W3C-PROV Compliance3.2 AnalysisStep Intermediate Representation (IR)3.3 Session and Tool Usage Tracking3.4 Provenance Hash and Tamper-Evidence4. Access Control & Authentication4.1 License Management System4.2 Subscription Tier Enforcement4.3 API Key Security4.4 Cloud vs Local Security Models5. Secure Code Execution5.1 Custom Code Execution Service5.2 Security Controls5.3 Deployment Recommendations5.4 Best Practices for Custom Code6. Data Protection & Isolation6.1 Workspace Isolation6.2 Concurrent Access Protection6.3 Session Management and Data Restoration7. Network Security & Rate Limiting7.1 Redis Rate Limiter Architecture7.2 Multi-Domain Rate Limiting7.3 API Timeout and Error Handling8. Validation & Data Quality8.1 Schema Validation8.2 Accession Validation8.3 Pre-Download Validation9. Deployment Security9.1 Docker Deployment9.2 S3 Backend Security9.3 AWS License Service Deployment10. Compliance Features for Regulated Environments10.1 GxP-Ready Checklist10.2 Deployment Patterns for Regulated Environments10.3 Standard Operating Procedures (SOPs)10.4 Validation Testing for GxP11. Security Best Practices11.1 Environment Configuration Security11.2 Access Control Best Practices11.3 Data Handling Best Practices11.4 Monitoring and Incident Response12. Future Enhancements12.1 Security Roadmap (Phases 2-4)12.2 Feature Roadmap12.3 Community Feedback13. Related Documentation13.1 Security & Compliance Documentation13.2 Architecture & Implementation Documentation13.3 Configuration & Deployment Documentation13.4 API & Developer Reference13.5 Troubleshooting & SupportTechnical Implementation (Advanced)ArchitectureCode LocationFuture EnhancementsSupport