Omics-OS Docs
Advanced

Workspace Content Service

The WorkspaceContentService provides structured, type-safe caching of research content (publications, datasets, metadata) in the DataManagerV2 workspace....

Overview

The WorkspaceContentService provides structured, type-safe caching of research content (publications, datasets, metadata) in the DataManagerV2 workspace. Introduced in Lobster v0.2+, it replaces manual JSON file operations with a centralized service using Pydantic schemas for validation and enum-based type safety.

Key Benefits:

  • Type Safety: Pydantic models validate all cached content
  • Enum-Based Validation: ContentType and RetrievalLevel enums prevent string typos
  • Automatic File Management: Professional naming conventions and directory organization
  • Level-Based Retrieval: Flexible detail levels (summary/methods/samples/platform/full)
  • Workspace Integration: Seamless integration with DataManagerV2 and research_agent tools

Two-Tier Architecture:

research_agent tools (write_to_workspace, get_content_from_workspace)

       WorkspaceContentService (validation, file I/O)

          DataManagerV2 workspace directory

         literature/  data/  metadata/  exports/  (JSON/CSV files)

Architecture

Content Types (Enum)

from lobster.tools.workspace_content_service import ContentType

class ContentType(str, Enum):
    PUBLICATION = "publication"      # Research papers (PubMed, PMC, bioRxiv)
    DATASET = "dataset"              # GEO, SRA, PRIDE datasets
    METADATA = "metadata"            # Sample mappings, validation results, QC reports
    EXPORTS = "exports"              # Analysis results and data exports
    DOWNLOAD_QUEUE = "download_queue"    # Download queue entries (JSONL)
    PUBLICATION_QUEUE = "publication_queue"  # Publication queue entries (JSONL)

Workspace Directory Mapping:

  • ContentType.PUBLICATIONworkspace/literature/*.json
  • ContentType.DATASETworkspace/data/*.json
  • ContentType.METADATAworkspace/metadata/*.json
  • ContentType.EXPORTSworkspace/exports/*.*
  • ContentType.DOWNLOAD_QUEUEworkspace/.lobster/queues/download_queue.jsonl
  • ContentType.PUBLICATION_QUEUEworkspace/.lobster/queues/publication_queue.jsonl

Retrieval Levels (Enum)

from lobster.tools.workspace_content_service import RetrievalLevel

class RetrievalLevel(str, Enum):
    SUMMARY = "summary"      # Key-value overview (title, authors, sample count)
    METHODS = "methods"      # Methods section (publications only)
    SAMPLES = "samples"      # Sample IDs and metadata (datasets only)
    PLATFORM = "platform"    # Platform/technology info (datasets only)
    FULL = "full"            # All available content

Level-Specific Fields:

Content TypeSummaryMethodsSamplesPlatformFull
Publicationidentifier, title, authors, journal, year, keywordsidentifier, title, methodsN/AN/AAll fields
Datasetidentifier, title, sample_count, organismN/Aidentifier, sample_count, samplesidentifier, platform, platform_idAll fields
Metadataidentifier, content_type, description, related_datasetsN/AN/AN/AAll fields

Pydantic Content Schemas

PublicationContent

from lobster.tools.workspace_content_service import PublicationContent

pub = PublicationContent(
    identifier="PMID:35042229",
    title="Single-cell RNA-seq reveals...",
    authors=["Smith J", "Jones A"],
    journal="Nature",
    year=2022,
    abstract="We performed single-cell RNA-seq...",
    methods="Cells were processed using 10X Chromium...",
    full_text="...",  # Complete paper text
    keywords=["single-cell", "RNA-seq", "cancer"],
    source="PMC",  # PMC, PubMed, bioRxiv
    cached_at="2025-01-12T10:30:00",  # ISO 8601 timestamp
    url="https://pubmed.ncbi.nlm.nih.gov/35042229/"
)

Fields:

  • identifier (required): PMID, DOI, or bioRxiv ID
  • title, authors, journal, year: Bibliographic metadata
  • abstract, methods, full_text: Content sections
  • keywords: Publication keywords (MeSH terms, author keywords)
  • source (required): Provider (PMC, PubMed, bioRxiv, medRxiv)
  • cached_at (required): ISO 8601 timestamp
  • url: Publication URL

DatasetContent

from lobster.tools.workspace_content_service import DatasetContent

dataset = DatasetContent(
    identifier="GSE123456",
    title="Single-cell RNA-seq of aging brain",
    platform="Illumina NovaSeq 6000",
    platform_id="GPL24676",
    organism="Homo sapiens",
    sample_count=12,
    samples={
        "GSM1": {"age": 25, "tissue": "brain"},
        "GSM2": {"age": 65, "tissue": "brain"}
    },
    experimental_design="Age comparison: young (n=6) vs old (n=6)",
    summary="Dataset comparing transcriptional changes...",
    pubmed_ids=["35042229"],
    source="GEO",  # GEO, SRA, PRIDE
    cached_at="2025-01-12T10:30:00",
    url="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456"
)

Fields:

  • identifier (required): GSE, SRA, PRIDE accession
  • title, summary: Dataset descriptions
  • platform, platform_id: Technology information
  • organism: Species (e.g., Homo sapiens, Mus musculus)
  • sample_count (required): Number of samples (≥0)
  • samples: Dictionary mapping sample IDs to metadata
  • experimental_design: Study design description
  • pubmed_ids: Associated publications
  • source (required): Repository (GEO, SRA, PRIDE)
  • cached_at (required): ISO 8601 timestamp
  • url: Dataset URL

MetadataContent

from lobster.tools.workspace_content_service import MetadataContent

metadata = MetadataContent(
    identifier="gse12345_to_gse67890_mapping",
    content_type="sample_mapping",
    description="Sample ID mapping between two datasets",
    data={
        "exact_matches": 10,
        "fuzzy_matches": 5,
        "unmapped": 2,
        "mapping_rate": 0.88
    },
    related_datasets=["GSE12345", "GSE67890"],
    source="SampleMappingService",
    cached_at="2025-01-12T10:30:00"
)

Fields:

  • identifier (required): Unique metadata identifier
  • content_type (required): Type descriptor (sample_mapping, validation, qc_report, etc.)
  • description: Human-readable description
  • data (required): Arbitrary JSON-serializable content
  • related_datasets: Related dataset accessions
  • source (required): Tool or service name
  • cached_at (required): ISO 8601 timestamp

Service API

Initialization

from lobster.core.data_manager_v2 import DataManagerV2
from lobster.tools.workspace_content_service import WorkspaceContentService

data_manager = DataManagerV2(workspace_path="~/.lobster_workspace")
workspace_service = WorkspaceContentService(data_manager=data_manager)

Directory Structure Created:

workspace_path/
├── literature/    # Publications (PublicationContent)
├── data/          # Datasets (DatasetContent)
└── metadata/      # Metadata (MetadataContent)

Writing Content

from lobster.tools.workspace_content_service import (
    PublicationContent,
    ContentType,
    WorkspaceContentService
)
from datetime import datetime

# Create content model
pub_content = PublicationContent(
    identifier="PMID:35042229",
    title="Single-cell analysis of aging",
    authors=["Smith J", "Jones A"],
    journal="Nature",
    year=2022,
    abstract="Abstract text...",
    methods="Methods text...",
    source="PMC",
    cached_at=datetime.now().isoformat()
)

# Write to workspace
cache_path = workspace_service.write_content(
    content=pub_content,
    content_type=ContentType.PUBLICATION
)
# Returns: "/workspace/literature/pmid_35042229.json"

Naming Convention:

  • Identifier sanitized: lowercase, special characters → underscores
  • PMID:35042229pmid_35042229.json
  • GSE123456gse123456.json
  • DOI:10.1038/s41586-021-12345-6doi_10_1038_s41586_021_12345_6.json

Reading Content

Basic Retrieval

from lobster.tools.workspace_content_service import ContentType, RetrievalLevel

# Read full content
full_content = workspace_service.read_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION,
    level=RetrievalLevel.FULL
)
# Returns: Dict with all fields

# Read summary only
summary = workspace_service.read_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION,
    level=RetrievalLevel.SUMMARY
)
# Returns: Dict with identifier, title, authors, journal, year, keywords

# Read methods section
methods = workspace_service.read_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION,
    level=RetrievalLevel.METHODS
)
# Returns: Dict with identifier, title, methods

Dataset Retrieval Examples

# Get dataset summary
summary = workspace_service.read_content(
    identifier="GSE123456",
    content_type=ContentType.DATASET,
    level=RetrievalLevel.SUMMARY
)
# Returns: identifier, title, sample_count, organism

# Get sample metadata
samples = workspace_service.read_content(
    identifier="GSE123456",
    content_type=ContentType.DATASET,
    level=RetrievalLevel.SAMPLES
)
# Returns: identifier, sample_count, samples, experimental_design

# Get platform information
platform = workspace_service.read_content(
    identifier="GSE123456",
    content_type=ContentType.DATASET,
    level=RetrievalLevel.PLATFORM
)
# Returns: identifier, platform, platform_id, organism

Listing Content

# List all cached content
all_content = workspace_service.list_content()
# Returns: List[Dict] with all publications, datasets, metadata

# List only publications
publications = workspace_service.list_content(
    content_type=ContentType.PUBLICATION
)
# Returns: List[Dict] with publication metadata

# List only datasets
datasets = workspace_service.list_content(
    content_type=ContentType.DATASET
)
# Returns: List[Dict] with dataset metadata

List Result Format:

[
    {
        "identifier": "PMID:35042229",
        "title": "Single-cell analysis...",
        "authors": ["Smith J", "Jones A"],
        "cached_at": "2025-01-12T10:30:00",
        "_content_type": "publication",  # Added by service
        "_file_path": "/workspace/literature/pmid_35042229.json"  # Added by service
    },
    # ... more items
]

Deleting Content

# Delete cached publication
deleted = workspace_service.delete_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION
)
# Returns: True if deleted, False if not found

Workspace Statistics

stats = workspace_service.get_workspace_stats()
# Returns:
# {
#     "total_items": 42,
#     "publications": 15,
#     "datasets": 20,
#     "metadata": 7,
#     "total_size_mb": 12.5,
#     "cache_dir": "/workspace/cache/content"
# }

Centralized Exports Directory (v1.0+)

As of version 1.0, all user-facing data exports (CSV, TSV, Excel) are written to a centralized exports directory for easy discovery.

Directory Structure:

workspace_path/
├── literature/    # Publications (PublicationContent)
├── data/          # Datasets (DatasetContent)
├── metadata/      # Metadata (MetadataContent)
└── exports/       # 🆕 User-facing CSV/TSV/Excel exports (v1.0+)

Why Centralized Exports?

  • Single Location: Customers know exactly where to find exported files
  • Easy Discovery: No hunting across multiple subdirectories
  • Clean Organization: Separates cached JSON (metadata/) from final outputs (exports/)
  • Predictable: All tools write to same location

Getting Exports Directory:

exports_dir = workspace_service.get_exports_directory(create=True)
# Returns: Path("workspace_path/exports")

Listing Export Files:

# List all exports
files = workspace_service.list_export_files()
# Returns: [
#     {
#         "name": "aggregated_samples.csv",
#         "path": Path("workspace_path/exports/aggregated_samples.csv"),
#         "size": 1024567,
#         "modified": "2025-01-12T14:30:00",
#         "category": "metadata"  # metadata, results, plots, custom
#     },
#     ...
# ]

# Filter by pattern
csv_files = workspace_service.list_export_files(pattern="*.csv")

# Filter by category
metadata_exports = workspace_service.list_export_files(category="metadata")

File Categorization: Files are automatically categorized based on naming conventions:

  • metadata_* → "metadata" (sample tables, mappings)
  • results_* → "results" (analysis outputs)
  • plot_* → "plots" (visualizations)
  • Other → "custom"

Usage in Custom Code:

# In execute_custom_code, OUTPUT_DIR variable is pre-configured
df.to_csv(OUTPUT_DIR / "my_results.csv")  # Saves to workspace/exports/

Unified Metadata View: The /metadata CLI command now shows exports alongside other sources:

sources = workspace_service.get_all_metadata_sources()
# Returns: {
#     "in_memory": [...],           # metadata_store entries
#     "workspace_files": [...],      # workspace/metadata/*.json
#     "exports": [...],              # workspace/exports/*.csv
#     "deprecated": [...]            # workspace/metadata/exports/*.csv (old location)
# }

Deprecation Warning: The old workspace/metadata/exports/ location is deprecated. A warning is shown if files exist there:

⚠️ Found 3 files in deprecated location: workspace/metadata/exports/
New exports go to workspace/exports/. Consider migrating:
    mv workspace/metadata/exports/* workspace/exports/

Integration with research_agent Tools

The research_agent provides two tools that use WorkspaceContentService under the hood:

write_to_workspace Tool

Purpose: Cache research content for persistent access and specialist handoff.

Usage Pattern:

# In research_agent tool
from lobster.tools.workspace_content_service import (
    ContentType,
    PublicationContent,
    WorkspaceContentService
)

@tool
def write_to_workspace(identifier: str, workspace: str, content_type: str = None) -> str:
    # 1. Initialize service
    workspace_service = WorkspaceContentService(data_manager=data_manager)

    # 2. Map workspace categories to ContentType enum
    workspace_to_content_type = {
        "literature": ContentType.PUBLICATION,
        "data": ContentType.DATASET,
        "metadata": ContentType.METADATA,
    }

    # 3. Validate workspace category
    if workspace not in workspace_to_content_type:
        return f"Error: Invalid workspace '{workspace}'"

    # 4. Retrieve content from data_manager
    if identifier in data_manager.metadata_store:
        content_data = data_manager.metadata_store[identifier]
    elif identifier in data_manager.list_modalities():
        adata = data_manager.get_modality(identifier)
        content_data = {...}  # Extract metadata
    else:
        return f"Error: Identifier '{identifier}' not found"

    # 5. Create Pydantic model
    content_model = PublicationContent(
        identifier=identifier,
        # ... populate fields
        cached_at=datetime.now().isoformat()
    )

    # 6. Write using service
    cache_path = workspace_service.write_content(
        content=content_model,
        content_type=workspace_to_content_type[workspace]
    )

    return f"Cached to {cache_path}"

Naming Conventions:

  • Publications: publication_PMID12345 or publication_DOI...
  • Datasets: dataset_GSE12345
  • Metadata: metadata_GSE12345_samples

Example:

# Cache publication after reading
> "I just read PMID:35042229. Please cache it for later."
 write_to_workspace("publication_PMID35042229", workspace="literature", content_type="publication")

# Cache dataset metadata
> "Cache GSE123456 metadata for validation."
 write_to_workspace("dataset_GSE123456", workspace="data", content_type="dataset")

get_content_from_workspace Tool

Purpose: Retrieve cached research content with flexible detail levels.

Unified Architecture (v2.6+)

As of version 2.6, get_content_from_workspace uses a unified adapter-based architecture that provides consistent behavior across all workspace types.

Key Improvements:

  • Consistent API: All workspaces support the same operations (list, filter, retrieve)
  • Unified Formatting: Status emojis, titles, and details formatted consistently
  • Type Safety: Internal WorkspaceItem TypedDict ensures defensive field access
  • Error Handling: No more KeyError crashes on missing fields

Architecture Diagram:

User Query → Dispatcher → Adapter → WorkspaceItem[] → Formatter → Markdown
                 ↓           ↓             ↓              ↓
            5 workspaces  Normalize   Unified      Consistent
                         data types   structure     output

Adapters:

  1. _adapt_general_content() - literature, data, metadata workspaces
  2. _adapt_download_queue() - download queue entries
  3. _adapt_publication_queue() - publication queue entries

WorkspaceItem Structure:

class WorkspaceItem(TypedDict, total=False):
    identifier: str          # Primary ID
    workspace: str           # Category
    type: str                # Item type
    status: Optional[str]    # For queues
    priority: Optional[int]  # For queues
    title: Optional[str]     # Display title
    cached_at: Optional[str] # ISO timestamp
    details: Optional[str]   # Summary/metadata

Benefits:

  • Agents can use same mental model for all workspaces
  • No workspace-specific error handling needed
  • Easy to add new workspace types (one adapter function)
  • Backward compatible (same output format)

Usage Pattern (Simplified)

@tool
def get_content_from_workspace(
    identifier: str = None,
    workspace: str = None,
    level: str = "summary"
) -> str:
    # 1. Initialize service
    workspace_service = WorkspaceContentService(data_manager=data_manager)

    # 2. Map strings to enums
    workspace_to_content_type = {
        "literature": ContentType.PUBLICATION,
        "data": ContentType.DATASET,
        "metadata": ContentType.METADATA,
    }

    level_to_retrieval = {
        "summary": RetrievalLevel.SUMMARY,
        "methods": RetrievalLevel.METHODS,
        "samples": RetrievalLevel.SAMPLES,
        "platform": RetrievalLevel.PLATFORM,
        "metadata": RetrievalLevel.FULL,
    }

    # 3. List mode (no identifier)
    if identifier is None:
        content_type_filter = workspace_to_content_type[workspace] if workspace else None
        all_cached = workspace_service.list_content(content_type=content_type_filter)
        return format_list_response(all_cached)

    # 4. Retrieve mode (with identifier)
    retrieval_level = level_to_retrieval[level]

    # Try each content type if workspace not specified
    content_types_to_try = (
        [workspace_to_content_type[workspace]] if workspace
        else list(ContentType)
    )

    for content_type in content_types_to_try:
        try:
            cached_content = workspace_service.read_content(
                identifier=identifier,
                content_type=content_type,
                level=retrieval_level
            )
            return format_response(cached_content, level)
        except FileNotFoundError:
            continue

    return f"Error: Identifier '{identifier}' not found"

Examples:

# List all cached content
> "What content do I have cached?"
 get_content_from_workspace()

# List publications only
> "Show me cached publications."
 get_content_from_workspace(workspace="literature")

# Get publication methods section
> "Show methods from PMID:35042229."
 get_content_from_workspace(
    identifier="publication_PMID35042229",
    workspace="literature",
    level="methods"
)

# Get dataset samples
> "Show sample IDs for GSE123456."
 get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="samples"
)

# Get full metadata
> "Show full metadata for my sample mapping."
 get_content_from_workspace(
    identifier="metadata_gse12345_to_gse67890_mapping",
    workspace="metadata",
    level="metadata"
)

Common Workflows

Workflow 1: Cache Publication for Later Analysis

# 1. Search literature
search_literature("BRCA1 breast cancer", max_results=5)

# 2. Read full publication
read_full_publication("PMID:35042229")
# → Content automatically cached in metadata_store

# 3. Cache to workspace
write_to_workspace(
    identifier="publication_PMID35042229",
    workspace="literature",
    content_type="publication"
)

# 4. Later: Retrieve methods section
get_content_from_workspace(
    identifier="publication_PMID35042229",
    workspace="literature",
    level="methods"
)

Workflow 2: Cache Dataset Before Handoff to Specialist

# 1. Discover dataset
find_related_entries("PMID:35042229", entry_type="dataset")
# → Found: GSE123456

# 2. Get dataset metadata
get_dataset_metadata("GSE123456")
# → Metadata stored in metadata_store

# 3. Cache to workspace before handoff
write_to_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    content_type="dataset"
)

# 4. Hand off to metadata_assistant
handoff_to_metadata_assistant(
    instructions="Validate GSE123456 for treatment_response field. "
                "Dataset cached in data workspace."
)

Workflow 3: Multiple Detail Levels

# Start with summary
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="summary"
)
# → Returns: title, sample_count, organism

# Need more details? Get samples
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="samples"
)
# → Returns: sample IDs and metadata

# Need platform info?
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="platform"
)
# → Returns: platform, platform_id, organism

# Need everything?
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="metadata"
)
# → Returns: all fields

Best Practices

Naming Conventions

Follow Professional Naming:

  • Lowercase identifiers
  • Underscores for separators
  • Descriptive prefixes
# ✅ Good
"publication_PMID35042229"
"dataset_GSE123456"
"metadata_gse12345_to_gse67890_mapping"

# ❌ Bad
"PMID:35042229"  # Contains colon
"GSE 123456"      # Contains space
"mapping-12345"   # Ambiguous prefix

Content Validation

Always Use Pydantic Models:

# ✅ Good - Validation enforced
pub_content = PublicationContent(
    identifier="PMID:35042229",
    source="PMC",
    cached_at=datetime.now().isoformat()
)
workspace_service.write_content(pub_content, ContentType.PUBLICATION)

# ❌ Bad - No validation
raw_dict = {"identifier": "PMID:35042229"}  # Missing required fields
# Will fail validation

Error Handling

Handle FileNotFoundError:

from lobster.tools.workspace_content_service import ContentType, RetrievalLevel

try:
    content = workspace_service.read_content(
        identifier="publication_PMID12345",
        content_type=ContentType.PUBLICATION,
        level=RetrievalLevel.SUMMARY
    )
except FileNotFoundError as e:
    logger.warning(f"Content not found: {e}")
    # List available content
    available = workspace_service.list_content(ContentType.PUBLICATION)
    logger.info(f"Available publications: {[c['identifier'] for c in available]}")

Level Selection

Choose Appropriate Detail Level:

Use CaseRecommended LevelWhy
Quick overviewSUMMARYFast, minimal data transfer
Replication protocolMETHODSFocused on procedures
Sample alignmentSAMPLESJust sample metadata
Platform validationPLATFORMTechnology compatibility check
Full exportFULLComplete content for archival

Workspace Organization

Categorize Content by Type:

# Literature review project
workspace_service.write_content(pub1, ContentType.PUBLICATION)  # → literature/
workspace_service.write_content(pub2, ContentType.PUBLICATION)  # → literature/

# Dataset analysis project
workspace_service.write_content(dataset1, ContentType.DATASET)  # → data/
workspace_service.write_content(dataset2, ContentType.DATASET)  # → data/

# Metadata operations
workspace_service.write_content(mapping, ContentType.METADATA)  # → metadata/

Backward Compatibility

Maintain Tool Signatures:

  • Both tools (write_to_workspace, get_content_from_workspace) maintain original signatures
  • String-based parameters at tool level
  • Enum conversion happens internally
  • Same response formats as before refactoring

Performance Considerations

Caching Strategy

When to Cache:

  • ✅ After expensive operations (PDF parsing, full-text extraction)
  • ✅ Before handing off to other agents (context preservation)
  • ✅ When content will be reused (literature reviews, multi-step workflows)

When NOT to Cache:

  • ❌ Temporary scratch data
  • ❌ Duplicates of in-memory modalities
  • ❌ Large binary files (use modalities storage instead)

File Size Management

Monitor Workspace Size:

stats = workspace_service.get_workspace_stats()
if stats["total_size_mb"] > 100:
    logger.warning("Workspace size exceeding 100MB")
    # Consider cleaning old cached content

Delete Old Content:

# Remove cached content no longer needed
workspace_service.delete_content(
    identifier="old_publication_PMID12345",
    content_type=ContentType.PUBLICATION
)

Troubleshooting

Common Issues

Issue: "File has been modified since read"

  • Cause: Auto-formatter/linter running between Read and Edit
  • Solution: Read larger context window (400+ lines) before editing

Issue: "Invalid workspace 'xyz'"

  • Cause: Typo in workspace parameter
  • Solution: Use enum mapping: "literature", "data", or "metadata"

Issue: "Invalid detail level 'abc'"

  • Cause: Unsupported level string
  • Solution: Use valid levels: "summary", "methods", "samples", "platform", "metadata"

Issue: "ValidationError: Field required"

  • Cause: Missing required Pydantic fields
  • Solution: Check schema requirements (identifier, source, cached_at)

Debugging

Enable Debug Logging:

import logging
logging.basicConfig(level=logging.DEBUG)

# Service operations will log:
# - File paths created
# - Content validated
# - Errors encountered

Inspect Workspace Contents:

# Check cached files
ls -lh ~/.lobster_workspace/literature/
ls -lh ~/.lobster_workspace/data/
ls -lh ~/.lobster_workspace/metadata/

# View JSON content
cat ~/.lobster_workspace/literature/pmid_35042229.json | jq .

Migration from Manual JSON Handling

Before (Manual Implementation)

# Old approach - manual file operations
import json
from pathlib import Path

cache_dir = Path(workspace_path) / "literature"
cache_file = cache_dir / f"{identifier.lower()}.json"

# Write
with open(cache_file, "w") as f:
    json.dump({"identifier": identifier, ...}, f)

# Read
with open(cache_file, "r") as f:
    content = json.load(f)

# List
cached_files = list(cache_dir.glob("*.json"))

After (WorkspaceContentService)

# New approach - service-based
from lobster.tools.workspace_content_service import (
    WorkspaceContentService,
    PublicationContent,
    ContentType,
    RetrievalLevel
)

workspace_service = WorkspaceContentService(data_manager=data_manager)

# Write
pub_content = PublicationContent(identifier=identifier, ...)
workspace_service.write_content(pub_content, ContentType.PUBLICATION)

# Read
content = workspace_service.read_content(
    identifier, ContentType.PUBLICATION, RetrievalLevel.SUMMARY
)

# List
cached_list = workspace_service.list_content(ContentType.PUBLICATION)

Benefits:

  • ✅ Pydantic validation (catch errors early)
  • ✅ Enum type safety (no string typos)
  • ✅ Automatic directory management
  • ✅ Level-based filtering (no manual if/elif chains)
  • ✅ Professional naming (automatic sanitization)

Version History

VersionChanges
v0.2+Initial implementation with Pydantic schemas, enum-based validation, two-tier architecture

See Also

  • WorkspaceContentService Source: lobster/tools/workspace_content_service.py (714 lines)
  • Pydantic Schemas: PublicationContent, DatasetContent, MetadataContent
  • Integration: research_agent tools (write_to_workspace, get_content_from_workspace)
  • Testing: tests/integration/test_workspace_content_service.py

On this page