content-core

CLAUDE.md•4.38 KiB

# Content Core Library for extracting, cleaning, and summarizing content from URLs, files, and text. ## Commands - **Install dependencies**: `uv sync --group dev` - **Run tests**: `make test` or `uv run pytest -v` - **Run single test**: `uv run pytest -k "test_name"` - **Linting**: `make ruff` (runs `ruff check . --fix`) - **Build package**: `uv build` - **Build docs**: `make build-docs` ## Codebase Structure ``` src/content_core/ ├── __init__.py # CLI entry points (ccore, cclean, csum) and public API ├── config.py # Configuration loading, engine selection, retry/proxy settings ├── models.py # ModelFactory for Esperanto LLM/STT model caching ├── templated_message.py # LLM prompt execution with Jinja templates ├── logging.py # Loguru configuration │ ├── common/ # Shared infrastructure (see common/CLAUDE.md) │ ├── exceptions.py # Exception hierarchy │ ├── retry.py # Retry decorators for transient failures │ ├── state.py # Pydantic state models for LangGraph │ ├── types.py # Type aliases (DocumentEngine, UrlEngine) │ └── utils.py # Input content processing │ ├── processors/ # Format-specific extractors (see processors/CLAUDE.md) │ ├── pdf.py # PDF/EPUB via PyMuPDF │ ├── url.py # URL extraction (jina/firecrawl/crawl4ai/bs4) │ ├── audio.py # Audio transcription via Esperanto STT │ ├── video.py # Video-to-audio via moviepy │ ├── youtube.py # YouTube transcript extraction │ ├── office.py # Office docs (docx/pptx/xlsx) │ ├── text.py # Plain text files │ └── docling.py # Optional Docling integration │ ├── content/ # High-level workflows │ ├── extraction/ # LangGraph extraction workflow │ │ └── graph.py # Main extraction state graph │ ├── identification/ # File type detection │ │ └── file_detector.py │ ├── cleanup/ # Content cleaning via LLM │ └── summary/ # Content summarization via LLM │ ├── tools/ # LangChain tool wrappers (see tools/CLAUDE.md) │ ├── extract.py # extract_content_tool │ ├── cleanup.py # cleanup_content_tool │ └── summarize.py # summarize_content_tool │ └── mcp/ # MCP server for AI assistant integration └── server.py # FastMCP server implementation ``` ## Architecture **Data flow**: Input -> LangGraph workflow -> Processor -> Output 1. `ProcessSourceInput` received via API or CLI 2. `content/extraction/graph.py` routes to appropriate processor based on source type 3. Processor extracts content and returns state updates 4. `ProcessSourceOutput` returned to caller **Key patterns**: - LangGraph StateGraph orchestrates extraction workflow - Processors are stateless functions that take `ProcessSourceState` and return dict updates - Retry decorators handle transient failures for network/API operations - Configuration loaded from YAML with env var overrides ## Integration - **Esperanto**: LLM and STT model abstraction via `ModelFactory` - **LangGraph**: Workflow orchestration in `content/extraction/graph.py` - **LangChain**: Tool wrappers in `tools/` for agent integration - **ai-prompter**: Template rendering in `templated_message.py` ## Gotchas - Import aliases: `content_core.extraction` = `content_core.content.extraction` - `docling` is optional: check `DOCLING_AVAILABLE` before using - Proxy must be passed through state or config, not set globally on requests - All async operations should use retry decorators for resilience - `ModelFactory` caches models but invalidates on proxy change ## Code Style - **Formatting**: Follow PEP 8 - **Imports**: Organize by standard library, third-party, local - **Error handling**: Use custom exceptions from `common/exceptions.py` - **Documentation**: Update docs when changing functionality - **Tests**: Write unit tests for new code; integration tests for workflows ## Release Process 1. Run `make test` to verify everything works 2. Update version in `pyproject.toml` 3. Run `uv sync` to update the lock file 4. Commit changes 5. Merge to main (if in a branch) 6. Tag the release 7. Push to GitHub

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lfnovo/content-core'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•4.38 KiB