What can you do with this server?

This MCP server provides AI-powered content extraction from URLs and files through Content Core's intelligent auto-detection engine. • **Extract from URLs** - Retrieve clean, structured content from web pages using smart engine selection (Firecrawl → Jina → BeautifulSoup fallback) • **Process documents** - Extract text from PDF, Word, PowerPoint, Excel, Markdown, HTML, and EPUB files (Docling → Enhanced PyMuPDF fallback) • **Transcribe media** - Convert video (MP4, AVI, MOV) and audio (MP3, WAV, M4A) to text using OpenAI Whisper speech-to-text • **Extract from images** - Process JPG, PNG, and TIFF images with OCR text recognition • **Handle archives** - Extract and analyze content from ZIP, TAR, and GZ files • **Automatic optimization** - The 'auto' engine intelligently selects the best extraction method based on content type • **Structured output** - Returns JSON responses with extracted content, metadata, and supports multiple formats (text, JSON, XML) • **Multiple interfaces** - Access through CLI commands, Python library, MCP server, Raycast extension, and macOS Services

Which integrations are available for this server?

Exposes a set of compatible tools for [Langchain](/mcp/servers/integrations/langchain) framework, enabling extraction, cleaning, and summarization capabilities directly within Langchain agents and chains. Enables right-click integration with [macOS](/mcp/servers/integrations/macos) Finder through Services, allowing content extraction and summarization from any supported file with options for clipboard or TextEdit output. Integrates with [OpenAI](/mcp/servers/integrations/openai) services for transcription (Whisper) and content processing, allowing for AI-powered content extraction and summarization. Provides a [Python](/mcp/servers/integrations/python) library for programmatic access to content extraction, cleaning, and summarization capabilities, with asynchronous functionality and customizable options. Offers a [Raycast](/mcp/servers/integrations/raycast) extension with smart auto-detection commands for extracting and summarizing content from various sources, including URLs and files, with multiple output options and visual feedback.

Content Core

License: MIT PyPI version Downloads GitHub stars GitHub forks GitHub issues Code style: black Ruff

Content Core is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.

🚀 What You Can Do

Extract content from anywhere:

📄 Documents - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
🎥 Media - Videos (MP4, AVI, MOV) with automatic transcription
🎵 Audio - MP3, WAV, M4A with speech-to-text conversion
🌐 Web - Any URL with intelligent content extraction
🖼️ Images - JPG, PNG, TIFF with OCR text recognition
📦 Archives - ZIP, TAR, GZ with content analysis

Process with AI:

✨ Clean & format extracted content automatically
📝 Generate summaries with customizable styles (bullet points, executive summary, etc.)
🎯 Context-aware processing - explain to a child, technical summary, action items
🔄 Smart engine selection - automatically chooses the best extraction method

Related MCP server: MCP Access Server

🛠️ Multiple Ways to Use

🖥️ Command Line (Zero Install)

# Extract content from any source uvx --from "content-core" ccore https://example.com uvx --from "content-core" ccore document.pdf # Generate AI summaries uvx --from "content-core" csum video.mp4 --context "bullet points"

🤖 Claude Desktop Integration

One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.

🔍 Raycast Extension

Smart auto-detection commands:

Extract Content - Full interface with format options
Summarize Content - 9 summary styles available
Quick Extract - Instant clipboard extraction

🖱️ macOS Right-Click Integration

Right-click any file in Finder → Services → Extract or Summarize content instantly.

🐍 Python Library

import content_core as cc # Extract from any source result = await cc.extract("https://example.com/article") summary = await cc.summarize_content(result, context="explain to a child")

⚡ Key Features

🎯 Intelligent Auto-Detection: Automatically selects the best extraction method based on content type and available services
🔧 Smart Engine Selection:
- URLs: Firecrawl → Jina → BeautifulSoup fallback chain
- Documents: Docling → Enhanced PyMuPDF → Simple extraction fallback
- Media: OpenAI Whisper transcription
- Images: OCR with multiple engine support
📊 Enhanced PDF Processing: Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
🌍 Multiple Integrations: CLI, Python library, MCP server, Raycast extension, macOS Services
⚡ Zero-Install Options: Use uvx for instant access without installation
🧠 AI-Powered Processing: LLM integration for content cleaning and summarization
🔄 Asynchronous: Built with asyncio for efficient processing
🐍 Pure Python Implementation: No system dependencies required - simplified installation across all platforms

Getting Started

Installation

Install Content Core using pip - no system dependencies required!

# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction) pip install content-core # With enhanced document processing (adds Docling) pip install content-core[docling] # With MCP server support (now included by default) pip install content-core # Full installation (with enhanced document processing) pip install content-core[docling]

Note: Unlike many content extraction tools, Content Core uses pure Python implementations and doesn't require system libraries like libmagic. This ensures consistent, hassle-free installation across Windows, macOS, and Linux.

Alternatively, if you’re developing locally:

# Clone the repository git clone https://github.com/lfnovo/content-core cd content-core # Install with uv uv sync

Command-Line Interface

Content Core provides three CLI commands for extracting, cleaning, and summarizing content: ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).

Zero-install usage with uvx:

# Extract content uvx --from "content-core" ccore https://example.com # Clean content uvx --from "content-core" cclean "messy content" # Summarize content uvx --from "content-core" csum "long text" --context "bullet points"

ccore - Extract Content

Extracts content from text, URLs, or files, with optional formatting. Usage:

ccore [-f|--format xml|json|text] [-d|--debug] [content]

Options:

-f, --format: Output format (xml, json, or text). Default: text.
-d, --debug: Enable debug logging.
content: Input content (text, URL, or file path). If omitted, reads from stdin.

Examples:

# Extract from a URL as text ccore https://example.com # Extract from a file as JSON ccore -f json document.pdf # Extract from piped text as XML echo "Sample text" | ccore --format xml

cclean - Clean Content

Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths. Usage:

cclean [-d|--debug] [content]

Options:

-d, --debug: Enable debug logging.
content: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.

Examples:

# Clean a text string cclean " messy text " # Clean piped JSON echo '{"content": " messy text "}' | cclean # Clean content from a URL cclean https://example.com # Clean a file’s content cclean document.txt

csum - Summarize Content

Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.

Usage:

csum [--context "context text"] [-d|--debug] [content]

Options:

--context: Context for summarization (e.g., "explain to a child"). Default: none.
-d, --debug: Enable debug logging.
content: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.

Examples:

# Summarize text csum "AI is transforming industries." # Summarize with context csum --context "in bullet points" "AI is transforming industries." # Summarize piped content cat article.txt | csum --context "one sentence" # Summarize content from URL csum https://example.com # Summarize a file's content csum document.txt

Quick Start

You can quickly integrate content-core into your Python projects to extract, clean, and summarize content from various sources.

import content_core as cc # Extract content from a URL, file, or text result = await cc.extract("https://example.com/article") # Clean messy content cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...") # Summarize content with optional context summary = await cc.summarize_content("long article text", context="explain to a child") # Extract audio with custom speech-to-text model from content_core.common import ProcessSourceInput result = await cc.extract(ProcessSourceInput( file_path="interview.mp3", audio_provider="openai", audio_model="whisper-1" ))

Documentation

For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our Usage Documentation.

MCP Server Integration

Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.

Quick Setup with Claude Desktop

# Install Content Core (MCP server included) pip install content-core # Or use directly with uvx (no installation required) uvx --from "content-core" content-core-mcp

Add to your claude_desktop_config.json:

{ "mcpServers": { "content-core": { "command": "uvx", "args": [ "--from", "content-core", "content-core-mcp" ] } } }

For detailed setup instructions, configuration options, and usage examples, see our MCP Documentation.

Enhanced PDF Processing

Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.

Key Improvements

🔬 Mathematical Formula Extraction: Enhanced quality flags eliminate  placeholders
📊 Automatic Table Detection: Tables converted to markdown format for LLM consumption
🔧 Quality Text Rendering: Better ligature, whitespace, and image-text integration
⚡ Optional OCR Enhancement: Selective OCR for formula-heavy pages (requires Tesseract)

Configuration for Scientific Documents

For documents with heavy mathematical content, enable OCR enhancement:

# In cc_config.yaml extraction: pymupdf: enable_formula_ocr: true # Enable OCR for formula-heavy pages formula_threshold: 3 # Min formulas per page to trigger OCR ocr_fallback: true # Graceful fallback if OCR fails

# Runtime configuration from content_core.config import set_pymupdf_ocr_enabled set_pymupdf_ocr_enabled(True)

Requirements for OCR Enhancement

# Install Tesseract OCR (optional, for formula enhancement) # macOS brew install tesseract # Ubuntu/Debian sudo apt-get install tesseract-ocr

Note: OCR is optional - you get improved PDF extraction automatically without any additional setup.

macOS Services Integration

Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.

Available Services

Create 4 convenient services for different workflows:

Extract Content → Clipboard - Quick copy for immediate pasting
Extract Content → TextEdit - Review before using
Summarize Content → Clipboard - Quick summary copying
Summarize Content → TextEdit - Formatted summary with headers

Quick Setup

Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
Create services manually using Automator (5 minutes setup)

Usage

Right-click any supported file in Finder → Services → Choose your option:

PDFs, Word docs - Instant text extraction
Videos, audio files - Automatic transcription
Images - OCR text recognition
Web content - Clean text extraction
Multiple files - Batch processing support

Features

Zero-install processing: Uses uvx for isolated execution
Multiple output options: Clipboard or TextEdit display
System notifications: Visual feedback on completion
Wide format support: 20+ file types supported
Batch processing: Handle multiple files at once
Keyboard shortcuts: Assignable hotkeys for power users

For complete setup instructions with copy-paste scripts, see macOS Services Documentation.

Raycast Extension

Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.

Quick Setup

From Raycast Store (coming soon):

Open Raycast and search for "Content Core"
Install the extension by luis_novo
Configure API keys in preferences

Manual Installation:

Download the extension from the repository
Open Raycast → "Import Extension"
Select the raycast-content-core folder

Commands

🔍 Extract Content - Smart URL/file detection with full interface

Auto-detects URLs vs file paths in real-time
Multiple output formats (Text, JSON, XML)
Drag & drop support for files
Rich results view with metadata

📝 Summarize Content - AI-powered summaries with customizable styles

9 different summary styles (bullet points, executive summary, etc.)
Auto-detects source type with visual feedback
One-click snippet creation and quicklinks

⚡ Quick Extract - Instant extraction to clipboard

Type → Tab → Paste source → Enter
No UI, works directly from command bar
Perfect for quick workflows

Features

Smart Auto-Detection: Instantly recognizes URLs vs file paths
Zero Installation: Uses uvx for Content Core execution
Rich Integration: Keyboard shortcuts, clipboard actions, Raycast snippets
All File Types: Documents, videos, audio, images, archives
Visual Feedback: Real-time type detection with icons

For detailed setup, configuration, and usage examples, see Raycast Extension Documentation.

Using with Langchain

For users integrating with the Langchain framework, content-core exposes a set of compatible tools. These tools, located in the src/content_core/tools directory, allow you to leverage content-core extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.

You can import and use these tools like any other Langchain tool. For example:

from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool from langchain.agents import initialize_agent, AgentType tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool] agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True) agent.run("Extract the content from https://example.com and then summarize it.")

Refer to the source code in src/content_core/tools for specific tool implementations and usage details.

Basic Usage

The core functionality revolves around the extract_content function.

import asyncio from content_core.extraction import extract_content async def main(): # Extract from raw text text_data = await extract_content({"content": "This is my sample text content."}) print(text_data) # Extract from a URL (uses 'auto' engine by default) url_data = await extract_content({"url": "https://www.example.com"}) print(url_data) # Extract from a local video file (gets transcript, engine='auto' by default) video_data = await extract_content({"file_path": "path/to/your/video.mp4"}) print(video_data) # Extract from a local markdown file (engine='auto' by default) md_data = await extract_content({"file_path": "path/to/your/document.md"}) print(md_data) # Per-execution override with Docling for documents doc_data = await extract_content({ "file_path": "path/to/your/document.pdf", "document_engine": "docling", "output_format": "html" }) # Per-execution override with Firecrawl for URLs url_data = await extract_content({ "url": "https://www.example.com", "url_engine": "firecrawl" }) print(doc_data) if __name__ == "__main__": asyncio.run(main())

(See src/content_core/notebooks/run.ipynb for more detailed examples.)

Docling Integration

Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).

Enabling Docling

Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".

Via configuration file

In your cc_config.yaml or custom config, set:

extraction: document_engine: docling # 'auto' (default), 'simple', or 'docling' url_engine: auto # 'auto' (default), 'simple', 'firecrawl', or 'jina' docling: output_format: markdown # markdown | html | json

Programmatically in Python

from content_core.config import set_document_engine, set_url_engine, set_docling_output_format # switch document engine to Docling set_document_engine("docling") # switch URL engine to Firecrawl set_url_engine("firecrawl") # choose output format: 'markdown', 'html', or 'json' set_docling_output_format("html") # now use ccore.extract or ccore.ccore result = await cc.extract("document.pdf")

Configuration

Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or .env files, loaded automatically via python-dotenv.

Example .env:

OPENAI_API_KEY=your-key-here GOOGLE_API_KEY=your-key-here # Engine Selection (optional) CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina # Audio Processing (optional) CCORE_AUDIO_CONCURRENCY=3 # Number of concurrent audio transcriptions (1-10, default: 3) # Esperanto Timeout Configuration (optional) ESPERANTO_LLM_TIMEOUT=300 # Language model timeout in seconds (default: 300, max: 3600) ESPERANTO_STT_TIMEOUT=3600 # Speech-to-text timeout in seconds (default: 3600, max: 3600)

Engine Selection via Environment Variables

For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:

CCORE_DOCUMENT_ENGINE: Force document engine (auto, simple, docling)
CCORE_URL_ENGINE: Force URL engine (auto, simple, firecrawl, jina)
CCORE_AUDIO_CONCURRENCY: Number of concurrent audio transcriptions (1-10, default: 3)

These variables take precedence over config file settings and provide explicit control for different deployment scenarios.

Audio Processing Configuration

Content Core processes long audio files by splitting them into segments and transcribing them in parallel for improved performance. You can control the concurrency level to balance speed with API rate limits:

Default: 3 concurrent transcriptions
Range: 1-10 concurrent transcriptions
Configuration: Set via CCORE_AUDIO_CONCURRENCY environment variable or extraction.audio.concurrency in cc_config.yaml

Higher concurrency values can speed up processing of long audio/video files but may hit API rate limits. Lower values are more conservative and suitable for accounts with lower API quotas.

Retry Configuration

Content Core includes automatic retry logic for transient failures in external operations (network requests, API calls, transcription). Retries use exponential backoff with jitter to handle temporary issues gracefully.

Supported operations:

youtube - YouTube video title and transcript fetching (5 retries, 2-60s backoff)
url_api - URL extraction via Jina/Firecrawl APIs (3 retries, 1-30s backoff)
url_network - Network operations like HEAD requests, BeautifulSoup (3 retries, 0.5-10s backoff)
audio - Audio transcription API calls (3 retries, 2-30s backoff)
llm - LLM API calls for cleanup/summary (3 retries, 1-30s backoff)
download - Remote file downloads (3 retries, 1-15s backoff)

Environment variable overrides:

# Override retry settings per operation type CCORE_YOUTUBE_MAX_RETRIES=10 # Max retry attempts (1-20) CCORE_YOUTUBE_BASE_DELAY=3 # Base delay in seconds (0.1-60) CCORE_YOUTUBE_MAX_DELAY=120 # Max delay in seconds (1-300) # Same pattern for other operations: CCORE_URL_API_MAX_RETRIES=5 CCORE_AUDIO_MAX_RETRIES=5 CCORE_LLM_MAX_RETRIES=5 CCORE_DOWNLOAD_MAX_RETRIES=5

For detailed configuration, see our Usage Documentation.

Timeout Configuration

Content Core uses the Esperanto library for AI model interactions and supports configurable timeouts for different operations. Timeouts prevent requests from hanging indefinitely and ensure reliable processing.

Configuration Methods (in priority order):

Config Files (highest priority): Set in cc_config.yaml or models_config.yaml
Environment Variables: Provide global defaults via ESPERANTO_LLM_TIMEOUT and ESPERANTO_STT_TIMEOUT when a timeout isn't specified in configuration files

Default Timeouts:

Speech-to-Text: 3600 seconds (1 hour) - for very long audio files
Language Models: 300-600 seconds - for content processing operations
Cleanup Model: 600 seconds (10 minutes) - handles large content with 8000 max tokens
Summary Model: 300 seconds (5 minutes) - for content summarization

Environment Variable Overrides:

# Override language model timeout globally (used when config files omit a timeout) export ESPERANTO_LLM_TIMEOUT=300 # Override speech-to-text timeout globally (used when config files omit a timeout) export ESPERANTO_STT_TIMEOUT=3600

Valid Range: 1 to 3600 seconds (1 hour maximum)

For more details on Esperanto timeout configuration, see the Esperanto documentation.

Custom Prompt Templates

Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the prompts directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the PROMPT_PATH environment variable in your .env file or system environment.

Example .env with custom prompt path:

OPENAI_API_KEY=your-key-here GOOGLE_API_KEY=your-key-here PROMPT_PATH=/path/to/your/custom/prompts

When a prompt template is requested, Content Core will first look in the custom directory specified by PROMPT_PATH (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.

Development

To set up a development environment:

# Clone the repository git clone <repository-url> cd content-core # Create virtual environment and install dependencies uv venv source .venv/bin/activate uv sync --group dev # Run tests make test # Lint code make lint # See all commands make help

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please see our Contributing Guide for more details on how to get started.

Content Core

🚀 What You Can Do

🛠️ Multiple Ways to Use

🖥️ Command Line (Zero Install)

🤖 Claude Desktop Integration

🔍 Raycast Extension

🖱️ macOS Right-Click Integration

🐍 Python Library

⚡ Key Features

Getting Started

Installation

Command-Line Interface

ccore - Extract Content

cclean - Clean Content

csum - Summarize Content

Quick Start

Documentation

MCP Server Integration

Quick Setup with Claude Desktop

Enhanced PDF Processing

Key Improvements

Configuration for Scientific Documents

Requirements for OCR Enhancement

macOS Services Integration

Available Services

Quick Setup

Usage

Features

Raycast Extension

Quick Setup

Commands

Features

Using with Langchain

Basic Usage

Docling Integration

Enabling Docling

Via configuration file

Programmatically in Python

Configuration

Engine Selection via Environment Variables

Audio Processing Configuration

Retry Configuration

Timeout Configuration

Custom Prompt Templates

Development

License

Contributing

Resources

Tools

Appeared in Searches

New MCP Servers

Latest Blog Posts

MCP directory API