content-core
This MCP server provides AI-powered content extraction from URLs and files through Content Core's intelligent auto-detection engine.
• Extract from URLs - Retrieve clean, structured content from web pages using smart engine selection (Firecrawl → Jina → BeautifulSoup fallback)
• Process documents - Extract text from PDF, Word, PowerPoint, Excel, Markdown, HTML, and EPUB files (Docling → Enhanced PyMuPDF fallback)
• Transcribe media - Convert video (MP4, AVI, MOV) and audio (MP3, WAV, M4A) to text using OpenAI Whisper speech-to-text
• Extract from images - Process JPG, PNG, and TIFF images with OCR text recognition
• Handle archives - Extract and analyze content from ZIP, TAR, and GZ files
• Automatic optimization - The 'auto' engine intelligently selects the best extraction method based on content type
• Structured output - Returns JSON responses with extracted content, metadata, and supports multiple formats (text, JSON, XML)
• Multiple interfaces - Access through CLI commands, Python library, MCP server, Raycast extension, and macOS Services
Exposes a set of compatible tools for Langchain framework, enabling extraction, cleaning, and summarization capabilities directly within Langchain agents and chains.
Enables right-click integration with macOS Finder through Services, allowing content extraction and summarization from any supported file with options for clipboard or TextEdit output.
Integrates with OpenAI services for transcription (Whisper) and content processing, allowing for AI-powered content extraction and summarization.
Provides a Python library for programmatic access to content extraction, cleaning, and summarization capabilities, with asynchronous functionality and customizable options.
Offers a Raycast extension with smart auto-detection commands for extracting and summarizing content from various sources, including URLs and files, with multiple output options and visual feedback.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@content-coreextract the main points from this article: https://example.com/tech-news"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Content Core
Extract, process, and summarize content from URLs, files, and text through a unified async Python API, CLI, or MCP server.
Supported Formats
Category | Formats |
Web | URLs, HTML pages, YouTube videos, Reddit posts |
Documents | PDF, DOCX, PPTX, XLSX, EPUB, Markdown, plain text |
Media | MP3, WAV, M4A, FLAC, OGG (audio); MP4, AVI, MOV, MKV (video) |
Related MCP server: Fetch MCP Server
Quick Start
pip install content-coreimport content_core
result = await content_core.extract_content(url="https://example.com")
print(result.content)Or with zero install:
uvx content-core extract "https://example.com"CLI Usage
Content Core provides a unified content-core command with subcommands for extraction, summarization, and MCP server.
Extract
# From a URL
content-core extract "https://example.com"
# From a file
content-core extract document.pdf
# With JSON output
content-core extract document.pdf --format json
# With a specific engine
content-core extract "https://example.com" --engine firecrawl
# From stdin
echo "some text" | content-core extractSummarize
# Summarize text
content-core summarize "Long article text here..."
# With context
content-core summarize "Long text" --context "bullet points"
# From stdin
cat article.txt | content-core summarize --context "explain to a child"MCP Server
content-core mcpConfiguration
# Set persistent config
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514
# List current config
content-core config list
# Delete a config value
content-core config delete llm_providerConfig is stored in ~/.content-core/config.toml. Priority: command flags > env vars > config file > defaults.
Zero-Install with uvx
All commands work without installation using uvx:
uvx content-core extract "https://example.com"
uvx content-core summarize "text" --context "one sentence"
uvx content-core mcpPython API
Extraction
import content_core
# From a URL
result = await content_core.extract_content(url="https://example.com")
# From a file
result = await content_core.extract_content(file_path="document.pdf")
# From text
result = await content_core.extract_content(content="some text")
# With engine override
from content_core import ContentCoreConfig
config = ContentCoreConfig(url_engine="firecrawl")
result = await content_core.extract_content(url="https://example.com", config=config)Summarization
import content_core
summary = await content_core.summarize("long article text", context="bullet points")Configuration
from content_core import ContentCoreConfig
config = ContentCoreConfig(
url_engine="firecrawl",
document_engine="docling",
audio_concurrency=5,
)
result = await content_core.extract_content(url="https://example.com", config=config)MCP Integration
Content Core includes a Model Context Protocol (MCP) server for use with Claude Desktop and other MCP-compatible applications.
Add to your claude_desktop_config.json:
{
"mcpServers": {
"content-core": {
"command": "uvx",
"args": ["content-core", "mcp"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}The MCP server exposes two tools: extract_content and summarize_content. Both return plain text.
For detailed setup, see the MCP documentation.
Claude Code Skill
Content Core includes a SKILL.md that teaches AI agents how to use it for extracting content from external sources. To make it available in your Claude Code project, copy it to your skills directory:
# Download the skill
curl -o .claude/skills/content-core/SKILL.md --create-dirs \
https://raw.githubusercontent.com/lfnovo/content-core/main/SKILL.mdOnce installed, Claude Code can use content-core to extract content from URLs, documents, and media files — either via CLI (uvx content-core) or MCP if configured.
AI Providers
Content Core uses Esperanto to support multiple LLM and STT providers. Switch providers by changing the config — no code changes needed:
# Use Anthropic for summarization
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514
# Use Groq for transcription
content-core config set stt_provider groq
content-core config set stt_model whisper-large-v3Supported providers include OpenAI, Anthropic, Google, Groq, DeepSeek, Ollama, and more. See the Esperanto documentation for the full list.
Configuration
Content Core uses ContentCoreConfig powered by pydantic-settings. Settings are resolved in priority order: constructor args > env vars (CCORE_*) > config file (~/.content-core/config.toml) > defaults.
Environment Variables
Variable | Description | Default |
| URL extraction engine ( |
|
| Document extraction engine ( |
|
| Concurrent audio transcriptions (1-10) |
|
| Crawl4AI Docker API URL (omit for local browser mode) | - |
| Custom Firecrawl API URL for self-hosted instances | - |
| Firecrawl proxy mode ( |
|
| Wait time in ms before extraction |
|
| LLM provider for summarization | - |
| LLM model for summarization | - |
| Speech-to-text provider | - |
| Speech-to-text model | - |
| Speech-to-text timeout in seconds | - |
| Preferred YouTube transcript languages | - |
API keys for external services are set via their standard environment variables (e.g., OPENAI_API_KEY, FIRECRAWL_API_KEY, JINA_API_KEY).
Proxy Configuration
Content Core reads standard HTTP_PROXY / HTTPS_PROXY / NO_PROXY environment variables automatically. No additional configuration is needed.
Optional Dependencies
# Docling for advanced document parsing (PDF, DOCX, PPTX, XLSX)
pip install content-core[docling]
# Crawl4AI for local browser-based URL extraction
pip install content-core[crawl4ai]
python -m playwright install --with-deps
# LangChain tool wrappers
pip install content-core[langchain]
# All optional features
pip install content-core[docling,crawl4ai,langchain]Using with LangChain
When installed with the langchain extra, Content Core provides LangChain-compatible tool wrappers:
from content_core.tools import extract_content_tool, summarize_content_tool
tools = [extract_content_tool, summarize_content_tool]Documentation
Usage Guide -- Python API details, configuration, and examples
Processors -- How content extraction works for each format
MCP Server -- Claude Desktop and MCP integration
Development
git clone https://github.com/lfnovo/content-core
cd content-core
uv sync --group dev
# Run tests
make test
# Lint
make ruffLicense
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please see our Contributing Guide for details.
Appeared in Searches
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/lfnovo/content-core'
If you have feedback or need assistance with the MCP directory API, please join our Discord server