Exposes a set of compatible tools for Langchain framework, enabling extraction, cleaning, and summarization capabilities directly within Langchain agents and chains.
Enables right-click integration with macOS Finder through Services, allowing content extraction and summarization from any supported file with options for clipboard or TextEdit output.
Integrates with OpenAI services for transcription (Whisper) and content processing, allowing for AI-powered content extraction and summarization.
Provides a Python library for programmatic access to content extraction, cleaning, and summarization capabilities, with asynchronous functionality and customizable options.
Offers a Raycast extension with smart auto-detection commands for extracting and summarizing content from various sources, including URLs and files, with multiple output options and visual feedback.
Content Core
Content Core is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
🚀 What You Can Do
Extract content from anywhere:
- 📄 Documents - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
- 🎥 Media - Videos (MP4, AVI, MOV) with automatic transcription
- 🎵 Audio - MP3, WAV, M4A with speech-to-text conversion
- 🌐 Web - Any URL with intelligent content extraction
- 🖼️ Images - JPG, PNG, TIFF with OCR text recognition
- 📦 Archives - ZIP, TAR, GZ with content analysis
Process with AI:
- ✨ Clean & format extracted content automatically
- 📝 Generate summaries with customizable styles (bullet points, executive summary, etc.)
- 🎯 Context-aware processing - explain to a child, technical summary, action items
- 🔄 Smart engine selection - automatically chooses the best extraction method
🛠️ Multiple Ways to Use
🖥️ Command Line (Zero Install)
🤖 Claude Desktop Integration
One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
🔍 Raycast Extension
Smart auto-detection commands:
- Extract Content - Full interface with format options
- Summarize Content - 9 summary styles available
- Quick Extract - Instant clipboard extraction
🖱️ macOS Right-Click Integration
Right-click any file in Finder → Services → Extract or Summarize content instantly.
🐍 Python Library
⚡ Key Features
- 🎯 Intelligent Auto-Detection: Automatically selects the best extraction method based on content type and available services
- 🔧 Smart Engine Selection:
- URLs: Firecrawl → Jina → BeautifulSoup fallback chain
- Documents: Docling → Enhanced PyMuPDF → Simple extraction fallback
- Media: OpenAI Whisper transcription
- Images: OCR with multiple engine support
- 📊 Enhanced PDF Processing: Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
- 🌍 Multiple Integrations: CLI, Python library, MCP server, Raycast extension, macOS Services
- ⚡ Zero-Install Options: Use
uvx
for instant access without installation - 🧠 AI-Powered Processing: LLM integration for content cleaning and summarization
- 🔄 Asynchronous: Built with
asyncio
for efficient processing
Getting Started
Installation
Install Content Core using pip
:
Alternatively, if you’re developing locally:
Command-Line Interface
Content Core provides three CLI commands for extracting, cleaning, and summarizing content: ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).
Zero-install usage with uvx:
ccore - Extract Content
Extracts content from text, URLs, or files, with optional formatting. Usage:
Options:
-f
,--format
: Output format (xml, json, or text). Default: text.-d
,--debug
: Enable debug logging.content
: Input content (text, URL, or file path). If omitted, reads from stdin.
Examples:
cclean - Clean Content
Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths. Usage:
Options:
-d
,--debug
: Enable debug logging.content
: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
csum - Summarize Content
Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.
Usage:
Options:
--context
: Context for summarization (e.g., "explain to a child"). Default: none.-d
,--debug
: Enable debug logging.content
: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
Quick Start
You can quickly integrate content-core
into your Python projects to extract, clean, and summarize content from various sources.
Documentation
For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our Usage Documentation.
MCP Server Integration
Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.
Quick Setup with Claude Desktop
Add to your claude_desktop_config.json
:
For detailed setup instructions, configuration options, and usage examples, see our MCP Documentation.
Enhanced PDF Processing
Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
Key Improvements
- 🔬 Mathematical Formula Extraction: Enhanced quality flags eliminate
<!-- formula-not-decoded -->
placeholders - 📊 Automatic Table Detection: Tables converted to markdown format for LLM consumption
- 🔧 Quality Text Rendering: Better ligature, whitespace, and image-text integration
- ⚡ Optional OCR Enhancement: Selective OCR for formula-heavy pages (requires Tesseract)
Configuration for Scientific Documents
For documents with heavy mathematical content, enable OCR enhancement:
Requirements for OCR Enhancement
Note: OCR is optional - you get improved PDF extraction automatically without any additional setup.
macOS Services Integration
Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
Available Services
Create 4 convenient services for different workflows:
- Extract Content → Clipboard - Quick copy for immediate pasting
- Extract Content → TextEdit - Review before using
- Summarize Content → Clipboard - Quick summary copying
- Summarize Content → TextEdit - Formatted summary with headers
Quick Setup
- Install uv (if not already installed):
- Create services manually using Automator (5 minutes setup)
Usage
Right-click any supported file in Finder → Services → Choose your option:
- PDFs, Word docs - Instant text extraction
- Videos, audio files - Automatic transcription
- Images - OCR text recognition
- Web content - Clean text extraction
- Multiple files - Batch processing support
Features
- Zero-install processing: Uses
uvx
for isolated execution - Multiple output options: Clipboard or TextEdit display
- System notifications: Visual feedback on completion
- Wide format support: 20+ file types supported
- Batch processing: Handle multiple files at once
- Keyboard shortcuts: Assignable hotkeys for power users
For complete setup instructions with copy-paste scripts, see macOS Services Documentation.
Raycast Extension
Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
Quick Setup
From Raycast Store (coming soon):
- Open Raycast and search for "Content Core"
- Install the extension by
luis_novo
- Configure API keys in preferences
Manual Installation:
- Download the extension from the repository
- Open Raycast → "Import Extension"
- Select the
raycast-content-core
folder
Commands
🔍 Extract Content - Smart URL/file detection with full interface
- Auto-detects URLs vs file paths in real-time
- Multiple output formats (Text, JSON, XML)
- Drag & drop support for files
- Rich results view with metadata
📝 Summarize Content - AI-powered summaries with customizable styles
- 9 different summary styles (bullet points, executive summary, etc.)
- Auto-detects source type with visual feedback
- One-click snippet creation and quicklinks
⚡ Quick Extract - Instant extraction to clipboard
- Type → Tab → Paste source → Enter
- No UI, works directly from command bar
- Perfect for quick workflows
Features
- Smart Auto-Detection: Instantly recognizes URLs vs file paths
- Zero Installation: Uses
uvx
for Content Core execution - Rich Integration: Keyboard shortcuts, clipboard actions, Raycast snippets
- All File Types: Documents, videos, audio, images, archives
- Visual Feedback: Real-time type detection with icons
For detailed setup, configuration, and usage examples, see Raycast Extension Documentation.
Using with Langchain
For users integrating with the Langchain framework, content-core
exposes a set of compatible tools. These tools, located in the src/content_core/tools
directory, allow you to leverage content-core
extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
You can import and use these tools like any other Langchain tool. For example:
Refer to the source code in src/content_core/tools
for specific tool implementations and usage details.
Basic Usage
The core functionality revolves around the extract_content function.
(See src/content_core/notebooks/run.ipynb
for more detailed examples.)
Docling Integration
Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
Enabling Docling
Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".
Via configuration file
In your cc_config.yaml
or custom config, set:
Programmatically in Python
Configuration
Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or .env
files, loaded automatically via python-dotenv
.
Example .env
:
Engine Selection via Environment Variables
For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
CCORE_DOCUMENT_ENGINE
: Force document engine (auto
,simple
,docling
)CCORE_URL_ENGINE
: Force URL engine (auto
,simple
,firecrawl
,jina
)
These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
Custom Prompt Templates
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the prompts
directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the PROMPT_PATH
environment variable in your .env
file or system environment.
Example .env
with custom prompt path:
When a prompt template is requested, Content Core will first look in the custom directory specified by PROMPT_PATH
(if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
Development
To set up a development environment:
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please see our Contributing Guide for more details on how to get started.
This server cannot be installed
Extract content from URLs, documents, videos, and audio files using intelligent auto-engine selection. Supports web pages, PDFs, Word docs, YouTube transcripts, and more with structured JSON responses.
Related MCP Servers
- AsecurityAlicenseAqualityProvides functionality to fetch web content in various formats, including HTML, JSON, plain text, and Markdown.Last updated -4125,855353TypeScriptMIT License
- AsecurityAlicenseAqualityEnables text extraction from web pages and PDFs, and execution of predefined commands, enhancing content processing and automation capabilities.Last updated -3TypeScriptMIT License
- -securityFlicense-qualityProvides functionality to fetch web content in various formats, including HTML, JSON, plain text, and Markdown.Last updated -125,8551
- -security-license-qualityProvides functionality to fetch web content in various formats, including HTML, JSON, plain text, and Markdown with support for custom headers.Last updated -125,855TypeScript