README.mdβ’19.9 kB
# Content Core
[](https://opensource.org/licenses/MIT)
[](https://badge.fury.io/py/content-core)
[](https://pepy.tech/project/content-core)
[](https://pepy.tech/project/content-core)
[](https://github.com/lfnovo/content-core)
[](https://github.com/lfnovo/content-core)
[](https://github.com/lfnovo/content-core/issues)
[](https://github.com/psf/black)
[](https://github.com/astral-sh/ruff)
**Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summariesβall through a unified interface with multiple integration options.
## π What You Can Do
**Extract content from anywhere:**
- π **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
- π₯ **Media** - Videos (MP4, AVI, MOV) with automatic transcription
- π΅ **Audio** - MP3, WAV, M4A with speech-to-text conversion
- π **Web** - Any URL with intelligent content extraction
- πΌοΈ **Images** - JPG, PNG, TIFF with OCR text recognition
- π¦ **Archives** - ZIP, TAR, GZ with content analysis
**Process with AI:**
- β¨ **Clean & format** extracted content automatically
- π **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
- π― **Context-aware processing** - explain to a child, technical summary, action items
- π **Smart engine selection** - automatically chooses the best extraction method
## π οΈ Multiple Ways to Use
### π₯οΈ Command Line (Zero Install)
```bash
# Extract content from any source
uvx --from "content-core" ccore https://example.com
uvx --from "content-core" ccore document.pdf
# Generate AI summaries
uvx --from "content-core" csum video.mp4 --context "bullet points"
```
### π€ Claude Desktop Integration
One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
### π Raycast Extension
Smart auto-detection commands:
- **Extract Content** - Full interface with format options
- **Summarize Content** - 9 summary styles available
- **Quick Extract** - Instant clipboard extraction
### π±οΈ macOS Right-Click Integration
Right-click any file in Finder β Services β Extract or Summarize content instantly.
### π Python Library
```python
import content_core as cc
# Extract from any source
result = await cc.extract("https://example.com/article")
summary = await cc.summarize_content(result, context="explain to a child")
```
## β‘ Key Features
* **π― Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
* **π§ Smart Engine Selection:**
* **URLs:** Firecrawl β Jina β BeautifulSoup fallback chain
* **Documents:** Docling β Enhanced PyMuPDF β Simple extraction fallback
* **Media:** OpenAI Whisper transcription
* **Images:** OCR with multiple engine support
* **π Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
* **π Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
* **β‘ Zero-Install Options:** Use `uvx` for instant access without installation
* **π§ AI-Powered Processing:** LLM integration for content cleaning and summarization
* **π Asynchronous:** Built with `asyncio` for efficient processing
* **π Pure Python Implementation:** No system dependencies required - simplified installation across all platforms
## Getting Started
### Installation
Install Content Core using `pip` - **no system dependencies required!**
```bash
# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
pip install content-core
# With enhanced document processing (adds Docling)
pip install content-core[docling]
# With MCP server support (now included by default)
pip install content-core
# Full installation (with enhanced document processing)
pip install content-core[docling]
```
> **Note:** Unlike many content extraction tools, Content Core uses pure Python implementations and doesn't require system libraries like libmagic. This ensures consistent, hassle-free installation across Windows, macOS, and Linux.
Alternatively, if youβre developing locally:
```bash
# Clone the repository
git clone https://github.com/lfnovo/content-core
cd content-core
# Install with uv
uv sync
```
### Command-Line Interface
Content Core provides three CLI commands for extracting, cleaning, and summarizing content:
ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).
**Zero-install usage with uvx:**
```bash
# Extract content
uvx --from "content-core" ccore https://example.com
# Clean content
uvx --from "content-core" cclean "messy content"
# Summarize content
uvx --from "content-core" csum "long text" --context "bullet points"
```
#### ccore - Extract Content
Extracts content from text, URLs, or files, with optional formatting.
Usage:
```bash
ccore [-f|--format xml|json|text] [-d|--debug] [content]
```
Options:
- `-f`, `--format`: Output format (xml, json, or text). Default: text.
- `-d`, `--debug`: Enable debug logging.
- `content`: Input content (text, URL, or file path). If omitted, reads from stdin.
Examples:
```bash
# Extract from a URL as text
ccore https://example.com
# Extract from a file as JSON
ccore -f json document.pdf
# Extract from piped text as XML
echo "Sample text" | ccore --format xml
```
#### cclean - Clean Content
Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths.
Usage:
```bash
cclean [-d|--debug] [content]
```
Options:
- `-d`, `--debug`: Enable debug logging.
- `content`: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
```bash
# Clean a text string
cclean " messy text "
# Clean piped JSON
echo '{"content": " messy text "}' | cclean
# Clean content from a URL
cclean https://example.com
# Clean a fileβs content
cclean document.txt
```
### csum - Summarize Content
Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.
Usage:
```bash
csum [--context "context text"] [-d|--debug] [content]
```
Options:
- `--context`: Context for summarization (e.g., "explain to a child"). Default: none.
- `-d`, `--debug`: Enable debug logging.
- `content`: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
```bash
# Summarize text
csum "AI is transforming industries."
# Summarize with context
csum --context "in bullet points" "AI is transforming industries."
# Summarize piped content
cat article.txt | csum --context "one sentence"
# Summarize content from URL
csum https://example.com
# Summarize a file's content
csum document.txt
```
## Quick Start
You can quickly integrate `content-core` into your Python projects to extract, clean, and summarize content from various sources.
```python
import content_core as cc
# Extract content from a URL, file, or text
result = await cc.extract("https://example.com/article")
# Clean messy content
cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")
# Summarize content with optional context
summary = await cc.summarize_content("long article text", context="explain to a child")
```
## Documentation
For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our [Usage Documentation](docs/usage.md).
## MCP Server Integration
Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.
<a href="https://glama.ai/mcp/servers/@lfnovo/content-core">
<img width="380" height="200" src="https://glama.ai/mcp/servers/@lfnovo/content-core/badge" />
</a>
### Quick Setup with Claude Desktop
```bash
# Install Content Core (MCP server included)
pip install content-core
# Or use directly with uvx (no installation required)
uvx --from "content-core" content-core-mcp
```
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"content-core": {
"command": "uvx",
"args": [
"--from",
"content-core",
"content-core-mcp"
]
}
}
}
```
For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
## Enhanced PDF Processing
Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
### Key Improvements
- **π¬ Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
- **π Automatic Table Detection**: Tables converted to markdown format for LLM consumption
- **π§ Quality Text Rendering**: Better ligature, whitespace, and image-text integration
- **β‘ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
### Configuration for Scientific Documents
For documents with heavy mathematical content, enable OCR enhancement:
```yaml
# In cc_config.yaml
extraction:
pymupdf:
enable_formula_ocr: true # Enable OCR for formula-heavy pages
formula_threshold: 3 # Min formulas per page to trigger OCR
ocr_fallback: true # Graceful fallback if OCR fails
```
```python
# Runtime configuration
from content_core.config import set_pymupdf_ocr_enabled
set_pymupdf_ocr_enabled(True)
```
### Requirements for OCR Enhancement
```bash
# Install Tesseract OCR (optional, for formula enhancement)
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
```
**Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
## macOS Services Integration
Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
### Available Services
Create **4 convenient services** for different workflows:
- **Extract Content β Clipboard** - Quick copy for immediate pasting
- **Extract Content β TextEdit** - Review before using
- **Summarize Content β Clipboard** - Quick summary copying
- **Summarize Content β TextEdit** - Formatted summary with headers
### Quick Setup
1. **Install uv** (if not already installed):
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. **Create services manually** using Automator (5 minutes setup)
### Usage
**Right-click any supported file** in Finder β **Services** β Choose your option:
- **PDFs, Word docs** - Instant text extraction
- **Videos, audio files** - Automatic transcription
- **Images** - OCR text recognition
- **Web content** - Clean text extraction
- **Multiple files** - Batch processing support
### Features
- **Zero-install processing**: Uses `uvx` for isolated execution
- **Multiple output options**: Clipboard or TextEdit display
- **System notifications**: Visual feedback on completion
- **Wide format support**: 20+ file types supported
- **Batch processing**: Handle multiple files at once
- **Keyboard shortcuts**: Assignable hotkeys for power users
For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
## Raycast Extension
Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
### Quick Setup
**From Raycast Store** (coming soon):
1. Open Raycast and search for "Content Core"
2. Install the extension by `luis_novo`
3. Configure API keys in preferences
**Manual Installation**:
1. Download the extension from the repository
2. Open Raycast β "Import Extension"
3. Select the `raycast-content-core` folder
### Commands
**π Extract Content** - Smart URL/file detection with full interface
- Auto-detects URLs vs file paths in real-time
- Multiple output formats (Text, JSON, XML)
- Drag & drop support for files
- Rich results view with metadata
**π Summarize Content** - AI-powered summaries with customizable styles
- 9 different summary styles (bullet points, executive summary, etc.)
- Auto-detects source type with visual feedback
- One-click snippet creation and quicklinks
**β‘ Quick Extract** - Instant extraction to clipboard
- Type β Tab β Paste source β Enter
- No UI, works directly from command bar
- Perfect for quick workflows
### Features
- **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
- **Zero Installation**: Uses `uvx` for Content Core execution
- **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
- **All File Types**: Documents, videos, audio, images, archives
- **Visual Feedback**: Real-time type detection with icons
For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
## Using with Langchain
For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
You can import and use these tools like any other Langchain tool. For example:
```python
from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool
from langchain.agents import initialize_agent, AgentType
tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("Extract the content from https://example.com and then summarize it.")
```
Refer to the source code in `src/content_core/tools` for specific tool implementations and usage details.
## Basic Usage
The core functionality revolves around the extract_content function.
```python
import asyncio
from content_core.extraction import extract_content
async def main():
# Extract from raw text
text_data = await extract_content({"content": "This is my sample text content."})
print(text_data)
# Extract from a URL (uses 'auto' engine by default)
url_data = await extract_content({"url": "https://www.example.com"})
print(url_data)
# Extract from a local video file (gets transcript, engine='auto' by default)
video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
print(video_data)
# Extract from a local markdown file (engine='auto' by default)
md_data = await extract_content({"file_path": "path/to/your/document.md"})
print(md_data)
# Per-execution override with Docling for documents
doc_data = await extract_content({
"file_path": "path/to/your/document.pdf",
"document_engine": "docling",
"output_format": "html"
})
# Per-execution override with Firecrawl for URLs
url_data = await extract_content({
"url": "https://www.example.com",
"url_engine": "firecrawl"
})
print(doc_data)
if __name__ == "__main__":
asyncio.run(main())
```
(See `src/content_core/notebooks/run.ipynb` for more detailed examples.)
## Docling Integration
Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
### Enabling Docling
Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".
#### Via configuration file
In your `cc_config.yaml` or custom config, set:
```yaml
extraction:
document_engine: docling # 'auto' (default), 'simple', or 'docling'
url_engine: auto # 'auto' (default), 'simple', 'firecrawl', or 'jina'
docling:
output_format: markdown # markdown | html | json
```
#### Programmatically in Python
```python
from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
# switch document engine to Docling
set_document_engine("docling")
# switch URL engine to Firecrawl
set_url_engine("firecrawl")
# choose output format: 'markdown', 'html', or 'json'
set_docling_output_format("html")
# now use ccore.extract or ccore.ccore
result = await cc.extract("document.pdf")
```
## Configuration
Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or `.env` files, loaded automatically via `python-dotenv`.
Example `.env`:
```plaintext
OPENAI_API_KEY=your-key-here
GOOGLE_API_KEY=your-key-here
# Engine Selection (optional)
CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
```
### Engine Selection via Environment Variables
For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
### Custom Prompt Templates
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
Example `.env` with custom prompt path:
```plaintext
OPENAI_API_KEY=your-key-here
GOOGLE_API_KEY=your-key-here
PROMPT_PATH=/path/to/your/custom/prompts
```
When a prompt template is requested, Content Core will first look in the custom directory specified by `PROMPT_PATH` (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
## Development
To set up a development environment:
```bash
# Clone the repository
git clone <repository-url>
cd content-core
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv sync --group dev
# Run tests
make test
# Lint code
make lint
# See all commands
make help
```
## License
This project is licensed under the [MIT License](LICENSE). See the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md) for more details on how to get started.