Content Core

License: MIT PyPI version Downloads GitHub stars GitHub forks GitHub issues Ruff

統一された非同期Python API、CLI、またはMCPサーバーを通じて、URL、ファイル、テキストからコンテンツを抽出、処理、要約します。

サポートされているフォーマット

カテゴリ	フォーマット
Web	URL、HTMLページ、YouTube動画、Reddit投稿
ドキュメント	PDF、DOCX、PPTX、XLSX、EPUB、Markdown、プレーンテキスト
メディア	MP3、WAV、M4A、FLAC、OGG (音声); MP4、AVI、MOV、MKV (動画)

Related MCP server: Fetch MCP Server

クイックスタート

pip install content-core

import content_core

result = await content_core.extract_content(url="https://example.com")
print(result.content)

またはインストール不要で実行:

uvx content-core extract "https://example.com"

CLIの使用方法

Content Coreは、抽出、要約、MCPサーバー用のサブコマンドを備えた統一された content-core コマンドを提供します。

抽出

# From a URL
content-core extract "https://example.com"

# From a file
content-core extract document.pdf

# With JSON output
content-core extract document.pdf --format json

# With a specific engine
content-core extract "https://example.com" --engine firecrawl

# From stdin
echo "some text" | content-core extract

要約

# Summarize text
content-core summarize "Long article text here..."

# With context
content-core summarize "Long text" --context "bullet points"

# From stdin
cat article.txt | content-core summarize --context "explain to a child"

MCPサーバー

content-core mcp

設定

# Set persistent config
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514

# List current config
content-core config list

# Delete a config value
content-core config delete llm_provider

設定は ~/.content-core/config.toml に保存されます。優先順位: コマンドフラグ > 環境変数 > 設定ファイル > デフォルト値。

uvxによるインストール不要の実行

すべてのコマンドは uvx を使用してインストールなしで動作します:

uvx content-core extract "https://example.com"
uvx content-core summarize "text" --context "one sentence"
uvx content-core mcp

Python API

抽出

import content_core

# From a URL
result = await content_core.extract_content(url="https://example.com")

# From a file
result = await content_core.extract_content(file_path="document.pdf")

# From text
result = await content_core.extract_content(content="some text")

# With engine override
from content_core import ContentCoreConfig
config = ContentCoreConfig(url_engine="firecrawl")
result = await content_core.extract_content(url="https://example.com", config=config)

要約

import content_core

summary = await content_core.summarize("long article text", context="bullet points")

設定

from content_core import ContentCoreConfig

config = ContentCoreConfig(
    url_engine="firecrawl",
    document_engine="docling",
    audio_concurrency=5,
)
result = await content_core.extract_content(url="https://example.com", config=config)

MCP統合

Content Coreには、Claude Desktopやその他のMCP互換アプリケーションで使用するためのModel Context Protocol (MCP) サーバーが含まれています。

claude_desktop_config.json に以下を追加してください:

{
  "mcpServers": {
    "content-core": {
      "command": "uvx",
      "args": ["content-core", "mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

MCPサーバーは extract_content と summarize_content の2つのツールを公開します。どちらもプレーンテキストを返します。

詳細なセットアップについては、MCPドキュメントを参照してください。

Claude Codeスキル

Content Coreには、AIエージェントが外部ソースからコンテンツを抽出する方法を教える SKILL.md が含まれています。Claude Codeプロジェクトで利用できるようにするには、スキルディレクトリにコピーしてください:

# Download the skill
curl -o .claude/skills/content-core/SKILL.md --create-dirs \
  https://raw.githubusercontent.com/lfnovo/content-core/main/SKILL.md

インストールが完了すると、Claude CodeはCLI (uvx content-core) または設定済みの場合はMCPを介して、content-coreを使用してURL、ドキュメント、メディアファイルからコンテンツを抽出できるようになります。

AIプロバイダー

Content Coreは Esperanto を使用して、複数のLLMおよびSTTプロバイダーをサポートしています。設定を変更するだけでプロバイダーを切り替えることができ、コードの変更は不要です:

# Use Anthropic for summarization
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514

# Use Groq for transcription
content-core config set stt_provider groq
content-core config set stt_model whisper-large-v3

サポートされているプロバイダーには、OpenAI、Anthropic、Google、Groq、DeepSeek、Ollamaなどがあります。全リストについては Esperantoドキュメントを参照してください。

設定

Content Coreは pydantic-settings を利用した ContentCoreConfig を使用します。設定は優先順位の高い順に解決されます: コンストラクタ引数 > 環境変数 (CCORE_*) > 設定ファイル (~/.content-core/config.toml) > デフォルト値。

環境変数

変数	説明	デフォルト
`CCORE_URL_ENGINE`	URL抽出エンジン (`auto`, `simple`, `firecrawl`, `jina`, `crawl4ai`)	`auto`
`CCORE_DOCUMENT_ENGINE`	ドキュメント抽出エンジン (`auto`, `simple`, `docling`)	`auto`
`CCORE_AUDIO_CONCURRENCY`	同時音声文字起こし数 (1-10)	`3`
`CRAWL4AI_API_URL`	Crawl4AI Docker API URL (ローカルブラウザモードの場合は省略)	-
`FIRECRAWL_API_URL`	セルフホストインスタンス用のカスタムFirecrawl API URL	-
`CCORE_FIRECRAWL_PROXY`	Firecrawlプロキシモード (`auto`, `basic`, `stealth`)	`auto`
`CCORE_FIRECRAWL_WAIT_FOR`	抽出前の待機時間 (ms)	`3000`
`CCORE_LLM_PROVIDER`	要約用LLMプロバイダー	-
`CCORE_LLM_MODEL`	要約用LLMモデル	-
`CCORE_STT_PROVIDER`	音声認識(STT)プロバイダー	-
`CCORE_STT_MODEL`	音声認識(STT)モデル	-
`CCORE_STT_TIMEOUT`	音声認識(STT)タイムアウト (秒)	-
`CCORE_YOUTUBE_LANGUAGES`	YouTube文字起こしの優先言語	-

外部サービスのAPIキーは、標準の環境変数 (例: OPENAI_API_KEY, FIRECRAWL_API_KEY, JINA_API_KEY) を介して設定されます。

プロキシ設定

Content Coreは、標準の HTTP_PROXY / HTTPS_PROXY / NO_PROXY 環境変数を自動的に読み取ります。追加の設定は不要です。

オプションの依存関係

# Docling for advanced document parsing (PDF, DOCX, PPTX, XLSX)
pip install content-core[docling]

# Crawl4AI for local browser-based URL extraction
pip install content-core[crawl4ai]
python -m playwright install --with-deps

# LangChain tool wrappers
pip install content-core[langchain]

# All optional features
pip install content-core[docling,crawl4ai,langchain]

LangChainでの使用

langchain エクストラをインストールすると、Content CoreはLangChain互換のツールラッパーを提供します:

from content_core.tools import extract_content_tool, summarize_content_tool

tools = [extract_content_tool, summarize_content_tool]

ドキュメント

使用ガイド -- Python APIの詳細、設定、例
プロセッサ -- 各フォーマットのコンテンツ抽出の仕組み
MCPサーバー -- Claude DesktopおよびMCP統合

開発

git clone https://github.com/lfnovo/content-core
cd content-core

uv sync --group dev

# Run tests
make test

# Lint
make ruff

ライセンス

このプロジェクトは MITライセンスの下でライセンスされています。

貢献

貢献を歓迎します！詳細については貢献ガイドを参照してください。

content-core