Claimify

AGENTS.md•16.4 KiB

# AGENTS.md - Project Documentation for AI Agents ## Project Overview **ClaimsMCP** is a Model Context Protocol (MCP) server that implements the "Claimify" research methodology for extracting verifiable factual claims from text. This is a production-ready implementation using OpenAI's structured outputs API and follows the peer-reviewed approach from "Towards Effective Extraction and Evaluation of Factual Claims" by Metropolitansky & Larson (2025). **Key Insight**: This is NOT just another text extraction tool. It implements a sophisticated 4-stage academic pipeline that filters out opinions, resolves ambiguities, and produces atomic, self-contained claims. ## Architecture at a Glance ``` User Text → MCP Server → Pipeline → LLM Client → MCP Sampling (primary) ↓ ↓ (or fallback) [Splitting → Selection → Disambiguation → Decomposition] ↓ ↓ Structured Claims OpenAI API ``` ## Core Components ### 1. `claimify_server.py` - MCP Server Entry Point - **Purpose**: Exposes claim extraction as an MCP tool via stdio - **Protocol**: Uses Model Context Protocol for client communication - **Tool**: `extract_claims(text_to_process, question)` - the main interface - **Important**: Uses stdio (stdin/stdout), NOT HTTP - this is intentional for MCP - **Async**: Built on asyncio, runs single-threaded event loop **Key Implementation Details**: - Returns JSON array of claims as string - Error handling wraps all exceptions - Validates API keys on startup - Logs to stderr (stdout reserved for MCP protocol) ### 2. `pipeline.py` - Core Extraction Logic - **Purpose**: Orchestrates the 4-stage Claimify methodology - **Architecture**: Each stage is a separate function with structured I/O **The 4 Stages**: 1. **Sentence Splitting** (`split_into_sentences`) - Uses NLTK's punkt tokenizer - Handles paragraphs and list items - Maintains sentence boundaries accurately 2. **Selection Stage** (`run_selection_stage`) - Filters for sentences with verifiable propositions - Removes opinions, speculation, generic statements - Uses surrounding context (p=5, f=5 sentences) - Returns: `('verifiable', sentence)` or `('unverifiable', None)` 3. **Disambiguation Stage** (`run_disambiguation_stage`) - Resolves pronouns, acronyms, partial names - Addresses referential and structural ambiguity - Only proceeds if ambiguity can be resolved with context - Returns: `('resolved', sentence)` or `('unresolvable', None)` 4. **Decomposition Stage** (`run_decomposition_stage`) - Breaks sentences into atomic claims - Adds clarifying context in [brackets] - Ensures each claim is self-contained - Returns: `List[str]` of claims **Flow Control**: - Each sentence processes through all stages sequentially - Early exit if any stage filters out the sentence - Final output is deduplicated list of all claims **Context Windows**: - Fixed at p=5, f=5 (preceding/following sentences) - Based on paper's experimental findings - Balance between context richness and token cost ### 3. `llm_client.py` - LLM Communication Layer - **Purpose**: Handles all LLM communication via MCP sampling (primary) or OpenAI API (fallback) - **Key Feature**: Tries MCP sampling first, falls back to OpenAI's `beta.chat.completions.parse` API - **Type Safety**: All responses validated against Pydantic models (via JSON parsing for sampling, native for API) - **Model Support**: MCP sampling works with any model; OpenAI API requires gpt-4o-2024-08-06, gpt-4o-mini, gpt-4o **Critical Methods**: - `make_structured_request(system_prompt, user_prompt, response_model, stage)` - Main entry point, tries sampling then API - `_make_sampling_request()` - Uses MCP `session.create_message()` with retry logic for malformed JSON - `_make_openai_request()` - Direct OpenAI API call with structured outputs - `_extract_json_from_text()` - Parses JSON from text responses (handles markdown code blocks) - Returns parsed Pydantic model or None on failure - Handles refusals and validation errors explicitly **Retry Logic**: - MCP sampling responses are parsed as JSON and validated with Pydantic - On validation failure, retries up to `SAMPLING_MAX_RETRIES` times with error hints - Validation errors are included in retry prompts to guide LLM corrections **Logging Strategy**: - Controlled by `LOG_LLM_CALLS` env var - Logs to stderr or file (configurable) - Captures: method used (sampling/API), prompts, responses, token usage, duration - Each call numbered for tracing through pipeline ### 4. `structured_models.py` - Pydantic Response Schemas - **Purpose**: Define strict response formats for structured outputs - **Type Safety**: Enforced by OpenAI API, not just validation **Models**: ```python SelectionResponse: - sentence: str - thought_process: str # 4-step CoT - final_submission: Literal["Contains...", "Does NOT contain..."] - sentence_with_only_verifiable_information: Optional[str] DisambiguationResponse: - incomplete_names_acronyms_abbreviations: str - linguistic_ambiguity_analysis: str - changes_needed: Optional[str] - decontextualized_sentence: Optional[str] DecompositionResponse: - sentence: str - referential_terms: Optional[str] - max_clarified_sentence: str - proposition_range: str # "3-5" - propositions: List[str] - final_claims: List[Claim] Claim: - text: str - verifiable: bool = True # Always True, guides LLM thinking ``` **Why Structured Outputs**: - No regex parsing failures - Guaranteed schema compliance - Explicit refusal handling - Better error messages - Reduced retry logic needs ### 5. `structured_prompts.py` - Stage Prompts - **Purpose**: Contains the academic methodology as system prompts - **Source**: Adapted from Metropolitansky & Larson (2025) paper - **Modification**: Optimized for both structured outputs (OpenAI) and JSON parsing (MCP sampling) - **JSON Instructions**: Each prompt includes explicit JSON schema requirements for MCP sampling compatibility **Multi-Language Support**: - All prompts include critical language requirement - Content preserved in original language - Structural keywords in English - Examples demonstrate preservation **Prompt Engineering Notes**: - Very detailed with many examples (from paper) - Uses chain-of-thought for selection/disambiguation - Explicit rules about what to filter - Examples cover edge cases - **NEW**: Explicit JSON format instructions at end of each prompt for sampling compatibility ## Data Flow Example **Input**: "Apple Inc. was founded in 1976 by Steve Jobs. The company is incredibly innovative." 1. **Splitting**: - ["Apple Inc. was founded in 1976 by Steve Jobs.", "The company is incredibly innovative."] 2. **Selection** (Sentence 1): - Status: verifiable - Output: "Apple Inc. was founded in 1976 by Steve Jobs." 3. **Selection** (Sentence 2): - Status: unverifiable (opinion about "innovative") - Filtered out 4. **Disambiguation** (Sentence 1): - Status: resolved - Output: "Apple Inc. was founded in 1976 by Steve Jobs." (no changes needed) 5. **Decomposition** (Sentence 1): - Output: ["Apple Inc. was founded in 1976", "Steve Jobs founded Apple Inc."] **Final Claims**: ["Apple Inc. was founded in 1976", "Steve Jobs founded Apple Inc."] ## Environment Configuration **Required**: - None (can run with just MCP sampling) **Optional**: - `OPENAI_API_KEY` - For OpenAI API fallback when MCP sampling unavailable - `LLM_MODEL` - Default: "gpt-4o-2024-08-06" (only used for OpenAI API fallback) **MCP Sampling Configuration**: - `SAMPLING_MAX_TOKENS_SELECTION` - Default: "500" (max tokens for selection stage) - `SAMPLING_MAX_TOKENS_DISAMBIGUATION` - Default: "400" (max tokens for disambiguation stage) - `SAMPLING_MAX_TOKENS_DECOMPOSITION` - Default: "800" (max tokens for decomposition stage) - `SAMPLING_MAX_RETRIES` - Default: "2" (max retries for malformed JSON responses) **Logging**: - `LOG_LLM_CALLS` - Default: "true" (set to "false" to disable) - `LOG_OUTPUT` - Default: "stderr" (or "file") - `LOG_FILE` - Default: "claimify_llm.log" (if LOG_OUTPUT=file) **Model Compatibility**: - **MCP Sampling**: Works with any model the client supports - **OpenAI API Fallback**: Only these models support structured outputs: - gpt-4o-2024-08-06 (recommended) - gpt-4o-mini (faster/cheaper) - gpt-4o (latest) ## Common Agent Tasks ### Adding a New Stage 1. Create Pydantic model in `structured_models.py` 2. Write system prompt in `structured_prompts.py` 3. Add `run_<stage>_stage()` function in `pipeline.py` 4. Add parsing function `parse_structured_<stage>_output()` 5. Insert into pipeline flow in `ClaimifyPipeline.run()` ### Modifying Prompts **Location**: `structured_prompts.py` **Guidelines**: - Keep multi-language requirement - Maintain example structure - Test with various languages - Update Pydantic model if changing fields - Structured outputs enforce schema, so less format instruction needed ### Debugging Pipeline Issues **Check logs first**: ```python # Enable detailed logging LOG_LLM_CALLS=true LOG_OUTPUT=stderr # or file ``` **Log sections**: - Each LLM call numbered - Stage identifier - Full prompts and responses - Token usage - Duration **Common issues**: - Sentence filtered at selection → check thought_process field - Can't disambiguate → check linguistic_ambiguity_analysis - No claims extracted → check all stages in logs - Model errors → verify model supports structured outputs ### Testing Changes **Manual testing**: ```python from llm_client import LLMClient from pipeline import ClaimifyPipeline client = LLMClient() pipeline = ClaimifyPipeline(client, "What is the history of Apple?") claims = pipeline.run("Apple Inc. was founded in 1976 by Steve Jobs.") print(claims) ``` **Automated tests**: ```bash python test_claimify.py ``` ### Performance Considerations **Token costs**: - Each sentence goes through 3 LLM calls (selection, disambiguation, decomposition) - Context windows add 10 sentences worth of tokens per call - Long documents can be expensive **Optimization strategies**: - Break large texts into logical chunks - Use gpt-4o-mini for cost savings (still supports structured outputs) - Consider pre-filtering obvious non-factual content - Batch similar sentences if modifying pipeline **Timing**: - Expect ~10 seconds per sentence (3 API calls with gpt-4o) - Faster with gpt-4o-mini - Network latency is main factor ## Error Handling Patterns **Pipeline level**: - Each stage returns status tuple - Early exit on non-verifiable/unresolvable - Continues processing remaining sentences on individual failures **LLM Client level**: - Returns None on API errors - Logs all errors to stderr - Handles refusals explicitly - No automatic retries (rely on structured outputs reliability) **Server level**: - Wraps all tool calls in try/except - Returns error message to client - Validates environment on startup - Fails fast if configuration invalid ## Code Style and Patterns **Type hints**: - Use throughout (Python 3.10+) - Pydantic models for structured data - TypeVar for generic Pydantic responses **Async patterns**: - Server is async (MCP requirement) - Pipeline is synchronous - LLM client is synchronous - No parallelization currently (could be added) **Logging**: - Use stderr for logs (stdout is MCP protocol) - Check `self.logger` before logging - Structured log messages with stage/call info **Documentation**: - Docstrings on all public functions - References to paper sections where applicable - Examples in docstrings ## MCP Integration Notes **Client Configuration** (e.g., Cursor): ```json { "mcpServers": { "claimify-local": { "command": "/path/to/venv/bin/python", "args": ["/path/to/claimify_server.py"] } } } ``` **Tool Interface**: ```json { "name": "extract_claims", "inputSchema": { "text_to_process": "string (required)", "question": "string (optional, provides context)" } } ``` **Important**: - Server communicates via stdio, not HTTP - All logs must go to stderr - Tool returns JSON string (not raw list) - Client deserializes JSON response ## Dependencies **Core**: - `openai>=1.0.0` - API client with structured outputs - `mcp>=1.0.0` - Model Context Protocol server - `pydantic>=2.0.0` - Response validation - `nltk>=3.8.0` - Sentence tokenization - `python-dotenv>=1.0.0` - Environment config **NLTK Data**: - punkt_tab tokenizer (auto-downloaded on first run) - Fallback to punkt if punkt_tab unavailable ## Research Context **Paper**: "Towards Effective Extraction and Evaluation of Factual Claims" (Metropolitansky & Larson, 2025) **Key Methodology**: - Multi-stage pipeline with context awareness - Emphasis on decontextualization - Atomic claim decomposition - Verifiability as core criterion **This Implementation**: - Uses structured outputs (not in original paper) - Simplified prompt format (structure enforced by API) - Added comprehensive logging - Production-ready error handling - MCP server interface **Not an official implementation** - prompts modified for structured outputs ## Security Considerations **API Keys**: - Loaded from .env file - Never logged or exposed - Validated on startup **Input Validation**: - Text length not hard-limited (trust MCP client) - No SQL/command injection risks (only API calls) - Structured outputs prevent prompt injection in responses **Output Safety**: - LLM refusals handled explicitly - No arbitrary code execution - Content filtering by OpenAI ## Common Gotchas 1. **"Model does not support structured outputs"** - Only specific models work - Update LLM_MODEL in .env - Check `supports_structured_outputs()` logic 2. **NLTK punkt not found** - Auto-downloads on first run - May need internet connection - Fallback to older punkt version 3. **Empty claims list** - Check logs to see which stage filtered - Input may be all opinions/generic statements - Try simpler factual sentences first 4. **MCP client can't connect** - Verify absolute paths in config - Check virtual environment activation - Ensure python executable is correct 5. **Language not preserved** - Prompts have critical language requirement - Report if model not following instructions - Check logs for actual LLM responses 6. **Logs interfering with MCP** - Always log to stderr, never stdout - Stdout reserved for MCP protocol - Use LOG_OUTPUT=file if needed ## Future Enhancement Ideas **Performance**: - Parallel processing of independent sentences - Caching for identical context windows - Batching API calls - Streaming responses for long documents **Features**: - Support for other LLM providers (Anthropic, local models) - Confidence scores for claims - Claim relationship tracking - Multi-document claim deduplication **Quality**: - Few-shot examples in prompts - Active learning from user feedback - A/B testing different prompt variations - Evaluation metrics against paper's benchmarks **Developer Experience**: - Web UI for testing - Detailed validation reports - Claim visualization - Interactive disambiguation ## Getting Help **For development issues**: 1. Check logs with `LOG_LLM_CALLS=true` 2. Review this AGENTS.md 3. Consult README.md for setup 4. Read the original paper for methodology **For extending the system**: 1. Start with `pipeline.py` to understand flow 2. Review `structured_models.py` for data structures 3. Study `structured_prompts.py` for prompt engineering 4. Test with `test_claimify.py` ## Quick Reference **File purposes**: - `claimify_server.py` - MCP server, entry point - `pipeline.py` - Core 4-stage extraction logic - `llm_client.py` - OpenAI API wrapper - `structured_models.py` - Pydantic response schemas - `structured_prompts.py` - System prompts for each stage - `test_claimify.py` - Test suite - `requirements.txt` - Python dependencies - `setup.py` - Package installation - `env.example` - Environment template - `README.md` - User documentation - `LICENSE` - Apache 2.0 **Key functions**: - `ClaimifyPipeline.run(text)` - Main extraction - `split_into_sentences(text)` - Stage 1 - `run_selection_stage()` - Stage 2 - `run_disambiguation_stage()` - Stage 3 - `run_decomposition_stage()` - Stage 4 - `LLMClient.make_structured_request()` - API calls **Environment variables**: - `OPENAI_API_KEY` - Required - `LLM_MODEL` - Model selection - `LOG_LLM_CALLS` - Enable/disable logging - `LOG_OUTPUT` - Where to log - `LOG_FILE` - Log file path --- **Last Updated**: October 31, 2025 **Project Status**: Production-ready **Python Version**: 3.10+ **License**: Apache 2.0

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AdamGustavsson/ClaimsMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

AGENTS.md•16.4 KiB