Foundry MCP

foundry-mcp
docs
examples
deep-research

README.md•7.97 KiB

# Deep Research Examples This directory contains example outputs from the `deep-research` workflow, demonstrating how foundry-mcp conducts automated, multi-phase research on topics. ## Available Examples | Example | Report | Audit | Description | |---------|--------|-------|-------------| | **LLM Judges** | `llm-judges-report.md` | `llm-judges-audit.jsonl` | Techniques, architectures, and evaluation methods for LLM-as-a-Judge *(earlier iteration)* | | **Conversation-Based Assessment** | `cba-report.md` | `cba-audit.jsonl` | Methodologies, frameworks, and AI applications in educational/professional assessment | --- # Example 1: LLM Judges > **Note:** This example is from an earlier iteration of the deep research workflow (v0.8.0). The current workflow has additional phases, improved source gathering, and enhanced synthesis capabilities. This section documents the LLM Judges research output. ## Research Query > "LLM Judges: techniques, architectures, evaluation methods, and applications for using large language models as automated evaluators and judges" ## Workflow Overview The deep research workflow executes in distinct phases: ### Phase 1: Planning The system analyzes the query and generates targeted sub-queries to explore different facets of the topic. For this research, it generated 12 sub-queries covering: - Core architectures (pairwise comparison, direct scoring) - Known biases (positional, verbosity, self-preference) - Mitigation techniques (Chain-of-Thought, position swapping) - Advanced approaches (Judge Assembly, hybrid frameworks) ### Phase 2: Gathering Each sub-query is executed against multiple search providers in parallel: - **Tavily** - 12 queries - **Perplexity** - 12 queries - **Google** - 12 queries - **Semantic Scholar** - 12 queries Total: 48 search queries across 4 providers, yielding 156 unique sources from 64 distinct domains. ### Phase 3: Analysis Findings are synthesized, conflicts are identified, and knowledge gaps are noted for refinement iterations. ### Phase 4: Synthesis A final report is generated with executive summary, key findings organized by theme, analysis of supporting/conflicting evidence, limitations, and actionable conclusions. ### Phase 5: Refinement The workflow iterates up to 3 times, identifying gaps and generating additional sub-queries to fill them. ## Statistics | Metric | Value | |--------|-------| | Total Iterations | 3 | | Sub-queries Generated | 12 | | Search Queries Executed | 48 | | Sources Examined | 156 | | Unique Source Domains | 64 | | Key Findings | 12 | | Knowledge Gaps | 6 | | Total Tokens Used | 129,685 | | Duration | ~74 seconds | ## Files in This Directory | File | Description | |------|-------------| | `llm-judges-report.md` | The final synthesized research report | | `llm-judges-audit.jsonl` | Detailed audit trail of every operation (JSONL format) | ## Audit Trail Structure The audit file (`llm-judges-audit.jsonl`) contains one JSON object per line, recording: ```json { "timestamp": "2026-01-01T01:18:35.518082Z", "event_id": "94c477f3916948558059faefd5a6d856", "event_type": "workflow_complete", "research_id": "deepres-906a9d34c7b2", "phase": "synthesis", "iteration": 3, "level": "info", "data": { "source_count": 156, "finding_count": 12, "total_tokens_used": 129685, "search_provider_stats": { "tavily": 12, "perplexity": 12, "google": 12, "semantic_scholar": 12 } } } ``` Event types include: - `workflow_start` / `workflow_complete` - Session lifecycle - `phase_start` / `phase_complete` - Phase transitions with timing - `planning_result` - Sub-queries generated - `gathering_provider_result` - Per-provider search results - `analysis_result` - Findings and gaps extracted - `synthesis_result` - Report generation - `refinement_result` - Gap-filling iterations ## Usage To run your own deep research: ```bash # Start research (runs in background) foundry research deep-research \ --query "Your research topic here" \ --max-iterations 3 # Check progress foundry research deep-research-status --research-id <id> # Get final report foundry research deep-research-report --research-id <id> ``` Or via MCP tool calls: ```python # Start {"action": "deep-research", "query": "...", "max_iterations": 3} # Status (shows live progress) {"action": "deep-research-status", "research_id": "..."} # Report {"action": "deep-research-report", "research_id": "..."} ``` ## Key Takeaways from This Research The research revealed that LLM-as-a-Judge is a powerful but systematically biased paradigm: 1. **Human-level agreement** - GPT-4 achieves >80% agreement with human annotators, matching inter-rater reliability 2. **Three critical biases** require active mitigation: - **Position bias** - First option favored in pairwise comparisons - **Verbosity bias** - Longer responses rated higher regardless of accuracy - **Self-preference bias** - Models favor outputs from their own family 3. **Mandatory mitigations** - Position swapping and Chain-of-Thought prompting are essential 4. **Domain-specific validation** - For technical tasks like code evaluation, use "Judge Assembly" patterns combining LLM reasoning with deterministic checks (execution, linting) 5. **Hybrid frameworks** - Co-Eval approaches augment LLM judgment with objective metrics to reduce hallucinated scoring ## Source Diversity The research drew from 64 unique domains including: - Academic sources: arxiv.org, neurips.cc, aclanthology.org, openreview.net - Industry blogs: cameronrwolfe.substack.com, eugeneyan.com, wandb.ai - Documentation: docs.ragas.io, langchain-opentutorial.gitbook.io - Research tools: semantic scholar, google scholar references --- # Example 2: Conversation-Based Assessment This section documents the Conversation-Based Assessment (CBA) research output. ## Research Query > "Conversation based assessment: methodologies, frameworks, applications in education and professional settings, AI-powered conversational assessment tools, validity and reliability considerations, best practices for design and implementation" ## Workflow Overview The research explored conversation-based assessment across multiple dimensions: ### Phase 1: Planning The system generated 4 targeted sub-queries covering: - Core methodologies and frameworks (ORID, Caring Assessments) - AI applications in recruitment and healthcare - Educational efficacy and validity considerations - Best practices for implementation ### Phase 2: Gathering Sub-queries executed across search providers, yielding 27 unique sources. ### Phase 3-5: Analysis, Synthesis, Refinement Findings synthesized across healthcare, education, and professional domains with gap analysis. ## Statistics | Metric | Value | |--------|-------| | Total Iterations | 2 | | Sub-queries Generated | 4 | | Sources Examined | 44 | | Key Findings | 4 | | Knowledge Gaps | 2 | | Total Tokens Used | ~275,000 | ## Files | File | Description | |------|-------------| | `cba-report.md` | The final synthesized research report | | `cba-audit.jsonl` | Detailed audit trail of every operation | | `cba-session.json` | Full session state including all sources and findings | ## Key Takeaways 1. **Structured Frameworks Matter**: ORID (Objective, Reflective, Interpretive, Decisional) ensures cognitive depth beyond simple recall 2. **AI Validity Varies by Domain**: - **Healthcare**: High validity for screening (depression scales, medical Q&A) - **Recruitment**: Strong market validation for technical skill assessment - **Education**: Engagement ≠ Learning - positive feedback doesn't guarantee improved outcomes 3. **Critical Biases**: Insufficient data on linguistic diversity and neurodiverse populations 4. **Hybrid Approaches Recommended**: AI for initial screening; human oversight for complex pedagogical goals ## Source Diversity The research drew from diverse domains: - Healthcare: JAMA Network, ScienceDirect, PubMed Central - Education: SAGE Journals, ETS Research, ResearchGate - Professional: Gartner, iMocha, Testlify, Metaview - Frameworks: Better Evaluation, SFJ Awards

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/tylerburleigh/foundry-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•7.97 KiB