Open Census MCP Server

evaluation_pipeline_overview.mermaid.md•6.61 KiB

---
title: "CQS Evaluation Pipeline — Slide-Ready Overview"
description: Four-stage evaluation pipeline for FCSM 2026 knowledge representation study
date: 2026-02-16
status: active
usage: backup/appendix slides for FCSM presentation
---

## Pipeline Overview (one slide)

```mermaid
flowchart LR
    subgraph S1["Stage 1: Generate"]
        direction TB
        S1_IN["39 queries\n3 conditions"]
        S1_OUT["117 responses\n(JSONL)"]
        S1_IN --> S1_OUT
    end

    subgraph S2["Stage 2: Judge"]
        direction TB
        S2_IN["3 pairwise\ncomparisons"]
        S2_OUT["2,106 scores\n(D1-D5)"]
        S2_IN --> S2_OUT
    end

    subgraph S3["Stage 3: Fidelity"]
        direction TB
        S3_IN["117 responses\nvs tool returns"]
        S3_OUT["claim verification\n(match/mismatch)"]
        S3_IN --> S3_OUT
    end

    subgraph S4["Stage 4: Analysis"]
        direction TB
        S4_IN["scores +\nfidelity"]
        S4_OUT["effect sizes\np-values\ntables"]
        S4_IN --> S4_OUT
    end

    S1 --> S2
    S1 --> S3
    S2 --> S4
    S3 --> S4
```

**Speaker notes:** Four-stage pipeline. Stage 1 generates responses under three experimental conditions. Stage 2 scores them using a multi-vendor LLM judge panel. Stage 3 independently verifies factual claims against tool returns. Stage 4 synthesizes scores and fidelity into statistical analysis. Stages 2 and 3 run independently from Stage 1 output — they measure different things (quality judgment vs factual accuracy).

---

## Stage 1: Response Generation (one slide)

```mermaid
flowchart TD
    Q["39 Curated Queries\n(41% normal, 59% edge cases)"]

    Q --> C["Control\nData tools only\nNo methodology"]
    Q --> R["RAG\nData tools +\nRetrieved document chunks"]
    Q --> P["Pragmatics\nData tools +\nCurated expert judgment\nvia MCP tool"]

    C --> OUT["117 Responses\n(JSONL with full provenance)"]
    R --> OUT
    P --> OUT

    style C fill:#f5f5f5,stroke:#999
    style R fill:#fff3e0,stroke:#e65100
    style P fill:#e3f2fd,stroke:#1565c0
```

**Speaker notes:** All three conditions get the same data tools — get_census_data and explore_variables. The only variable is how methodology support is delivered. Control gets none. RAG gets methodology via standard retrieval-augmented generation — document chunks retrieved by semantic similarity, injected into the system prompt. Pragmatics gets methodology via a curated MCP tool that delivers expert judgment structured around specific statistical concepts. Same model, same queries, same API access. The question is whether the form of knowledge representation matters.

Key design decision: the production MCP server bundles pragmatics in every data response. For the experiment, we sanitize tool results — stripping the pragmatics field before the model sees it for control and RAG. The full payload is logged for fidelity verification.

---

## Stage 2: LLM-as-Judge Scoring (one slide)

```mermaid
flowchart TD
    PAIRS["3 Pairwise Comparisons"]

    PAIRS --> P1["RAG vs Pragmatics"]
    PAIRS --> P2["Control vs Pragmatics"]
    PAIRS --> P3["Control vs RAG"]

    subgraph JUDGE["Per Comparison"]
        direction TB
        ANON["Anonymized\nResponse A / Response B"]
        VENDOR["3 Vendors\nAnthropic · OpenAI · Google"]
        PASSES["6 Passes\n(3 A-first, 3 B-first)"]
        RUBRIC["CQS Rubric\nD1-D5, scored 0-2\n+ confidence + reasoning"]
    end

    P1 --> JUDGE
    P2 --> JUDGE
    P3 --> JUDGE

    JUDGE --> SCORES["2,106 JudgeRecords"]
```

**Speaker notes:** Each comparison is scored by three independent LLM vendors — this detects self-enhancement bias, where a model scores its own outputs higher. Presentation is counterbalanced: each query is scored with both A-first and B-first orderings to detect position bias. Six passes per vendor per query enables test-retest reliability measurement. The judge never sees condition labels — just anonymized Response A and Response B. Five dimensions: source selection, methodology, uncertainty communication, definitions, reproducibility. Each scored 0-2 with confidence and free-text reasoning.

Why three comparisons instead of two? Control vs each is baseline validation. RAG vs Pragmatics is the core research question — does curation outperform retrieval? That deserves direct measurement, not a derived estimate.

---

## Stage 3: Fidelity Verification (one slide)

```mermaid
flowchart LR
    RESP["117 Responses"]

    RESP --> CHECK["Automated\nClaim Extraction"]

    CHECK --> VERIFY{"Verify each claim\nagainst tool returns"}

    VERIFY --> MATCH["match"]
    VERIFY --> MIS["mismatched"]
    VERIFY --> NOSRC["no_source"]
    VERIFY --> CALC["calculation\ncorrect/incorrect"]

    MATCH --> SCORE["Fidelity Score\n= (match + calc_correct)\n÷ total claims"]
    MIS --> SCORE
    NOSRC --> SCORE
    CALC --> SCORE
```

**Speaker notes:** Stage 3 is the trustworthiness verification stage. D6 (Grounding) is a binary gate — treatment conditions ground in authoritative sources by design; control does not. Automated fidelity provides claim-level verification: every quantitative claim in the response is extracted and compared against the actual tool call data. In V2, all three conditions make tool calls, so all three get fidelity verification. RAG responses are additionally verified against retrieved chunks.

---

## Stage 4: Three-Group Analysis (one slide)

```mermaid
flowchart TD
    INPUT["2,106 Judge Scores\n+ 117 Fidelity Reports"]

    INPUT --> AGG["Aggregate to Query Level\n(n=39 experimental units)"]

    AGG --> OMNIBUS["Friedman Test\n(repeated measures)"]
    AGG --> PAIRWISE["Wilcoxon Post-Hoc\n(Bonferroni α = 0.0167)"]
    AGG --> EFFECT["Cohen's d\nper dimension\nper comparison\n(bootstrap 95% CI)"]

    OMNIBUS --> REPORT["Publication Tables"]
    PAIRWISE --> REPORT
    EFFECT --> REPORT

    subgraph BIAS["Bias Checks"]
        B1["Position bias"]
        B2["Self-enhancement bias"]
        B3["Verbosity bias"]
    end

    AGG --> BIAS
    BIAS --> REPORT
```

**Speaker notes:** The experimental unit is the query, not the judge record. We aggregate across vendors and passes to get one score per query per condition before running statistical tests. Friedman test is the omnibus — are the three conditions different? Then Wilcoxon signed-rank post-hoc on each pair with Bonferroni correction for three comparisons. Effect sizes reported as Cohen's d with bootstrap confidence intervals. Three bias checks run on every analysis: position bias (does A always win?), self-enhancement (does Anthropic's judge favor Anthropic's caller outputs?), and verbosity (do longer responses score higher regardless of quality?). If any bias exceeds threshold, it's flagged and reported.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

evaluation_pipeline_overview.mermaid.md•6.61 KiB

---
title: "CQS Evaluation Pipeline — Slide-Ready Overview"
description: Four-stage evaluation pipeline for FCSM 2026 knowledge representation study
date: 2026-02-16
status: active
usage: backup/appendix slides for FCSM presentation
---

## Pipeline Overview (one slide)

```mermaid
flowchart LR
    subgraph S1["Stage 1: Generate"]
        direction TB
        S1_IN["39 queries\n3 conditions"]
        S1_OUT["117 responses\n(JSONL)"]
        S1_IN --> S1_OUT
    end

    subgraph S2["Stage 2: Judge"]
        direction TB
        S2_IN["3 pairwise\ncomparisons"]
        S2_OUT["2,106 scores\n(D1-D5)"]
        S2_IN --> S2_OUT
    end

    subgraph S3["Stage 3: Fidelity"]
        direction TB
        S3_IN["117 responses\nvs tool returns"]
        S3_OUT["claim verification\n(match/mismatch)"]
        S3_IN --> S3_OUT
    end

    subgraph S4["Stage 4: Analysis"]
        direction TB
        S4_IN["scores +\nfidelity"]
        S4_OUT["effect sizes\np-values\ntables"]
        S4_IN --> S4_OUT
    end

    S1 --> S2
    S1 --> S3
    S2 --> S4
    S3 --> S4
```

**Speaker notes:** Four-stage pipeline. Stage 1 generates responses under three experimental conditions. Stage 2 scores them using a multi-vendor LLM judge panel. Stage 3 independently verifies factual claims against tool returns. Stage 4 synthesizes scores and fidelity into statistical analysis. Stages 2 and 3 run independently from Stage 1 output — they measure different things (quality judgment vs factual accuracy).

---

## Stage 1: Response Generation (one slide)

```mermaid
flowchart TD
    Q["39 Curated Queries\n(41% normal, 59% edge cases)"]

    Q --> C["Control\nData tools only\nNo methodology"]
    Q --> R["RAG\nData tools +\nRetrieved document chunks"]
    Q --> P["Pragmatics\nData tools +\nCurated expert judgment\nvia MCP tool"]

    C --> OUT["117 Responses\n(JSONL with full provenance)"]
    R --> OUT
    P --> OUT

    style C fill:#f5f5f5,stroke:#999
    style R fill:#fff3e0,stroke:#e65100
    style P fill:#e3f2fd,stroke:#1565c0
```

**Speaker notes:** All three conditions get the same data tools — get_census_data and explore_variables. The only variable is how methodology support is delivered. Control gets none. RAG gets methodology via standard retrieval-augmented generation — document chunks retrieved by semantic similarity, injected into the system prompt. Pragmatics gets methodology via a curated MCP tool that delivers expert judgment structured around specific statistical concepts. Same model, same queries, same API access. The question is whether the form of knowledge representation matters.

Key design decision: the production MCP server bundles pragmatics in every data response. For the experiment, we sanitize tool results — stripping the pragmatics field before the model sees it for control and RAG. The full payload is logged for fidelity verification.

---

## Stage 2: LLM-as-Judge Scoring (one slide)

```mermaid
flowchart TD
    PAIRS["3 Pairwise Comparisons"]

    PAIRS --> P1["RAG vs Pragmatics"]
    PAIRS --> P2["Control vs Pragmatics"]
    PAIRS --> P3["Control vs RAG"]

    subgraph JUDGE["Per Comparison"]
        direction TB
        ANON["Anonymized\nResponse A / Response B"]
        VENDOR["3 Vendors\nAnthropic · OpenAI · Google"]
        PASSES["6 Passes\n(3 A-first, 3 B-first)"]
        RUBRIC["CQS Rubric\nD1-D5, scored 0-2\n+ confidence + reasoning"]
    end

    P1 --> JUDGE
    P2 --> JUDGE
    P3 --> JUDGE

    JUDGE --> SCORES["2,106 JudgeRecords"]
```

**Speaker notes:** Each comparison is scored by three independent LLM vendors — this detects self-enhancement bias, where a model scores its own outputs higher. Presentation is counterbalanced: each query is scored with both A-first and B-first orderings to detect position bias. Six passes per vendor per query enables test-retest reliability measurement. The judge never sees condition labels — just anonymized Response A and Response B. Five dimensions: source selection, methodology, uncertainty communication, definitions, reproducibility. Each scored 0-2 with confidence and free-text reasoning.

Why three comparisons instead of two? Control vs each is baseline validation. RAG vs Pragmatics is the core research question — does curation outperform retrieval? That deserves direct measurement, not a derived estimate.

---

## Stage 3: Fidelity Verification (one slide)

```mermaid
flowchart LR
    RESP["117 Responses"]

    RESP --> CHECK["Automated\nClaim Extraction"]

    CHECK --> VERIFY{"Verify each claim\nagainst tool returns"}

    VERIFY --> MATCH["match"]
    VERIFY --> MIS["mismatched"]
    VERIFY --> NOSRC["no_source"]
    VERIFY --> CALC["calculation\ncorrect/incorrect"]

    MATCH --> SCORE["Fidelity Score\n= (match + calc_correct)\n÷ total claims"]
    MIS --> SCORE
    NOSRC --> SCORE
    CALC --> SCORE
```

**Speaker notes:** Stage 3 is the trustworthiness verification stage. D6 (Grounding) is a binary gate — treatment conditions ground in authoritative sources by design; control does not. Automated fidelity provides claim-level verification: every quantitative claim in the response is extracted and compared against the actual tool call data. In V2, all three conditions make tool calls, so all three get fidelity verification. RAG responses are additionally verified against retrieved chunks.

---

## Stage 4: Three-Group Analysis (one slide)

```mermaid
flowchart TD
    INPUT["2,106 Judge Scores\n+ 117 Fidelity Reports"]

    INPUT --> AGG["Aggregate to Query Level\n(n=39 experimental units)"]

    AGG --> OMNIBUS["Friedman Test\n(repeated measures)"]
    AGG --> PAIRWISE["Wilcoxon Post-Hoc\n(Bonferroni α = 0.0167)"]
    AGG --> EFFECT["Cohen's d\nper dimension\nper comparison\n(bootstrap 95% CI)"]

    OMNIBUS --> REPORT["Publication Tables"]
    PAIRWISE --> REPORT
    EFFECT --> REPORT

    subgraph BIAS["Bias Checks"]
        B1["Position bias"]
        B2["Self-enhancement bias"]
        B3["Verbosity bias"]
    end

    AGG --> BIAS
    BIAS --> REPORT
```

**Speaker notes:** The experimental unit is the query, not the judge record. We aggregate across vendors and passes to get one score per query per condition before running statistical tests. Friedman test is the omnibus — are the three conditions different? Then Wilcoxon signed-rank post-hoc on each pair with Bonferroni correction for three comparisons. Effect sizes reported as Cohen's d with bootstrap confidence intervals. Three bias checks run on every analysis: position bias (does A always win?), self-enhancement (does Anthropic's judge favor Anthropic's caller outputs?), and verbosity (do longer responses score higher regardless of quality?). If any bias exceeds threshold, it's flagged and reported.