Open Census MCP Server

2026-02-21_anthropic_parse_failure_gap.md•2.75 KiB

# Lab Notes — Anthropic Parse Failure Gap Discovery

**Date:** 2026-02-21
**Severity:** Data integrity — 3 unusable judge records in production dataset

---

## Discovery

During numbers registry construction, programmatic verification of `results/v2_redo/stage2/rag_vs_pragmatics_20260216_092144.jsonl` revealed 3 records with `preference: parse_failed`:

| query_id | pass_number | vendor | 
|----------|-------------|--------|
| AMB-003 | 2 | Anthropic |
| PER-001c | 1 | Anthropic |
| PER-001c | 4 | Anthropic |

## What Happened

1. **Feb 16:** rag_vs_pragmatics production run completed. QC noted "3 Anthropic parse failures (1.3%) — records stored with `preference: parse_failed`" in `talks/fcsm_2026/2026-02-16_stage2_rag_vs_pragmatics_production.md`. Documented but not flagged for backfill.

2. **Feb 18-19:** control_vs_pragmatics had a different issue — 3 missing Google records (not parse failures, but records that were never generated). These were backfilled on Feb 19 from a separate targeted run, merged into the main JSONL.

3. **Feb 19:** Final aggregate analysis ran with `parse failures: 3` noted in output. The `aggregate_analysis.py` script excludes parse_failed records per VR-072, so the 3 records were silently dropped from all statistical computations.

4. **Feb 21:** Numbers registry verification caught the gap. The 3 parse failures were never backfilled because the Feb 19 backfill effort focused on the control_vs_pragmatics Google gap, which was the blocking issue at the time.

## Impact

- AMB-003 has 5/6 Anthropic passes (missing pass 2) in rag_vs_pragmatics
- PER-001c has 4/6 Anthropic passes (missing passes 1 and 4) in rag_vs_pragmatics  
- `aggregate_analysis.py` computes query-level medians across available passes, so these queries have slightly less Anthropic representation
- Statistical impact likely negligible (3/2106 = 0.14%) but the principle is violated: we claimed 702 usable records per comparison

## Corrective Action

CC task `2026-02-21_backfill_anthropic_parse_failures.md` created to:
1. Re-run Anthropic judge on the 3 specific query/pass combinations
2. Remove parse_failed records, merge backfill
3. QC verification (0 parse failures)
4. Re-run aggregate analysis
5. Update numbers registry if any certified values shift

## Root Cause

No systematic process to flag and track parse failures for backfill. The QC script (VR-072) reports them but doesn't create follow-up tasks. The Feb 19 backfill was triggered by a different failure mode (missing records, not parse failures) and didn't sweep for other gaps.

## Prevention

The `verify_registry_counts.py` V&V script (CC task `2026-02-21_verify_registry_counts.md`) now checks for parse failures as part of SD-006 verification. Future runs will catch this class of issue programmatically.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

2026-02-21_anthropic_parse_failure_gap.md•2.75 KiB

# Lab Notes — Anthropic Parse Failure Gap Discovery

**Date:** 2026-02-21
**Severity:** Data integrity — 3 unusable judge records in production dataset

---

## Discovery

During numbers registry construction, programmatic verification of `results/v2_redo/stage2/rag_vs_pragmatics_20260216_092144.jsonl` revealed 3 records with `preference: parse_failed`:

| query_id | pass_number | vendor | 
|----------|-------------|--------|
| AMB-003 | 2 | Anthropic |
| PER-001c | 1 | Anthropic |
| PER-001c | 4 | Anthropic |

## What Happened

1. **Feb 16:** rag_vs_pragmatics production run completed. QC noted "3 Anthropic parse failures (1.3%) — records stored with `preference: parse_failed`" in `talks/fcsm_2026/2026-02-16_stage2_rag_vs_pragmatics_production.md`. Documented but not flagged for backfill.

2. **Feb 18-19:** control_vs_pragmatics had a different issue — 3 missing Google records (not parse failures, but records that were never generated). These were backfilled on Feb 19 from a separate targeted run, merged into the main JSONL.

3. **Feb 19:** Final aggregate analysis ran with `parse failures: 3` noted in output. The `aggregate_analysis.py` script excludes parse_failed records per VR-072, so the 3 records were silently dropped from all statistical computations.

4. **Feb 21:** Numbers registry verification caught the gap. The 3 parse failures were never backfilled because the Feb 19 backfill effort focused on the control_vs_pragmatics Google gap, which was the blocking issue at the time.

## Impact

- AMB-003 has 5/6 Anthropic passes (missing pass 2) in rag_vs_pragmatics
- PER-001c has 4/6 Anthropic passes (missing passes 1 and 4) in rag_vs_pragmatics  
- `aggregate_analysis.py` computes query-level medians across available passes, so these queries have slightly less Anthropic representation
- Statistical impact likely negligible (3/2106 = 0.14%) but the principle is violated: we claimed 702 usable records per comparison

## Corrective Action

CC task `2026-02-21_backfill_anthropic_parse_failures.md` created to:
1. Re-run Anthropic judge on the 3 specific query/pass combinations
2. Remove parse_failed records, merge backfill
3. QC verification (0 parse failures)
4. Re-run aggregate analysis
5. Update numbers registry if any certified values shift

## Root Cause

No systematic process to flag and track parse failures for backfill. The QC script (VR-072) reports them but doesn't create follow-up tasks. The Feb 19 backfill was triggered by a different failure mode (missing records, not parse failures) and didn't sweep for other gaps.

## Prevention

The `verify_registry_counts.py` V&V script (CC task `2026-02-21_verify_registry_counts.md`) now checks for parse failures as part of SD-006 verification. Future runs will catch this class of issue programmatically.