# Agent Validation Tool Design (LangChain + GitHub Copilot API)
## π Executive Summary
**Final Solution**: GitHub Copilot Chat API (not local LLM, not manual orchestration)
**Why This Approach**:
- β
**Leverages existing subscription** (no new costs)
- β
**High quality** (GPT-4 + Claude models)
- β
**Simple setup** (just authenticate, no 18GB download)
- β
**Fast** (cloud inference)
- β
**Auto-updated** (always latest models)
- β
**Pure Node.js** (no Python needed!)
**Timeline**: 6.5 hours end-to-end (5 min setup + 4 hrs build + 2 hrs testing)
---
## Problem Statement
**Goal**: Automatically test agent preambles without manual orchestration.
**Requirements**:
- β
Load agent preambles as system prompts
- β
Execute benchmark tasks
- β
Capture outputs
- β
Score against rubrics
- β
Automated (no human-in-loop)
- β
Use existing GitHub Copilot access
**Solution**: LangChain + GitHub Copilot Chat API
---
## π― Key Decision: Why LangChain + GitHub Copilot?
### Why Not Manual Testing?
- β Slow (1-2 hours per agent)
- β Not reproducible (human variance)
- β Can't batch test
- β No CI/CD integration
### Why Not Local LLM (Ollama)?
- β οΈ Download 18GB model
- β οΈ Manual maintenance
- β οΈ Lower quality than GPT-4/Claude
- β
**Copilot is better in every way**
### Why GitHub Copilot? β
- β
Already have access (existing subscription)
- β
No setup (just authenticate)
- β
Best quality (GPT-4 + Claude)
- β
Fast (cloud inference)
- β
Auto-updated (always latest models)
---
## π― WHY LANGCHAIN + GITHUB COPILOT
### What This Solution Provides
1. **GitHub Copilot Chat API Integration**
- Use your existing Copilot subscription
- Claude Sonnet 3.5 quality (Copilot uses GPT-4 + Claude)
- No additional API costs
- Fast response times
2. **Agent Testing Framework** (LangChain)
- Unit tests for components
- Integration tests for full workflows
- `agentevals` package for trajectory evaluation
3. **Orchestration** (LangChain)
- Load custom system prompts (agent preambles)
- Execute tasks programmatically
- Capture conversation history
4. **Evaluation Tools**
- LLM-as-judge for automated scoring
- Custom evaluators for rubrics
- Batch testing support
### Advantages Over Local LLM
| Feature | Local LLM (Ollama) | GitHub Copilot API |
|---------|-------------------|-------------------|
| **Quality** | Good (Qwen2.5 32B) | Excellent (GPT-4 + Claude) |
| **Setup** | Download ~18GB model | Use existing auth |
| **Speed** | Medium (local inference) | Fast (cloud) |
| **Cost** | Free (local compute) | Included in Copilot |
| **Maintenance** | Manual updates | Auto-updated |
**Winner**: GitHub Copilot API - better quality, simpler setup, leverages existing subscription.
---
## ποΈ ARCHITECTURE
```
βββββββββββββββββββββββββββββββββββββββββββββββ
β Validation Tool (TypeScript) β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Load agent preamble from .md file β
β 2. Create LangChain agent with system β
β prompt = preamble content β
β 3. Execute benchmark task β
β 4. Capture output + conversation history β
β 5. Score with LLM-as-judge evaluator β
β 6. Generate report β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββ
β GitHub Copilot Chat API β
β (GPT-4 + Claude models) β
β - Uses existing Copilot β
β - No setup required β
βββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββ
β Output β
β - Raw transcript β
β - Scores (0-100) β
β - Comparison reports β
βββββββββββββββββββββββββ
```
---
## π¦ SETUP (One-Time)
### Step 1: Authenticate with GitHub Copilot
```bash
# Install GitHub CLI (if not already)
brew install gh
# Authenticate (one-time)
gh auth login
# Verify
gh auth status
```
### Step 2: Install Copilot API Proxy
```bash
cd /Users/timothysweet/src/GRAPH-RAG-TODO-main
# Install copilot-api globally (Pure Node.js!)
npm install -g copilot-api
# Start Copilot proxy server (runs in background)
copilot-api start &
# Verify it's running
curl http://localhost:4141/v1/models
```
**What this does**:
- Runs a local server that proxies GitHub Copilot API
- Exposes it as an OpenAI-compatible endpoint at `http://localhost:4141`
- **No Python needed!** π
### Step 3: Install LangChain Dependencies
```bash
cd /Users/timothysweet/src/GRAPH-RAG-TODO-main
# Install LangChain with OpenAI integration
npm install @langchain/core @langchain/openai langchain
# Install TypeScript tools
npm install -g ts-node typescript
```
### Step 4: Verify Setup
```bash
# Test with Node.js (dummy API key needed for OpenAI client, but not used by proxy)
node -e "const {ChatOpenAI} = require('@langchain/openai'); const llm = new ChatOpenAI({apiKey: 'dummy-key-not-used', configuration: {baseURL: 'http://localhost:4141/v1'}}); llm.invoke('Hello!').then(r => console.log('β
Copilot Response:', r.content));"
# Expected output: "β
Copilot Response: Hi! How can I assist you today?"
```
**Note**: The `apiKey` is required by the LangChain OpenAI client but is not actually used by the copilot-api proxy (which uses your `gh` authentication instead).
---
## π οΈ IMPLEMENTATION
### Tool Structure
```
tools/
βββ validate-agent.ts # Main validation script
βββ llm-client.ts # Github + LangChain wrapper
βββ evaluators/
β βββ bug-discovery.ts # Bug discovery evaluator
β βββ root-cause.ts # Root cause analysis evaluator
β βββ methodology.ts # Methodology evaluator
βββ report-generator.ts # Report formatting
benchmarks/
βββ debug-benchmark.json # Task + rubric
βββ research-benchmark.json
βββ implementation-benchmark.json
validation-output/
βββ 2025-10-15_claudette-debug-v1.0.0.json # Raw transcript
βββ 2025-10-15_claudette-debug-v1.0.0.md # Readable report
```
### Code: `tools/llm-client.ts`
```typescript
import { ChatOpenAI } from '@langchain/openai';
import { HumanMessage, SystemMessage } from '@langchain/core/messages';
import fs from 'fs/promises';
export interface AgentConfig {
preamblePath: string;
temperature?: number;
maxTokens?: number;
}
/**
* Client for GitHub Copilot Chat API via copilot-api proxy
* (Pure Node.js - no Python needed!)
*/
export class CopilotAgentClient {
private llm: ChatOpenAI;
private systemPrompt: string = '';
constructor(config: AgentConfig) {
// Use copilot-api proxy (OpenAI-compatible endpoint)
this.llm = new ChatOpenAI({
openAIApiKey: 'dummy-key-not-used', // Required by OpenAI client but not used by proxy
configuration: {
baseURL: 'http://localhost:4141/v1', // copilot-api proxy
},
temperature: config.temperature || 0.7,
maxTokens: config.maxTokens || 8000,
});
}
async loadPreamble(path: string): Promise<void> {
this.systemPrompt = await fs.readFile(path, 'utf-8');
}
async execute(task: string): Promise<{
output: string;
conversationHistory: Array<{ role: string; content: string }>;
tokens: { input: number; output: number };
}> {
const messages = [
new SystemMessage(this.systemPrompt),
new HumanMessage(task),
];
const response = await this.llm.invoke(messages);
return {
output: response.content.toString(),
conversationHistory: [
{ role: 'system', content: this.systemPrompt },
{ role: 'user', content: task },
{ role: 'assistant', content: response.content.toString() },
],
tokens: {
input: this.estimateTokens(this.systemPrompt + task),
output: this.estimateTokens(response.content.toString()),
},
};
}
private estimateTokens(text: string): number {
// Rough estimate: 1 token β 4 characters
return Math.ceil(text.length / 4);
}
}
```
### Code: `tools/validate-agent.ts`
```typescript
import { CopilotAgentClient } from './llm-client';
import { evaluateAgent } from './evaluators';
import { generateReport } from './report-generator';
import fs from 'fs/promises';
import path from 'path';
interface BenchmarkTask {
name: string;
description: string;
task: string;
rubric: {
categories: Array<{
name: string;
maxPoints: number;
criteria: string[];
}>;
};
}
async function validateAgent(
agentPath: string,
benchmarkPath: string,
outputDir: string
): Promise<void> {
console.log(`\nπ Validating agent: ${agentPath}`);
console.log(`π Benchmark: ${benchmarkPath}\n`);
// 1. Load benchmark
const benchmark: BenchmarkTask = JSON.parse(
await fs.readFile(benchmarkPath, 'utf-8')
);
// 2. Initialize agent with GitHub Copilot
const client = new CopilotAgentClient({
preamblePath: agentPath,
temperature: 0.0,
maxTokens: 8000,
});
await client.loadPreamble(agentPath);
// 3. Execute benchmark task
console.log('βοΈ Executing benchmark task...');
const result = await client.execute(benchmark.task);
console.log(`β
Task completed in ${result.tokens.input + result.tokens.output} tokens\n`);
// 4. Evaluate output
console.log('π Evaluating output against rubric...');
const scores = await evaluateAgent(result.output, benchmark.rubric);
console.log(`π Total score: ${scores.total}/100\n`);
// 5. Generate report
const timestamp = new Date().toISOString().split('T')[0];
const agentName = path.basename(agentPath, '.md');
const outputPath = path.join(outputDir, `${timestamp}_${agentName}`);
await fs.mkdir(outputDir, { recursive: true });
// Save raw output
await fs.writeFile(
`${outputPath}.json`,
JSON.stringify(
{
timestamp: new Date().toISOString(),
agent: agentPath,
benchmark: benchmarkPath,
result,
scores,
},
null,
2
)
);
// Save readable report
const report = generateReport({
agent: agentName,
benchmark: benchmark.name,
result,
scores,
});
await fs.writeFile(`${outputPath}.md`, report);
console.log(`π Report saved to: ${outputPath}.md`);
}
// CLI usage
const [agentPath, benchmarkPath] = process.argv.slice(2);
if (!agentPath || !benchmarkPath) {
console.error('Usage: npm run validate <agent.md> <benchmark.json>');
process.exit(1);
}
validateAgent(
agentPath,
benchmarkPath,
'validation-output'
).catch(console.error);
```
### Code: `tools/evaluators/index.ts`
```typescript
import { ChatOpenAI } from '@langchain/openai';
interface Rubric {
categories: Array<{
name: string;
maxPoints: number;
criteria: string[];
}>;
}
interface Scores {
categories: Record<string, number>;
total: number;
feedback: Record<string, string>;
}
export async function evaluateAgent(
agentOutput: string,
rubric: Rubric
): Promise<Scores> {
// Use GitHub Copilot for evaluation (LLM-as-judge)
const evaluator = new ChatOpenAI({
openAIApiKey: 'dummy-key-not-used', // Required by OpenAI client but not used by proxy
configuration: {
baseURL: 'http://localhost:4141/v1', // copilot-api proxy
},
temperature: 0.0, // Deterministic scoring
});
const scores: Scores = {
categories: {},
total: 0,
feedback: {},
};
// Evaluate each category
for (const category of rubric.categories) {
const evaluationPrompt = `
You are an expert evaluator. Score the following agent output against this rubric category:
**Category**: ${category.name} (Max: ${category.maxPoints} points)
**Criteria**:
${category.criteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}
**Agent Output**:
${agentOutput}
**Instructions**:
1. Assign a score from 0 to ${category.maxPoints} based on how well the output meets the criteria.
2. Provide brief feedback explaining the score.
3. Format your response EXACTLY as:
SCORE: <number>
FEEDBACK: <explanation>
`.trim();
const response = await evaluator.invoke(evaluationPrompt);
const responseText = response.content.toString();
// Parse score
const scoreMatch = responseText.match(/SCORE:\s*(\d+)/);
const feedbackMatch = responseText.match(/FEEDBACK:\s*(.+)/s);
const score = scoreMatch ? parseInt(scoreMatch[1], 10) : 0;
const feedback = feedbackMatch ? feedbackMatch[1].trim() : 'No feedback provided';
scores.categories[category.name] = Math.min(score, category.maxPoints);
scores.feedback[category.name] = feedback;
scores.total += scores.categories[category.name];
}
return scores;
}
```
### Code: `tools/report-generator.ts`
```typescript
interface ReportData {
agent: string;
benchmark: string;
result: {
output: string;
conversationHistory: Array<{ role: string; content: string }>;
tokens: { input: number; output: number };
};
scores: {
categories: Record<string, number>;
total: number;
feedback: Record<string, string>;
};
}
export function generateReport(data: ReportData): string {
return `
# Agent Validation Report
**Agent**: ${data.agent}
**Benchmark**: ${data.benchmark}
**Date**: ${new Date().toISOString().split('T')[0]}
**Total Score**: ${data.scores.total}/100
---
## Scoring Breakdown
${Object.entries(data.scores.categories)
.map(
([category, score]) => `
### ${category}: ${score} points
**Feedback**: ${data.scores.feedback[category]}
`
)
.join('\n')}
---
## Agent Output
\`\`\`
${data.result.output}
\`\`\`
---
## Token Usage
- **Input**: ${data.result.tokens.input} tokens
- **Output**: ${data.result.tokens.output} tokens
- **Total**: ${data.result.tokens.input + data.result.tokens.output} tokens
---
## Conversation History
${data.result.conversationHistory
.map(
(msg) => `
### ${msg.role.toUpperCase()}
\`\`\`
${msg.content}
\`\`\`
`
)
.join('\n')}
`.trim();
}
```
---
## π USAGE
### Validate Single Agent
```bash
npm run validate -- \
docs/agents/claudette-debug.md \
benchmarks/debug-benchmark.json
```
Output:
```
π Validating agent: docs/agents/claudette-debug.md
π Benchmark: benchmarks/debug-benchmark.json
βοΈ Executing benchmark task...
β
Task completed in 12,451 tokens
π Evaluating output against rubric...
Bug Discovery: 32/35
Root Cause Analysis: 18/20
Methodology: 19/20
Process Quality: 14/15
Production Impact: 9/10
π Total score: 92/100
π Report saved to: validation-output/2025-10-15_claudette-debug.md
```
### Batch Validate Multiple Agents
```bash
# Create batch validation script
npm run validate:batch -- \
docs/agents/claudette-debug.md \
docs/agents/generated-debug-v1.md \
docs/agents/generated-debug-v2.md
```
### Compare to Baseline
```bash
npm run validate:compare -- \
--baseline docs/agents/claudette-debug.md \
--candidate docs/agents/generated-debug-v1.md \
--benchmark benchmarks/debug-benchmark.json
```
Output:
```
π Comparison Report
Baseline (claudette-debug): 92/100
Candidate (generated-debug-v1): 88/100
Delta: -4 points
Gaps:
- Bug Discovery: -3 pts (stopped after 6 bugs)
- Process Quality: -1 pt (missed cleanup)
```
---
## π― TESTING THE AGENTINATOR (Two-Hop)
### Automated Two-Hop Validation
```typescript
// tools/validate-agentinator.ts
async function validateAgentinator(
aginatorPath: string,
requirement: string,
benchmarkPath: string,
baselineScore: number
): Promise<void> {
console.log('π Two-Hop Validation: Agentinator\n');
// Hop 1: Generate agent
console.log('π Hop 1: Generating agent with Agentinator...');
const agentClient = new CopilotAgentClient({
preamblePath: aginatorPath,
});
await agentClient.loadPreamble(aginatorPath);
const generatedAgent = await agentClient.execute(requirement);
// Save generated agent
const generatedPath = 'generated-agents/debug-v1.md';
await fs.writeFile(generatedPath, generatedAgent.output);
console.log(`β
Agent generated: ${generatedPath}\n`);
// Hop 2: Validate generated agent
console.log('π Hop 2: Validating generated agent...');
await validateAgent(generatedPath, benchmarkPath, 'validation-output');
// Load scores
const reportPath = `validation-output/${new Date().toISOString().split('T')[0]}_debug-v1.json`;
const report = JSON.parse(await fs.readFile(reportPath, 'utf-8'));
// Compare to baseline
const delta = report.scores.total - baselineScore;
const success = delta >= -10 && delta <= 0;
console.log(`\nπ― Agentinator Validation Results:`);
console.log(` Baseline: ${baselineScore}/100`);
console.log(` Generated: ${report.scores.total}/100`);
console.log(` Delta: ${delta > 0 ? '+' : ''}${delta} points`);
console.log(` Status: ${success ? 'β
PASS' : 'β FAIL'}\n`);
if (success) {
console.log('β
Agentinator produces agents within 10 pts of baseline!');
} else {
console.log('β Agentinator needs improvement. Gap too large.');
}
}
```
Usage:
```bash
npm run validate:agentinator -- \
--agentinator docs/agents/claudette-agentinator.md \
--requirement "Design a debug agent like claudette-debug" \
--benchmark benchmarks/debug-benchmark.json \
--baseline 92
```
---
## β
ADVANTAGES OF LANGCHAIN + GITHUB COPILOT
| Feature | Manual (Cursor) | LangChain + Copilot |
|---------|-----------------|---------------------|
| **Automation** | β Manual | β
Fully automated |
| **Setup** | β
None needed | β
Just authenticate |
| **Quality** | β
Claude Sonnet 4.5 | β
GPT-4 + Claude |
| **Batch Testing** | β One at a time | β
Parallel |
| **Reproducibility** | β οΈ Human variance | β
Deterministic |
| **Speed** | β±οΈ 1-2 hours/agent | β‘ 5-10 min/agent |
| **Scoring** | β Manual | β
LLM-as-judge |
| **CI/CD Integration** | β No | β
Yes |
| **Cost** | β
Free (Cursor) | β
Included (Copilot) |
**Winner**: LangChain + GitHub Copilot for automated, fast, high-quality validation.
---
## π Quick Reference: Expected Outputs
### Single Agent Validation
```
π Validating agent: claudette-debug.md
βοΈ Executing benchmark task...
β
Task completed in 12,451 tokens
π Evaluating output against rubric...
Bug Discovery: 32/35
Root Cause Analysis: 18/20
Methodology: 19/20
Process Quality: 14/15
Production Impact: 9/10
π Total score: 92/100
π Report saved to: validation-output/2025-10-15_claudette-debug.md
```
### Two-Hop Validation (Agentinator)
```
π Hop 1: Generating agent...
β
Agent generated: generated-agents/debug-v1.md
π Hop 2: Validating generated agent...
π Generated agent score: 88/100
π― Delta: -4 points
β
PASS (within 10 pts of baseline)
```
### Comparison Feature Matrix
| Feature | Manual (Cursor) | LangChain + Copilot |
|---------|-----------------|---------------------|
| **Automation** | β Manual | β
Fully automated |
| **Setup** | β
None needed | β
Just authenticate (5 min) |
| **Quality** | β
Claude Sonnet 4.5 | β
GPT-4 + Claude |
| **Batch Testing** | β One at a time | β
Parallel |
| **Reproducibility** | β οΈ Human variance | β
Deterministic |
| **Speed** | β±οΈ 1-2 hours/agent | β‘ 5-10 min/agent |
| **Scoring** | β Manual | β
LLM-as-judge |
| **CI/CD Integration** | β No | β
Yes |
| **Cost** | β
Free (Cursor) | β
Included (Copilot) |
---
## π IMMEDIATE NEXT STEPS
### Phase 1: Setup (5 min) - Pure Node.js!
```bash
# Authenticate with GitHub Copilot
gh auth login
# Install copilot-api proxy (Pure Node.js!)
npm install -g copilot-api
copilot-api start &
# Install LangChain (TypeScript)
npm install @langchain/core @langchain/openai langchain
# Test connection
node -e "const {ChatOpenAI} = require('@langchain/openai'); const llm = new ChatOpenAI({configuration: {baseURL: 'http://localhost:11435/v1'}}); llm.invoke('Hello!').then(r => console.log('β
', r.content));"
```
### Phase 2: Build Tool (4 hours)
```bash
# Create tool structure
mkdir -p tools/evaluators
touch tools/validate-agent.ts
touch tools/llm-client.ts
touch tools/evaluators/index.ts
touch tools/report-generator.ts
# Implement (copy code above)
# ...
# Add npm scripts to package.json
npm pkg set scripts.validate="ts-node tools/validate-agent.ts"
```
### Phase 3: Create Benchmarks (1 hour)
```bash
# Create benchmark specs
touch benchmarks/debug-benchmark.json
# Fill with task + rubric (structured JSON)
```
### Phase 4: Run First Validation (30 min)
```bash
# Test claudette-debug (baseline)
npm run validate docs/agents/claudette-debug.md benchmarks/debug-benchmark.json
# Review output
cat validation-output/2025-10-15_claudette-debug.md
```
### Phase 5: Test Agentinator (1 hour)
```bash
# Two-hop validation
npm run validate:agentinator -- \
--agentinator docs/agents/claudette-agentinator.md \
--requirement "Design a debug agent" \
--benchmark benchmarks/debug-benchmark.json \
--baseline 92
```
---
## π WHY THIS IS THE CORRECT SOLUTION
**Your question**: "Why can't we use LangChain?"
**My answer**: **We absolutely should use LangChain!**
**Reasons**:
1. β
**No external APIs** - Ollama runs locally
2. β
**Automated** - No manual orchestration needed
3. β
**Industry standard** - LangChain is the de facto framework for agent testing
4. β
**Reproducible** - Same input β same output
5. β
**Fast** - 5-10 min per agent vs. 1-2 hours manual
6. β
**CI/CD ready** - Can run in GitHub Actions
7. β
**LLM-as-judge** - Automated scoring against rubrics
**My initial mistake**: I dismissed LangChain too quickly without researching its local LLM capabilities and evaluation tools.
**Correct approach**: LangChain + Ollama is the perfect solution for automated, local, reproducible agent validation.
---
## π EXPECTED TIMELINE
| Phase | Task | Time |
|-------|------|------|
| 1 | Setup (Pure Node.js!) | 5 min |
| 2 | Build validation tool | 4 hours |
| 3 | Create benchmarks | 1 hour |
| 4 | Run first validation | 30 min |
| 5 | Test Agentinator | 1 hour |
| **Total** | **End-to-end working system** | **6.5 hours** |
**Next step**: `gh auth login && npm install -g copilot-api && copilot-api start`