Kodit

Overview Schema Related Servers Score Discussions

kodit
docs
benchmarking

CODE_RAG_BENCHMARK_PLAN.md

CODE_RAG_BENCHMARK_PLAN.md•16.4 KiB

# Kodit Benchmark: SWE-bench Implementation ## Overview This document describes the SWE-bench benchmark implementation for evaluating Kodit's code retrieval capabilities. The benchmark uses real-world GitHub issues from the [SWE-bench](https://www.swebench.com/) dataset to measure how much Kodit's retrieval improves LLM patch generation. --- ## 1. Why SWE-bench? SWE-bench tests repository-level issue resolution—a task where retrieval provides significant value: | Kodit Capability | SWE-bench Requirement | Alignment | |------------------|----------------------|-----------| | Index Git repositories | Real GitHub repos at specific commits | ✅ Perfect | | Hybrid search (BM25 + semantic) | Find relevant code for bug fixing | ✅ Perfect | | AST-based snippet extraction | Locate functions/classes to modify | ✅ Perfect | | Filter by repository | Each task targets a specific repo | ✅ Perfect | **Why SWE-bench over RepoEval?** | Feature | SWE-bench | RepoEval | |---------|-----------|----------| | Exact commit hashes | ✅ `base_commit` field | ❌ Snapshots only | | Evaluation method | ✅ Real test execution | ⚠️ Token similarity | | Task complexity | Real bug fixes | Function completion | | Retrieval impact | High (large repos) | Medium | From the [SWE-bench leaderboard](https://www.swebench.com/): - RAG-based approaches (BM25 retrieval + LLM) achieve **4-7% on Lite** - Agentless-Lite with embedding retrieval achieves **32% on Lite** - This demonstrates significant headroom for better retrieval --- ## 2. Dataset ### 2.1 Data Source The benchmark uses the official SWE-bench datasets from Hugging Face: | Dataset | Size | Use Case | |---------|------|----------| | `princeton-nlp/SWE-bench_Lite` | 300 instances | Primary benchmark | | `princeton-nlp/SWE-bench_Verified` | 500 instances | Extended benchmark | ### 2.2 Repositories SWE-bench Lite covers 12 popular Python repositories: | Repository | Instances | Description | |------------|-----------|-------------| | django/django | 114 | Web framework | | sympy/sympy | 77 | Symbolic mathematics | | matplotlib/matplotlib | 23 | Plotting library | | scikit-learn/scikit-learn | 23 | Machine learning | | pytest-dev/pytest | 17 | Testing framework | | sphinx-doc/sphinx | 16 | Documentation generator | | astropy/astropy | 6 | Astronomy library | | psf/requests | 6 | HTTP library | | pylint-dev/pylint | 6 | Code linter | | pydata/xarray | 5 | N-D arrays | | mwaskom/seaborn | 4 | Statistical visualization | | pallets/flask | 3 | Web microframework | ### 2.3 Instance Format Each instance contains: ```python { "instance_id": "django__django-11049", # Unique identifier "repo": "django/django", # GitHub repository "base_commit": "17455e924e24...", # Exact commit to checkout "problem_statement": "...", # Issue description (natural language) "hints_text": "...", # Optional hints "patch": "diff --git a/...", # Ground truth fix "test_patch": "diff --git a/...", # Test additions "FAIL_TO_PASS": ["test_invalid_string..."], # Tests that should pass after fix "PASS_TO_PASS": ["test_other..."], # Tests that should remain passing "version": "3.0", # Library version "environment_setup_commit": "...", # Commit for environment setup } ``` ### 2.4 Example Task ``` Instance: django__django-11049 Commit: 17455e924e243e7a55e8a38f45966d8cbb27c273 Problem Statement: Correct expected format in invalid DurationField error message. The current error message says "[DD] [HH:[MM:]]ss[.uuuuuu]" but should be "[DD] [[HH:]MM:]ss[.uuuuuu]" because seconds are mandatory. Expected Patch: diff --git a/django/db/models/fields/__init__.py - "[DD] [HH:[MM:]]ss[.uuuuuu] format.") + "[DD] [[HH:]MM:]ss[.uuuuuu] format.") Tests to Fix: ["test_invalid_string (model_fields.test_durationfield.TestValidation)"] ``` --- ## 3. Experimental Design ### 3.1 Conditions | Condition | Description | |-----------|-------------| | **Baseline** | LLM generates patch with only the problem statement | | **BM25** | LLM generates patch with BM25-retrieved context (SWE-bench baseline) | | **Kodit** | LLM generates patch with Kodit-retrieved context | | **Oracle** | LLM generates patch with gold file context (upper bound) | ### 3.2 Metrics **Primary Metric**: - **Resolve Rate**: Percentage of instances where generated patch makes `FAIL_TO_PASS` tests pass - **Resolve Rate Delta**: `Resolve(Kodit) - Resolve(BM25)` — the improvement over baseline RAG **Secondary Metrics**: - **Retrieval Recall@k**: Fraction of modified files found in top-k results - **Context Utilization**: How often retrieved context appears in generated patches ### 3.3 Evaluation Evaluation uses the official SWE-bench harness with Docker containers: 1. Apply generated patch to repository at `base_commit` 2. Run `FAIL_TO_PASS` tests in isolated environment 3. Verify `PASS_TO_PASS` tests still pass (no regressions) 4. Instance is "resolved" only if all conditions met #### Running the SWE-bench Harness Install and run the official evaluation harness: ```bash # Install SWE-bench pip install swebench # Run evaluation (requires Docker with ~100GB disk space) python -m swebench.harness.run_evaluation \ --dataset_name princeton-nlp/SWE-bench_Lite \ --predictions_path results/predictions.jsonl \ --max_workers 8 \ --run_id kodit_eval # For Mac M-series (ARM), build images locally: python -m swebench.harness.run_evaluation \ --dataset_name princeton-nlp/SWE-bench_Lite \ --predictions_path results/predictions.jsonl \ --max_workers 8 \ --namespace '' \ --run_id kodit_eval ``` **Performance**: ~30 mins for Lite (300 instances) on 16 cores with `cache_level=env`. **Cloud option**: Run on Modal to avoid local Docker setup: ```bash pip install modal swebench[modal] modal setup python -m swebench.harness.run_evaluation \ --dataset_name princeton-nlp/SWE-bench_Lite \ --predictions_path results/predictions.jsonl \ --modal true ``` --- ## 4. Running the Benchmark ### 4.1 Quick Start ```bash # Step 1: Setup - clone repos at specific commits, index with Kodit uv run kodit benchmark setup --dataset swebench-lite # Step 2: Run a single instance (for testing) uv run kodit benchmark run-one django__django-11049 # Step 3: Run full benchmark uv run kodit benchmark run --dataset swebench-lite --condition kodit # Step 4: Evaluate predictions uv run kodit benchmark evaluate results/predictions.jsonl ``` ### 4.2 CLI Commands ```bash # Show available instances uv run kodit benchmark list --dataset swebench-lite --repo django/django # Run specific instances uv run kodit benchmark run \ --instances django__django-11049 django__django-13447 \ --model claude-3-5-sonnet-20241022 \ --condition kodit # Compare conditions uv run kodit benchmark compare \ results/baseline.jsonl \ results/kodit.jsonl ``` ### 4.3 Configuration Options | Option | Default | Description | |--------|---------|-------------| | `--dataset` | `swebench-lite` | Dataset variant (lite, verified, full) | | `--model` | `claude-3-5-sonnet-20241022` | LiteLLM model identifier | | `--condition` | `kodit` | Retrieval condition (baseline, bm25, kodit, oracle) | | `--top-k` | `5` | Number of files/snippets to retrieve | | `--instances` | all | Specific instance IDs to run | | `--repo` | all | Filter to specific repository | --- ## 5. Architecture ### 5.1 Directory Structure ``` benchmarks/ ├── __init__.py ├── cli.py # CLI commands (setup, run, evaluate) ├── swebench/ │ ├── __init__.py │ ├── instance.py # SWEBenchInstance dataclass │ ├── loader.py # HuggingFace dataset loader │ ├── repository.py # Git clone/checkout management │ ├── retriever.py # Kodit retrieval wrapper │ ├── prompt.py # Prompt templates │ ├── generator.py # LLM patch generation │ └── evaluator.py # SWE-bench harness wrapper ├── repos/ # Cloned repositories (gitignored) │ └── django__django-11049/ # Instance-specific checkout ├── results/ # Benchmark outputs │ └── predictions.jsonl └── cache/ # Indexed repository cache ``` ### 5.2 Setup Process The `setup` command prepares repositories for benchmarking: 1. **Load dataset**: Fetch from `princeton-nlp/SWE-bench_Lite` 2. **Clone repositories**: For each unique `(repo, base_commit)` pair: ```bash git clone https://github.com/{repo} repos/{instance_id} cd repos/{instance_id} git checkout {base_commit} ``` 3. **Index with Kodit**: For each cloned repository: - POST to `/api/v1/repositories` with `file://` URI - Wait for indexing to complete - Store mapping: `instance_id → repository_id` 4. **Cache index**: Save Kodit database for reuse ### 5.3 Benchmark Pipeline ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Load Instance│───▶│ Retrieve │───▶│ Build Prompt │ │ from HF │ │ from Kodit │ │ (issue+ctx) │ └──────────────┘ └──────────────┘ └──────┬───────┘ │ ┌──────────────┐ ┌──────────────┐ │ │ Evaluate │◀───│ Generate │◀──────────┘ │ with Docker │ │ Patch │ └──────────────┘ └──────────────┘ ``` --- ## 6. Implementation Details ### 6.1 Retrieval Strategy The `KoditRetriever` queries Kodit with the problem statement: ```python class KoditRetriever: async def retrieve(self, instance: SWEBenchInstance, k: int = 5) -> list[RetrievedFile]: # Extract key terms from problem statement keywords = self._extract_keywords(instance.problem_statement) results = await self._search_service.search( user_intent=instance.problem_statement, keywords=keywords, source_repo=f"github.com/{instance.repo}", ) # Group snippets by file, return top-k files files = self._group_by_file(results) return files[:k] ``` ### 6.2 Prompt Template Following the SWE-bench BM25 baseline format: ``` You will be provided with a partial code base and an issue statement explaining a problem to resolve. <issue> {problem_statement} </issue> <code> [start of {file_path_1}] {file_content_1} [end of {file_path_1}] [start of {file_path_2}] {file_content_2} [end of {file_path_2}] </code> Generate a patch in unified diff format that resolves the issue. Only output the patch, no explanations. ``` ### 6.3 Prediction Format (Target Output) Our benchmark must produce a JSONL file compatible with the SWE-bench evaluation harness: ```jsonl {"instance_id": "django__django-11049", "model_name_or_path": "kodit-claude", "model_patch": "diff --git a/django/db/models/fields/__init__.py b/django/db/models/fields/__init__.py\nindex abc123..def456 100644\n--- a/django/db/models/fields/__init__.py\n+++ b/django/db/models/fields/__init__.py\n@@ -1000,7 +1000,7 @@\n- \"[DD] [HH:[MM:]]ss[.uuuuuu] format.\")\n+ \"[DD] [[HH:]MM:]ss[.uuuuuu] format.\")"} {"instance_id": "django__django-13447", "model_name_or_path": "kodit-claude", "model_patch": "diff --git a/..."} ``` **Required fields:** - `instance_id`: Must match exactly (e.g., `django__django-11049`) - `model_name_or_path`: Identifier for tracking (e.g., `kodit-claude-sonnet`) - `model_patch`: Unified diff format with `diff --git` header This file is then passed to the SWE-bench harness: ```bash python -m swebench.harness.run_evaluation \ --predictions_path results/predictions.jsonl \ ... ``` --- ## 7. Results Format ### 7.1 Per-Instance Results ```json { "instance_id": "django__django-11049", "condition": "kodit", "retrieved_files": ["django/db/models/fields/__init__.py"], "retrieval_recall": 1.0, "generated_patch": "diff --git a/...", "resolved": true, "fail_to_pass_results": {"test_invalid_string": "PASSED"}, "latency_ms": 2340 } ``` ### 7.2 Aggregate Results ```json { "benchmark": "swebench-lite", "model": "claude-3-5-sonnet-20241022", "timestamp": "2024-01-15T10:30:00Z", "conditions": { "baseline": { "resolve_rate": 0.15, "retrieval_recall_5": 0.0, "instances_run": 300 }, "bm25": { "resolve_rate": 0.22, "retrieval_recall_5": 0.45, "instances_run": 300 }, "kodit": { "resolve_rate": 0.28, "retrieval_recall_5": 0.62, "instances_run": 300 } }, "kodit_delta_vs_baseline": "+13%", "kodit_delta_vs_bm25": "+6%" } ``` --- ## 8. Expected Results Based on SWE-bench leaderboard data and CodeRAG-Bench findings: | Condition | Expected Resolve Rate | |-----------|----------------------| | Baseline (no retrieval) | ~15% | | BM25 retrieval | ~22% | | Kodit retrieval | ~28% | | Oracle (gold files) | ~45% | **Key Insight**: The gap between BM25 (22%) and Oracle (45%) represents the potential improvement from better retrieval. Kodit's hybrid search should capture more of this potential than pure BM25. --- ## 9. Troubleshooting ### Repository cloning fails - Ensure network access to GitHub - Some repos may require authentication for private forks - Use `--skip-clone` if repos already exist locally ### Indexing takes too long - Large repos (django, sympy) can take 10-30 minutes - Use `--repo` flag to test with smaller repos first (flask, requests) - Pre-indexed caches can be shared across runs ### Docker evaluation fails - Ensure Docker daemon is running - SWE-bench requires significant disk space for containers - Use `--dry-run` to test pipeline without evaluation ### API key issues - Set appropriate API keys for your LLM provider - `ANTHROPIC_API_KEY` for Claude models - `OPENAI_API_KEY` for OpenAI models --- ## 10. Implementation Checklist | # | Task | Priority | Status | |---|------|----------|--------| | 1 | Create `SWEBenchInstance` dataclass | High | ✅ DONE | | 2 | Implement HuggingFace dataset loader (`download` command) | High | ✅ DONE | | 3 | Implement `prepare-instance` command (clone repo, index with Kodit) | High | TODO | | 4 | Implement Kodit retrieval wrapper | High | TODO | | 5 | Implement prompt builder | High | TODO | | 6 | Implement patch generator | High | TODO | | 7 | Output predictions in SWE-bench JSONL format | High | TODO | | 8 | Add result aggregation and reporting | Medium | TODO | | 9 | Add BM25 baseline comparison | Medium | TODO | ### Completed - **`SWEBenchInstance`** (`src/benchmark/swebench/instance.py`): Immutable dataclass with all SWE-bench fields - **`DatasetLoader`** (`src/benchmark/swebench/loader.py`): Downloads from HuggingFace, saves/loads JSON - **`download` command**: `uv run kodit-benchmark download --dataset lite` - **Dataset stored at**: `benchmarks/data/swebench-lite.json` (300 instances) ### Next Step **`prepare-instance` command**: Takes a single instance ID, clones the repo at the exact commit, starts a fresh Kodit server, and indexes the repository. ```bash uv run kodit-benchmark prepare-instance django__django-11049 ``` This will: 1. Look up the instance from the downloaded dataset 2. Clone `django/django` to `benchmarks/repos/django__django-11049/` 3. Checkout the exact `base_commit` 4. Start a fresh Kodit server (using existing `start-kodit` infrastructure) 5. Index the repository via Kodit API 6. Wait for indexing to complete --- ## 11. References - [SWE-bench Website](https://www.swebench.com/) - [SWE-bench GitHub](https://github.com/SWE-bench/SWE-bench) - [SWE-bench Evaluation Guide](https://www.swebench.com/SWE-bench/guides/evaluation/) - [SWE-bench Docker Setup](https://www.swebench.com/SWE-bench/guides/docker_setup/) - [SWE-bench Paper](https://arxiv.org/abs/2310.06770) - [SWE-bench Lite Dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite) - [Agentless Paper](https://arxiv.org/abs/2407.01489) - RAG-based approach achieving 32% - [CodeRAG-Bench Paper](https://arxiv.org/abs/2406.14497) - Analysis of retrieval impact

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/helixml/kodit'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CODE_RAG_BENCHMARK_PLAN.md•16.4 KiB