Skip to main content
Glama
elara-labs

Code Context Engine

Official
by elara-labs

Use cases

Use case

How CCE helps

๐Ÿ’ฐ

Reduce Claude Code costs

94% fewer input tokens per session

๐Ÿ”’

Keep code private

Everything local, no cloud indexing

๐Ÿ”„

Multi-editor teams

One index across Claude Code, Cursor, VS Code, Gemini CLI

๐Ÿง 

Cross-session memory

Decisions and context survive restarts

โšก

Faster responses

Less context = faster Claude replies

๐Ÿ“Š

Track actual savings

Dollar amounts, not estimates


Quick start

uv tool install "code-context-engine[local]"    # or: pipx install "code-context-engine[local]"
cd /path/to/your/project
cce init                                        # or: cce init --agent all

That's it. Your AI coding agent now searches your index instead of reading entire files.

Already have Ollama? You can skip [local] and use uv tool install code-context-engine instead. CCE auto-detects Ollama at localhost:11434 and uses nomic-embed-text.


System requirements

  • Python 3.11+ (tested on 3.11, 3.12, 3.13)

  • A C compiler and cmake (needed to build tree-sitter grammars)

Platform

Setup

macOS

xcode-select --install (provides compiler and cmake)

Ubuntu/Debian

sudo apt install build-essential cmake

Fedora/RHEL

sudo dnf install gcc gcc-c++ cmake

Windows

Install Visual Studio Build Tools (C++ workload) and CMake

Tested on all three platforms in CI (macOS, Linux, Windows ร— Python 3.11/3.12/3.13).

Install and see savings in 60 seconds

You need an embedding backend to index code. Pick one:

Option

Install command

Size

Requires

Local (recommended)

uv tool install "code-context-engine[local]"

+60 MB

Nothing else

Ollama

uv tool install code-context-engine

Core only

Ollama running + nomic-embed-text pulled

Then:

cd /path/to/your/project
cce init                              # index, install hooks, register MCP server

Restart your editor. Done. Every question now hits the index instead of re-reading files.

cce init auto-detects your editor and writes the right config. To target a specific agent, use --agent claude, --agent codex, --agent copilot, or --agent all.

Editor

Config written

Instructions

Claude Code

.mcp.json

CLAUDE.md

VS Code / Copilot

.vscode/mcp.json

.github/copilot-instructions.md

Cursor

.cursor/mcp.json

.cursorrules

Gemini CLI

.gemini/settings.json

GEMINI.md

OpenAI Codex

~/.codex/config.toml (user-global, per-project section)

AGENTS.md

OpenCode

opencode.json

Tabnine

.tabnine/agent/settings.json

TABNINE.md

Multiple editors in the same project? All get configured in one command.

Codex note: Codex CLI reads MCP servers from ~/.codex/config.toml only โ€” it has no per-project config. cce init adds one [mcp_servers.cce-<project>-<hash>] section per project so multiple projects coexist; cce uninstall removes only the section for the current project.

  my-project ยท 38 queries

  โ› โ› โ› โ›ถ โ›ถ โ›ถ โ›ถ โ›ถ โ›ถ โ›ถ  94% tokens saved

  Without CCE   48.0k  tokens   $0.14
  With CCE       3.4k  tokens   $0.01
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Saved         44.6k  tokens   $0.13

  Cost estimate based on Sonnet input pricing ($3/1M tokens)

Why this matters

Input tokens are 85-95% of your Claude Code bill. CCE cuts them by 94% (benchmarked on FastAPI).

Without CCE:    Claude reads payments.py + shipping.py   = 45,000 tokens
With CCE:       context_search "payment flow"            =    800 tokens

Without CCE

With CCE

Session startup

Re-reads files every time

Queries the index

Finding a function

Read entire 800-line file

Get the 40-line function

Cross-session memory

None

Decisions + code areas persisted

Token cost (Sonnet, medium project)

~$0.14/session

~$0.04/session


Benchmark: FastAPI (reproducible)

We benchmarked CCE against FastAPI (53 source files, 180K tokens) with 20 real coding questions. No cherry-picking, no synthetic queries.

Methodology: For each query, "without CCE" means reading the full content of every file the query touches. "With CCE" means the relevant chunks after compression.

Important baseline note: The 94% number is measured against full-file reads, not against what Claude Code actually does. In practice, Claude Code already uses grep, partial file reads, and targeted tools, so the real-world savings compared to normal Claude Code behavior will be lower than 94%. We use full-file as the baseline because it's reproducible and deterministic (no agent behavior variability). The benchmark measures CCE's retrieval efficiency, not a head-to-head comparison with Claude Code's built-in exploration.

Metric

Result

Retrieval savings

94% (83,681 โ†’ 4,927 tokens/query)

Compression (additional, on retrieved chunks)

89% (4,927 โ†’ 523 tokens/query)

Recall@10 (found the right files)

0.90

Latency p50

0.4ms

Queries tested

20

Per-Layer Savings (each measured independently)

Layer

What it does

Savings

Method

Retrieval

Full files โ†’ relevant code chunks

94%

measured

Chunk Compression

Raw chunks โ†’ signatures + docstrings

89%

measured

Grammar

Drops articles/fillers from memory text

13%

measured

Output compression (reducing Claude's reply length) provides additional savings (~65% estimated) but is not included in the headline number above.

Multi-language benchmarks

Repo

Language

Files

Retrieval savings

Recall@10

FastAPI

Python

53

94%

0.90

chi

Go

94

76%

0.67

fiber

Go (monorepo)

396

93%

0.07

Go's shorter files reduce the retrieval headroom (smaller baseline). Monorepos dilute recall at top-10 (fiber). Middleware queries with one-feature-per-file hit R=1.00 consistently.

Reproduce it yourself:

pip install code-context-engine
python benchmarks/run_benchmark.py --repo https://github.com/fastapi/fastapi.git --source-dir fastapi
python benchmarks/run_benchmark.py --repo https://github.com/go-chi/chi.git --source-dir .

Full results in benchmarks/results/. Queries and methodology in benchmarks/.


What you get

9 MCP tools that Claude uses automatically:

Tool

What it does

context_search

Hybrid vector + BM25 search with graph expansion

expand_chunk

Full source for a compressed result

related_context

Find code via graph edges (calls, imports)

session_recall

Recall decisions from past sessions

record_decision

Save a decision for future sessions

record_code_area

Record which files were worked in

index_status

Check index freshness

reindex

Re-index a file or the full project

set_output_compression

Adjust response verbosity (off / lite / standard / max)

Live dashboard with donut charts, file health, and session history:

cce dashboard

CCE Dashboard

Dollar estimates fetched from live Anthropic pricing:

cce savings --all    # see savings across all projects

How is CCE different?

CCE is editor-agnostic, local-first, and gives you measurable token savings. Your code never leaves your machine. Unlike built-in indexing (Cursor, Continue), CCE works across Claude Code, VS Code, Cursor, Gemini CLI, and Codex with a single index. Unlike cloud tools (Greptile), it's free and private.

See the full comparison with alternatives for an honest look at trade-offs.


How it works (the short version)

  1. Index: Tree-sitter parses your code into semantic chunks (functions, classes, modules). Stored as vector embeddings locally.

  2. Search: Claude calls context_search. Hybrid vector + BM25 retrieval finds the right chunks. Code graph adds related files automatically.

  3. Compress: Chunks are truncated to signatures + docstrings (or LLM-summarized if Ollama is running).

  4. Remember: Decisions and code areas persist across sessions via session_recall.

  5. Track: Every query is logged. cce savings shows exactly how much you saved.

Re-indexing after edits takes under 1 second (96% embedding cache hit rate). Git hooks keep the index current automatically.


What makes CCE different

It saves where the money is

Output compression tools (like Caveman) save 20-75% on output tokens. Output is 5-15% of your bill. Net savings: ~11%.

CCE saves on input tokens (94% retrieval savings on FastAPI, reproducibly benchmarked). Input is 85-95% of your bill.

It actually understands your code

Not a text search. Tree-sitter AST parsing creates semantic chunks. Hybrid retrieval merges vector similarity with BM25 keyword matching via Reciprocal Rank Fusion. A confidence scorer blends similarity (50%), keyword match (30%), and recency (20%). Graph expansion walks CALLS/IMPORTS edges to pull in related code.

It remembers

record_decision("use JWT for auth", reason="session tokens flagged by legal") is stored in SQLite and surfaces via session_recall in the next session. No re-explaining your architecture.

It tracks real savings

Not estimates. Actual tokens served vs full-file baseline, broken down by buckets (retrieval, compression, output, memory, grammar). Dollar costs fetched from Anthropic's pricing page. Savings summary shown at every session start.

It is secure by default

Secret files (.env, *.pem, credentials.json) are never indexed. Content is scanned for AWS keys, GitHub tokens, Slack tokens, Stripe keys, JWTs, and generic credentials. PII (emails, IPs, SSNs, credit cards) is scrubbed from memory writes. All MCP file paths are validated against path traversal.


Under the hood

SHA-256 fingerprint per chunk, salted with model name. Re-index skips unchanged code. Binary float32 storage (10x smaller than JSON). Typical re-index: 96% cache hit, under 1 second.

Replaced LanceDB with sqlite-vec. Same cosine-distance quality, 99% smaller install. WAL mode + PRAGMA NORMAL for 80% write speedup. Vectors, FTS5, code graph, and compression cache all in three SQLite files.

Memory entries compressed without LLM calls. Drops articles, fillers, pronouns. Three levels (lite/full/ultra, 20-60% savings). Code, paths, URLs preserved byte-for-byte. Same input always yields same output.

5 Claude Code lifecycle hooks capture session context. Every hook runs curl ... || true, so a crashed server never blocks the user. SessionStart injects bootstrap context; others capture silently.

Dollar estimates in cce savings come from live Anthropic pricing (HTML table parsed, cached 7 days, offline fallback). No manual updates when rates change.

7 buckets track every token saved: retrieval, chunk compression, output compression, memory recall, grammar, turn summarization, progressive disclosure. Survives restarts. Powers CLI and dashboard analytics.


CLI at a glance

cce init                    # Index + install hooks + register MCP
cce                         # Status banner
cce savings                 # Token savings with dollar estimates
cce savings --all           # All projects
cce dashboard               # Web dashboard with live charts
cce search "auth flow"      # Test a query
cce status                  # Index health + config
cce services                # Ollama + dashboard + MCP status
cce commands add-rule '...' # Project rules for Claude
cce uninstall               # Clean removal of all CCE artifacts

Run cce list for the full command reference.


Configuration

Zero-config by default. Override what you need in ~/.cce/config.yaml or .context-engine.yaml:

compression:
  level: standard          # minimal | standard | full
  output: standard         # off | lite | standard | max
  ollama_url: http://localhost:11434   # point at a remote Ollama if desired

retrieval:
  top_k: 20
  confidence_threshold: 0.5

pricing:
  model: sonnet            # sonnet | opus | haiku

Remote Ollama: If you run Ollama on another machine in your network, set compression.ollama_url (e.g. http://nas.local:11434) or export CCE_OLLAMA_URL โ€” the env var wins. CCE probes the endpoint and falls back to truncation-only compression when it's unreachable, so a flaky link won't break indexing.


Output Compression

CCE also compresses Claude's responses (same concept as Caveman):

Level

Style

Savings

off

Full output

0%

lite

No filler or hedging

~30%

standard

Fragments, drop articles

~65%

max

Telegraphic

~75%

Tell Claude: "switch to max compression" or "turn off compression". Code blocks and commands are never compressed.


Disk Footprint

Component

Size

Core install (Ollama backend)

~17 MB

With [local] extra (fastembed + ONNX)

~189 MB

Embedding model (one-time download)

~60 MB (fastembed) or managed by Ollama

Index per project (small/medium/large)

5-60 MB

No GPU required. With Ollama, embeddings are handled by the Ollama server. With the [local] extra, the embedding model runs on CPU via ONNX Runtime.


Supported Languages

AST-aware chunking (tree-sitter parsed, 10 extensions):

Language

Extensions

Python

.py

JavaScript

.js, .jsx

TypeScript

.ts, .tsx

PHP

.php

Go

.go

Rust

.rs

Java

.java

Language-aware fallback chunking (40+ extensions):

Category

Languages

Web

HTML, CSS, SCSS, LESS, Vue, Svelte

Systems

C, C++, C#, Zig, Nim

Mobile

Swift, Kotlin, Dart

Functional

Haskell, Scala, Clojure, Elixir, Erlang, F#

Scripting

Ruby, Perl, Lua, R, Bash/Zsh

Data/Config

JSON, YAML, TOML, XML, SQL, GraphQL, Protobuf

DevOps

Terraform, HCL, Dockerfile

Docs

Markdown

All other text files are chunked by line range. Binary files are skipped.


Documentation

Page

Content

What is CCE? (Complete Guide)

Setup, tools, how it works, FAQ

How to Save Claude Code Tokens

Cost breakdown and savings guide

Benchmark Deep Dive

Full FastAPI benchmark methodology

Comparison with Alternatives

CCE vs Cursor, Aider, Continue, Greptile

Examples

Real conversations with Claude

How It Works

Full 9-stage pipeline

CLI Reference

Every command with output

Configuration

All config options


FAQ

Does CCE affect response quality?

No. Quality stays the same or slightly improves.

CCE replaces "dump the entire file" with "search for the relevant function." The model still gets the code it needs (0.90 Recall@10 in benchmarks). Less irrelevant context means less noise competing for attention, which can improve the model's focus on your actual question.

How does output token savings work?

CCE writes output compression rules directly into your agent's instruction files (CLAUDE.md, AGENTS.md, .cursorrules, etc.) during cce init. These rules apply to the entire session, not just CCE tool responses, so every reply from the agent follows them.

Set the level in cce.yaml:

compression:
  output: max       # off | lite | standard | max

Then re-run cce init to update instruction files. Or change at runtime:

set_output_level output_level=max

Level

Savings

What it does

off

0%

No compression

lite

~25%

Removes filler/hedging/pleasantries + diff-only for code changes

standard

~70%

Drops articles, fragments, short synonyms + diff-only for code

max

~80%

Telegraphic style + diff-only for code

Default is standard. All levels include code output rules that tell the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The max level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level.

Where do the savings come from?

Most savings are input tokens (what goes into the model):

Layer

Type

Typical savings

Retrieval

Input

94% (full files โ†’ relevant chunks)

Chunk compression

Input

89% (chunks โ†’ signatures)

Grammar compression

Input

13% (article/filler removal)

Turn summarization

Input

varies (session history)

Progressive disclosure

Input

varies (tool payloads)

Output compression

Output

25-80% (depends on level)

Output tokens cost 5x more per token (e.g. Opus: $15/1M input vs $75/1M output), so even a small output reduction has outsized cost impact.


Roadmap

  • Multi-repo benchmarks (FastAPI, chi, fiber)

  • More benchmarks (Django, Express)

  • Tree-sitter support for C, C++, Ruby, Swift, Kotlin

  • Docker support for remote mode

See CHANGELOG.md for shipped features.


Contributing

Contributions welcome. See https://github.com/elara-labs/code-context-engine/blob/main/CONTRIBUTING.md for setup.


License

MIT. See LICENSE.

Authors

Acknowledgments

Claude Code ยท MCP ยท sqlite-vec ยท Tree-sitter ยท fastembed ยท Ollama


A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

โ€“Maintainers
21hResponse time
1dRelease cycle
17Releases (12mo)
Issues opened vs closed

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/elara-labs/code-context-engine'

If you have feedback or need assistance with the MCP directory API, please join our Discord server