Provides real-time access to Python codebase structures by indexing function metadata, parameters, and docstrings to help identify existing implementations and prevent code duplication.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@CodeWalkerSearch for any existing functions that handle CSV data loading."
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
CodeWalker
Walk your codebase before writing new code.
CodeWalker is an MCP server that gives Claude Code real-time access to your Python codebase structure, enabling AI-assisted development that reuses existing code instead of duplicating it.
The Problem: AI Code Duplication
What Happens Without CodeWalker
When Claude Code writes code, it can't see what already exists in your codebase. This causes a cascade of problems:
Day 1: You ask Claude to add CSV loading functionality
# Claude creates: src/data_loader.py
def load_csv_file(path):
return pd.read_csv(path)Day 5: Different feature needs CSV loading
# Claude creates: src/importer.py (Claude has no memory of data_loader.py)
def load_csv_data(filepath):
df = pd.read_csv(filepath)
return dfDay 10: Another feature, another duplicate
# Claude creates: src/utils.py (Claude still doesn't know about the others)
def read_csv(file_path):
return pd.read_csv(file_path, low_memory=False) # Now with different behavior!Result after 2 weeks:
🔴 7 different CSV loading functions across your codebase
🔴 Inconsistent behavior (one uses
low_memory=False, others don't)🔴 Impossible to maintain (bug fixes need to be applied 7 times)
🔴 Unpredictable behavior (which implementation gets called depends on imports)
🔴 Code review nightmare (reviewing duplicate implementations wastes time)
The Cost of Code Duplication
This isn't just messy - it's expensive:
Impact | Cost |
Development Time | 30-40% wasted rewriting existing code |
Bug Fixes | Same bug appears in multiple places, fixed multiple times |
Code Reviews | Reviewers waste time on duplicate implementations |
Onboarding | New developers confused by inconsistent patterns |
Technical Debt | Duplicates diverge over time, creating maintenance burden |
Testing | Same logic tested multiple times (or worse, inconsistently) |
Real Example: A codebase with 800 functions had 52.7% duplication rate - 422 functions were duplicates. That's thousands of wasted lines of code.
How CodeWalker Solves This
CodeWalker indexes your codebase and lets Claude search before writing:
With CodeWalker
Day 1: You ask Claude to add CSV loading
Claude (internal): Let me check if CSV loading already exists...
> search_functions("load csv")
Found: load_csv_file() in src/data_loader.py
Claude: "I found an existing CSV loader. Let me use it instead of creating a new one."Result:
# Claude imports existing function
from src.data_loader import load_csv_file
data = load_csv_file(path)Day 5, 10, 15...: Same pattern - Claude finds and reuses existing code
Result after 2 weeks:
✅ 1 canonical CSV loading function (not 7)
✅ Consistent behavior across entire codebase
✅ Easy to maintain (fix bugs once, fixed everywhere)
✅ Predictable behavior (one implementation = one behavior)
✅ Fast code reviews (reviewers see reuse, not duplication)
Why This Problem Exists
LLMs Lack Architectural Awareness
Claude Code (and all LLMs) have a fundamental limitation:
❌ Can't see your codebase structure ❌ Can't search across files ❌ Can't remember what exists ❌ Can't detect duplicates
The technical reason: When Claude writes code, it only sees:
The current file you're editing
Recent conversation context
Maybe a few related files you showed it
What Claude DOESN'T see:
That
load_csv_file()already exists insrc/data_loader.pyThat 3 other files have similar functions
That your team has a canonical implementation
Your codebase architecture and patterns
Result: Claude invents new implementations instead of reusing existing ones.
The "10 Developers, 0 Communication" Problem
Working with AI without CodeWalker is like having 10 developers who never talk to each other:
Developer 1 (Monday): Creates load_csv_file()
Developer 2 (Tuesday): Doesn't know about it, creates load_csv_data()
Developer 3 (Wednesday): Doesn't know about either, creates read_csv()
Developer 4 (Thursday): Creates import_csv()
... and so onEach "developer" (AI session) works in isolation, creating duplicates because they can't see what others did.
CodeWalker fixes this by giving AI a "shared memory" of your entire codebase.
Real-World Impact
Case Study: Elisity Project
Before CodeWalker:
800 total functions
422 duplicates (52.7% duplication rate)
33 direct
pd.read_csv()calls (should use centralized loader)11 duplicate
print_summary()implementations3 duplicate
load_flow_data()functions with diverging behavior
With CodeWalker:
Claude finds existing implementations before writing new code
Duplication rate drops to near-zero for new code
Codebase becomes more maintainable over time
Time Saved:
Development: 30-40% less time rewriting existing code
Code Review: Reviewers focus on new logic, not duplicate detection
Bug Fixes: Fix once instead of hunting down 3-7 duplicates
How It Works
Architecture
┌─────────────────────┐
│ Your Codebase │
│ (Python files) │
└──────────┬──────────┘
│
│ AST Parser extracts
│ function metadata
▼
┌─────────────────────┐
│ SQLite Index │
│ (functions.db) │
│ │
│ • Function names │
│ • Parameters │
│ • Locations │
│ • Docstrings │
└──────────┬──────────┘
│
│ Claude queries via
│ MCP protocol
▼
┌─────────────────────┐
│ Claude Code │
│ │
│ "Does load_csv │
│ already exist?" │
│ │
│ → Yes! Use it │
└─────────────────────┘What Gets Indexed
For each function in your codebase:
Name -
load_csv_fileLocation -
src/data_loader.py:42Parameters -
(path, encoding='utf-8')Docstring - First line for quick understanding
Type - Regular function, async function, or class method
Decorators -
@staticmethod,@cached, etc.
What's NOT stored: Function bodies, comments, string literals (only structural metadata).
Search Performance
Parsing: ~100-200 files/second
Indexing: ~1000 functions/second
Search: Sub-millisecond SQLite queries
Database size: ~1 KB per function
Example: 800 functions = ~800 KB database, indexed in < 5 seconds, searched in < 1ms.
Features
🔍 Search Before Writing
Tool: search_functions(query, exact=False)
Find existing functions before Claude writes new code:
> search_functions("load csv")
Found 3 functions:
• load_csv_file(path, encoding='utf-8')
Location: src/data_loader.py:42
Docs: Load CSV file with proper encoding handling
• FlowDataLoader.load_flows(flow_path, site_label)
Location: modules/flow_loader.py:98
Docs: Load flow data from CSV with site labeling
• read_raw_csv(filepath)
Location: legacy/importer.py:156
Docs: Legacy CSV reader (deprecated)Claude sees these results and chooses to import the canonical implementation instead of creating a new one.
🔁 Detect Duplicates
Tool: find_duplicates()
Find functions with the same name in multiple files:
> find_duplicates()
⚠️ Found 3 function names with multiple implementations:
**load_flow_data** (3 implementations):
- cohesion_analyzer.py:253
- legacy/community_detector.py:440
- policy_group_clustering.py:497
**format_bytes** (2 implementations):
- utils.py:88
- helpers.py:124
💡 Recommendation: Consolidate into single canonical implementations.Use this to audit your codebase and identify consolidation opportunities.
🎯 Similar Signatures
Tool: find_similar_signatures(min_params=2)
Find functions with the same parameters (might be doing the same thing):
> find_similar_signatures(min_params=2)
Found 2 signature groups:
**Signature: (data, output_path)** - 4 functions:
• save_to_csv in exporter.py:67
• write_csv_file in writer.py:134
• export_data in utils.py:203
• save_results in analyzer.py:445
💡 These functions likely do the same thing with different names.Catches semantic duplicates - functions that do the same thing but have different names.
📂 Multi-Project Support
Work on multiple projects without reconfiguring:
# One-time setup
> register_project("project-a", "/Users/jose/Projects/project-a")
> register_project("project-b", "/Users/jose/Projects/project-b")
# Daily use - auto-detects from your current directory
cd ~/Projects/project-a
> search_functions("auth")
[Auto-detected: project-a]
Found 5 functions...
cd ~/Projects/project-b
> search_functions("auth")
[Auto-detected: project-b]
Found 3 functions...Features:
✅ Register unlimited projects
✅ Auto-detection from working directory
✅ Isolated indexes (no cross-contamination)
✅ Zero configuration switching
📊 Codebase Statistics
Tool: get_index_stats()
Understand your codebase at a glance:
> get_index_stats()
📊 CodeWalker Statistics:
Total Functions: 800
Total Files: 60
Unique Names: 765
Methods: 423
Async Functions: 67
Avg Parameters: 2.3
Duplication Rate: 4.4% (35 duplicates)
Last Indexed: 2026-03-18 10:35:00Track duplication rate over time to measure improvement.
Quick Start
1. Install
git clone https://github.com/[username]/codewalker.git
cd codewalker
pip install -r requirements.txt2. Configure Claude Code
Add to ~/.config/claude-code/mcp.json:
{
"mcpServers": {
"codewalker": {
"command": "python3",
"args": ["/absolute/path/to/codewalker/src/server.py"]
}
}
}3. Register Your Projects
Restart Claude Code, then:
> register_project("my-project", "/absolute/path/to/your/project")
🔄 Registering project: my-project
📁 Path: /absolute/path/to/your/project
⏳ Indexing project...
Found 800 functions
✅ Indexing complete!
Total Functions: 800
Total Files: 60
Unique Names: 7654. Start Using
CodeWalker now automatically prevents duplicate code:
You: "Add functionality to load CSV files"
Claude (internal):
> search_functions("load csv")
Found: load_csv_file() in src/data_loader.py
Claude: "I found an existing CSV loader at src/data_loader.py:42.
Let me use that instead of creating a new one:
from src.data_loader import load_csv_file
data = load_csv_file(path)Available Tools
Project Management
register_project(name, path)- Add a project to CodeWalkerlist_projects()- View all registered projectsunregister_project(name)- Remove a projectget_current_project()- Show which project is detected
Function Search
search_functions(query, exact)- Find functions by namefind_duplicates()- Detect duplicate function namesfind_similar_signatures(min_params)- Find functions with similar parametersget_file_functions(file_path)- List all functions in a fileget_index_stats()- View codebase statisticsreindex_repository()- Rebuild index after major changes
Use Cases
1. Prevent Duplication During Development
Before every implementation:
You: "Add user authentication"
Claude: Let me check if auth code already exists...
> search_functions("auth")
Found: authenticate_user() in src/auth.py
Claude: "I found existing auth code. Let me use it..."2. Onboard to New Codebases
Explore unfamiliar code:
> search_functions("export")
Found 12 functions with "export" in the name
> get_file_functions("src/exporter.py")
Lists all 8 functions in the file with signatures and docsQuickly understand what exists before writing new code.
3. Refactoring and Cleanup
Find consolidation opportunities:
> find_duplicates()
Found 15 duplicate function names
> find_similar_signatures()
Found 8 signature groups (functions with same params)Systematically eliminate duplication.
4. Code Review
Reviewers can verify reuse:
Reviewer: "Why didn't you use the existing loader?"
Developer: "Let me check..."
> search_functions("load")
Found 3 loaders I didn't know about!Catch missed reuse opportunities during review.
Comparison: With vs Without CodeWalker
Scenario | Without CodeWalker | With CodeWalker |
Add CSV loading | Creates 7th duplicate | Finds and reuses existing |
Authentication needed | Creates new auth from scratch | Imports existing |
Format bytes | Creates 3rd | Uses canonical implementation |
Code review | "Why is this duplicated?" | "Good reuse of existing code" |
Bug in duplicates | Fix bug in 7 different places | Fix once, fixed everywhere |
Onboarding | "Which loader should I use?" | Clear: one canonical implementation |
Duplication rate | 40-60% (typical for AI projects) | < 5% (with CodeWalker) |
Graph Theory Connection
CodeWalker treats your codebase as a graph:
Vertices - Functions, classes, modules
Edges - Imports, function calls, dependencies
Walking - Traversing the graph to discover existing code
Graph concepts:
Graph walk - Sequence of vertices (functions) and edges (calls)
Traversal - Systematic exploration of the graph structure
Random walks - Discovery algorithms (like PageRank)
Tree walks - AST traversal (what the parser does)
This isn't just a metaphor - CodeWalker literally walks your Abstract Syntax Tree (AST) to build the function graph.
Roadmap & Future Development
CodeWalker v2.0.0 solves the core AI code duplication problem for Python projects. Future versions will add deeper analysis, broader language support, and smarter automation.
🔥 High Priority
Why these matter: These features provide immediate value for existing users and are most frequently requested.
Incremental indexing - Currently, reindexing rebuilds the entire database. Incremental indexing would only update changed files, making reindexing 10-100x faster for large codebases. Impact: Seconds instead of minutes for 10k+ function codebases.
Near-duplicate detection - Functions like
load_csv,load_csv_data, andread_csv_fileare semantically duplicates but have different names. Levenshtein distance matching would catch these "near-duplicates" that current exact/partial matching misses. Impact: Catch 20-30% more duplicates.Cross-project search - Search across all registered projects simultaneously. Useful for teams with shared utilities across multiple repos or monorepo users who want to find reusable code anywhere. Impact: Prevent reinventing wheels across project boundaries.
Call graph analysis - Track what calls what to enable "blast radius" analysis ("what breaks if I change this function?") and identify unused code. Impact: Safer refactoring, dead code detection.
🎯 Medium Priority
Why these matter: These features enhance CodeWalker's intelligence and reduce manual effort.
Semantic similarity (ML-based) - Detect functions that do the same thing with completely different names and signatures using embedding-based similarity. Example:
save_to_csv(data, path)andexport_results(df, filename)might be doing the same thing. Impact: Catch duplicates current signature matching misses.Auto-reindexing on file changes - Watch filesystem and automatically reindex when Python files change. No more manual
reindex_repository()calls. Impact: Zero-maintenance index that's always current.Multi-language support - Extend beyond Python to JavaScript, TypeScript, Go, Rust, Java. Same duplication prevention for polyglot codebases. Impact: Unified duplication prevention across entire stack.
Blast radius visualization - Show dependency trees and impact analysis when considering changes. "If I modify function X, these 15 functions are affected." Impact: Confident refactoring.
💡 Lower Priority
Why these matter: Nice-to-have features that improve developer experience but aren't critical to core functionality.
Web UI - Visual interface for browsing functions, viewing call graphs, and exploring codebase structure in a browser. Alternative to CLI-only workflow. Impact: Better onboarding experience, visual learners benefit.
VS Code extension - Native VS Code integration with inline suggestions ("⚠️ Similar function exists: use
load_csv_file()instead"). Impact: Proactive duplicate prevention during typing.Import suggestions - When Claude is about to write new code, automatically suggest existing imports. "You're about to write X, but Y already exists - import it?" Impact: Even less manual searching.
GitHub Action - CI/CD integration that fails PRs introducing duplicates above a threshold. Enforce duplication standards via automation. Impact: Prevent duplicates from ever being merged.
📊 Current Capabilities
What works today:
Language Support:
✅ Python (full support for functions, methods, async functions, decorators)
🚧 JavaScript, TypeScript, Go, Rust (on roadmap)
Analysis:
✅ Function names, signatures, locations, docstrings
✅ Parameter matching and signature comparison
✅ Duplicate detection (exact name matches)
🚧 Call graph analysis (planned)
🚧 Semantic similarity (planned)
🚧 Near-duplicate detection via Levenshtein distance (planned)
Indexing:
✅ Full repository indexing (~5 seconds for 800 functions)
✅ Manual reindexing on demand
🚧 Incremental updates (only changed files - planned)
🚧 Auto-reindexing on file changes (planned)
Search:
✅ Exact and partial name matching
✅ Parameter signature matching
✅ Multi-project support with auto-detection
🚧 Semantic search by behavior (planned)
🚧 Cross-project search (planned)
FAQ
Q: Does this work with other AI assistants?
Yes! CodeWalker uses the Model Context Protocol (MCP), which is an open standard. Any AI tool that supports MCP can use CodeWalker:
Claude Code (tested)
Claude Desktop (should work)
Other MCP-compatible tools
Q: How much overhead does indexing add?
Very little:
Initial indexing: ~5 seconds for 800 functions
Reindexing: ~5 seconds (full rebuild)
Search queries: < 1ms
Memory: ~10 MB for typical projects
You barely notice it's there.
Q: What if my codebase is huge?
CodeWalker scales well:
Tested on 800 functions / 60 files
Should handle 10,000+ functions easily (SQLite scales)
For massive codebases (100k+ functions), consider:
Incremental indexing (planned feature)
Multiple project registrations (already supported)
Excluding test files or generated code
Q: Can I use this on proprietary code?
Yes! Everything is local:
✅ Index stored locally (~/.codewalker)
✅ No data sent to external services
✅ No network requests during search
✅ Your code never leaves your machine
CodeWalker is 100% private.
Q: How is this different from IDE autocomplete?
Complementary, not competing:
IDE autocomplete:
Works in single file
Shows available imports
Type-aware suggestions
Real-time as you type
CodeWalker:
Works across entire codebase
Searches by semantic intent ("load csv")
Finds duplicates proactively
Used by AI during code generation
Use both - IDE for writing, CodeWalker for AI-assisted development.
Q: What about private/internal functions?
CodeWalker indexes everything:
Public functions: ✅ Indexed
Private functions (
_private): ✅ IndexedInternal functions (
__internal): ✅ Indexed
Why? Because you might want to reuse private functions too. Claude respects Python conventions (won't use _private from other modules without good reason), but knowing they exist prevents duplication.
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
Areas we need help:
Multi-language support (JavaScript, TypeScript, Go)
Incremental indexing
Semantic similarity detection
Performance optimization
License
MIT License - see LICENSE for details.
Free to use in personal and commercial projects.
Credits
Built to solve a real problem: Claude Code was creating duplicate implementations across a 60-file, 800-function codebase. CodeWalker eliminated the duplication.
Inspired by: Pharaoh (commercial tool for codebase intelligence)
Built with: Claude Sonnet 4.5 (dogfooding - using AI to build tools that improve AI)
Support
Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See guides in this repository
Summary
Problem: AI assistants can't see your codebase, causing massive code duplication.
Solution: CodeWalker indexes your codebase and lets AI search before writing.
Result: 40-60% reduction in duplicate code, faster development, cleaner codebase.
Get Started:
pip install -r requirements.txt
# Configure MCP (see Quick Start above)
> register_project("my-project", "/path/to/project")
> search_functions("whatever you're about to write")Stop duplicating code. Start walking your codebase. 🚀
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.