PyTorch Documentation Search Tool

README.md•4.56 kB

# PyTorch Documentation Search Tool (Project Paused) A semantic search prototype for PyTorch documentation with command-line capabilities. ## Current Status (April 19, 2025) **⚠️ This project is currently paused for significant redesign.** The tool provides a basic command-line search interface for PyTorch documentation but requires substantial improvements in several areas. While the core embedding and search functionality works at a basic level, both relevance quality and MCP integration require additional development. ### Example Output ``` $ python scripts/search.py "How are multi-attention heads plotted out in PyTorch?" Found 5 results for 'How are multi-attention heads plotted out in PyTorch?': --- Result 1 (code) --- Title: plot_visualization_utils.py Source: plot_visualization_utils.py Score: 0.3714 Snippet: # models. Let's start by analyzing the output of a Mask-RCNN model. Note that... --- Result 2 (code) --- Title: plot_transforms_getting_started.py Source: plot_transforms_getting_started.py Score: 0.3571 Snippet: https://github.com/pytorch/vision/tree/main/gallery/... ``` ## What Works ✅ **Basic Semantic Search**: Command-line interface for querying PyTorch documentation ✅ **Vector Database**: Functional ChromaDB integration for storing and querying embeddings ✅ **Content Differentiation**: Distinguishes between code and text content ✅ **Interactive Mode**: Option to run continuous interactive queries in a session ## What Needs Improvement ❌ **Relevance Quality**: Moderate similarity scores (0.35-0.37) indicate suboptimal results ❌ **Content Coverage**: Specialized topics may have insufficient representation in the database ❌ **Chunking Strategy**: Current approach breaks documentation at arbitrary points ❌ **Result Presentation**: Snippets are too short and lack sufficient context ❌ **MCP Integration**: Connection timeout issues prevent Claude Code integration ## Getting Started ### Environment Setup Create a conda environment with all dependencies: ```bash conda env create -f environment.yml conda activate pytorch_docs_search ``` ### API Key Setup The tool requires an OpenAI API key for embedding generation: ```bash export OPENAI_API_KEY=your_key_here ``` ## Command-line Usage ```bash # Search with a direct query python scripts/search.py "your search query here" # Run in interactive mode python scripts/search.py --interactive # Additional options python scripts/search.py "query" --results 5 # Limit to 5 results python scripts/search.py "query" --filter code # Only code results python scripts/search.py "query" --json # Output in JSON format ``` ## Project Architecture - `ptsearch/core/`: Core search functionality (database, embedding, search) - `ptsearch/config/`: Configuration management - `ptsearch/utils/`: Utility functions and logging - `scripts/`: Command-line tools - `data/`: Embedded documentation and database - `ptsearch/protocol/`: MCP protocol handling (currently unused) - `ptsearch/transport/`: Transport implementations (STDIO, SSE) (currently unused) ## Why This Project Is Paused After evaluating the current implementation, we've identified several challenges that require significant redesign: 1. **Data Quality Issues**: The current embedding approach doesn't capture semantic relationships between PyTorch concepts effectively enough. Relevance scores around 0.35-0.37 are too low for a quality user experience. 2. **Chunking Limitations**: Our current method divides documentation into chunks based on character count rather than conceptual boundaries, leading to fragmented results. 3. **MCP Integration Problems**: Despite multiple implementation approaches, we encountered persistent timeout issues when attempting to integrate with Claude Code: - STDIO integration failed at connection establishment - Flask server with SSE transport couldn't maintain stable connections - UVX deployment experienced similar timeout issues ## Future Roadmap When development resumes, we plan to focus on: 1. **Improved Chunking Strategy**: Implement semantic chunking that preserves conceptual boundaries 2. **Enhanced Result Formatting**: Provide more context and better snippet selection 3. **Expanded Documentation Coverage**: Ensure comprehensive representation of all PyTorch topics 4. **MCP Integration Redesign**: Work with the Claude team to resolve timeout issues ## Development ### Running Tests ```bash pytest -v tests/ ``` ### Format Code ```bash black . ``` ## License MIT License

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/seanmichaelmcgee/pytorch-docs-refactored'

If you have feedback or need assistance with the MCP directory API, please join our Discord server