# Knowledge Graph Integration Plan for PT-MCP
> **"Where am I now?"** - Semantic context through integrated knowledge graphs
## Vision
PT-MCP will provide not just code structure, but **semantic meaning** by integrating:
- **YAGO 4.5**: Base knowledge graph with entities, relationships, and facts
- **Schema.org**: Domain-specific schemas for semantic typing
## The Paul Test Man Analogy
Just as Paul Test Man mapped signal coverage to ensure "Can you hear me now?", PT-MCP maps semantic meaning to answer "Where am I now?" - providing AI assistants with rich contextual understanding beyond syntax.
## Integration Architecture (Preliminary)
### Phase 1: Core Infrastructure
```
PT-MCP Server
↓
Context Analyzer (existing)
↓
Semantic Enricher (new)
├── YAGO Query Service
│ └── Entity Resolver
│ └── Relationship Mapper
│ └── Fact Retriever
└── Schema.org Mapper
└── Type Classifier
└── Property Extractor
└── Vocabulary Builder
```
### Phase 2: Knowledge Graph Integration
#### YAGO 4.5 Integration
**Purpose**: Provide base knowledge graph segments relevant to code context
**Approach Options** (to be decided after research):
1. **Remote Query**: SPARQL endpoint queries
2. **Local Subset**: Download relevant domain data
3. **Hybrid**: Local cache + remote fallback
4. **Embedded Triple Store**: Run local RDF database
**Expected Capabilities**:
- Entity recognition (e.g., "React" → framework entity)
- Relationship extraction (e.g., "uses TypeScript" → hasLanguage)
- Fact retrieval (e.g., "created by Facebook in 2013")
- Concept linking (e.g., "REST API" → HTTP methods → status codes)
#### Schema.org Integration
**Purpose**: Provide domain-specific knowledge graphs
**Domain Mapping**:
```typescript
CodebaseType → Schema.org Type
├── Web Application → WebApplication
├── API Server → WebAPI / APIReference
├── Mobile App → MobileApplication
├── Library/Package → SoftwareLibrary
├── Database Schema → Dataset
├── Documentation → TechArticle / HowTo
└── Test Suite → SoftwareTest (if available)
```
**Property Mapping**:
- `dependencies` → schema:softwareRequirements
- `version` → schema:softwareVersion
- `authors` → schema:author
- `license` → schema:license
- `description` → schema:description
### Phase 3: Context Enhancement
#### New MCP Tool: `enrich_context`
```typescript
{
path: string;
analysis_result: any; // from analyze_codebase
enrichment_level: 'minimal' | 'standard' | 'comprehensive';
include_yago: boolean;
include_schema: boolean;
}
```
**Returns**:
```typescript
{
codebase_context: {...}, // existing analysis
knowledge_graph: {
yago_entities: [
{
entity: "React",
type: "SoftwareFramework",
relationships: [
{ predicate: "developedBy", object: "Facebook" },
{ predicate: "writtenIn", object: "JavaScript" }
],
facts: [...]
}
],
schema_annotations: {
"@context": "https://schema.org",
"@type": "WebApplication",
"name": "...",
"applicationCategory": "DeveloperApplication",
"softwareVersion": "1.0.0",
"programmingLanguage": ["TypeScript", "JavaScript"]
}
}
}
```
## Technical Stack (Proposed)
### Dependencies to Add
- `rdflib` or `n3` - RDF processing
- `jsonld` - JSON-LD parsing for Schema.org
- `sparql-http-client` - SPARQL queries (if remote)
- `levelgraph` or `quadstore` - Local triple store (if embedded)
- TBD based on agent research
### Data Storage
- **Option 1**: In-memory cache with TTL
- **Option 2**: Local RDF database (LevelGraph, RDFStore)
- **Option 3**: File-based cache (JSON-LD files)
- **Option 4**: Hybrid (memory + disk)
### Query Strategy
- **Entity Linking**: Match code entities to knowledge graph entities
- **Context Window**: Retrieve relevant subgraph within N hops
- **Relevance Scoring**: Rank entities by contextual relevance
- **Caching**: Cache frequent queries to reduce latency
## Use Cases
### Use Case 1: Framework Recognition
```
Input: analyze_codebase finds "React" and "Next.js"
Enhanced Output:
- YAGO: React (JavaScript library, created 2013, by Facebook)
- YAGO: Next.js (React framework, created by Vercel)
- Schema: WebApplication with SoftwareFramework annotations
```
### Use Case 2: API Documentation
```
Input: analyze_codebase finds REST API with Express
Enhanced Output:
- YAGO: REST (architectural style), HTTP (protocol)
- Schema: WebAPI type with APIReference properties
- Relationships: usesProtocol → HTTP, hasEndpoint → [...]
```
### Use Case 3: Database Context
```
Input: analyze_codebase finds PostgreSQL usage
Enhanced Output:
- YAGO: PostgreSQL (RDBMS, SQL dialect, ACID compliant)
- Schema: Dataset type with database properties
- Facts: Version requirements, performance characteristics
```
## Success Metrics
1. **Accuracy**: >90% correct entity linking
2. **Relevance**: >80% of returned KG segments are contextually useful
3. **Performance**: <2s latency for knowledge graph enrichment
4. **Coverage**: Support for 50+ programming languages/frameworks initially
## Open Questions (To Be Answered by Research)
1. **YAGO Access**:
- What's the current YAGO 4.5 access method?
- Do they have a programming domain subset?
- What's the query performance?
2. **Schema.org**:
- Are there software engineering extensions?
- How to validate and infer types?
- What's the best JSON-LD library?
3. **W3C Standards**:
- Which RDF format is optimal?
- SPARQL vs. GraphQL vs. REST?
- Best practices for embedded vs. remote?
4. **Serena Patterns**:
- Does Serena use any semantic web tech?
- What patterns can we reuse?
- Any performance lessons learned?
## Next Steps
1. ✅ Launch research agents (4 agents in parallel)
2. ⏳ Wait for research results
3. 📋 Create detailed technical specification
4. 📋 Implement proof-of-concept for YAGO integration
5. 📋 Implement Schema.org annotation system
6. 📋 Add `enrich_context` MCP tool
7. 📋 Write comprehensive tests
8. 📋 Optimize for performance
9. 📋 Document usage patterns
---
**Status**: Research phase in progress
**Agents Running**: 4 (YAGO, Schema.org, Serena, W3C)
**Next Update**: After agent research completes