Codebase MCP Server

codebase-mcp
specs
001-build-a-production

data-model.md•16.6 KiB

# Data Model: Production-Grade MCP Server **Feature**: Production-Grade MCP Server for Semantic Code Search **Date**: 2025-10-06 **Phase**: Phase 1 - Design & Contracts ## Entity Relationship Overview ``` Repository 1──────* CodeFile 1──────* CodeChunk 1─────1 Embedding │ │ 1 │ * ChangeEvent Task 1─────* TaskPlanningReference 1─────* TaskBranchLink 1─────* TaskCommitLink 1─────* TaskStatusHistory ``` ## Core Entities ### 1. Repository Represents a code repository being indexed. **SQLAlchemy Model**: ```python from sqlalchemy import Column, String, DateTime, Boolean from sqlalchemy.dialects.postgresql import UUID from sqlalchemy.orm import relationship from datetime import datetime import uuid class Repository(Base): __tablename__ = "repositories" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) path: Mapped[str] = mapped_column(String, unique=True, nullable=False, index=True) name: Mapped[str] = mapped_column(String, nullable=False) last_indexed_at: Mapped[datetime | None] = mapped_column(DateTime, nullable=True) is_active: Mapped[bool] = mapped_column(Boolean, default=True, nullable=False) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) # Relationships code_files: Mapped[list["CodeFile"]] = relationship(back_populates="repository", cascade="all, delete-orphan") ``` **Pydantic Schema**: ```python from pydantic import BaseModel, Field from datetime import datetime from uuid import UUID class RepositoryCreate(BaseModel): path: str = Field(..., description="Absolute path to repository") name: str = Field(..., description="Repository display name") class RepositoryResponse(BaseModel): id: UUID path: str name: str last_indexed_at: datetime | None is_active: bool created_at: datetime file_count: int = Field(..., description="Number of indexed files") model_config = {"from_attributes": True} ``` --- ### 2. CodeFile Represents a source code file in the repository. **SQLAlchemy Model**: ```python class CodeFile(Base): __tablename__ = "code_files" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) repository_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("repositories.id"), nullable=False) path: Mapped[str] = mapped_column(String, nullable=False, index=True) relative_path: Mapped[str] = mapped_column(String, nullable=False) content_hash: Mapped[str] = mapped_column(String(64), nullable=False) # SHA-256 size_bytes: Mapped[int] = mapped_column(Integer, nullable=False) language: Mapped[str | None] = mapped_column(String, nullable=True) modified_at: Mapped[datetime] = mapped_column(DateTime, nullable=False) indexed_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) is_deleted: Mapped[bool] = mapped_column(Boolean, default=False, nullable=False) deleted_at: Mapped[datetime | None] = mapped_column(DateTime, nullable=True) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) # Relationships repository: Mapped["Repository"] = relationship(back_populates="code_files") chunks: Mapped[list["CodeChunk"]] = relationship(back_populates="code_file", cascade="all, delete-orphan") change_events: Mapped[list["ChangeEvent"]] = relationship(back_populates="code_file", cascade="all, delete-orphan") __table_args__ = ( Index("ix_code_files_repo_path", "repository_id", "relative_path", unique=True), ) ``` **Pydantic Schema**: ```python class CodeFileResponse(BaseModel): id: UUID repository_id: UUID relative_path: str content_hash: str size_bytes: int language: str | None modified_at: datetime indexed_at: datetime is_deleted: bool chunk_count: int = Field(..., description="Number of code chunks") model_config = {"from_attributes": True} ``` --- ### 3. CodeChunk Represents a semantic chunk of code (function, class, or logical block). **SQLAlchemy Model**: ```python from pgvector.sqlalchemy import Vector class CodeChunk(Base): __tablename__ = "code_chunks" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) code_file_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("code_files.id"), nullable=False) content: Mapped[str] = mapped_column(Text, nullable=False) start_line: Mapped[int] = mapped_column(Integer, nullable=False) end_line: Mapped[int] = mapped_column(Integer, nullable=False) chunk_type: Mapped[str] = mapped_column(String, nullable=False) # function, class, block embedding: Mapped[Vector] = mapped_column(Vector(768), nullable=True) # nomic-embed-text created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) # Relationships code_file: Mapped["CodeFile"] = relationship(back_populates="chunks") __table_args__ = ( Index("ix_chunks_embedding_cosine", "embedding", postgresql_using="hnsw", postgresql_with={"m": 16, "ef_construction": 64}, postgresql_ops={"embedding": "vector_cosine_ops"}), ) ``` **Pydantic Schema**: ```python class CodeChunkCreate(BaseModel): code_file_id: UUID content: str start_line: int = Field(..., ge=1) end_line: int = Field(..., ge=1) chunk_type: str = Field(..., pattern="^(function|class|block)$") class CodeChunkResponse(BaseModel): id: UUID code_file_id: UUID content: str start_line: int end_line: int chunk_type: str has_embedding: bool = Field(..., description="Whether embedding is generated") similarity_score: float | None = Field(None, description="Similarity score when from search") model_config = {"from_attributes": True} ``` --- ### 4. Embedding Metadata about embedding generation (stored with CodeChunk, tracked separately for analytics). **SQLAlchemy Model**: ```python class EmbeddingMetadata(Base): __tablename__ = "embedding_metadata" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) model_name: Mapped[str] = mapped_column(String, nullable=False, default="nomic-embed-text") model_version: Mapped[str | None] = mapped_column(String, nullable=True) dimensions: Mapped[int] = mapped_column(Integer, nullable=False, default=768) generation_time_ms: Mapped[int] = mapped_column(Integer, nullable=False) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) ``` --- ### 5. Task Represents a development task with status tracking. **SQLAlchemy Model**: ```python class Task(Base): __tablename__ = "tasks" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) title: Mapped[str] = mapped_column(String, nullable=False) description: Mapped[str | None] = mapped_column(Text, nullable=True) notes: Mapped[str | None] = mapped_column(Text, nullable=True) status: Mapped[str] = mapped_column(String, nullable=False, default="need to be done", index=True) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) updated_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow, nullable=False) # Relationships planning_references: Mapped[list["TaskPlanningReference"]] = relationship(back_populates="task", cascade="all, delete-orphan") branch_links: Mapped[list["TaskBranchLink"]] = relationship(back_populates="task", cascade="all, delete-orphan") commit_links: Mapped[list["TaskCommitLink"]] = relationship(back_populates="task", cascade="all, delete-orphan") status_history: Mapped[list["TaskStatusHistory"]] = relationship(back_populates="task", cascade="all, delete-orphan") __table_args__ = ( CheckConstraint("status IN ('need to be done', 'in-progress', 'complete')", name="valid_status"), ) ``` **Pydantic Schema**: ```python class TaskCreate(BaseModel): title: str = Field(..., min_length=1, max_length=200) description: str | None = None notes: str | None = None planning_references: list[str] = Field(default_factory=list, description="Relative file paths to planning docs") class TaskUpdate(BaseModel): title: str | None = None description: str | None = None notes: str | None = None status: str | None = Field(None, pattern="^(need to be done|in-progress|complete)$") class TaskResponse(BaseModel): id: UUID title: str description: str | None notes: str | None status: str created_at: datetime updated_at: datetime planning_references: list[str] branches: list[str] commits: list[str] model_config = {"from_attributes": True} ``` --- ### 6. TaskPlanningReference Links tasks to planning documents (spec.md, plan.md, etc.). **SQLAlchemy Model**: ```python class TaskPlanningReference(Base): __tablename__ = "task_planning_references" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) task_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tasks.id"), nullable=False) file_path: Mapped[str] = mapped_column(String, nullable=False) reference_type: Mapped[str] = mapped_column(String, nullable=False) # spec, plan, design, other created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) # Relationships task: Mapped["Task"] = relationship(back_populates="planning_references") ``` --- ### 7. TaskBranchLink Associates tasks with git branches. **SQLAlchemy Model**: ```python class TaskBranchLink(Base): __tablename__ = "task_branch_links" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) task_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tasks.id"), nullable=False) branch_name: Mapped[str] = mapped_column(String, nullable=False) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) # Relationships task: Mapped["Task"] = relationship(back_populates="branch_links") __table_args__ = ( Index("ix_branch_links_task_branch", "task_id", "branch_name", unique=True), ) ``` --- ### 8. TaskCommitLink Associates tasks with git commits. **SQLAlchemy Model**: ```python class TaskCommitLink(Base): __tablename__ = "task_commit_links" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) task_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tasks.id"), nullable=False) commit_hash: Mapped[str] = mapped_column(String(40), nullable=False) commit_message: Mapped[str | None] = mapped_column(Text, nullable=True) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False) # Relationships task: Mapped["Task"] = relationship(back_populates="commit_links") __table_args__ = ( Index("ix_commit_links_task_commit", "task_id", "commit_hash", unique=True), ) ``` --- ### 9. TaskStatusHistory Tracks task status transitions over time. **SQLAlchemy Model**: ```python class TaskStatusHistory(Base): __tablename__ = "task_status_history" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) task_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tasks.id"), nullable=False) from_status: Mapped[str | None] = mapped_column(String, nullable=True) to_status: Mapped[str] = mapped_column(String, nullable=False) changed_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False, index=True) # Relationships task: Mapped["Task"] = relationship(back_populates="status_history") ``` --- ### 10. ChangeEvent Records file system changes for incremental indexing. **SQLAlchemy Model**: ```python class ChangeEvent(Base): __tablename__ = "change_events" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) code_file_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("code_files.id"), nullable=False) event_type: Mapped[str] = mapped_column(String, nullable=False) # added, modified, deleted detected_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False, index=True) processed: Mapped[bool] = mapped_column(Boolean, default=False, nullable=False) # Relationships code_file: Mapped["CodeFile"] = relationship(back_populates="change_events") ``` --- ### 11. SearchQuery Analytics for search queries (optional, for optimization). **SQLAlchemy Model**: ```python class SearchQuery(Base): __tablename__ = "search_queries" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) query_text: Mapped[str] = mapped_column(Text, nullable=False) result_count: Mapped[int] = mapped_column(Integer, nullable=False) latency_ms: Mapped[int] = mapped_column(Integer, nullable=False) filters: Mapped[dict | None] = mapped_column(JSON, nullable=True) created_at: Mapped[datetime] = mapped_column(DateTime, default=datetime.utcnow, nullable=False, index=True) ``` --- ## Validation Rules ### CodeFile - `content_hash`: SHA-256 hex string (64 chars) - `size_bytes`: Non-negative integer - `modified_at`: Must not be in future - `deleted_at`: Set only when `is_deleted=True`, trigger 90-day cleanup ### CodeChunk - `start_line <= end_line` - `chunk_type` in `['function', 'class', 'block']` - `embedding`: 768-dimensional vector or NULL ### Task - `status` in `['need to be done', 'in-progress', 'complete']` - Status transition: Create history entry on change - `updated_at`: Auto-update on modification ### TaskPlanningReference - `file_path`: Relative path, must exist in file system - `reference_type` in `['spec', 'plan', 'design', 'other']` --- ## State Transitions ### Task Status Flow ``` [Created] → need to be done ↓ in-progress ↓ complete ``` **Rules**: - Can move from any status to any other status - Each transition creates `TaskStatusHistory` entry - `updated_at` timestamp updates on transition ### File Deletion Flow ``` [Indexed] → is_deleted=True, deleted_at=NOW ↓ [Wait 90 days] ↓ [Permanent cleanup] ``` **Rules**: - Mark `is_deleted=True` when file removed from filesystem - Set `deleted_at` to current timestamp - Scheduled job removes records where `deleted_at < NOW - 90 days` - Cascades to chunks and embeddings --- ## Indexes ### Performance-Critical Indexes 1. **code_chunks.embedding** (HNSW, cosine distance) - Fast similarity search 2. **code_files.repository_id, relative_path** (BTREE, unique) - File lookup 3. **tasks.status** (BTREE) - Task filtering 4. **change_events.detected_at** (BTREE) - Incremental processing 5. **search_queries.created_at** (BTREE) - Analytics queries ### Unique Constraints 1. `repositories.path` - One entry per repository path 2. `code_files(repository_id, relative_path)` - One entry per file 3. `task_branch_links(task_id, branch_name)` - One branch per task 4. `task_commit_links(task_id, commit_hash)` - One commit per task --- ## Database Initialization ```sql -- Enable pgvector extension CREATE EXTENSION IF NOT EXISTS vector; -- Set vector distance function (cosine for normalized embeddings) -- (Handled by SQLAlchemy index definition) -- Scheduled cleanup job (90-day retention) CREATE OR REPLACE FUNCTION cleanup_deleted_files() RETURNS void AS $$ BEGIN DELETE FROM code_files WHERE is_deleted = TRUE AND deleted_at < NOW() - INTERVAL '90 days'; END; $$ LANGUAGE plpgsql; -- Run cleanup daily SELECT cron.schedule('cleanup-deleted-files', '0 2 * * *', 'SELECT cleanup_deleted_files()'); ``` --- ## Migration Strategy 1. **Alembic Setup**: Auto-generate migrations from SQLAlchemy models 2. **Initial Migration**: Create all tables, indexes, extensions 3. **Version Control**: Migration files in `migrations/versions/` 4. **Rollback Support**: Down migrations for all changes 5. **Data Preservation**: Never drop columns without migration path **Example Migration**: ```python # migrations/versions/001_initial_schema.py def upgrade(): op.execute("CREATE EXTENSION IF NOT EXISTS vector") # ... table creation via SQLAlchemy auto-generate def downgrade(): # ... table drops in reverse order op.execute("DROP EXTENSION IF EXISTS vector") ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

data-model.md•16.6 KiB