README.md•14.1 kB
# Maintenance Scripts
This directory contains maintenance and diagnostic scripts for the MCP Memory Service database.
## Quick Reference
| Script | Purpose | Performance | Use Case |
|--------|---------|-------------|----------|
| [`check_memory_types.py`](#check_memory_typespy-new) | Display type distribution | <1s | Quick health check, pre/post-consolidation validation |
| [`consolidate_memory_types.py`](#consolidate_memory_typespy-new) | Consolidate fragmented types | ~5s for 1000 updates | Type taxonomy cleanup, reduce fragmentation |
| [`regenerate_embeddings.py`](#regenerate_embeddingspy) | Regenerate all embeddings | ~5min for 2600 memories | After cosine migration or embedding corruption |
| [`fast_cleanup_duplicates.sh`](#fast_cleanup_duplicatessh) | Fast duplicate removal | <5s for 100+ duplicates | Bulk duplicate cleanup |
| [`find_all_duplicates.py`](#find_all_duplicatespy) | Detect near-duplicates | <2s for 2000 memories | Duplicate detection and analysis |
| [`find_duplicates.py`](#find_duplicatespy) | API-based duplicate finder | Slow (~90s/duplicate) | Detailed duplicate analysis via API |
| [`repair_sqlite_vec_embeddings.py`](#repair_sqlite_vec_embeddingspy) | Fix embedding corruption | Varies | Repair corrupted embeddings |
| [`repair_zero_embeddings.py`](#repair_zero_embeddingspy) | Fix zero-valued embeddings | Varies | Repair zero embeddings |
| [`cleanup_corrupted_encoding.py`](#cleanup_corrupted_encodingpy) | Fix encoding issues | Varies | Repair encoding corruption |
## Detailed Documentation
### `check_memory_types.py` 🆕
**Purpose**: Quick diagnostic tool to display memory type distribution in the database.
**When to Use**:
- Before running consolidation to see what needs cleanup
- After consolidation to verify results
- Regular health checks to monitor type fragmentation
- When investigating memory organization issues
**Usage**:
```bash
# Display type distribution (Windows)
python scripts/maintenance/check_memory_types.py
# On macOS/Linux, update the database path in the script first
```
**Output Example**:
```
Memory Type Distribution
============================================================
Total memories: 1,978
Unique types: 128
Memory Type Count %
------------------------------------------------------------
note 609 30.8%
session 89 4.5%
fix 67 3.4%
milestone 60 3.0%
reference 45 2.3%
...
```
**Performance**: < 1 second for any database size (read-only SQL query)
**Features**:
- Shows top 30 types by frequency
- Displays total memory count and unique type count
- Identifies NULL/empty types as "(empty/NULL)"
- Percentage calculation for easy analysis
- Zero risk (read-only operation)
**Workflow Integration**:
1. Run `check_memory_types.py` to identify fragmentation
2. If types > 150, consider running consolidation
3. Run `consolidate_memory_types.py --dry-run` to preview
4. Execute `consolidate_memory_types.py` to clean up
5. Run `check_memory_types.py` again to verify improvement
### `consolidate_memory_types.py` 🆕
**Purpose**: Consolidates fragmented memory types into a standardized 24-type taxonomy.
**When to Use**:
- Type fragmentation (e.g., `bug-fix`, `bugfix`, `technical-fix` all coexisting)
- Many types with only 1-2 memories
- Inconsistent naming across similar concepts
- After importing memories from external sources
- Monthly maintenance to prevent type proliferation
**Usage**:
```bash
# Preview changes (safe, read-only)
python scripts/maintenance/consolidate_memory_types.py --dry-run
# Execute consolidation
python scripts/maintenance/consolidate_memory_types.py
# Use custom mapping configuration
python scripts/maintenance/consolidate_memory_types.py --config custom_mappings.json
```
**Performance**: ~5 seconds for 1,000 memory updates (Nov 2025 real-world test: 1,049 updates in 5s)
**Safety Features**:
- ✅ Automatic timestamped backup before execution
- ✅ Dry-run mode shows preview without changes
- ✅ Transaction safety (atomic with rollback on error)
- ✅ Database lock detection (prevents concurrent access)
- ✅ HTTP server warning (recommends stopping before execution)
- ✅ Disk space verification (needs 2x database size)
- ✅ Backup verification (size and existence checks)
**Standard 24-Type Taxonomy**:
**Content Types:** `note`, `reference`, `document`, `guide`
**Activity Types:** `session`, `implementation`, `analysis`, `troubleshooting`, `test`
**Artifact Types:** `fix`, `feature`, `release`, `deployment`
**Progress Types:** `milestone`, `status`
**Infrastructure Types:** `configuration`, `infrastructure`, `process`, `security`, `architecture`
**Other Types:** `documentation`, `solution`, `achievement`, `technical`
**Example Consolidations**:
- NULL/empty → `note`
- `bug-fix`, `bugfix`, `technical-fix` → `fix`
- `session-summary`, `session-checkpoint` → `session`
- `project-milestone`, `development-milestone` → `milestone`
- All `technical-*` → base type (remove prefix)
- All `project-*` → base type (remove prefix)
**Typical Results** (from production database, Nov 2025):
```
Before: 342 unique types, 609 NULL/empty, fragmented naming
After: 128 unique types (63% reduction), all valid types
Updated: 1,049 memories (59% of database)
Time: ~5 seconds
```
**Configuration**:
Edit `consolidation_mappings.json` to customize behavior:
```json
{
"mappings": {
"old-type-name": "new-type-name",
"bug-fix": "fix",
"technical-solution": "solution"
}
}
```
**Prerequisites**:
```bash
# 1. Stop HTTP server
systemctl --user stop mcp-memory-http.service
# 2. Disconnect MCP clients (Claude Code: /mcp command)
# 3. Verify disk space (need 2x database size)
df -h ~/.local/share/mcp-memory/
```
**Recovery**:
```bash
# If something goes wrong, restore from automatic backup
cp ~/.local/share/mcp-memory/sqlite_vec.db.backup-TIMESTAMP ~/.local/share/mcp-memory/sqlite_vec.db
# Verify restoration
sqlite3 ~/.local/share/mcp-memory/sqlite_vec.db "SELECT COUNT(*), COUNT(DISTINCT memory_type) FROM memories;"
```
**Notes**:
- Creates timestamped backup automatically (e.g., `sqlite_vec.db.backup-20251101-202042`)
- No data loss - only type reassignment
- Safe to run multiple times (idempotent for same mappings)
- Comprehensive reporting shows before/after statistics
- See `consolidation_mappings.json` for full mapping list
**Maintenance Schedule**:
- Run `--dry-run` monthly to check fragmentation
- Execute when unique types exceed 150
- Review custom mappings quarterly
---
### `regenerate_embeddings.py`
**Purpose**: Regenerates embeddings for all memories in the database after schema migrations or corruption.
**When to Use**:
- After cosine distance migration
- When embeddings table is dropped but memories are preserved
- After embedding corruption detected
**Usage**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/regenerate_embeddings.py
```
**Performance**: ~5 minutes for 2600 memories with all-MiniLM-L6-v2 model
**Notes**:
- Uses configured storage backend (hybrid, cloudflare, or sqlite_vec)
- Creates embeddings using sentence-transformers model
- Shows progress every 100 memories
- Safe to run multiple times (idempotent)
---
### `fast_cleanup_duplicates.sh`
**Purpose**: Fast duplicate removal using direct SQL access instead of API calls.
**When to Use**:
- Bulk duplicate cleanup after detecting duplicates
- When API-based deletion is too slow (>1min per duplicate)
- Production cleanup without extended downtime
**Usage**:
```bash
bash scripts/maintenance/fast_cleanup_duplicates.sh
```
**Performance**: <5 seconds for 100+ duplicates
**How It Works**:
1. Stops HTTP server to avoid database locking
2. Uses direct SQL DELETE with timestamp normalization
3. Keeps newest copy of each duplicate group
4. Restarts HTTP server automatically
**Warnings**:
- ⚠️ Requires systemd HTTP server setup (`mcp-memory-http.service`)
- ⚠️ Brief service interruption during cleanup
- ⚠️ Direct database access bypasses Cloudflare sync (background sync handles it later)
---
### `find_all_duplicates.py`
**Purpose**: Fast duplicate detection using content normalization and hash comparison.
**When to Use**:
- Regular duplicate audits
- Before running cleanup operations
- Investigating duplicate memory issues
**Usage**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/find_all_duplicates.py
```
**Performance**: <2 seconds for 2000 memories
**Detection Method**:
- Normalizes content by removing timestamps (dates, ISO timestamps)
- Groups memories by MD5 hash of normalized content
- Reports duplicate groups with counts
**Output**:
```
Found 23 groups of duplicates
Total memories to delete: 115
Total memories after cleanup: 1601
```
---
### `find_duplicates.py`
**Purpose**: Comprehensive duplicate detection via HTTP API with detailed analysis.
**When to Use**:
- Need detailed duplicate analysis with full metadata
- API-based workflow required
- Integration with external tools
**Usage**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/find_duplicates.py
```
**Performance**: Slow (~90 seconds per duplicate deletion)
**Features**:
- Loads configuration from Claude hooks config
- Supports self-signed SSL certificates
- Pagination support for large datasets
- Detailed duplicate grouping and reporting
**Notes**:
- 15K script with comprehensive error handling
- Useful for API integration scenarios
- Slower than `find_all_duplicates.py` due to network overhead
---
### `repair_sqlite_vec_embeddings.py`
**Purpose**: Repairs corrupted embeddings in the sqlite-vec virtual table.
**When to Use**:
- Embedding corruption detected
- vec0 extension errors
- Database integrity issues
**Usage**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/repair_sqlite_vec_embeddings.py
```
**Warnings**:
- ⚠️ Requires vec0 extension to be properly installed
- ⚠️ May drop and recreate embeddings table
---
### `repair_zero_embeddings.py`
**Purpose**: Detects and fixes memories with zero-valued embeddings.
**When to Use**:
- Search results showing 0% similarity scores
- After embedding regeneration failures
- Embedding quality issues
**Usage**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/repair_zero_embeddings.py
```
---
### `cleanup_corrupted_encoding.py`
**Purpose**: Fixes encoding corruption issues in memory content.
**When to Use**:
- UTF-8 encoding errors
- Display issues with special characters
- After data migration from different encoding
**Usage**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/cleanup_corrupted_encoding.py
```
---
## Best Practices
### Before Running Maintenance Scripts
1. **Backup your database**:
```bash
cp ~/.local/share/mcp-memory/sqlite_vec.db ~/.local/share/mcp-memory/sqlite_vec.db.backup
```
2. **Check memory count**:
```bash
sqlite3 ~/.local/share/mcp-memory/sqlite_vec.db "SELECT COUNT(*) FROM memories"
```
3. **Stop HTTP server if needed** (for direct database access):
```bash
systemctl --user stop mcp-memory-http.service
```
### After Running Maintenance Scripts
1. **Verify results**:
```bash
sqlite3 ~/.local/share/mcp-memory/sqlite_vec.db "SELECT COUNT(*) FROM memories"
```
2. **Check for duplicates**:
```bash
/home/hkr/repositories/mcp-memory-service/venv/bin/python scripts/maintenance/find_all_duplicates.py
```
3. **Restart HTTP server**:
```bash
systemctl --user start mcp-memory-http.service
```
4. **Test search functionality**:
```bash
curl -s "http://127.0.0.1:8000/api/health"
```
### Performance Comparison
| Operation | API-based | Direct SQL | Speedup |
|-----------|-----------|------------|---------|
| Delete 1 duplicate | ~90 seconds | ~0.05 seconds | **1800x faster** |
| Delete 100 duplicates | ~2.5 hours | <5 seconds | **1800x faster** |
| Find duplicates | ~30 seconds | <2 seconds | **15x faster** |
**Recommendation**: Use direct SQL scripts (`fast_cleanup_duplicates.sh`, `find_all_duplicates.py`) for production maintenance. API-based scripts are useful for integration and detailed analysis.
## Troubleshooting
### "Database is locked"
**Cause**: HTTP server or MCP server has open connection
**Solution**:
```bash
# Stop HTTP server
systemctl --user stop mcp-memory-http.service
# Disconnect MCP server in Claude Code
# Type: /mcp
# Run maintenance script
bash scripts/maintenance/fast_cleanup_duplicates.sh
# Restart services
systemctl --user start mcp-memory-http.service
```
### "No such module: vec0"
**Cause**: Python sqlite3 module doesn't load vec0 extension automatically
**Solution**: Use scripts that work with the vec0-enabled environment:
- ✅ Use: `fast_cleanup_duplicates.sh` (bash wrapper with Python)
- ✅ Use: `/venv/bin/python` with proper storage backend
- ❌ Avoid: Direct `sqlite3` Python module for virtual table operations
### Slow API Performance
**Cause**: Hybrid backend syncs each operation to Cloudflare
**Solution**: Use direct SQL scripts for bulk operations:
```bash
bash scripts/maintenance/fast_cleanup_duplicates.sh # NOT Python API scripts
```
## Related Documentation
- [Database Schema](../../docs/database-schema.md) - sqlite-vec table structure
- [Storage Backends](../../CLAUDE.md#storage-backends) - Hybrid, Cloudflare, SQLite-vec
- [Troubleshooting](../../docs/troubleshooting.md) - Common issues and solutions
## Contributing
When adding new maintenance scripts:
1. Add comprehensive docstring explaining purpose and usage
2. Include progress indicators for long-running operations
3. Add error handling and validation
4. Document in this README with performance characteristics
5. Test with both sqlite_vec and hybrid backends