Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
performance-optimization.md19.9 kB
# Performance Optimization Guide **Purpose**: How to optimize NornicDB for maximum performance **Audience**: AI coding agents **Goal**: 3-52x faster than Neo4j (maintain or improve) --- ## Performance Philosophy ### Golden Rules 1. **Measure First** - Always benchmark before optimizing 2. **Profile** - Find the actual bottleneck, don't guess 3. **Prove Improvement** - Show before/after benchmarks 4. **No Regressions** - Don't sacrifice other metrics 5. **Document** - Explain why optimization works ### Performance Targets **Current benchmarks (M3 Max, 64GB):** | Operation | NornicDB | Neo4j | Target | | ---------------------------- | ------------- | ----------- | -------- | | Message content lookup | 6,389 ops/sec | 518 ops/sec | ≥6,000 | | Recent messages (friends) | 2,769 ops/sec | 108 ops/sec | ≥2,500 | | Avg friends per city | 4,713 ops/sec | 91 ops/sec | ≥4,500 | | Tag co-occurrence | 2,076 ops/sec | 65 ops/sec | ≥2,000 | | Index lookup | 7,623 ops/sec | 2,143 ops/sec | ≥7,000 | | Count nodes | 5,253 ops/sec | 798 ops/sec | ≥5,000 | | Write: node | 5,578 ops/sec | 1,690 ops/sec | ≥5,000 | | Write: edge | 6,626 ops/sec | 1,611 ops/sec | ≥6,000 | **Memory targets:** - Cold start: <1s (vs 10-30s for Neo4j) - Memory footprint: 100-500 MB (vs 1-4 GB for Neo4j) - No memory leaks - Efficient garbage collection --- ## Optimization Process ### Step 1: Benchmark Baseline **Create benchmark before optimizing:** ```go func BenchmarkQueryExecution(b *testing.B) { // Setup store := setupBenchmarkData() exec := NewStorageExecutor(store) ctx := context.Background() query := "MATCH (n:Person) WHERE n.age > 25 RETURN n LIMIT 100" b.ResetTimer() b.ReportAllocs() for i := 0; i < b.N; i++ { _, err := exec.Execute(ctx, query, nil) if err != nil { b.Fatal(err) } } } ``` **Run baseline:** ```bash go test -bench=BenchmarkQueryExecution -benchmem -count=5 > before.txt ``` **Example output:** ``` BenchmarkQueryExecution-8 4252 282456 ns/op 2456 B/op 45 allocs/op ``` ### Step 2: Profile **CPU profiling:** ```bash go test -bench=BenchmarkQueryExecution -cpuprofile=cpu.prof go tool pprof cpu.prof # In pprof: (pprof) top10 (pprof) list FunctionName (pprof) web # Visual graph ``` **Memory profiling:** ```bash go test -bench=BenchmarkQueryExecution -memprofile=mem.prof go tool pprof mem.prof # In pprof: (pprof) top10 (pprof) list FunctionName ``` **Trace analysis:** ```bash go test -bench=BenchmarkQueryExecution -trace=trace.out go tool trace trace.out ``` ### Step 3: Identify Bottleneck **Common bottlenecks:** 1. **Allocations** - Too many heap allocations 2. **Locks** - Contention on mutexes 3. **I/O** - Disk or network operations 4. **Algorithm** - O(n²) instead of O(n log n) 5. **Serialization** - JSON encoding/decoding 6. **Reflection** - Runtime type inspection **Example profile output:** ``` Total: 2.5s 1.2s (48%) - json.Marshal 0.5s (20%) - sync.(*Mutex).Lock 0.3s (12%) - runtime.mallocgc 0.2s (8%) - findNodes 0.3s (12%) - other ``` **Analysis:** JSON marshaling is the bottleneck (48% of time) ### Step 4: Optimize **Apply optimization technique:** ```go // BEFORE: Slow JSON marshaling func (e *Executor) formatResult(rows [][]interface{}) (*Result, error) { // Marshal each row individually - slow! for _, row := range rows { data, err := json.Marshal(row) if err != nil { return nil, err } result.Data = append(result.Data, data) } return result, nil } // AFTER: Pre-allocate and batch func (e *Executor) formatResult(rows [][]interface{}) (*Result, error) { // Pre-allocate capacity result := &Result{ Data: make([][]byte, 0, len(rows)), } // Use sync.Pool for buffers buf := bufferPool.Get().(*bytes.Buffer) defer bufferPool.Put(buf) encoder := json.NewEncoder(buf) for _, row := range rows { buf.Reset() if err := encoder.Encode(row); err != nil { return nil, err } result.Data = append(result.Data, buf.Bytes()) } return result, nil } ``` ### Step 5: Benchmark After **Run benchmark again:** ```bash go test -bench=BenchmarkQueryExecution -benchmem -count=5 > after.txt ``` **Compare results:** ```bash benchcmp before.txt after.txt ``` **Example output:** ``` benchmark old ns/op new ns/op delta BenchmarkQueryExecution-8 282456 156234 -44.68% benchmark old allocs new allocs delta BenchmarkQueryExecution-8 45 23 -48.89% benchmark old bytes new bytes delta BenchmarkQueryExecution-8 2456 1234 -49.76% ``` **Result:** 44.68% faster, 48.89% fewer allocations! ✅ ### Step 6: Document **Add comment explaining optimization:** ```go // formatResult formats query results for Neo4j compatibility. // // OPTIMIZATION: Uses sync.Pool for buffer reuse and pre-allocates capacity // to reduce allocations. Benchmarks show 44% speedup over naive approach: // Before: 282,456 ns/op, 2,456 B/op, 45 allocs/op // After: 156,234 ns/op, 1,234 B/op, 23 allocs/op // Improvement: 44.68% faster, 48.89% fewer allocations // // See: BenchmarkQueryExecution in executor_test.go func (e *Executor) formatResult(rows [][]interface{}) (*Result, error) { // Implementation with optimization } ``` --- ## Optimization Techniques ### 1. Reduce Allocations **Problem:** Excessive heap allocations cause GC pressure **Solution 1: Pre-allocate slices** ```go // BEFORE: Append without capacity func processNodes(nodes []*Node) []string { var names []string // ← No capacity, will reallocate for _, node := range nodes { names = append(names, node.Name) } return names } // AFTER: Pre-allocate capacity func processNodes(nodes []*Node) []string { names := make([]string, 0, len(nodes)) // ← Pre-allocate for _, node := range nodes { names = append(names, node.Name) } return names } ``` **Solution 2: Use sync.Pool** ```go // Pool of reusable buffers var bufferPool = sync.Pool{ New: func() interface{} { return new(bytes.Buffer) }, } // BEFORE: Allocate new buffer each time func formatData(data []byte) string { buf := new(bytes.Buffer) // ← New allocation buf.Write(data) return buf.String() } // AFTER: Reuse buffers from pool func formatData(data []byte) string { buf := bufferPool.Get().(*bytes.Buffer) defer bufferPool.Put(buf) buf.Reset() buf.Write(data) return buf.String() } ``` **Solution 3: Reuse slices** ```go // BEFORE: Allocate new slice each iteration func processInBatches(items []Item) { for i := 0; i < len(items); i += 100 { batch := items[i:min(i+100, len(items))] processBatch(batch) } } // AFTER: Reuse slice (if safe) func processInBatches(items []Item) { batch := make([]Item, 0, 100) // ← Allocate once for _, item := range items { batch = append(batch, item) if len(batch) == 100 { processBatch(batch) batch = batch[:0] // ← Reset, don't reallocate } } if len(batch) > 0 { processBatch(batch) } } ``` ### 2. Optimize Algorithms **Problem:** O(n²) algorithm is slow for large datasets **Solution: Use better algorithm** ```go // BEFORE: O(n²) - nested loops func findDuplicates(nodes []*Node) []*Node { var duplicates []*Node for i := 0; i < len(nodes); i++ { for j := i + 1; j < len(nodes); j++ { if nodes[i].ID == nodes[j].ID { duplicates = append(duplicates, nodes[j]) } } } return duplicates } // AFTER: O(n) - use map func findDuplicates(nodes []*Node) []*Node { seen := make(map[NodeID]bool, len(nodes)) var duplicates []*Node for _, node := range nodes { if seen[node.ID] { duplicates = append(duplicates, node) } else { seen[node.ID] = true } } return duplicates } ``` **Benchmark:** ``` Before (O(n²)): 1,000 nodes = 500ms After (O(n)): 1,000 nodes = 5ms Improvement: 100x faster! ``` ### 3. Use Indexes **Problem:** Full table scan is slow **Solution: Create and use indexes** ```go // BEFORE: Linear scan O(n) func (e *MemoryEngine) FindNodesByLabel(label string) []*Node { var results []*Node for _, node := range e.nodes { for _, l := range node.Labels { if l == label { results = append(results, node) break } } } return results } // AFTER: Use index O(log n) or O(1) type MemoryEngine struct { nodes map[NodeID]*Node labelIndex map[string][]NodeID // ← Index } func (e *MemoryEngine) FindNodesByLabel(label string) []*Node { nodeIDs := e.labelIndex[label] // ← O(1) lookup results := make([]*Node, 0, len(nodeIDs)) for _, id := range nodeIDs { if node, ok := e.nodes[id]; ok { results = append(results, node) } } return results } func (e *MemoryEngine) CreateNode(node *Node) error { // Store node e.nodes[node.ID] = node // Update indexes for _, label := range node.Labels { e.labelIndex[label] = append(e.labelIndex[label], node.ID) } return nil } ``` **Benchmark:** ``` Before (no index): 10,000 nodes = 50ms After (with index): 10,000 nodes = 0.5ms Improvement: 100x faster! ``` ### 4. Batch Operations **Problem:** Individual operations have overhead **Solution: Batch multiple operations** ```go // BEFORE: Individual creates func importNodes(nodes []*Node) error { for _, node := range nodes { if err := engine.CreateNode(node); err != nil { return err } } return nil } // AFTER: Batch create func importNodes(nodes []*Node) error { return engine.CreateNodesBatch(nodes) } func (e *BadgerEngine) CreateNodesBatch(nodes []*Node) error { return e.db.Update(func(txn *badger.Txn) error { for _, node := range nodes { data, err := json.Marshal(node) if err != nil { return err } key := []byte("node:" + string(node.ID)) if err := txn.Set(key, data); err != nil { return err } } return nil }) } ``` **Benchmark:** ``` Before (individual): 1,000 nodes = 500ms After (batch): 1,000 nodes = 50ms Improvement: 10x faster! ``` ### 5. Reduce Lock Contention **Problem:** Multiple goroutines waiting on same lock **Solution 1: Use RWMutex** ```go // BEFORE: Mutex blocks all access type Cache struct { mu sync.Mutex data map[string]interface{} } func (c *Cache) Get(key string) (interface{}, bool) { c.mu.Lock() // ← Blocks readers defer c.mu.Unlock() val, ok := c.data[key] return val, ok } // AFTER: RWMutex allows concurrent reads type Cache struct { mu sync.RWMutex // ← Read/Write mutex data map[string]interface{} } func (c *Cache) Get(key string) (interface{}, bool) { c.mu.RLock() // ← Multiple readers OK defer c.mu.RUnlock() val, ok := c.data[key] return val, ok } func (c *Cache) Set(key string, val interface{}) { c.mu.Lock() // ← Exclusive write defer c.mu.Unlock() c.data[key] = val } ``` **Solution 2: Shard locks** ```go // BEFORE: Single lock for all data type Cache struct { mu sync.RWMutex data map[string]interface{} } // AFTER: Multiple locks (sharding) type Cache struct { shards []*cacheShard } type cacheShard struct { mu sync.RWMutex data map[string]interface{} } func (c *Cache) getShard(key string) *cacheShard { hash := fnv.New32a() hash.Write([]byte(key)) return c.shards[hash.Sum32()%uint32(len(c.shards))] } func (c *Cache) Get(key string) (interface{}, bool) { shard := c.getShard(key) shard.mu.RLock() defer shard.mu.RUnlock() val, ok := shard.data[key] return val, ok } ``` **Benchmark (100 concurrent goroutines):** ``` Before (single lock): 1,000 ops/sec After (16 shards): 12,000 ops/sec Improvement: 12x faster! ``` ### 6. Optimize Serialization **Problem:** JSON encoding/decoding is slow **Solution 1: Use faster serialization** ```go // BEFORE: Standard JSON import "encoding/json" func serialize(node *Node) ([]byte, error) { return json.Marshal(node) } // AFTER: Use msgpack (faster binary format) import "github.com/vmihailenco/msgpack/v5" func serialize(node *Node) ([]byte, error) { return msgpack.Marshal(node) } ``` **Solution 2: Avoid serialization** ```go // BEFORE: Serialize for cache func (c *Cache) Set(key string, node *Node) { data, _ := json.Marshal(node) c.data[key] = data } func (c *Cache) Get(key string) (*Node, error) { data := c.data[key].([]byte) var node Node json.Unmarshal(data, &node) return &node, nil } // AFTER: Store object directly func (c *Cache) Set(key string, node *Node) { c.data[key] = node // ← No serialization! } func (c *Cache) Get(key string) (*Node, error) { return c.data[key].(*Node), nil } ``` **Benchmark:** ``` Before (JSON): 10,000 ops = 500ms After (msgpack): 10,000 ops = 200ms After (no serial): 10,000 ops = 10ms ``` ### 7. Parallelize **Problem:** Sequential processing is slow **Solution: Use goroutines** ```go // BEFORE: Sequential processing func processNodes(nodes []*Node) []Result { results := make([]Result, len(nodes)) for i, node := range nodes { results[i] = expensiveOperation(node) } return results } // AFTER: Parallel processing func processNodes(nodes []*Node) []Result { results := make([]Result, len(nodes)) var wg sync.WaitGroup sem := make(chan struct{}, runtime.NumCPU()) // Limit concurrency for i, node := range nodes { wg.Add(1) go func(idx int, n *Node) { defer wg.Done() sem <- struct{}{} // Acquire defer func() { <-sem }() // Release results[idx] = expensiveOperation(n) }(i, node) } wg.Wait() return results } ``` **Benchmark (8 cores):** ``` Before (sequential): 1,000 nodes = 800ms After (parallel): 1,000 nodes = 120ms Improvement: 6.7x faster! ``` ### 8. Cache Expensive Operations **Problem:** Repeated expensive computations **Solution: Cache results** ```go // BEFORE: Compute every time func (e *Executor) generateEmbedding(text string) ([]float32, error) { return expensiveMLModel.Generate(text) // ← 100ms per call } // AFTER: Cache embeddings type Executor struct { embeddingCache *sync.Map } func (e *Executor) generateEmbedding(text string) ([]float32, error) { // Check cache if cached, ok := e.embeddingCache.Load(text); ok { return cached.([]float32), nil } // Generate embedding, err := expensiveMLModel.Generate(text) if err != nil { return nil, err } // Cache e.embeddingCache.Store(text, embedding) return embedding, nil } ``` **Benchmark:** ``` Before (no cache): 100 calls = 10,000ms (100ms each) After (with cache): 100 calls = 100ms (1ms each, 99 cache hits) Improvement: 100x faster! ``` --- ## Memory Optimization ### 1. Reduce Memory Footprint **Use smaller types:** ```go // BEFORE: 64-bit for small values type Node struct { AccessCount int64 // ← 8 bytes, but rarely > 1 million } // AFTER: 32-bit sufficient type Node struct { AccessCount int32 // ← 4 bytes, max 2 billion } ``` **Use pointers for large structs:** ```go // BEFORE: Copy entire struct type Cache struct { data map[string]Node // ← Copies entire Node } // AFTER: Store pointer type Cache struct { data map[string]*Node // ← Only 8 bytes per entry } ``` ### 2. Avoid Memory Leaks **Close resources:** ```go // BEFORE: Goroutine leak func processStream(ch <-chan Data) { for data := range ch { go process(data) // ← Goroutines never stop! } } // AFTER: Use context for cancellation func processStream(ctx context.Context, ch <-chan Data) { for { select { case data := <-ch: go func(d Data) { select { case <-ctx.Done(): return default: process(d) } }(data) case <-ctx.Done(): return } } } ``` **Clear references:** ```go // BEFORE: Cache grows forever type Cache struct { data map[string]*Node } // AFTER: Implement eviction type Cache struct { data map[string]*cacheEntry maxSize int eviction *list.List // LRU list } func (c *Cache) Set(key string, node *Node) { if len(c.data) >= c.maxSize { c.evictOldest() // ← Remove old entries } c.data[key] = &cacheEntry{ node: node, timestamp: time.Now(), } } ``` --- ## Benchmarking Best Practices ### 1. Realistic Data ```go func BenchmarkQueryExecution(b *testing.B) { // Use realistic data size store := setupBenchmarkData( nodeCount: 10000, // ← Realistic size edgeCount: 50000, avgDegree: 5, ) exec := NewStorageExecutor(store) ctx := context.Background() // Use realistic queries queries := []string{ "MATCH (n:Person) WHERE n.age > 25 RETURN n LIMIT 100", "MATCH (a:Person)-[:KNOWS]->(b:Person) RETURN a, b", "MATCH (n:Person) RETURN count(n)", } b.ResetTimer() for i := 0; i < b.N; i++ { query := queries[i%len(queries)] _, err := exec.Execute(ctx, query, nil) if err != nil { b.Fatal(err) } } } ``` ### 2. Multiple Runs ```bash # Run benchmark 10 times for statistical significance go test -bench=. -benchmem -count=10 > results.txt # Analyze with benchstat benchstat results.txt ``` ### 3. Isolate Changes ```bash # Benchmark one change at a time git checkout main go test -bench=. -benchmem -count=5 > before.txt git checkout feature-branch go test -bench=. -benchmem -count=5 > after.txt benchcmp before.txt after.txt ``` --- ## Performance Checklist Before claiming optimization: - [ ] Baseline benchmark exists - [ ] Profiling identified bottleneck - [ ] Optimization targets bottleneck - [ ] After benchmark shows improvement - [ ] No regression in other metrics - [ ] Memory usage acceptable - [ ] All tests still pass - [ ] Documentation updated - [ ] Commit includes benchmarks --- ## Quick Reference ### Benchmark Commands ```bash # Run benchmarks go test -bench=. -benchmem # CPU profile go test -bench=. -cpuprofile=cpu.prof go tool pprof cpu.prof # Memory profile go test -bench=. -memprofile=mem.prof go tool pprof mem.prof # Compare benchmarks benchcmp before.txt after.txt # Statistical analysis benchstat results.txt ``` ### Common Optimizations 1. **Reduce allocations** - Pre-allocate, use sync.Pool 2. **Better algorithms** - O(n) instead of O(n²) 3. **Use indexes** - O(1) or O(log n) lookups 4. **Batch operations** - Reduce overhead 5. **Reduce locks** - RWMutex, sharding 6. **Faster serialization** - msgpack, protobuf 7. **Parallelize** - Use goroutines 8. **Cache** - Expensive operations --- **Remember**: Premature optimization is the root of all evil. Always measure first, optimize second, and prove improvement with benchmarks.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server