Skip to main content
Glama
FreezeVII

TurboQuant Tools

by FreezeVII

๐ŸงŠ TurboQuant Tools

Compress AI embeddings by 5โ€“7ร— with near-lossless quality.

CLI + Python Library + MCP Server for extreme vector compression using Google's TurboQuant (PolarQuant + QJL) โ€” wrapped in a clean numpy-first API.

PyPI Python License Tests


๐Ÿš€ Quick Start

pip install turboquant-tools

Compress a .npy embedding file:

turboquant compress embeddings.npy compressed.tq

Restore:

turboquant decompress compressed.tq restored.npy

Estimate savings:

turboquant estimate embeddings.npy --bits 3
# Original: 153.00 MB -> Compressed: 20.13 MB (7.60ร—, save 87%)

Related MCP server: mcp-compress

๐Ÿ“ฆ What's Inside

Command / Tool

Description

turboquant compress

Compress .npy embeddings โ†’ .tq binary

turboquant decompress

Restore .tq โ†’ .npy

turboquant estimate

Predict compression ratio before running

turboquant mcp-server

MCP stdio server (AI agent integration)

Python compress()

Compress numpy arrays in code

Python decompress()

Restore in code


๐Ÿ”ง CLI Reference

compress

turboquant compress INPUT [OUTPUT] [OPTIONS]

Option

Default

Description

INPUT

โ€”

.npy file with float32 embeddings (n, d)

OUTPUT

{stem}_tq{b}.tq

Output .tq file

-b, --bits

3

Bit width (3 or 4)

-o, --output

โ€”

Alternative to positional OUTPUT

--no-qjl

off

Skip QJL correction (faster, lower quality)

Examples:

# Basic 3-bit compression
turboquant compress wiki_embeddings.npy wiki.tq

# 4-bit compression (higher quality)
turboquant compress embeddings.npy -b 4

# Fast mode (no QJL)
turboquant compress big_set.npy -b 3 --no-qjl

decompress

turboquant decompress INPUT [OUTPUT]

estimate

turboquant estimate INPUT [--bits N]

๐Ÿ Python API

from turboquant_tools import compress, decompress, estimate_savings
import numpy as np

# Load or generate embeddings
vectors = np.random.randn(10000, 384).astype(np.float32)

# Compress (5โ€“7ร— reduction)
compressed = compress(vectors, bits=3, use_qjl=False)
print(f"{vectors.nbytes / 1e6:.1f} MB โ†’ {compressed.nbytes / 1e6:.1f} MB ({compressed.memory.ratio:.1f}ร—)")

# Restore
restored = decompress(compressed)
print(f"MAE: {np.abs(restored - vectors).mean():.4f}")

# Estimate without running
est = estimate_savings(n_vectors=100000, dim=768, bits=3)
print(est)  # Original: X MB -> Compressed: Y MB (7.60ร—, save 87%)

CompressedVectors objects carry metadata:

compressed.n_vectors   # original count
compressed.dim         # original dimension
compressed.nbytes      # compressed size in bytes
compressed.memory      # MemoryBytes(original, compressed, ratio)
compressed.data        # raw .tq bytes (save to disk)

๐Ÿค– MCP Server (AI Agents)

TurboQuant Tools ships with a native MCP server for AI agent integration โ€” works with any MCP-compatible host (Hermes, Claude Desktop, etc.).

Start

turboquant mcp-server

Register in your MCP client

Hermes Agent (~/.hermes/config.yaml):

mcp_servers:
  turboquant-tools:
    command: turboquant
    args: ["mcp-server"]
    enabled: true

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "turboquant-tools": {
      "command": "turboquant",
      "args": ["mcp-server"]
    }
  }
}

Available Tools

Tool

Description

compress_embeddings

Compress vectors in-memory

decompress_embeddings

Restore compressed vectors

estimate_savings_mcp

Predict compression ratio

embed_and_compress

Embed texts via API + compress in one step


๐Ÿ“Š Performance

Measured on random float32 embeddings (CPU, no GPU needed):

Vectors

Dim

Mode

Original

Compressed

Ratio

MAE

20

384

PolarQuant 3-bit

30 KB

10 KB

3.0ร—

2.6

20

384

TurboQuant (QJL)

30 KB

20 KB

1.5ร—

3.3

100K

384

PolarQuant 3-bit

153 MB

20 MB

7.6ร—

โ€”

Use cases:

  • RAG pipelines โ€” compress vector DB indexes

  • Edge devices โ€” fit embeddings in limited RAM

  • Storage savings โ€” reduce cloud costs for large vector stores

  • Memory-bound agents โ€” compress context vectors on the fly


๐Ÿงช Development

git clone https://github.com/FreezeVII/turboquant-tools.git
cd turboquant-tools
pip install -e .
pip install pytest
pytest tests/

Run tests

pytest tests/ -v

๐Ÿงฑ How It Works

Two-stage compression inspired by Google's TurboQuant:

  1. PolarQuant โ€” Random Hadamard rotation + scalar quantization to 3โ€“4 bits per dimension. Captures magnitude and direction.

  2. QJL (optional) โ€” Quantized Johnson-Lindenstrauss residual correction. Recovers high-frequency detail lost in PolarQuant.

Both stages run CPU-only via PyTorch โ€” no GPU required. The .tq binary format uses a 30-byte header with magic bytes (TQT2) + packed indices and norms.

Under the hood this wraps OnlyTerp/turboquant, a reference PyTorch implementation.


๐Ÿ“„ License

MIT โ€” see LICENSE.


๐Ÿ™Œ Contributing

PRs welcome! Ideas:

  • FAISS index compression (compress_faiss)

  • Onnx / numpy-only backend (no PyTorch dep)

  • Streaming compression for billion-scale datasets

  • Pre-built wheels for faster install


A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
โ€“Release cycle
โ€“Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/FreezeVII/turboquant-tools'

If you have feedback or need assistance with the MCP directory API, please join our Discord server