Skip to main content
Glama
sirmos

ArmBench MCP Server

by sirmos

⚡ ArmBench — Arm64 LLM Inference Benchmark Suite + MCP Server

KleidiAI-optimized LLM benchmarking and inference server for Arm64 cloud infrastructure. Built for the Arm AI Optimization Challenge 2026.

License Platform Python


🎯 What is ArmBench?

ArmBench is a one-command benchmarking tool that:

  1. Deploys LLMs (Llama 3.2) on Arm64 cloud instances using llama.cpp + KleidiAI

  2. Measures real performance — tokens/sec, time-to-first-token, memory usage across quantization levels (Q4_K_M vs Q8_0)

  3. Serves results via an MCP-compatible FastAPI server any agent framework can call

  4. Visualizes everything in a clean real-time dashboard


Related MCP server: Hugging Face MCP Server

🏗️ Architecture

armbench/

├── benchmark/ # llama.cpp + KleidiAI inference engine + metrics

├── mcp_server/ # FastAPI MCP-compatible LLM endpoint

├── dashboard/ # Real-time results dashboard (HTML)

├── scripts/ # One-command setup + benchmark + server scripts

└── docker/ # Arm64-optimized Docker configuration

🚀 Quick Start (Arm64 Instance)

1. Clone and setup

git clone https://github.com/sirmos/armbench.git
cd armbench
bash scripts/setup.sh

2. Run benchmark

bash scripts/run_benchmark.sh

3. Start MCP server

bash scripts/start_mcp.sh

4. Open dashboard

Navigate to http://your-instance-ip:8000 in your browser.


☁️ Tested Arm64 Platforms

Platform

Instance

Arm CPU

Oracle Cloud

VM.Standard.A1.Flex

Ampere Altra

AWS

c7g.large

Graviton3

GCP

c4a-standard-4

Axion


📊 What We Benchmark

Metric

Description

Tokens/sec

Inference throughput

Time to First Token

Latency from prompt to first output token

Memory (MB)

RAM consumed during inference

Model size (GB)

Disk footprint per quantization level

Models

Model

Quant

Size

Use case

Llama-3.2-3B-Instruct

Q4_K_M

1.9 GB

Speed-optimized

Llama-3.2-3B-Instruct

Q8_0

3.4 GB

Quality-optimized


🔌 MCP Server API

Endpoint

Method

Description

/

GET

Server info

/health

GET

Health + platform info

/models

GET

List available models

/generate

POST

Run inference

/benchmark

POST

Full benchmark suite

/mcp/tools

GET

MCP-compatible tools listing

/docs

GET

Interactive API docs

Example: Generate

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is KleidiAI?", "model": "Llama-3.2-3B-Q4_K_M"}'

⚙️ Arm-Specific Optimizations

  • KleidiAI: Arm's optimized kernel library for ML workloads

  • llama.cpp Arm SVE: Scalable Vector Extension support enabled at build time

  • Native CPU tuning: -DLLAMA_NATIVE=ON compiles for exact CPU microarchitecture

  • Thread optimization: Automatically uses all available Arm cores


📄 License

MIT License — see LICENSE


Built for the Arm AI Optimization Challenge 2026

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sirmos/armbench'

If you have feedback or need assistance with the MCP directory API, please join our Discord server