Provides GPU profiling capabilities through NVIDIA Nsight Systems, allowing AI agents to profile binaries, parse performance reports, compute kernel and memory copy statistics, and analyze interval trees to identify hardware bottlenecks.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@nsys-mcpprofile ./my_cuda_app and show the top kernels by total duration"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
nsys-mcp is an MCP (Model Context Protocol) server that
provides GPU profiling capabilities through NVIDIA Nsight Systems (nsys).
It lets an LLM agent profile binaries, parse reports, compute statistics, and
analyze interval trees — all via standard MCP tool calls.
Prerequisites
Python 3.10+
NVIDIA Nsight Systems (
nsys) installed and available inPATH. Download from the Nsight Systems page. See the Nsight Systems documentation for setup details.
Installation
pip install -e .For development (tests):
pip install -e ".[dev]"Running the Server
The server communicates over stdio (the default MCP transport):
python -m nsys_mcp.serverCursor / VS Code MCP configuration
Add to your MCP settings (e.g. .cursor/mcp.json):
{
"mcpServers": {
"nsys-profiler": {
"command": "python",
"args": ["-m", "nsys_mcp.server"]
}
}
}Available Tools
The server exposes 10 tools:
# | Tool | Description |
1 |
| Verify that |
2 |
| Profile a binary with full CUDA, NVTX, and GPU metrics collection |
3 |
| Load a pre-existing |
4 |
| List all cached profiling reports with metadata |
5 |
| Breakdown of event types and counts for a report |
6 |
| Aggregate GPU kernel statistics grouped by kernel name |
7 |
| Aggregate NVTX range durations grouped by annotation text |
8 |
| Aggregate memory copy statistics grouped by direction |
9 |
| Construct an interval tree from profiling events |
10 |
| Run structural queries against an interval tree |
profile_binary
Profile a binary with full CUDA, NVTX, and GPU metrics collection. Results are cached so repeated calls with the same arguments skip re-profiling.
Parameter | Type | Description |
|
| Path to the executable |
|
| Command-line arguments (optional) |
|
| Extra environment variables (optional) |
|
| Working directory (optional) |
|
| Max profiling duration in seconds (optional) |
|
| Additional nsys flags (optional) |
Returns report_id, event_counts, and time_span_ns.
load_report
Load a pre-existing .nsys-rep or NDJSON .json file without re-profiling.
Parameter | Type | Description |
|
| Path to |
get_event_summary
Get a breakdown of event types and counts for a report.
Parameter | Type | Description |
|
| ID from |
get_kernel_stats
Aggregate GPU kernel statistics grouped by kernel name. Includes duration statistics (mean, std, min, max, median, count, total) and GPU metrics (grid/block size, shared memory, registers).
Parameter | Type | Description |
|
| Report identifier |
|
| Limit to top N kernels (optional) |
|
|
|
get_nvtx_stats
Aggregate NVTX range durations grouped by annotation text.
Parameter | Type | Description |
|
| Report identifier |
|
| Filter by NVTX domain (optional) |
get_memcpy_stats
Aggregate memory copy statistics grouped by copy direction (HtoD, DtoH, DtoD, etc.). Includes duration stats, total bytes, and bandwidth estimates.
Parameter | Type | Description |
|
| Report identifier |
build_interval_tree
Construct an interval tree from profiling events. If multiple disjoint trees exist (a forest), they can be merged under a synthetic root.
Parameter | Type | Description |
|
| Report identifier |
|
| Subset of |
|
| Merge forest into single tree (default: |
|
| Filter by thread/stream ID (optional) |
query_interval_tree
Run structural queries against a previously built interval tree.
Parameter | Type | Description |
|
| Report identifier |
|
| One of the query types below |
|
| Event name for |
|
| Scope query to a named subtree (optional) |
|
| Limit traversal depth (optional) |
Query types:
Type | Description |
| Find the longest-duration event in a subtree |
| List top-level interval names |
| Count occurrences of a named event in a subtree |
| Aggregated stats for a named subtree |
Typical Workflow
1. check_nsys() — verify nsys is available
2. profile_binary(binary="/app/solver", ...) — profile and get report_id
3. get_kernel_stats(report_id, top_n=10) — see top 10 kernels
4. get_nvtx_stats(report_id) — see NVTX annotation timings
5. get_memcpy_stats(report_id) — see memory transfer stats
6. build_interval_tree(report_id) — build the tree
7. query_interval_tree(report_id, — find bottleneck
query_type="most_time_consuming")
8. query_interval_tree(report_id, — count specific kernel calls
query_type="count_calls",
event_name="cub::DeviceReduce")Caching
Profiling results are cached in two tiers:
In-memory LRU — fast access for the current session (up to 8 reports).
Disk — persists across server restarts at
~/.nsys_mcp/cache/.
Cache keys are derived from the binary path and arguments, so identical profiling runs reuse cached results automatically.
Testing
pip install -e ".[dev]"
pytestProject Structure
src/nsys_mcp/
├── server.py # FastMCP server, tool definitions, lifespan
├── nsys_runner.py # nsys CLI wrapper (profile, export, version)
├── report_parser.py # NDJSON streaming parser, string-table resolution
├── models.py # Pydantic models for events, stats, configs
├── aggregator.py # Group-by aggregation (mean, std, min, max, count)
├── interval_tree.py # Interval tree/forest construction + queries
└── cache.py # Two-tier cache (memory LRU + disk pickle)Links
License
nsys-mcp is licensed under the MIT License.
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.