Which integrations are available for this server?

Provides GPU profiling capabilities through NVIDIA Nsight Systems, allowing AI agents to profile binaries, parse performance reports, compute kernel and memory copy statistics, and analyze interval trees to identify hardware bottlenecks.

How do I use nsys-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@nsys-mcp profile ./my_cuda_app and show the top kernels by total duration" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

nsys-mcp by KorovkoAlexander

nsys-mcp is an MCP (Model Context Protocol) server that provides GPU profiling capabilities through NVIDIA Nsight Systems (nsys). It lets an LLM agent profile binaries, parse reports, compute statistics, and analyze interval trees — all via standard MCP tool calls.

Prerequisites

Python 3.10+
NVIDIA Nsight Systems (nsys) installed and available in PATH. Download from the Nsight Systems page. See the Nsight Systems documentation for setup details.

Installation

pip install -e .

For development (tests):

pip install -e ".[dev]"

Running the Server

The server communicates over stdio (the default MCP transport):

python -m nsys_mcp.server

Cursor / VS Code MCP configuration

Add to your MCP settings (e.g. .cursor/mcp.json):

{
  "mcpServers": {
    "nsys-profiler": {
      "command": "python",
      "args": ["-m", "nsys_mcp.server"]
    }
  }
}

Available Tools

The server exposes 10 tools:

#	Tool	Description
1	`check_nsys`	Verify that `nsys` is installed and return its version
2	`profile_binary`	Profile a binary with full CUDA, NVTX, and GPU metrics collection
3	`load_report`	Load a pre-existing `.nsys-rep` or NDJSON `.json` file
4	`list_reports`	List all cached profiling reports with metadata
5	`get_event_summary`	Breakdown of event types and counts for a report
6	`get_kernel_stats`	Aggregate GPU kernel statistics grouped by kernel name
7	`get_nvtx_stats`	Aggregate NVTX range durations grouped by annotation text
8	`get_memcpy_stats`	Aggregate memory copy statistics grouped by direction
9	`build_interval_tree`	Construct an interval tree from profiling events
10	`query_interval_tree`	Run structural queries against an interval tree

`profile_binary`

Profile a binary with full CUDA, NVTX, and GPU metrics collection. Results are cached so repeated calls with the same arguments skip re-profiling.

Parameter	Type	Description
`binary`	`str`	Path to the executable
`args`	`list[str]`	Command-line arguments (optional)
`env`	`dict[str, str]`	Extra environment variables (optional)
`cwd`	`str`	Working directory (optional)
`duration`	`int`	Max profiling duration in seconds (optional)
`extra_nsys_flags`	`list[str]`	Additional nsys flags (optional)

Returns report_id, event_counts, and time_span_ns.

`load_report`

Load a pre-existing .nsys-rep or NDJSON .json file without re-profiling.

Parameter	Type	Description
`path`	`str`	Path to `.nsys-rep` or `.json` file

`get_event_summary`

Get a breakdown of event types and counts for a report.

Parameter	Type	Description
`report_id`	`str`	ID from `profile_binary` or `load_report`

`get_kernel_stats`

Aggregate GPU kernel statistics grouped by kernel name. Includes duration statistics (mean, std, min, max, median, count, total) and GPU metrics (grid/block size, shared memory, registers).

Parameter	Type	Description
`report_id`	`str`	Report identifier
`top_n`	`int`	Limit to top N kernels (optional)
`sort_by`	`str`	`total_ns`, `count`, `mean_ns`, or `max_ns` (default: `total_ns`)

`get_nvtx_stats`

Aggregate NVTX range durations grouped by annotation text.

Parameter	Type	Description
`report_id`	`str`	Report identifier
`domain_id`	`int`	Filter by NVTX domain (optional)

`get_memcpy_stats`

Aggregate memory copy statistics grouped by copy direction (HtoD, DtoH, DtoD, etc.). Includes duration stats, total bytes, and bandwidth estimates.

Parameter	Type	Description
`report_id`	`str`	Report identifier

`build_interval_tree`

Construct an interval tree from profiling events. If multiple disjoint trees exist (a forest), they can be merged under a synthetic root.

Parameter	Type	Description
`report_id`	`str`	Report identifier
`event_types`	`list[str]`	Subset of `["kernel", "nvtx", "trace", "memcpy", "sync"]` (default: all)
`reduce_forest`	`bool`	Merge forest into single tree (default: `true`)
`thread_id`	`int`	Filter by thread/stream ID (optional)

`query_interval_tree`

Run structural queries against a previously built interval tree.

Parameter	Type	Description
`report_id`	`str`	Report identifier
`query_type`	`str`	One of the query types below
`event_name`	`str`	Event name for `count_calls`
`subtree_root_name`	`str`	Scope query to a named subtree (optional)
`max_depth`	`int`	Limit traversal depth (optional)

Query types:

Type	Description
`most_time_consuming`	Find the longest-duration event in a subtree
`top_level`	List top-level interval names
`count_calls`	Count occurrences of a named event in a subtree
`subtree_summary`	Aggregated stats for a named subtree

Typical Workflow

1. check_nsys()                              — verify nsys is available
2. profile_binary(binary="/app/solver", ...) — profile and get report_id
3. get_kernel_stats(report_id, top_n=10)     — see top 10 kernels
4. get_nvtx_stats(report_id)                 — see NVTX annotation timings
5. get_memcpy_stats(report_id)               — see memory transfer stats
6. build_interval_tree(report_id)            — build the tree
7. query_interval_tree(report_id,            — find bottleneck
       query_type="most_time_consuming")
8. query_interval_tree(report_id,            — count specific kernel calls
       query_type="count_calls",
       event_name="cub::DeviceReduce")

Caching

Profiling results are cached in two tiers:

In-memory LRU — fast access for the current session (up to 8 reports).
Disk — persists across server restarts at ~/.nsys_mcp/cache/.

Cache keys are derived from the binary path and arguments, so identical profiling runs reuse cached results automatically.

Testing

pip install -e ".[dev]"
pytest

Project Structure

src/nsys_mcp/
├── server.py           # FastMCP server, tool definitions, lifespan
├── nsys_runner.py      # nsys CLI wrapper (profile, export, version)
├── report_parser.py    # NDJSON streaming parser, string-table resolution
├── models.py           # Pydantic models for events, stats, configs
├── aggregator.py       # Group-by aggregation (mean, std, min, max, count)
├── interval_tree.py    # Interval tree/forest construction + queries
└── cache.py            # Two-tier cache (memory LRU + disk pickle)

Links

License

nsys-mcp is licensed under the MIT License.

This server cannot be installed

A

license - permissive license

-

quality - not tested

-

maintenance - not tested

How are these scores calculated?

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

nsys-mcp