Using AI to find bugs in open-source projects
Written by Frank Fiegel on .
I will prefix this article by saying that while the part of testing projects and generating the GitHub issues has been automated, the part of actually submitting the GitHub issue not. I have been doing it manually to avoid accidentally submitting issues that do not provide valuable context.
Scroll down to Experiment results for a summary of the results of this experiment.
Context
Recently, I've been actively involved in the MCP ecosystem. For those who are not familiar with MCP, Model Context Protocol (MCP) is a protocol proposed by members of Anthropic team with an effort to standardize how applications provide context to LLMs. You can read more about it here.
The important thing to know about MCP is that it spurred the creation of hundreds of MCP servers. Each of these servers is essentially a standalone application (mostly Python and TypeScript) that LLMs can communicate with through the MCP protocol (using stdio
or SSE transports).
At the moment, the instructions for running these servers are mostly geared towards engineers, as they usually require at least high-level knowledge of Python or Node.js. However, generally speaking, instructions are as simple as:
or
However, as you can imagine, with 300+ open-source servers in just under a couple of weeks, due to the volume alone, a lot of MCP servers were made public without enough testing to ensure that they can be setup outside of the original authors' environment. This is where the LLMs come in.
Sandboxing MCP servers using Docker and LLMs
As part of embedding MCP into Glama, I had to figure out how to virtualize these servers. I dediced to use Docker, as it allowed me to run the servers in a sandboxed environment. However, it still meant that I had to write instructions for hundreds of servers, and there were new servers being added every hour. Therefore, I decided to also automate the process of generating the Dockerfile
for each server.
Generating the Dockerfile
Here is how it works:
- We pull file tree from GitHub repository.
- We inspect the presence of various files, such as
pyproject.toml
,uv.toml
,package.json
, andREADME.md
. - We then use LLMs to hypothesize scenarios for how to build the a
Dockerfile
that would start the server. - Finally, we run the
Dockerfile
and attempt to connect to the MCP server.
- If there is an error, we feed it back to step 3 and try again.
Classifying the failures
Today, out of 310 MCP servers in our directory, we have 211 servers that can be started using Dockerfile
inferred this way. Out of the remaining 99 servers, the failure to start is grouped into the following categories:
unknown
(5%) – server is failing to start for an unknown reason.unsupported
(22%) – server requires dependencies that are currently not supported by our build environment (e.g. missing drivers, etc.).requires_valid_credentials
(8%) – servers that cannot boot without valid credentials (e.g. database credentials, API keys, etc.).missing_configuration
(32%) - the server is failing to start due to missingENV
variables, missing positional arguments, or missing configuration files.invalid_server
(33%) – the server cannot be started due to missing files, dependencies, or other issues that cannot be recover from.
The servers in the unsupported
category are only a matter of time before we can support them. Most of the servers in the requires_valid_credentials
we know to be able to run, but we need to manually supply test credentials. Everything in missing_configuration
is something that we can fix by mocking the missing configuration, so it is only a matter of time before we can support it.
Determining which servers are invalid
I am feeding LLMs with the Docker build logs and stderr
output of the servers to determine which servers are invalid, e.g. consider the following stderr
output:
If we get this error when trying to start the server, we can pretty confidently say that the server author has not provided bin
entry point in package.json
, and therefore the server cannot be started; I have coded a dozen of these rules based on observing the build pipeline of hundreds of MCP servers.
So we are left with a large number of invalid_server
servers, which we cannot do anything about... they need to be fixed by the server author.
Let's try to nudge the authors of these servers to fix these issues.
Generating GitHub issues
I realized that if I cannot start the server, then others cannot either. And if I was the author of the server, I would really want to know that the server that I intended to make available to others is not working. And since I already have the setup instructions (Dockerfile
), the error (stderr
), error classification, and the GitHub repository URL, I can easily generate a GitHub issue.
The way it works is that I run a script to index all MCP servers several times a day (to assess the latest GitHub repository contents). If the server is failing to start, and the error is classified as invalid_server
, I then generate a GitHub issue template URL and send it to a private Discord channel for review. There, I review every issue proposal and either go ahead and create GitHub issue, or dismiss it.
Experiment results
I started raising issues just last week and used low volume to test how server authors react to these issues.
Below is a list of the issues, and the outcome at the time of writing this article.
Issue | Outcome |
---|---|
mcp-discord#2 | ✅ Author fixed the issue. |
web-browser-mcp-server#1 | ✅ Author fixed the issue. |
box-mcp-server#2 | ✅ Author fixed the issue. |
moondream-mcp#1 | ✅ Author fixed the issue. |
mcp-clickhouse#1 | ✅ Author fixed the issue. |
mcp-snowflake-server#2 | ✅ Author fixed the issue. |
mcp-server-bigquery#2 | Issue misclassified. Should have been missing_configuration . |
mcp-pinecone#3 | Issue misclassified. Should have been missing_configuration . |
mcp-otzaria-server#1 | Issue misclassified. Should have been missing_configuration . |
mcp-solver#2 | Issue misclassified. Should have been unsupported . |
A pretty strong start: Out of 10 reported issues, 6 resulted in the server being fixed and 3 identified gaps in the README.md
setup instructions.
This started as an experiment, but I am now scaling it up to help authors of every MCP server ensure that their servers are working as expected.
A secondary insight from this experiment is that LLMs are incredibly good judges of documentation quality. If several LLMs cannot figure out how to start a server based on several attempts based on all the information in the GitHub repository, there is a high chance that a regular human would struggle to figure out as well. I am exploring how to automatically document setup instructions for any MCP server in a consistent way.
Conclusion
Automating the testing and issue reporting has already improved reliability in the ecosystem. Now the goal is to scale this process and streamline server setup for everyone. If you’re interested in contributing or sharing feedback, join the MCP Discord and help shape the future of context standardization for MCP!
Written by Frank Fiegel (@punkpeye)