Using AI to find bugs in open-source projects

Written by on .

mcp
ai
llm

  1. Context
    1. Sandboxing MCP servers using Docker and LLMs
      1. Generating the Dockerfile
        1. Classifying the failures
          1. Determining which servers are invalid
            1. Generating GitHub issues
            2. Experiment results
              1. Conclusion

                I will prefix this article by saying that while the part of testing projects and generating the GitHub issues has been automated, the part of actually submitting the GitHub issue not. I have been doing it manually to avoid accidentally submitting issues that do not provide valuable context.

                Scroll down to Experiment results for a summary of the results of this experiment.

                Context

                Recently, I've been actively involved in the MCP ecosystem. For those who are not familiar with MCP, Model Context Protocol (MCP) is a protocol proposed by members of Anthropic team with an effort to standardize how applications provide context to LLMs. You can read more about it here.

                The important thing to know about MCP is that it spurred the creation of hundreds of MCP servers. Each of these servers is essentially a standalone application (mostly Python and TypeScript) that LLMs can communicate with through the MCP protocol (using stdio or SSE transports).

                At the moment, the instructions for running these servers are mostly geared towards engineers, as they usually require at least high-level knowledge of Python or Node.js. However, generally speaking, instructions are as simple as:

                npx @tokenizin/mcp-npx-fetch

                or

                uvx --from git+https://github.com/adhikasp/mcp-weather.git mcp-weather

                However, as you can imagine, with 300+ open-source servers in just under a couple of weeks, due to the volume alone, a lot of MCP servers were made public without enough testing to ensure that they can be setup outside of the original authors' environment. This is where the LLMs come in.

                Sandboxing MCP servers using Docker and LLMs

                As part of embedding MCP into Glama, I had to figure out how to virtualize these servers. I dediced to use Docker, as it allowed me to run the servers in a sandboxed environment. However, it still meant that I had to write instructions for hundreds of servers, and there were new servers being added every hour. Therefore, I decided to also automate the process of generating the Dockerfile for each server.

                Generating the Dockerfile

                Here is how it works:

                1. We pull file tree from GitHub repository.
                2. We inspect the presence of various files, such as pyproject.toml, uv.toml, package.json, and README.md.
                3. We then use LLMs to hypothesize scenarios for how to build the a Dockerfile that would start the server.
                4. Finally, we run the Dockerfile and attempt to connect to the MCP server.
                • If there is an error, we feed it back to step 3 and try again.

                Classifying the failures

                Today, out of 310 MCP servers in our directory, we have 211 servers that can be started using Dockerfile inferred this way. Out of the remaining 99 servers, the failure to start is grouped into the following categories:

                • unknown (5%) – server is failing to start for an unknown reason.
                • unsupported (22%) – server requires dependencies that are currently not supported by our build environment (e.g. missing drivers, etc.).
                • requires_valid_credentials (8%) – servers that cannot boot without valid credentials (e.g. database credentials, API keys, etc.).
                • missing_configuration (32%) - the server is failing to start due to missing ENV variables, missing positional arguments, or missing configuration files.
                • invalid_server (33%) – the server cannot be started due to missing files, dependencies, or other issues that cannot be recover from.

                The servers in the unsupported category are only a matter of time before we can support them. Most of the servers in the requires_valid_credentials we know to be able to run, but we need to manually supply test credentials. Everything in missing_configuration is something that we can fix by mocking the missing configuration, so it is only a matter of time before we can support it.

                Determining which servers are invalid

                I am feeding LLMs with the Docker build logs and stderr output of the servers to determine which servers are invalid, e.g. consider the following stderr output:

                Progress: resolved 1, reused 0, downloaded 0, added 0 /home/smith/.cache/pnpm/dlx/ssgcztdn5y2otnztyizayajk2e/19428a4c370-17:  ENOENT  ENOENT: no such file or directory, open '/home/smith/.cache/pnpm/dlx/Downloads/box-mcp-server-0.0.2.tgz' This error happened while installing the dependencies of box-mcp-server@0.0.3

                If we get this error when trying to start the server, we can pretty confidently say that the server author has not provided bin entry point in package.json, and therefore the server cannot be started; I have coded a dozen of these rules based on observing the build pipeline of hundreds of MCP servers.

                So we are left with a large number of invalid_server servers, which we cannot do anything about... they need to be fixed by the server author.

                Let's try to nudge the authors of these servers to fix these issues.

                Generating GitHub issues

                I realized that if I cannot start the server, then others cannot either. And if I was the author of the server, I would really want to know that the server that I intended to make available to others is not working. And since I already have the setup instructions (Dockerfile), the error (stderr), error classification, and the GitHub repository URL, I can easily generate a GitHub issue.

                The way it works is that I run a script to index all MCP servers several times a day (to assess the latest GitHub repository contents). If the server is failing to start, and the error is classified as invalid_server, I then generate a GitHub issue template URL and send it to a private Discord channel for review. There, I review every issue proposal and either go ahead and create GitHub issue, or dismiss it.

                Experiment results

                I started raising issues just last week and used low volume to test how server authors react to these issues.

                Below is a list of the issues, and the outcome at the time of writing this article.

                IssueOutcome
                mcp-discord#2✅ Author fixed the issue.
                web-browser-mcp-server#1✅ Author fixed the issue.
                box-mcp-server#2✅ Author fixed the issue.
                moondream-mcp#1✅ Author fixed the issue.
                mcp-clickhouse#1✅ Author fixed the issue.
                mcp-snowflake-server#2✅ Author fixed the issue.
                mcp-server-bigquery#2Issue misclassified. Should have been missing_configuration.
                mcp-pinecone#3Issue misclassified. Should have been missing_configuration.
                mcp-otzaria-server#1Issue misclassified. Should have been missing_configuration.
                mcp-solver#2Issue misclassified. Should have been unsupported.

                A pretty strong start: Out of 10 reported issues, 6 resulted in the server being fixed and 3 identified gaps in the README.md setup instructions.

                This started as an experiment, but I am now scaling it up to help authors of every MCP server ensure that their servers are working as expected.

                A secondary insight from this experiment is that LLMs are incredibly good judges of documentation quality. If several LLMs cannot figure out how to start a server based on several attempts based on all the information in the GitHub repository, there is a high chance that a regular human would struggle to figure out as well. I am exploring how to automatically document setup instructions for any MCP server in a consistent way.

                Conclusion

                Automating the testing and issue reporting has already improved reliability in the ecosystem. Now the goal is to scale this process and streamline server setup for everyone. If you’re interested in contributing or sharing feedback, join the MCP Discord and help shape the future of context standardization for MCP!

                Written by Frank Fiegel (@punkpeye)