Never Trust User Content: A Debugging Story

I am taking a deep sigh of relief as I am typing this because I've solved the mystery of why our service response times were fluctuating all over the place.

This could be a tweet, but I will make it a blog post. If nothing else, as a way to remind myself of the lesson learned.

The chase

In short, about a month ago, our server response time started to fluctuate a lot.

I tried correlating the issue with a number of factors, including:

recent changes to the codebase
recent dependency updates
spikes in traffic

But nothing explained the issue.

I started regularly profiling the application in production to understand where the time was being spent.

Nothing stood out as particularly unusual (I thought). I was surprised by how much of the time was spent by React rendering the UI, but I shrugged it off as business as usual. After all, we are serving thousands of requests across thousands of MCP server repositories.

I even picked every single function that stood out in the flamegraph, wrote benchmarks for them, and optimized them one by one.

But still, days passed, and I still could not figure out why we were seeing response time fluctuate from 50ms to 1000ms at seemingly no reason.

The trap

The way that I finally identified the cause was by instrumenting the renderToPipeableStream function to log the time it took to render the UI and report times that exceeded the 100ms threshold. It quickly revealed a pattern of some URLs taking a lot longer to render than others.

The culprit

Turns out, the issue was that we had accidentally indexed some very large markdown files (50MB+) in our database.

Not surprisingly, parsing 50MB+ markdown files and then converting them to React elements took a lot of time.

NOTE

Ironically, what made it harder to figure out is that we wrote a custom markdown parser that's good at parsing large markdown files without blocking the main thread. As a result, looking at the flamegraph, it just looked like we were doinga lot of work across thousands of requests.

Reflection on instrumentation

It is worth noting that even before this, we already had a ton of instrumentation, but it was at a function/span level, rather than the route.

It is also worth noting that while some tools did provide route level insights (Sentry), the pattern wasn't obvious in aggregate. The issue would become visible across all routes once we start processing one of these large files. In case of logging, it was the first-route that was logged before the CPU started spiking that gave it away.

The same logging technique later revealed a cluster of related issues processing and serving user generated content, including trying to server-side syntax highlight some very large files or serving binary files as text in some edge cases.

The lesson

Never trust users' content.

We are aggregating data across thousands of MCP server repositories. I should have known better.

After this incident, I've added checks to ensure that the content that we are attempting to download/display is within expected parameters.