This is a story of intermittent 500 Internal Server errors from APIs, that ended up being caused by a hardcoded constant in Go’s
package. I’ll mostly spare you the long-winded story, and get right to the problem and what we found to be the cause.
We noticed an increased number of 500 errors from specific API endpoints, and started to troubleshoot. We initially found something odd. When we retried a failed call, it would succeed sometimes. After a few minutes, we found we could “fix” the problem pretty easily by simply CURL’ing a failed API call a couple of times. After that, it would succeed again and again.
This was puzzling for a moment, but then we realized that the call would fail again after leaving the API idle for a few moments. Meanwhile, another team member looked up the changes we’d made to that API at the time it started to fail. (We use chatops
and other techniques to help with this type of troubleshooting.)
It turned out the bad change had increased parallelism for API requests from 10 to 25. We do some of our large-scale timeseries operations in parallel and had bumped it up along with some other changes.
The explanation is related to the following facts:
- We use a separate
database/sql connection pool per customer environment to ensure resource isolation and avoid problems like starvation and noisy-neighbor impact.
- We’d increased the size of our
database/sql connection pools from 10 to 25 to enable a parallelism increase.
- Because we have lots of customers and lots of API services, each with separate pools, we had set a short connection timeout server-side, with
wait_timeout to avoid too-many-connections on MySQL’s side.
database/sql has a hardcoded retry functionality when it encounters
driver.ErrBadConn on a connection it gets from the pool. It will retry queries after removing a failed connection and fetching another, up to 10 times.
Here’s how the problem happened:
- An API request opened up to 25 connections to the database, finished, and returned them to the pool.
- The API stayed idle for a bit and all the connections timed out on the database and were closed.
- The API got another request and started trying connections from the pool.
- After 10 tries, it gave up (this is
database/sql behavior, again – not the API per se) and returned error 500.
- If you ran the request again immediately it would clean up a further 10 connections from the pool. The third request would discard the final 5, and then on the 6th retry, would open new valid connections and succeed.
Since we can’t change the hardcoded limit of 10 retries without changing
code, we opted to just reduce parallelism for now. In the end, our hardware doesn’t yet support such high parallelism; this was a preparatory step for some future changes.
We plan to make the number of retries configurable (probably by vendoring in a change to
for now, and submitting it upstream).
The morals of the story? Just the usual: Know your tools, get the right people involved with troubleshooting, etc. No earth-shaking lessons here; just something interesting I thought worth sharing in case it helps someone else.