Skip to main content

Monitoring Service Health Check Duration

Check Duration

This service can usually hits the 50ms range for health checks although it started getting much worse. The service is actually written in Tornado although has a few blocking calls that are used. Non-blocking IO should allow the health checks to be very quick to respond as in this case it returns a static response.

The root cause for the problem is that calls to MongoDB in a particular handler were taking longer than before and will hold back other handlers as it is currently a blocking operation. If the HAProxy health checks pass a threshold it will remove the nodes from the pool, a good precaution, although in our case can cause flickering if MongoDB takes longer than expected.

I did receive alerts thanks to alerting of per-service health checks with Graphite Pager.

We are using Diamond at SeatGeek which easily collects metrics from HAProxy. Check duration is (by default) stored at servers.HAPROXY-SERVER.haproxy.BACKEND.HOST-SERVER.check_duration. The metric we alert on is the moving median for each server regardless of the HAProxy server aliasByNode(movingMedian(groupByNode(servers.*.haproxy.*.*.check_duration,3,"averageSeries"),10),0).