This service can usually hits the 50ms range for health checks although it started getting much worse. The service is actually written in Tornado although has a few blocking calls that are used. Non-blocking IO should allow the health checks to be very quick to respond as in this case it returns a static response.
The root cause for the problem is that calls to MongoDB in a particular handler were taking longer than before and will hold back other handlers as it is currently a blocking operation. If the HAProxy health checks pass a threshold it will remove the nodes from the pool, a good precaution, although in our case can cause flickering if MongoDB takes longer than expected.
I did receive alerts thanks to alerting of per-service health checks with Graphite Pager.
We are using Diamond at
SeatGeek which easily collects metrics from HAProxy.
Check duration is (by default) stored at
servers.HAPROXY-SERVER.haproxy.BACKEND.HOST-SERVER.check_duration. The metric
we alert on is the moving median for each server regardless of the HAProxy