Varnish proxy goes catatonic under heavy load

Matthew Seaman matthew at
Wed Nov 5 11:49:36 UTC 2014

Dear all,

We had an unfortunate set of circumstances which resulted in several
million people all trying to download about 1.5MB worth of images from
our servers over the course of a few hours.  Or, at least, it would have
been a few hours, except that our three varnish proxies just crumbled
under the load within 10 minutes.

Now, that's bad enough, but we could have just about coped if the
proxies stopped serving requests for a few minutes.  What actually
happened was that all three servers went catatonic on the network *and
stayed that way*: even when we shunted the traffic away from one, we
still couldn't access it via ssh or any network protocol.  And it stayed
like that for sufficiently long time that we had no recourse other than
to get the servers rebooted.

Can anyone explain what was happening here?  Not having the servers
recover accessibility for an extended period even after the excess
traffic was stopped is unacceptable.  We're also struggling to recreate
the effect in the lab: any clues about how to do so, and any suggestions
about how to prevent the 'going catatonic' response would be greatly

Servers are amd64 running FreeBSD 9.1 or 9.2 and Varnish 3.0.5.



