Facebook got hit with its worst outage in years.
Facebook was down for about 2.5 hours yesterday. Facebook has now issued a detailed explanation on what happend. According to Facebook the key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
Facebook made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
Facebook had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, Facebook slowly allowed more people back onto the site.
Operating a service for 500 million people is a huge challenge. Any small change could cause a dramatic meltdown as the above case showed. Facebook is quite a bit in the news these days. The Facebook movie is about to launch. Facebook is rumored to make a Facebook phone and Mark Zuckerberg donates $100 million to a school.