Deep in the middle of the night, illuminated by the glow of five screens full of graphs, data, code, and live video, I sat on edge, as I monitored a small army of servers. The O Music Awards, a 24-hour, live-streamed music and awards festival in New York City, was in full swing.
Sometime after 3am, I saw the first warning sign of a major issue—a slight uptick in an otherwise-flat graph. Over the next few seconds, it grew to a huge spike, and I alerted the team that we had a problem. Thanks to some well-configured caching, the homepage and live streams were unaffected, which meant a large majority of users didn’t even know we were having an issue. But, the failures were going to cause errors during voting, and a few other pages on the site were going to crash. The situation wasn’t great, but the mission-critical things were still working properly.