- August 22, 2013
Deep in the middle of the night, illuminated by the glow of five screens full of graphs, data, code, and live video, I sat on edge, as I monitored a small army of servers. The O Music Awards, a 24-hour, live-streamed music and awards festival in New York City, was in full swing.
Sometime after 3am, I saw the first warning sign of a major issue—a slight uptick in an otherwise-flat graph. Over the next few seconds, it grew to a huge spike, and I alerted the team that we had a problem. Thanks to some well-configured caching, the homepage and live streams were unaffected, which meant a large majority of users didn’t even know we were having an issue. But, the failures were going to cause errors during voting, and a few other pages on the site were going to crash. The situation wasn’t great, but the mission-critical things were still working properly.
With a large, live-stream viewership and a ton of active fans voting for their favorite performers, our team didn’t have time to dig through data to find the root cause just yet. We had to react quickly, get a fix pushed up, and reassess after we attended to the immediate impacts.
We upped our caching on a few of the heavier pages to lighten the load on our servers and be able to work under some cover. With the caching increased, data on those pages would be stagnant, but we would have the time we needed to get a patch deployed.
Within 4–5 minutes of the initial problem, we had everything up and operational. The voting errors disappeared, and those terrifyingly steep peaks in the graphs subsided.
As we built the O Music Awards site, we tried to find ways to break it in order to make sure it could withstand anything the world threw at it. Nonetheless, a certain pattern of voting behavior we hadn’t seen in previous years triggered a strange series of events that crippled our servers.
No matter how much you test, plan, and prepare your systems for the worst, there will always be small things that you didn’t account for or other unknown conditions that pop up and cause problems. And you need to accept that.
The best way to plan for those unforeseeable situations is to develop a contingency plan. If possible, design your system so that the most important goal can still be met, no matter what happens. We fine-tuned our caching approach so that even with complete server failure, people could still access the live streams to watch the event.
When things do go wrong, focus on a temporary patch before taking your time with a permanent fix. You can’t implement a sleek, well-architected fix while in an inevitable nervous, rushed state of emergency. It doesn’t matter how hacky you think your temporary solution is—if it gets the job done, push it, and give yourself the time you need to implement something better.
Above all else, in these situations, relax. Give yourself 10 seconds to be nervous about the people that are affected, money that could be lost, votes that could be missed, and on and on. After that, take a deep breath, relax, and get to work. You can’t work like you need to when you have all of those thoughts in the way, so get them out of your head, and focus on the issue at hand.
No Pressure, No Diamonds
Our system served as the platform for the Awards’ all-day-and-night event, saw tons of live-stream viewers, and handled a mind-numbing 97,597,768 votes in just over 30 days. This year’s O Music Awards was an overwhelming success, and our team came through with flying colors.
With a little preparation, development under pressure is much less tense and way more fun than you would think. Trust yourself and your team, stay relaxed, and you’ll make it through those 3am panic attacks just fine.