Script for Success – What a Football Coach Can Teach Us About Engineering
It’s 3AM, and the phone rings. Generally, it’s never good news when someone calls you in the middle of the night. I reach for my phone and see an automated alert that quora.com is down. As inviting as the bed seems, every minute of delay means another minute that Quora users over the world are having problems reaching the website. I roll out of bed and head to my computer. My first thought is to log into Hipchat in case some of my nocturnal teammates are available to help. My eyes still adjusting to the bright screen, I start scanning through error logs, reliability dashboards, and server graphs to try to pinpoint the source of the outage. After running through a battery of checks, it looks like our master database is hosed – it’s going to be a long night.
Pager duty rotations, where engineers take turns being on-call as the first line of defense to any and all site alerts during the week, are some of the most stressful experiences involved in growing a world-wide, online product. Being on-call means that you have to keep your laptop around with you wherever you go and that you might get called away at any moment, sometimes to deal with trifling issues but other times to address nerve-racking and time-sensitive outages. And yet, given that this shared responsibility is critical to keeping the product up and running, what can we do to improve this situation, or really any situation where we have to execute well under stressful and less than ideal circumstances?
A lesson from American football
One strategy comes from Bill Walsh, a former coach of the San Francisco 49ers. In The Score Takes Care of Itself, Walsh discusses a strategy called “scripting for success.” Walsh would make scripts, or contingency plans, for how to respond to all types of game scenarios. He had a plan for what to do if the team was behind by two or more touchdowns after the first quarter, a plan for what to do if key players got injured, a plan for what to do if the team had 25 yards to go, one play remaining, and in need of a touchdown, and many more. In fact, the first 20-25 plays of every 49ers game eventually became scripted, a tree of if-then rules that codified what the team would do based on different circumstances.
What Walsh realized was that it’s tough to clear your mind and make effective decisions at critical parts of the game, when you have thousands of fans roaring at you, when fans of the opposing team are throwing hot dogs and beer cups at you, and when the timer is ticking away your precious seconds. As Walsh explains, scripting helped to remove the decision-making process away from the distracting and intense emotions of the game:
“[I]t gave us a stunning tactical offensive asset that no other teams were utilizing at that time. Scripting was the most effective leadership tool in fair and foul weather. In a very calculated way, I began calling the plays for the game before the game was played. It took years for other teams to fully implement the concepts I had been developing for a long time.” 1
Walsh would eventually lead the 49ers to 3 Super Bowl victories and be named NFL Coach of the Year twice. 2
Simulate failures during peacetime
By adopting Walsh’s strategy of scripting, we shift our decision-making from high-stakes or high-pressure situations to more controlled environments. We reduce the frequency of situations where emotion clouds our judgments or where time pressure compounds our stress. As engineers, we can even programmatically script our responses and test them to ensure that they’re robust. This is particularly important in engineering organizations, where at a large scale, any infrastructure that can fail will begin to fail.
A number of engineering organizations have adopted a strategy of simulating failures and disasters during peacetime to prepare for the unexpected:
- When I was working at Google back in 2006, Google would run annual, multi-day Disaster Recovery Testing (DiRT) events. During DiRT exercises, the company would simulate disasters like earthquakes or hurricanes and verify that teams, communications, and critical systems could continue to function despite power outages or failures to entire data centers or offices. The exercises surfaced single points of failure, unreliable failovers, outdated emergency plans, or other unexpected errors and allowed teams to deal with them under controlled settings without the panic and stress associated with an actual emergency. 3
- Netflix built a system called Chaos Monkey that randomly kills services in its own infrastructure. 4 It may seem non-intuitive to wreck havoc on its own systems, but by configuring Chaos Monkey to kill services on weekdays during regular work hours, engineers can identify architectural weaknesses while they’re actually in the office rather than in the middle of the night when they get paged. As they note on their blog, “the best defense against major unexpected failures is to fail often.” 5
- At Dropbox, the engineering team would often run their systems under additional simulated load. If they hit some system limit that caused errors, they could easily disable the simulated load and address the issue. This created a much less stressful situation than having to firefight against the same issues caused by production traffic that they couldn’t just turn off. 6
In all of these examples, engineering organizations assume that the unexpected and the undesired will happen. They adopt the philosophy that it’s better to plan and script for those scenarios now, during peacetime, than during circumstances outside of their control.
Even if we’re not working on infrastructure, we’ll still encounter other high-stakes and high-pressure events in our professional careers. Job interviews and salary negotiations, for example, tend to also be infrequent, stressful, and yet high impact. Scripting and preparing for these situations can be equally valuable. As for me, I’m glad that automated recovery scripts and documented response plans made my on-call weeks at Quora much less stressful.
Bill Walsh, The Score Takes Care of Itself: My Philosophy of Leadership, p51. ↩
Bill Walsh (American football coach), Wikipedia. ↩
Kripa Krishan, “Weathering the Unexpected”, ACM Queue. ↩
John Ciancutti, “5 Lessons We’ve Learned Using AWS”, The Netflix Tech Blog. ↩
Cory Bennett and Ariel Tseitlin, “Chaos Monkey Released into the Wild”, The Netflix Tech Blog. ↩
Rajiv Eranki, “Scaling lessons learned at Dropbox, part 1”. ↩
“A comprehensive tour of our industry's collective wisdom written with clarity.”
— Jack Heart, Engineering Manager at Asana
“Edmond managed to distill his decade of engineering experience into crystal-clear best practices.”
— Daniel Peng, Senior Staff Engineer at Google
“A comprehensive tour of our industry's collective wisdom written with clarity.”
— Jack Heart, Engineering Manager at Asana
“Edmond managed to distill his decade of engineering experience into crystal-clear best practices.”
— Daniel Peng, Senior Staff Engineer at Google
Leave a Comment