How We Increased Uptime and Decreased Engineering Stress
Working in the big data world can be a little chaotic. For us that means a stable flow of 20k events per second can spike to 50k over the course of an hour. It takes only seconds to look through our logs and find a perfect example:
Things move fast and we need to be able to adapt and respond immediately when our systems are under stress. When a system goes “red” most people’s first instinct is to call the engineer who built it. Over time this can create silos of knowledge and an uneven load on the engineers. Another side-effect of this engineer-first thinking is that the priority is focused on fixing the problem, missing the importance of communication.
In June our Data organization began experimenting with a new process to help us improve on these. Our goals were
So far, we’ve seen improvements in all of these, an increase in uptime, better monitoring and best of all: fewer weekend pager events.
- Faster Response Times
- Better communication
- Better documentation of all processes
- Any engineer can fix any system
The Four Major Principles
Organization: There can be only one pagerduty schedule. No matter who is the expert on which service, the same schedule is used on all products. There is one place for tracking incidents, one place to look for runbooks and one place for workflow of incidents.
Automation: Dealing with a pagerduty can be stressful and any rote work should be automated. For us, this involved automatically creating a tracking ticket in Jira. Another tool was a script that lets us communicate in one place, but surface the message across all our customer touchpoints.
Empathy: This is inadvertently the most brilliant part of the system. When discussing the rollout, an engineer said to me, "I don't like it, I don't want my mistakes to wake up someone else at 4am." This exposes something really interesting - people don't mind inconveniencing themselves, especially if it's something small and infrequent. But the moment they realize it could be inconveniencing someone else, it becomes their top priority.
Communication is Equal Priority to Fixing: When an incident occurs, two people are paged. One person is the "communicator" one person is the "fixer." This protects the engineer from being distracted by others asking for status updates, and ratchets up the importance of communication.
Two heads are better than one
Two people are on pager duty instead of one. Their roles are clearly defined and separate.
- Send out emails and communicate in chat rooms
- Set User-facing error messages when applicable
- Label ticket, and update frequently
- Handle communications with DevOps
- Document process and update run-books
- Research and Identify problem
- Try applicable run-books
- Problem solve
Flatten out the pager duty structure entirely. Instead of each team tracking pages for their individual services, there is one schedule which listens for pages from any service. The escalation strategy works like so
Level 1: Incident Manager, Incident Engineer
Level 2: Team Leads
Level 3: VP and CTO
Note: This drastic rise in escalation sucks, but as a result we've had only one issue escalate to level 2 in the past three months.
Iterate and Improve
Every week, we retro and discuss how it went. Every page-event gets tracked in a spreadsheet where we can look for trends week over week. Then we discuss how that week went. How is morale? What went wrong, and did we document or automate the resolution? What could be better? This discussion is vital. We've already improved the process based on these retros.
For example, we realized there was confusion related to the urgency of alerts. Which ones need to be worried about immediately, and which can wait until we're back in the office? The retro helped us identify this ambiguity and address it by the second week.
So far, the trends have been positive and I'm confident we will continue to refine and improve the process.