Planning for official documentation/process on outage reporting
Problem we’re trying to solve:
Outside of proactive internal reporting and alerts, we don’t really have a process or procedure that anyone can pick up and follow to call the issue out. This leads to outages sometimes taking longer than actually necessary to identify and respond to, which can lengthen the duration of the outage. Not having a set process to train new employees on is also an opportunity for us.
What we do today:
To date, almost all of the outages I can think of were either:
- Support noticed influx of volume > report to ENG via Slack
- Someone working early in the AM notices an issue > report to ENG via Slack
- A few employees see the same issue > report to ENG via Slack
- Customer reports a bug that we escalate, ENG deems it is an outage (mostly partial)
Identification, verification of, and EPD response times for outages can vary, depending on who is online or the number of customers that have reported an issue.
Ideally:
If we can get people to be able to correctly identify and verify outage conditions, we could set up a more robust and standardized reporting process which immediately notifies those who have the potential to help. Anyone should be able to report an outage effectively to get eyes on the situation ASAP.
Those involved:
EPD
Sales
Success
Support
What it would look like (can be team-specific where necessary):
- How to verify that an outage has occurred
- How to determine outage severity
- Methods to report an outage
- Adam: Can start with something simple, like a Google form so that outages are logged in a spreadsheet
- How to ensure the report gets attention from relevant teams
- Comms and notifying the right people, internal and external
- Outage follow-up tasks
Questions:
- Is Slack currently the only way to notify EPD of outages?
- Josh: We could add all of support onto Pagerduty to directly page the oncall engineers
- Do we have an internal outage response SLA?
- Josh: We have an SLA for responding to pages. Since outages can alert in various ways, the outage response SLA is a bit more general than just pages unless we implement support paging oncall.
- What’s the latest on getting a better understanding of specific alerts that can take Gem down?
- It’s not always consistent. If I think about the most recent outages (jobrunner and AWS outage), the first one did not page and we should probably add a new alert on slow queries. The second one we should have gotten pinged by Pingdom that the site was down but Pingdom uses AWS and was also down.
- I think if we’re more consistent about documenting outages (which I think support could help with) then we can do an outage review in a few months and see if there are common causes that we haven’t addressed.
- Do we need to think of another internal notification model (ex. prod-down@gem.com)?
- I don’t think we need this right now if we implement the recommended step in (1).
- Are there any cons to letting anyone at the company verify and report service outages?
- This could be pretty noisy for engineers. I only think support and eng should have Pagerduty access to escalate to an oncall engineer. If that doesn’t prove effective, we could consider other channels.
- Problem we’re trying to solve:
- What we do today:
- Ideally:
- Those involved:
- What it would look like (can be team-specific where necessary):
- Questions: