Main menu

Pages

We are improving our culture by doing more low severity incidents – new stack

featured image

I’ve seen some great discussions lately about moving away from the culture that incident is a four-letter word. Some of the most popular and best advice on the subject is encouraging teams to declare more incidents and democratize who can declare incidents.

Dan Condomitty

Dan is the co-founder and head of engineering at FireHydrant. FireHydrant is an incident management platform that lets you integrate tools, streamline processes, and resolve incidents faster without leaving Slack. In this role, Dan will use his experience at companies such as Red Hat and CoreOS to lead teams building incident management technology used by companies such as CircleCI, Spotify, and Snyk.

While this advice seems pretty straightforward, I’ve noticed that there are often cultural barriers that make it difficult for teams to put it into practice.

Telling people to declare an incident doesn’t take away the fear that often accompanies it. Leads to a positive incident management culture.

FireHydrant’s internal engineering team builds steps into its Incident Management Program with the goal of increasing psychological safety around incident declaration and management.

One of the small, easy steps we’ve taken recently is expanding the scope of what is considered an incident and providing a safe and predictable investigation location. It has had such a huge positive impact on the team that I wanted to share some details in the hope that it will help others.

If you think it’s funny, call it an incident

We tend to think of an incident as a really, really bad and embarrassingly public moment when the customer is dissatisfied, the organization is losing money, everything is on fire. There are various activities that can be categorized as Also, the more we normalize low-impact incidents, the more confidence and experience we have with Sev1’s situation.

The first step in addressing some of our concerns about incidents was to ask the team to start thinking differently about how incidents are defined. Essentially, this was a cultural shift rather than a process change. And, as is the case with many cultural changes, team members change over time as they continue to see modeled behavior.

There was no grand plan to start this. it just happened. Someone on our team declared an incident with a spike in transient failures in our test suite. While this doesn’t fit the classic definition of an incident for most teams, it’s a priority to gain context across multiple communication channels, understand problems, implement fixes, and improve future work. It was a great way to rank. Reliability of our test suite. During the course of the incident, we realized that much of this work had already been done — the team was just avoiding labels.

In the end, certain versions of Node.js had memory leaks that caused test timeouts. Over the course of several days, he had five people involved, and despite being labeled an “incident,” he did not derail his routine. If anything, it provided structure and space that lowered the cognitive load rather than raised it. I drew.

It felt good to name a recurring task as a distraction that was hard to prioritize, so we started talking about it. We’re a small enough company that other teams wanted to try our weird approach to incident declaration. Someone declared an incident when the marketing team was getting an error notification while trying to deploy to a site. After 10 minutes of digging through Netlify, Gatsby, GitHub, and Contentful, he found an easy-to-fix permissions issue that unblocked a full day’s worth of work.

Right-size the response

We wanted to power behavioral change with a technical foundation. To give people a safe way to investigate whether something was in fact an incident without worrying about the repercussions that normally accompany declaring an incident, such as alerting or distracting colleagues. What should I do?

We created a new severity type “Triage” with the simplest runbook condition “Create a Slack channel”. This gives the engineer discovering the problem a place to jot down stream of consciousness or play-by-play notes to see what happens next if something just doesn’t feel right. You can add the graphs you’ve seen, the alerts you’ve seen, and any run history that might have caused the problem (including red herrings) for later reference. If it becomes clear that there is a bigger issue, the information is already documented in the channel, making it easy to escalate and get the right people to respond quickly.

Even if the severity does not change, it still provides valuable insight into system health. At my previous company, his CTO of mine used a notebook to tackle any problem. He wrote down all his notes about the incident and used to look back on them later with more context. These memos tell him that the serious problem started with a minor incident that happened six weeks before him, and then got pushed back to work that was considered higher priority at the time. There is a possibility that

Where do you go from here?

As more “triage” severity incidents were declared and resolved, it became clear that our team’s shared definition of an incident was changing. And with that redefinition, we see evidence that incidents are becoming less scary for everyone involved.

This redefinition makes me think a lot about where we go next. An engineer’s definition of an incident can be very different from customer support, sales, or marketing someone on her team being blocked from doing part of their job or doing it without much friction .

As we continue to evolve our definition of an incident, it makes sense to better understand how our internal and external customers are using our services.The truth is that all incidents whoMore on this as we continue to build and refine FireHydrant’s internal incident management program. If you liked (or didn’t like) what you read here, please let us know. I am dan@firehydrant.com.

Feature image via Pixabay

close