Velocity’s Edge Podcast S1E4 - Carla Geisser & Chris Swan on Crisis Engineering

Velocity’s Edge Podcast S1E4 - Carla Geisser & Chris Swan on Crisis Engineering

Velocity's Edge podcast guests Carla Geisser & Chris Swan

As Carla Geisser puts it: “The incidents that actually matter to how people interact with technology are not security incidents … They are things like, they can’t log into their bank account, they can’t buy their Taylor Swift tickets, they can’t get on an airplane.” And when everything’s on fire, most organizations make a critical mistake: they treat the crisis as the exception rather than the expectation. The companies that survive and thrive are those that understand a fundamental truth: if your business is growing, crises aren’t anomalies—they’re predictable outcomes of scale.

Crisis engineering - the art of troubleshooting and restoring proper service levels - isn’t about heroics. It’s about recognizing when manual intervention becomes unsustainable, and about building systems that handle failure as a normal operating condition. The question isn’t whether you’ll have incidents, but whether you’re consuming your entire organization’s capacity fighting fires that should be routine.

In this episode of Velocity’s Edge, Carla, Chris Swan, and Nick Selby explore the discipline of crisis engineering. Their conversation tackles essential questions: When does manual intervention become a resource problem that demands automation? How do you get out of incident mode as quickly as possible without leaving critical work undone? Why do organizations need to treat predictable events—Black Friday, tax day, major product launches—as declared disasters in advance? Most importantly, how do you build organizational memory and muscle for handling crises without burning out your teams?

Carla, an EPSD advisor and partner at Layer Aleph, pioneered Google’s SRE principles, led the effort to rescue Healthcare.gov during its 2014 crisis, and guided Fastly’s recovery from their global outage. Chris, an EPSD advisor and Engineer at Atsign, brings CTO experience from UBS and Credit Suisse. Nick, EPSD’s founder and Managing Partner, has led crisis response across law enforcement and technology, from the NYPD to fintech and insurance companies.

As in all our episodes, we speak in plain, executive-summary business terms, framing complex business and technology strategic challenges in context, using language that makes them more accessible and actionable.

Listen here, download it from Apple Podcasts, Spotify, or find it wherever you get your podcasts.



Episode Information
Season 1, Episode 4
Length: 27 minutes, 46 seconds
Host: Nick Selby
Guest: Carla Geisser, Chris Swan
Recorded: VOXPOD Podcast Studios, Parsons Green, London
Engineer: Dayna Ruka
Editor: Dayna Ruka, Jeet Vasani

Episode Transcript

Nick Selby: It’s Velocity’s Edge podcast. I’m Nick Selby, and I’m here today with two members of the EPSD advisory board. The first one is Carla Geisser. Carla, can you say hello and introduce yourself?

Carla Geisser: Hi, my name is Carla Geisser. I am an advisor with EPSD, and I’m also a partner at Layer Aleph, which does crisis engineering. We intervene in technology projects that are in some kind of trouble.

NS: Okay. And I’m here with Chris Swan, who’s also my podcast partner on the Tech Debt Burndown podcast. Chris, can you say hello and introduce yourself?

Chris Swan: So hi, I’m Chris Swan, and I spend a lot of my time engineering Atsign, where my focus is on progressive delivery and getting stuff into production safely and fast.

NS: Thanks. And Carla, it should be noted, got up very, very early in the morning to talk to us because Chris and I are in London and Carla’s in New York.

Carla, we met in 2017. You had described what you did at that time as “non-security incident response.” I do like “crisis engineering” a lot better. Can you tell us what it is that you do? Because, like me, if somebody sees your face, something bad has happened.

CG: I think to build a foundation, most of the incidents that actually matter to how people interact with technology are not security incidents. No offense to you, Nick. They are: they can’t log into their bank account, they can’t buy their Taylor Swift tickets, they can’t get on an airplane… All of that is technology problems that probably do not have an underlying security cause.

If there’s a security incident, maybe you get a letter in the mail six months later saying there’s been a breach and it had no effect on your life, but the ones that really cause a business to be in trouble immediately are often non-security incidents. They are downtime. They are inability to do the core user functions of the system.

So I have found this niche from a career in site reliability engineering at Google and then working on Healthcare.gov, and wherever a problem is like this, I hope to be involved.

NS: I’m listening to droves of colleagues in the security space now saying, “But availability is a security problem.”

I’m actually doing some work right now where it was really great that somebody wrote a script that broke almost everything in Salesforce. You know, inserted Customer A’s records into Customer Banana’s records. And of course, they didn’t write a logging function, so nobody knew what it did, but they just knew that everything was bad and they were at p-zero for like three days. And somebody mentioned it to me casually, and I said, “Did anybody think about calling in the security team?”

They’re like, “Why? That wasn’t a security problem.”

Like, well, yeah, it was, but I don’t know…

Where do you draw that line? Because technically it’s important. But this isn’t just a question to have a dictionary definition. I have found that when something goes catastrophically wrong, you never go wrong bringing in more people than you think you might need. So where do you fall in that?

CG: I mean, I am an ecumenical incident responder, I would say. We want everybody who might have an idea of what has happened or how to fix it involved in our incident response. The line between security and everything else is very blurry, and the line between technology operations and business operations is basically non-existent. So we want everybody involved.

I think part of the reason we’ve tried to focus more on non-security incident response is because the security people were very enthusiastic with their marketing over the last 15 years. And if you say “security incident response,” that means a very particular thing. That means forensics. That means data preservation for potential law enforcement action. And those are all useful things, but they are not what we do.

NS: I’ll go into my favorite quote of yours. If you turn in the Google SRE handbook, which I love, to the chapter on toil, up at the top of that chapter is your quote: “If a human operator needs to touch your system during normal operations, you have a bug.”

Two questions: The first one is can you explain that? And then also, can you talk about the three things that CEOs and COOs can do tomorrow to understand whether this is actually a problem for them, and what can they do internally to get their teams to start to get a handle on the right issues?

CG: So the sort of backstory of that quote was thinking about things like machine computer failures at Google, where if you run a tiny startup, you have maybe three or four cloud instances running somewhere. They’re going to go down sometimes, but you can do any kind of failover process by hand, because three-to-four machines are not going to fail that often.

If you are Google, and you have hundreds of thousands or millions of computers, some of them are failing every minute, every hour. So that needs to become part of your operational practice that just happens automatically without a human touching it. Because if a person has to go to every computer and go, “Oh well, XYZ one two, three failed. Let me do a manual procedure to move all the data to somewhere else,” you’re going to need hundreds of those people– or thousands of those people. It just doesn’t scale.

Part of what I was getting at is, it’s okay to do something manually if your business allows for it and it’s fairly rare.

I guess another example is if you are an e-commerce, like a small website selling stuff, then you’re going to have your “Black Friday panic moment.” And it’s okay. That’s once a year. You can be ready for Black Friday, and you’re just one company selling your stuff. If you are Shopify, you are doing something that is like Black Friday for a sale in some country or a special promotion somewhere for some retailer constantly. Like that is just always happening. So you can no longer have your “Black Friday war room.” You have to have a way to handle those spikes in load due to sales, due to unexpected promotions… You just have to be prepared for that all the time.

NS: I like that example. Trying to get a senior executive to understand, to be able to self-assess that, what are some of the things that you would suggest– without having to call in a bunch of people and professionals to tell them how to do it– how do they know where they sit on that? You just gave us a nice timeline there.

CG: Looking at the amount of time your people are spending on the rare event– and it will be pretty obvious if you have senior leadership status meetings, if they are constantly preparing for the next rare event, or they’re constantly working on things that should be part of your business practice– then you’ve let the line swing too far towards “the crisis is now the normal.” If you see that a huge amount of the capacity, in terms of people, of your organization, is going towards operational stuff that feels like it is just part of running the business, then you’ve let the line swing too far.

There is the opposite version, which is sysadmins will often say, “Oh, this thing is going to be a disaster. We have to do it every two years,” and they really want to build up a whole project to automate that thing, so that every two years when you have to do it, it will not be a disaster. And that’s a thing to question as a leader. Maybe it’s actually fine to put all the people who are experts in a room and have them do it once every two years, whatever it is.

NS: So it really comes down to your resource planning, which is, I think, non-technical enough that people could actually understand it. “When does our manual intervention become a problem that we want to do something about?” is really what it is.

CG: And when is it consuming some fraction of your human capabilities that you don’t want to spend on that? Because at some point it will consume your entire organization.

CS: This is another thing where I think it’s useful to have heuristics, because you want to automate all the things, but actually you can’t automate something without understanding it, and you probably can’t understand it without doing it at least once manually. And you’re then into– and I think there’s a great XKCD on this, sort of mapping out the charts of effort spent on automation versus effort spent just doing the thing manually and finding a right resolution on that.

CG: Yeah. And I am a big believer in building your automation, and your incident response, and all of this stuff off of manual work first. Because the manual work is a discovery process. You will learn things that are not in the documentation. You will learn that there is a step where Chris has to hand his secret spreadsheet, full of error records, to Nick, who will then re-run it through the shell script he’s had lying around for five years, and none of that will be visible until you try and do it by hand.

NS: Yeah. And it’s funny, because I’m actually in the middle of something very related to this right now. When do you know when to stop field testing and finding all these manual processes and just get to say, “Okay. Here are the business priorities. And in order to get that, we’ve got to stop having nine people do this.”?

I think that that’s actually a fairly straightforward line. I just don’t hear a lot of executives looking for that conversation to get into. You need to raise it on a regular basis– engineering leadership has to raise this in non-technical terms– on a regular enough basis that it’s top of mind.

CG: And engineers and system administrators are notoriously willing to do weird manual work forever.

NS: Yes. Yes. Almost as a point of pride. Like you, I tend to get dropped into situations where everything has gone wrong; there are several people in charge of different aspects of it; there’s no one in charge of other large swaths of of the thing.

I was really impressed when we first met. We were in the middle of both a security incident and a crisis engineering moment. And the first thing that you said to me after we introduced each other was like, “No. The team is going to work Monday through Friday, 9 to 5 with an hour for lunch. They’re not going to start work before nine. They’re not going to work on the weekends. This is a marathon.”

And I really liked the way you approached that, looking at things for the long term. Can you can you talk a little bit about your process of getting dropped into hell and how you sort out what’s going on?

CG: The key for me for sorting out what’s going on is getting everybody into a room who might know something. They might not have to stay there forever, but you need anyone who might have a little bit of a clue there for the initial response. And then I typically start a process of mapping out whatever the core thing that the business does, or the core thing that is broken, from the very beginning. So wherever a record enters the system, wherever a user logs into the thing… and just using that set of people, tracing what happens and what is currently not happening with that flow of work.

Because that will reveal a huge amount of stuff we know about the system, and also a huge amount of stuff that we don’t know about the system. And then we can go find out those answers using either the people we’ve already gathered or new people who will bubble up to the surface.

NS: Yeah. I just said to you your “five minute rule.” Do you want to talk a little bit about that? Like the early questions that you ask when you’re going into an organization. Or am I being too obscure?

CS: Maybe a little. I am actually kind of thinking back here to incidents I’ve been sucked into. And you’ve touched upon an important point there that I’ve often arrived at a thing just at the kind of fatigue point, where everybody’s thrashed themselves literally to not being able to stand up anymore. And at that point, it’s actually very difficult to get an understanding of the situation, because people are too tired to know which way is up. I think you almost have to sort of force a step back then and say, “Okay, pause for a moment. Let’s actually understand the landscape that we’re operating on.”

I think it’s a sort of natural tendency as we go into crisis mode to just sort of go full-on, full velocity. But where’s that hand off point between, “Okay, we’ve sprinted until we’ve fallen, and now we need to understand that this is going to be a marathon,” and, in cases of something like NotPetya comes to mind. That was months.

You can’t be in full action stations for months. I’m using a military analogy there, because if we think about states of readiness… that top state of readiness… you can only actually achieve that for a matter of hours. And then people start to fatigue, which is why the next stage down is six hours on and six hours off. Because you can do that indefinitely.

CG: Yeah.

NS: It’s really true.

CG: And the military and other first responder-types have learned this lesson hundreds and hundreds of times. They have it written down in their standard operating procedures that you can go really hard for maybe a day, and then you need to start doing things in shifts with regular breaks, and logistics support, and all the stuff that is necessary to keep the human part of the organization going.

When I’ve been involved in incidents, maybe you have that initial surge for about a day, but then you settle to a cadence that is, “We’re going to do two major changes a day.” And that’s still aggressive for most organizations, but it’s not full chaos.

NS: And the fact that the first thing you said to me when we came in there– because we both got dropped into utter chaos, and it was more chaotic than I had been used to– it was not just “something bad had happened.” It was, “Something bad has happened because so many other bad things had led up to it, and we’re not sure which of those bad things led to this one, but we’ve got to figure that out.”

These learning moments, I think, are something that a lot of organizations leave out of… I have the opportunity to look at a lot of incident response plans, incident response run books, and I see a lot of people paying attention to the very beginning, coalescing, “What’s your call tree? How do we bring all the people together into the same room? Where do we put them?”

I don’t see a lot of takeout menus. I don’t see a lot of “what do we do when we’re actually in the room?”

It’s very frequent that I’ll come into an incident and I’ll say, “So where’s the list of the criteria required to declare this incident over?”

And no one has thought of that. They’ve just been running really, really fast, but they’re not taking the time to figure it out.

And the most important thing that I see people not doing is learning. I see people writing a lot of postmortems, but I don’t see a lot of concerted effort to learn the lessons of the postmortems, especially around not just “how did the incident happen,” but “how did our response to the incident unfold, and what can we do better next time?” And that’s kind of in your direct field, right?

CG: Yeah. One of the things you mentioned was how to step down an incident. And in general, my attitude is that you actually want to get out of incident mode as fast as you reasonably can. The moment the things start to get routine, and start to get boring, and people are basically doing their jobs but they happen to be sitting in a different room, then you need to spin down the incident response and send everybody back to their desk– even if you’re not done yet.

Because keeping the incident response running is extremely expensive. It’s extremely unhealthy, both for the organization and for the individuals involved. You’re much better off spinning that down fairly quickly and turning it into your regular business project planning that now has a list of stuff that is “…and we are doing remediation here because now it’s just everybody’s job.”

So you want to get up to a pretty high cadence very quickly, and then the moment you think you might be done, you should stop. Because any organization will eventually learn that there is a secret room where meaningful work can get done really fast, and new stuff will start getting added to your incident war room that had nothing to do with the incident.

NS: That’s absolutely true. Now, going counter to that, we were talking before we started recording about some recent experience. You mentioned Healthcare.gov. That was when we first met. Can you give a recent example and throw this into the mix? There are executives who might not be intimately involved with incident response, but they’ve got criteria that they’re not going to let anybody leave that room until they’ve got them. And I’m thinking now, people like legal, compliance, right? People who are like, “Well, you can’t call this incident closed until you’ve got X, and I don’t have X on my desk, therefore, y’all go back into that room right now.”

CG: And they are not wrong, those people. But there’s a difference between the incident being closed and the war room cadence, or whatever you’re going to call it, being spinning at full tilt. And it’s important to separate those two things.

One of my favorite examples, not from technology at all, is I got in the habit of reading wildfire incident reports. And they were doing twice daily check-ins for this wildfire I was reading about. They were really fun to watch. And then the last one I read said, “The fire is not out. We have a perimeter in front of the important properties and wilderness boundaries. The fire will go out when the snow starts.” But the full incident response was over, even though the fire was still definitely burning and was going to burn hundreds of acres more of wilderness, they were no longer doing incident response.

NS: And I think this is actually one of those reasons why I love the way you work is you’re taking that human element, because I’m sure that the people I was just talking about who were saying, “Nobody’s leaving that room until I get X,” they’re not basing it on what’s going on now. They’re basing it on the history of people keeping promises or getting distracted or peeled off. As a matter of fact, in that first place that we were, they promised me 15 people for nine weeks, and I ended up with like eight people for three weeks. And they just kept peeling them off one at a time, and there I was, sitting alone in the room.

CG: Yes. And that’s why it’s important to spin it down fast, because it’s going to get spun down no matter what, so you might as well be aware and involved in that process. The executives are going to get distracted, because they always do. They’re going to take their people off to the next big launch for the conference that’s in six weeks. You’re better off saying, “Okay, the incident is over from our perspective. Using the criteria that we have defined, we understand there is follow up work. This can go on your Q4 schedule.”

And then it’s kind of up to the executives to keep it on their Q4 schedule. And if they don’t, well… (Laughs)

NS: Yeah, if you don’t, then it’s Tuesday, and we’re going to have the same problem next time.

This is to both of you. If what we’re talking about is building those muscles, building that capability to recognize what’s happening, bring people together quickly, get the thing to the point where now you can actually make it almost just part of sprint work and get back to normal, what are the things that senior executives have to do on their own? What can they do to cultivate that kind of organizational resiliency to incidents without having to bring in outsiders? What can they do tomorrow?

CS: So I’ll jump in with tabletop tactics. Carve out some time to think about what could possibly go wrong, and when that “what could possibly go wrong” happens, what are you going to do about it? And as part of that, maybe use the tools that you plan on using for how you’re going to talk to each other, bearing in mind as well that the tools you use today might not be working, because that could be the thing that you’re having the incident for. And I think this is where we get back to the commonality between security incidents and non-security incidents, is you probably want to be using the same approaches for those, and you probably want to be actually doing the tabletop exercises in common for both types of incidents.

CG: Yes. I love tabletop exercises. And I also love… way back at the beginning, we were talking about routine operations that are maybe not so routine. Like your Black Friday, your tax filing day, your big promotional event if you’re a retail site selling Beyoncé tickets, whatever it might be… Basically declare disaster in advance of those incidents, and get everybody you’re going to need in the room. Because, you know, no matter what, it’s going to be interesting, so you might as well treat it as if it’s always an incident.

Even if nothing bad happens. You’ve gotten all the right people in the room. You’re using all the tools that you would use to manage it if it became an incident. It’s a tabletop exercise, but it’s a more real one because it is around something that is actually happening. So you should do both.

NS: It’s a live fire tabletop exercise is what it is. The funny thing is, I have a customer who’s actually doing exactly that. And I don’t know if they’ve ever been as explicitly articulate about that’s what they’re doing, but that is what they do. And it is around one of those things that you mentioned. And every year they get better at it, and every year they break the records that they thought that they couldn’t break the previous year.

And I always worry, until you just said this, because it felt very ad hoc, and it didn’t feel very organized. I was always saying, “Well, you keep on making mistakes and getting away with it. You’re just going to keep on pushing people farther and farther.”

But as you just said that, it made me realize, actually, this is kind of inspired, what they’re doing. Because whether they knew it, that’s what they ended up doing. Exactly what you’re suggesting.

CG: And eventually, if they do it, if they keep up, whatever the practice is for some number of years or quarters, it’ll start to get boring. And maybe the war room for whatever that event is will have two people in it who are like, “Why are we even here? Like the last three of these, nothing interesting happened. Our preparation was great. Our tools worked. I would like to go take a nap.”

CS: Call us describing all of the war rooms on the 1st of January 2000.

NS: Yes, that’s exactly right.

CS: Yeah. We’ve now got this kind of conspiracy theory of “it was all overblown,” whereas the reality was that it was very overprepared.

NS: The very best prepared nothing. Yes.

CG: And I think all of us would much rather have a war room where no one got to go to the party, and we’re ordering extra pizzas, and nothing is happening for the year 2000, versus having to fix whatever might have broken with those people who we have gathered together. That is an ideal outcome.

CS: I was just thinking about your organizational memory here. We’re now closer to 2037 and the rollover for Unix time than we are to 2000. And we knew about that then, and we knew that we needed to get ready for it, and it’s still going to be a crisis like the last time around.

NS: It’s going to be. Yeah.

CG: I will add one last two cents, which is that you mentioned organizational storytelling. That is a huge part of the wrap-up for any incident that needs to happen. You need to cement a story in people’s minds about what happened.

The military is great at this– giving everybody a sticker, or a patch, or something, that says, “Hey, we were all involved in this major incident that happened. We all survived. Here is the memory we all have about what we did. And here’s the story we are going to tell to the interns next summer about what happened.”

Because otherwise, like the year 2000, it becomes this weirdo conspiracy theory where I know people who were pulled out of retirement to fix COBOL code at major banks, and nobody remembers that anymore.

NS: Well, they will when they get called in in 12 years, because it’s going to be the same folks.

I think we’ve all said it, but as a big lesson, that initial… and Eric Olson, who works with us, said this… in the in the early moments of something going wrong, you don’t know if it’s infra, if it’s ops, if it’s a hacker, if the building is on fire… you don’t know. So get everybody together.

And that has to include every subject matter expert. It’s not just the engineers. It’s also the people in your organization, the business leaders, and your communications people. They’ve got to be there to remember those kinds of things and be able to start taking those notes.

I know that the incident reporter role is often seen as very un-sexy. I can’t tell you how many times I’ve saved an operation just because I was there typing stuff, doing nothing, but just watching what everybody did and just typing it down.

This collective history and that storytelling is super, super important.

Carla, thank you for getting up so early. Chris, thanks for joining us. I’m Nick Selby, and this is Velocity’s Edge.