What 24/7 Public Safety Support Taught Me About Reliability
Working note. From years of on-call for 911 and law enforcement systems
The first time someone called me about a 911 dispatch system lagging by a few seconds, I learned the only lesson about reliability that really matters: seconds are not the same in every domain. In B2B SaaS, a few seconds of latency is a degraded customer experience. In a 911 center, a few seconds is the time between a dispatcher knowing where an officer is and not knowing. That difference reshapes how you think about everything.
I came up through that environment. Eleven years at a company building CAD, Records Management, and Mobile Computer Terminal systems used by 2,500+ public safety agencies. On-call rotations that did not take weekends off. Production support that ran 24/7 because the customers ran 24/7. Most of what I do now as a manager - ownership boundaries, on-call rotations, triage flow, observability - traces back to lessons from that work.
A few of those lessons.
The inherited mess
The team I took over had double-digit daily production issues escalating from support to the dev team. Engineers were getting interrupted constantly. Features slipped. People burned out. The first instinct, when you walk into something like that, is to triage harder. The right instinct is to ask a different question: why is every issue ending up on a dev's desk?
The answer was almost never "because it is a code bug." Most production issues in our domain turned out to be either user errors - the system worked as designed, the user did something unexpected - or data issues: records in unexpected states, often because of how an agency had configured something years earlier. Both of those are real problems. Neither of them needs a senior engineer at 2am.
Documentation enables your first line
The first move was writing down resolutions to recurring issues - not as docs for engineers, but as runbooks that the support team could actually use to resolve issues themselves. Every time dev resolved something, we captured the diagnostic path and the fix. Within a few months, support was closing categories of tickets that used to escalate every time.
This sounds obvious. It is not. Most engineering teams view runbooks as a documentation chore. They are actually a forcing function: if you cannot write down what you did, you do not fully understand why it worked. And if you can write it down, someone else can do it.
Automation for the patterns that repeat
The second move was building small internal tools to automate the fixes for issues that came up frequently - scripts the support team could run safely, with logging, that did the same thing an engineer would do manually. The bar was not "build a beautiful tool." The bar was "remove the engineer from the loop for known patterns." Each automated fix bought back a small amount of engineering capacity. Over time, the small amounts compounded.
Dashboards beat heroics
The biggest production performance wins did not come from heroic late-night debugging. They came from a dashboard I built and kept iterating on that surfaced slow database queries in production, ranked by total impact. Look at the top of the list, understand the use case, add an index or rewrite the query, ship it, watch the next-most-impactful query bubble up. Repeat.
Most reliability problems in long-lived systems are tractable like this: a small number of patterns cause most of the pain. The work is unglamorous and continuous. It is also the work that moved the needle more than anything else I did during those years.
Synthetic transactions beat customer reports
Once we had enough of the reactive surface cleaned up, the next move was synthetic transactions: automated flows that ran continuously and pretended to be real users, on real critical paths. The point is not catching what monitoring already catches. The point is catching the half-broken states where the system is "up" but the experience is degraded. When a synthetic transaction starts taking longer or returning unexpected results, you find the problem before the dispatcher does.
This is the move I would make first in any new environment now. Before reorganizing teams. Before redesigning anything. Before adding tooling. Just put synthetic traffic on the most important user paths and watch.
Proactive observability protects everyone
The deeper lesson, and the one I carry into every team I run, is that proactive observability is not just a customer-quality move. It is a people move. Engineers who get paged every weekend stop being engineers. They become firefighters. Firefighters burn out, leave, or become heroes, and the hero pattern is its own problem. The team that catches issues before they are customer-visible is the team that gets to do other engineering work.
This is part of why I think most reliability problems turn out to be operating-model problems rather than infrastructure problems. The infrastructure question is "how do we make the system more reliable?" The operating-model question is "what is the smallest change that lets the team detect issues earlier and resolve them without dev escalation?" Those usually have different answers, and the second one almost always produces a better outcome.
The hero anti-pattern
A side note worth its own paragraph. Every team I have inherited has had at least one engineer who is the hero of production support: the person everyone pages, who fixes everything, often heroically, often on weekends. It looks like a strength. It is a problem.
The hero pattern hides the underlying issue: there is knowledge in one person's head that should be in the team's documentation, and there is no system pulling that knowledge out. The hero gets credit for fixing things, which trains them to keep fixing things, which keeps the issues recurring. The career trajectory plateaus. The team stays fragile.
The most useful coaching conversation I have had with engineers in this pattern goes something like this: "Your job is no longer to fix production issues. Your job is to make production issues not happen. The measure of your success is not how fast you respond; it is how rarely you need to." Moving the goalpost from reaction to prevention has, more than once, taken an engineer from permanent firefighter to next promotion.
What this still means
I work in insurance claims now. The systems are different. The stakes are different. But the operating-model lessons travel. Most outages I have seen across domains share a structure: an underinvested observability layer, an undocumented support path, a heroic engineer holding things together, and a small number of recurring patterns causing most of the pain. The work is the same:
- Get the patterns out of people's heads and into runbooks.
- Automate the resolution for things that repeat.
- Watch the system the way an attacker would: synthetic traffic on critical paths.
- Move the team's reward function from reaction to prevention.
Public safety taught me the cost of getting reliability wrong. That cost shaped how I run engineering teams now, regardless of domain.