Runbooks Are Overrated

I have a take that makes some ops people mad: runbooks are overrated.

Not because documentation is bad.

Because most runbooks get written for problems you already understand, and then used in the worst possible moment.

The runbook trap

The pattern looks like this:

A startup starts turning into a scale-up.
Incidents start happening more often (or just hurting more).
Leadership feels the pain and wants something concrete.
The team gets told to write runbooks for everything.

You get a folder full of “Restart service X” and “Clear queue Y.”

It looks like operational maturity.

But it is often just maturity cosplay.

A story: the runbook-first CTO

I recently did some contract work helping a startup become a scale-up.

One of the first big moves was hiring a CIO from a large tech company to act as the CTO.

What he pushed hardest was runbooks.

Not for unknown, weird failures.

For simple issues that were happening repeatedly and could have been fixed. We ended up with a nice collection of documented processes.

It helped with ISO and SOC compliance because it proved we had procedures. But the runbooks were mostly describing the current shape of our technical debt.

And worse: they were describing it in a way that would rot.

Why runbooks fail in practice

Runbooks fail for boring reasons:

They drift. The UI changes, the button moves, the name changes, the environment changes.
They get written by the people who already know the fix.
They get read by the people who are least able to debug (new joiners, stressed on-call).
They become a substitute for fixing known issues.

When things go wrong at 2am, the runbook is usually one of two things:

a reminder of what you already planned to do
an outdated set of steps for a system that no longer looks like the screenshots

The failure mode is not “we forgot the commands.”

The failure mode is “the system changed, and the runbook didn’t.”

The uncomfortable point

If you can write a runbook for a problem, you probably understand the problem.

If you understand the problem and it is happening often, the best investment is usually:

remove the class of failure
reduce the blast radius
automate the recovery

Not “document the manual steps to do the recovery again next week.”

Runbooks are valuable proof. They are not valuable reliability.

What to do instead

When people say “we need runbooks,” they’re usually reaching for reliability, confidence, and control.

You get those outcomes faster by doing this.

1. Eliminate and automate repeat failures

If an issue has a runbook because it happens every week, treat that runbook as a symptom.

Make a short list:

top recent pages
top user-facing incidents
top manual actions during incidents

Then spend real engineering effort deleting those problems.

Where the fix is safe and repeatable, automate it.
The best runbook is a button:

click → wait → done

If it’s too scary to automate, that’s a signal about the system—not about automation.

2. Optimize for humans under stress

During incidents, people don’t read.

They pattern-match.

Prefer short checklists over long procedures:

verify the symptom
verify the impact
identify the owner
apply the safe mitigation
decide whether to escalate

Keep it to one screen. Anything longer won’t be read when the pager is going off.

3. Treat most runbooks as training material

This is the part most teams get wrong.

Many “runbooks” aren’t incident docs at all—they’re onboarding docs:

how deployments work
where logs live
how tracing fits together
how data is structured
how rollbacks fail

The goal isn’t “follow these steps.”

It’s “build a mental model so you can debug when the steps are wrong.”

Good on-call engineers don’t memorize runbooks. They understand the system.

4. Make the system explain itself

The best on-call experience is when the system tells you what’s happening.

That means:

clear error messages
correlation IDs everywhere
dashboards that answer real questions
alerts with context and hypotheses

Runbooks help with audits.

Observability is what helps at 2am.

When runbooks are worth it

Runbooks are still useful when:

the action is high-risk and you want guardrails
the procedure is rare (quarterly, yearly)
the failure mode is operational, not technical (access, rotation, vendor steps)
you have a big on-call rotation with lots of new folks

Even then, the bar is: it has to stay correct.

If the UI changes weekly, a runbook full of UI steps is a liability.

The real goal

People ask for runbooks because they want the feeling that the company is in control.

The control doesn’t come from writing down how to restart things.

It comes from:

fewer repeat incidents
faster diagnosis
safer mitigations
systems that are easier to operate

Write runbooks if you need to. But treat them as proof and training material.

If you’re writing one because the same issue keeps happening, the runbook isn’t the work.

Fixing the issue is.