2025.08.17·6 min·leadership

Notes on running on-call rotations

I've been on-call for teams of 4 and teams of 40. The difference isn't the pager — it's everything around it. Here's what I've learned about building on-call rotations that don't burn people out.

Small teams: the good and the bad

At 4-8 engineers, on-call is simple because everyone knows every system. You can diagnose anything because you wrote it. The bad part: you're on-call too often. Every third week, sometimes every other week. There's no redundancy, no one to shadow, and the bus factor is terrifying.

The fix at this size: invest aggressively in runbooks. Not wiki pages that rot — living documents checked into the repo next to the code. If a service can page you, it should have a runbook that a tired engineer at 3am can follow without thinking.

Medium teams: the awkward middle

At 15-25 engineers, you have enough people for a reasonable rotation, but knowledge is starting to silo. The backend person doesn't understand the frontend alerts. The infra person doesn't know why the API is returning 500s. Pages start bouncing between people. Mean time to resolution goes up, not down.

The fix: pair on-call. Two people per shift, one primary and one secondary. The secondary shadows for a rotation, then becomes primary. This costs an extra person per week but dramatically reduces both response time and burnout.

Large teams: the bureaucracy trap

At 40+ engineers, you have enough people for multiple specialty rotations — backend, frontend, infra, data. The trap is that each rotation becomes its own silo with its own escalation path, and soon you have a situation where a simple problem requires three teams to coordinate at 2am.

The fix: tiered on-call. Tier 1 is a generalist who triages and resolves the 80% of issues that don't need deep expertise. Tier 2 is the specialist who gets escalated for the hard problems. Most pages never reach Tier 2.

What actually works

Follow the sun if you can. If you have engineers in multiple time zones, rotate by tz. Waking someone up at 3am is expensive — in morale, in cognitive performance, and in the quality of the fix they ship at that hour.

Classify alerts ruthlessly. Most teams have 10x too many paging alerts. If an alert doesn't require human action within 30 minutes, it shouldn't page. It should be a Slack notification or a dashboard widget. Reserve the pager for things that actually degrade the user experience.

Post-incident reviews are mandatory. Not blame sessions — learning sessions. Every page should produce a document: what happened, why, what we're changing. If you're not writing these, you're not improving.

Budget for compensation. On-call is labor. Pay for it, comp time for it, or rotate it fairly. The fastest way to lose a good engineer is to treat on-call as an obligation instead of a responsibility.

The goal isn't fewer pages. It's pages that matter, handled by people who aren't exhausted.

← all poststhanks for reading