War rooms with
auto-attached runbooks.
Detect, page, acknowledge, resolve. On-call rotations, escalation policies, and post-mortems that write themselves.
Auto-detectOn-call rotationsEscalation policiesPost-mortem draftsRunbook attach
01 · INCIDENT LIFECYCLE
Seven phases.
One timeline.
Every incident progresses through a structured lifecycle. Each phase is timestamped, attributed, and auditable.
incident #INC-2847 · API Gateway Outage
DETECTT+00:00
Quorum failure confirmed (6/8 probes)
PAGET+00:04
On-call @maya paged via PagerDuty + SMS
ACKT+00:47
@maya acknowledged, war room opened
INVESTIGATET+02:12
Runbook rb_db_failover auto-attached
MITIGATET+08:30
Traffic rerouted to standby region
RESOLVET+14:22
Root cause patched, monitors green
POST-MORTEMT++2h
Draft auto-generated from timeline
02 · ON-CALL ROTATIONS
Always someone
awake.
Define primary, secondary, weekend, and follow-the-sun rotations. Relays pages the right person based on time, timezone, and override schedules.
on-call schedule · platform-engineering
| Rotation | Responder | Window | Channel |
|---|---|---|---|
| Primary | @maya | Mon-Fri 09:00-18:00 PST | PagerDuty + SMS |
| Secondary | @jordan | Mon-Fri 09:00-18:00 PST | Slack + Email |
| Weekend | @alex | Sat-Sun 00:00-24:00 UTC | PagerDuty + Phone |
| Follow-the-sun | @kenji (NRT) | Mon-Fri 18:00-02:00 PST | Opsgenie |
| Follow-the-sun | @liam (AMS) | Mon-Fri 02:00-09:00 PST | Opsgenie |
03 · ESCALATION POLICIES
Unacked?
Escalate.
Define multi-tier escalation policies with configurable timeouts. If the primary does not acknowledge, escalation continues until someone does.
escalation policy · critical-infra
0mPage primary on-call@mayaPagerDuty
5mNo ack -- page secondary@jordanSMS + Slack
10mNo ack -- page team lead@chenPhone call
15mNo ack -- page VP Eng@sarahPhone + Email
30mStill open -- notify CTO@davidAll channels
04 · POST-MORTEM
Auto-drafted.
Human-reviewed.
Relays generates a post-mortem draft from the incident timeline, monitor data, and resolution notes. Your team reviews and publishes -- not starts from blank.
post-mortem · INC-2847 · auto-draftDRAFT
Summary
API Gateway returned 503 for 14m 22s. Root cause: connection pool exhaustion after failed deploy of v4.2.1.
Impact
2,847 requests failed (0.3% of daily volume). 12 customers reported issues. SLO burn rate: 4.2x.
Timeline
07:12 detect -- 07:12 page -- 07:13 ack -- 07:20 mitigate -- 07:26 resolve
Root Cause
Deploy v4.2.1 introduced a connection leak in the auth middleware. Pool saturated at 500 connections after 8 minutes.
Action Items
1. Add connection pool metrics alert (P1) 2. Canary deploy for auth changes (P2) 3. Circuit breaker on pool exhaustion (P1)