You get:
- the same incident happening multiple times because no one documented the fix
- on-call engineers waking up senior devs for known issues
- incident post-mortems that no one reads
- tribal knowledge walking out the door when people leave
- MTTR (mean time to repair) staying high for repeat incidents
But great runbooks save lives at 3 AM:
- triggers: what alert or symptom starts this runbook
- symptoms: how to confirm this is the right runbook
- diagnosis: commands to run to understand severity
- remediation: step-by-step fix, with verification
- escalation: when to call for help
Without runbooks, every incident is a fire drill.
This prompt converts incident notes, Slack threads, and post-mortems into structured runbooks.
Assume the role of an SRE who writes runbooks for on-call engineers. Your task is to convert incident knowledge into a structured runbook. Generate: 1. RUNBOOK METADATA - Incident type (e.g., "Database connection pool exhaustion") - Severity (SEV-1 / SEV-2 / SEV-3) - Average time to fix (from historical data) 2. TRIGGERS - Alert name or condition - Example alert message 3. SYMPTOMS (how to confirm) - User-visible impact - Dashboard signals - Log patterns to grep 4. INITIAL DIAGNOSIS (ordered commands) - Command 1 to check health - Command 2 to check recent changes - Expected output vs. bad output 5. REMEDIATION STEPS (ordered, each with verification) - Step 1: immediate mitigation - Step 2: permanent fix - Step 3: verify fix worked 6. ESCALATION - Who to contact if steps fail - What information to bring them 7. POST-FIX ACTIONS - Data to capture for post-mortem - Commands to run to restore full service INPUTS: Incident notes / Slack threads / post-mortems: [PASTE KNOWN INCIDENT DOCUMENTATION OR DESCRIBE THE PROBLEM] Service/System affected: [E.G., "Payment API", "Auth Service", "Main Database"] Common fix (if known): [E.G., "Restart the sidekiq workers"] Tools available: [DATADOG / GRAFANA / CLOUDWATCH / CUSTOM DASHBOARD / OTHER] RULES: - Every command must be copy-paste ready - Assume the on-call engineer is tired and stressed - Include expected output for every diagnostic command - Flag any destructive commands with "⚠️ DESTRUCTIVE" - Include a "do not" section for common mistakes
- Write runbooks immediately after an incident — while memory is fresh.
- Run this prompt on your incident post-mortems to extract structured runbooks.
- Store runbooks where on-call engineers can find them quickly (not buried in Confluence).
- Review and update runbooks quarterly — things change.
- Use the “simulate an incident” game day to test runbooks before they’re needed.
Incident notes / Slack threads / post-mortems:
“We had an incident where the payment API was timing out. Turned out the database connection pool was set to 10, but we had 20 pods. Each pod took 2 connections, so 20 pods tried to take 40 connections but only 10 available. Restarting didn’t help because they’d just fight again. Fix was increasing pool to 50. Symptoms: logs showed ‘timeout waiting for connection’ and payment failures. Alert was ‘Payment API 5xx > 5% for 2 minutes.'”
Service/System affected:
Payment API
Common fix (if known):
Increase DB connection pool size
Tools available:
DataDog, AWS RDS Console
This framework improves outcomes by forcing:
- triggers (so you pick the right runbook)
- symptoms (so you confirm before acting)
- diagnosis commands (so you don’t guess)
- remediation steps with verification (so you know when you’re done)
- escalation paths (so you don’t waste time)
Great runbooks don’t tell stories — they give commands.
Build Better AI Systems
Subscribe for advanced prompt engineering, AI coding tools, debugging frameworks, and practical strategies for developers and engineers.
