You get:
- no visibility into overall system health (is automation working?)
- executives surprised by automation failures (no dashboard)
- metrics that don’t matter (vanity metrics, not action metrics)
- no SLA tracking (can’t tell if you’re meeting promises)
- reactive management (find out about failures from user complaints)
But dashboards can be designed:
- health score: overall system health (0-100)
- throughput: workflows completed per hour/day
- latency: p95 and p99 completion times
- error rates: percentage of failed executions
- SLA attainment: percentage meeting performance targets
Without dashboards, you’re flying blind.
This prompt designs comprehensive system health dashboards.
Assume the role of an observability engineer who designs system health dashboards. Your task is to define metrics and visualizations for workflow system health. Generate: 1. DASHBOARD SECTIONS | Section | Audience | Metrics | Refresh | |---------|----------|---------|---------| | Executive Summary | Leadership | System health score, SLA attainment | Daily | | Operations | On-call engineers | Error rates, queue depth, latency | Real-time | | Capacity Planning | Ops leads | Throughput trends, resource usage | Weekly | 2. SYSTEM HEALTH SCORE Score ranges: - 90-100: Healthy - 70-89: Degraded - 0-69: Critical 3. KEY METRICS TABLE | Metric | Formula | Target | Alert Threshold | |--------|---------|--------|-----------------| | System health | [from above] | >90 | <80 | | Error rate | failed / total | <1% | >5% | | p95 latency | 95th percentile of duration | <5s | >10s | | Throughput | workflows / hour | 10k/hr | <5k/hr | | Queue depth | pending workflows | <100 | >500 | 4. VISUALIZATION RECOMMENDATIONS - Health score: Gauge chart (red/yellow/green) - Error rate: Time-series line chart - Latency: Heatmap by workflow - Throughput: Area chart with forecast - Queue depth: Bar chart with alert line 5. ALERT RULES | Metric | Condition | Severity | Action | |--------|-----------|----------|--------| | Health score | <80 for 5 min | Warning | Slack | | Health score | <60 for 2 min | Critical | PagerDuty | | Error rate | >5% for 2 min | Critical | PagerDuty | | Queue depth | >500 for 10 min | Warning | Investigate | 6. SLA TRACKING | Workflow | SLA (p95) | Current Attainment | Target | |----------|-----------|-------------------|--------| | WF-001 | 5s | 99.5% | 99.9% | | WF-002 | 30s | 98.2% | 99.0% | 7. EXECUTIVE REPORTING - Weekly summary: health score trends, top issues - Monthly review: SLA attainment, capacity trends - Quarterly business review: ROI of automation INPUTS: Workflow inventory (from WS-01): [PASTE WORKFLOW LIST] SLAs from business requirements: [E.G., "Lead routing: 5s p95, 99.9% uptime"] Expected throughput (from business volume): [E.G., "50,000 leads per day"] Dashboard tool preferences: [E.G., "DataDog, Grafana, Tableau"] RULES: - Health score should be understandable at a glance (red/yellow/green) - Include leading indicators (queue depth) not just lagging (error rate) - Different audiences need different views (executive vs. operations) - Alert on symptoms (high error rate), not causes (exceptions) - Review metrics quarterly — what matters changes over time - Dashboard without action is decoration (include alerting)
- Health score should be understandable at a glance — red/yellow/green.
- Include leading indicators (queue depth) not just lagging (error rate).
- Different audiences need different views — executive summary vs. operational dashboard.
- Alert on symptoms (high error rate), not causes (specific exceptions).
- Review metrics quarterly — what matters changes over time.
- A dashboard without action is decoration — include alerting.
Workflow inventory:
“WF-001 Lead Capture, WF-002 Lead Scoring, WF-003 Lead Routing, WF-004 Reporting”
SLAs from business requirements:
“Lead Capture to Routing: 30s total p95. Reporting: 5 minutes p95.”
Expected throughput:
“50,000 leads per day (peak 10,000/hour)”
Dashboard tool preferences:
“DataDog”
This framework improves outcomes by forcing:
- dashboard sectioning (who needs to see what?)
- health score definition (single number for system status)
- key metrics specification (what to track, what targets)
- visualization design (how to see the data)
- SLA tracking (are we meeting promises?)
Failure modes this prevents:
- Executive surprise — “I didn’t know automation was failing”
- Alert fatigue — too many metrics, no clear health signal
- Dashboard decay — metrics that don’t change, no one reviews
- No action — dashboard shows problems but no alerting
This improves on: Scattered metrics and no dashboard. A unified health dashboard provides visibility for everyone.
Related to: WS-01 (Documenter) for workflow inventory; WS-04 (Optimizer) for performance improvement targets.
Build Better AI Systems
Subscribe for advanced prompt engineering, AI coding tools, debugging frameworks, and practical strategies for developers and engineers.
