Incident Response
Owner: Engineering Last reviewed: 2026-Q2
This playbook defines how we respond when WorkOpti is degraded, unavailable, insecure, or at risk of data loss.
Severity Levels
| Severity | Definition | Examples | Response |
|---|---|---|---|
| SEV1 | Broad outage, active data exposure, data loss, or security incident | API unavailable, auth bypass, cross-user document access | Immediate mitigation, engineering lead owns comms, postmortem required |
| SEV2 | Major feature broken for many users or important environment blocked | File upload failing, board sharing broken, deployment broken | Same-day mitigation, incident notes required |
| SEV3 | Degraded feature, workaround exists, limited impact | Slow document processing, notification delay, isolated UI regression | Prioritize in normal flow with clear owner |
| SEV4 | Low-risk bug or docs/process issue | Minor display issue, stale dashboard, noisy log | Backlog or next cleanup pass |
Roles
- Incident commander: coordinates response, keeps scope clear, decides mitigation path.
- Technical lead: investigates and implements fix or rollback.
- Communications owner: posts status updates to stakeholders.
- Scribe: records timeline, decisions, links, and follow-ups.
One person can hold multiple roles for small incidents, but SEV1 incidents need a named commander and scribe.
First 15 Minutes
- Confirm impact: affected environment, users, feature, and start time.
- Assign severity and roles.
- Start an incident thread with current status, owner, and next update time.
- Mitigate before root-cause analysis when user impact is active.
- Check recent deploys, migrations, provider status, logs, and health endpoints.
Mitigation Options
- Roll back to the previous container image tag.
- Disable or bypass the failing optional dependency where product behavior allows it.
- Revert a small configuration change.
- Pause a rollout or promotion.
- Temporarily block a risky endpoint only when the security or data-integrity risk is higher than the availability impact.
Verification
- API: check
/health/live,/health/ready,/health/status, and the affected endpoint. - Frontend: load the app through the deployed URL, log in with Clerk, and exercise the affected workflow.
- Data: verify no unauthorized access, failed migration, duplicate processing, or partial delete remains.
- External services: confirm Azure OpenAI, Vision, Document Intelligence, Blob Storage, PostgreSQL, Clerk, and Front Door/CDN as relevant.
Communications
Use concise updates:
Status: investigating | mitigating | monitoring | resolved
Severity: SEV1 | SEV2 | SEV3 | SEV4
Impact: who/what is affected
Action: what we are doing now
Next update: time
Owner: namePostmortem Template
# Postmortem: <incident title>
Date:
Severity:
Owners:
Status:
## Summary
## Impact
## Timeline
## Root Cause
## What Worked
## What Did Not Work
## Action Items
| Action | Owner | Due | Status |
|---|---|---|---|Follow-Up Rules
- SEV1 and SEV2 incidents require a postmortem.
- Action items must have owners and dates.
- Update the handbook, ADRs, dashboards, or tests when the incident revealed a missing guardrail.