Incident Response

Incident Response

Owner: Engineering Last reviewed: 2026-Q2

This playbook defines how we respond when WorkOpti is degraded, unavailable, insecure, or at risk of data loss.

Severity Levels

SeverityDefinitionExamplesResponse
SEV1Broad outage, active data exposure, data loss, or security incidentAPI unavailable, auth bypass, cross-user document accessImmediate mitigation, engineering lead owns comms, postmortem required
SEV2Major feature broken for many users or important environment blockedFile upload failing, board sharing broken, deployment brokenSame-day mitigation, incident notes required
SEV3Degraded feature, workaround exists, limited impactSlow document processing, notification delay, isolated UI regressionPrioritize in normal flow with clear owner
SEV4Low-risk bug or docs/process issueMinor display issue, stale dashboard, noisy logBacklog or next cleanup pass

Roles

  • Incident commander: coordinates response, keeps scope clear, decides mitigation path.
  • Technical lead: investigates and implements fix or rollback.
  • Communications owner: posts status updates to stakeholders.
  • Scribe: records timeline, decisions, links, and follow-ups.

One person can hold multiple roles for small incidents, but SEV1 incidents need a named commander and scribe.

First 15 Minutes

  1. Confirm impact: affected environment, users, feature, and start time.
  2. Assign severity and roles.
  3. Start an incident thread with current status, owner, and next update time.
  4. Mitigate before root-cause analysis when user impact is active.
  5. Check recent deploys, migrations, provider status, logs, and health endpoints.

Mitigation Options

  • Roll back to the previous container image tag.
  • Disable or bypass the failing optional dependency where product behavior allows it.
  • Revert a small configuration change.
  • Pause a rollout or promotion.
  • Temporarily block a risky endpoint only when the security or data-integrity risk is higher than the availability impact.

Verification

  • API: check /health/live, /health/ready, /health/status, and the affected endpoint.
  • Frontend: load the app through the deployed URL, log in with Clerk, and exercise the affected workflow.
  • Data: verify no unauthorized access, failed migration, duplicate processing, or partial delete remains.
  • External services: confirm Azure OpenAI, Vision, Document Intelligence, Blob Storage, PostgreSQL, Clerk, and Front Door/CDN as relevant.

Communications

Use concise updates:

Status: investigating | mitigating | monitoring | resolved
Severity: SEV1 | SEV2 | SEV3 | SEV4
Impact: who/what is affected
Action: what we are doing now
Next update: time
Owner: name

Postmortem Template

# Postmortem: <incident title>
 
Date:
Severity:
Owners:
Status:
 
## Summary
 
## Impact
 
## Timeline
 
## Root Cause
 
## What Worked
 
## What Did Not Work
 
## Action Items
 
| Action | Owner | Due | Status |
|---|---|---|---|

Follow-Up Rules

  • SEV1 and SEV2 incidents require a postmortem.
  • Action items must have owners and dates.
  • Update the handbook, ADRs, dashboards, or tests when the incident revealed a missing guardrail.