Incident Response

This playbook defines how we respond when WorkOpti is degraded, unavailable, insecure, or at risk of data loss.

Severity Levels

Severity	Definition	Examples	Response
SEV1	Broad outage, active data exposure, data loss, or security incident	API unavailable, auth bypass, cross-user document access	Immediate mitigation, engineering lead owns comms, postmortem required
SEV2	Major feature broken for many users or important environment blocked	File upload failing, board sharing broken, deployment broken	Same-day mitigation, incident notes required
SEV3	Degraded feature, workaround exists, limited impact	Slow document processing, notification delay, isolated UI regression	Prioritize in normal flow with clear owner
SEV4	Low-risk bug or docs/process issue	Minor display issue, stale dashboard, noisy log	Backlog or next cleanup pass

Roles

Incident commander: coordinates response, keeps scope clear, decides mitigation path.
Technical lead: investigates and implements fix or rollback.
Communications owner: posts status updates to stakeholders.
Scribe: records timeline, decisions, links, and follow-ups.

One person can hold multiple roles for small incidents, but SEV1 incidents need a named commander and scribe.

First 15 Minutes

Confirm impact: affected environment, users, feature, and start time.
Assign severity and roles.
Start an incident thread with current status, owner, and next update time.
Mitigate before root-cause analysis when user impact is active.
Check recent deploys, migrations, provider status, logs, and health endpoints.

Mitigation Options

Roll back to the previous container image tag.
Disable or bypass the failing optional dependency where product behavior allows it.
Revert a small configuration change.
Pause a rollout or promotion.
Temporarily block a risky endpoint only when the security or data-integrity risk is higher than the availability impact.

Verification

API: check /health/live, /health/ready, /health/status, and the affected endpoint.
Frontend: load the app through the deployed URL, log in with Clerk, and exercise the affected workflow.
Data: verify no unauthorized access, failed migration, duplicate processing, or partial delete remains.
External services: confirm Azure OpenAI, Vision, Document Intelligence, Blob Storage, PostgreSQL, Clerk, and Front Door/CDN as relevant.

Communications

Use concise updates:

Status: investigating | mitigating | monitoring | resolved
Severity: SEV1 | SEV2 | SEV3 | SEV4
Impact: who/what is affected
Action: what we are doing now
Next update: time

Postmortem Template

# Postmortem: <incident title>
 
Date:
Severity:
Owners:
Status:
 
## Summary
 
## Impact
 
## Timeline
 
## Root Cause
 
## What Worked
 
## What Did Not Work
 
## Action Items
 
| Action | Owner | Due | Status |
|---|---|---|---|

Follow-Up Rules

SEV1 and SEV2 incidents require a postmortem.
Action items must have owners and dates.
Update the handbook, ADRs, dashboards, or tests when the incident revealed a missing guardrail.

Security On-Call & Support