Activation Runbook¶
Generated: 2025-11-22 | Version: 1.0 Scope: End-to-end tenant activation flow (signup → invite → magic link activation) and supporting observability & recovery.
1. Purpose¶
Provide on-call & ops engineers a concise set of procedures to monitor, troubleshoot, and restore the activation funnel with minimal MTTR.
2. Flow Overview¶
- Anonymous visitor submits signup (US-101) → tenant & owner user provisioned.
- Owner sends invitation (US-102) → invitation token issued & email dispatched.
- Invitee clicks magic link (US-103) → token validated → account activated.
- (Optional) Additional invites; funnel metrics aggregated.
3. Components & Dependencies¶
| Component | Responsibility | Criticality |
|---|---|---|
| Auth API | Signup, token validation | High |
| Tenant Service | Provisioning & isolation metadata | High |
| Invitation Service | Create/resend invite, token issuance | High |
| Email Dispatcher | Sends invitation emails | Medium |
| Activation Page (SPA) | Token usage + profile capture | Medium |
| Event Stream | Emits activation.* events | Medium |
| Logging/Tracing Adapter | Correlation IDs | High |
4. Primary KPIs & SLOs¶
| KPI | Target | Alert Threshold |
|---|---|---|
| Signup success rate | > 98% | < 95% (5m window) |
| Invite acceptance rate (72h) | > 60% | < 40% (rolling 24h) |
| Magic link validation latency p95 | < 300ms | > 600ms (15m) |
| Tenant provisioning latency p95 | < 2000ms | > 3000ms (15m) |
| Activation funnel completion (start→activated) | > 50% | < 30% (24h) |
5. Monitoring Dashboard Checklist¶
- Graph: signup attempts vs successes (stacked).
- Latency panels: tenantProvisioningLatencyMs, magicLinkValidation.
- Conversion funnel: SignupStarted → TenantProvisioned → InvitationCreated → InvitationAccepted → ActivationCompleted counts.
- Error rate chart (Activation token invalid, Duplicate email, DB constraint failures).
6. Logging & Events (Reference)¶
Event names from events-schema.md:
- activation.SignupStarted
- activation.TenantProvisioned
- activation.InvitationCreated / Accepted
- activation.ActivationCompleted
Correlate via traceId; all logs at INFO+ must carry tenantId after provisioning.
7. Alerting Rules (Examples)¶
| Alert | Condition | Action |
|---|---|---|
| Low signup success | success/attempts < 95% 5m | Triage errors, check auth API health |
| Elevated provisioning latency | p95 > 3s 3 consecutive intervals | Inspect DB perf, provisioning traces |
| Magic link failures spike | invalid/expired count > normal baseline x3 | Validate token TTL config, clock drift |
| Email dispatch backlog | queue length > threshold (e.g., >100 pending) | Scale worker or check SMTP provider |
8. Standard Operational Procedures¶
8.1 Signup Failure Surge¶
- Check error log sample for common
error.type. - Run health check endpoint (
/health/auth,/health/tenant). - Inspect DB connectivity & recent migrations.
- Mitigation: Apply feature flag to route traffic to secondary instance or degrade (static queue message).
8.2 Invitation Emails Not Sending¶
- Verify email dispatcher queue metrics.
- Check SMTP provider status / API key validity.
- Requeue stuck jobs; resend last 5 test invites with mail sink to confirm.
- Escalate if provider outage > 15m; communicate manual activation instructions (direct token copy) to support.
8.3 Magic Link Invalid Tokens Increasing¶
- Check token TTL configuration vs expected (env var drift?).
- Validate system clock sync (NTP) across auth nodes.
- Inspect hash validation path & recent deployments affecting token parsing.
- Temporary fix: Extend TTL or bypass expiry for new invites (announce security trade-off internally).
8.4 Provisioning Latency Spike¶
- Trace sample slow requests → identify span with highest duration (likely db.query or external call).
- Check DB resource metrics (CPU, I/O, locks).
- Examine concurrent provisioning volume (load surge?).
- Apply rate limiting or queue provisioning until resources stabilized.
8.5 Partial Funnel Drop (Invite Acceptance Low)¶
- Compare email send counts vs acceptance events.
- Validate UI activation page availability (synthetic test).
- Sample user feedback for link issues (URL rewriting by email client?).
- Consider switching to shorter branded links or alternate email template.
9. Incident Severity Mapping¶
| Severity | Criteria | Response Time |
|---|---|---|
| SEV-1 | Complete signup outage | Immediate (<5m) |
| SEV-2 | >30% activation latency degradation | 15m |
| SEV-3 | Invite acceptance dip (>50% drop) | 30m |
| SEV-4 | Non-critical analytics event missing | Next business day |
10. Escalation Path¶
- On-call engineer triage.
- If infrastructure issue (DB/network) → Infra team.
- Email provider outage → Vendor support + Product notification.
- Security anomaly (token replay) → Security team immediate.
11. Rollback Strategy¶
Criteria: new deployment correlates with spike in activation errors within 10m. Steps: 1. Identify release version (tag/commit). 2. Execute automated rollback script or redeploy previous stable container image. 3. Verify health checks; run synthetic signup & invite. 4. Annotate incident log with root cause placeholder.
12. Synthetic Tests (Recommended)¶
- Hourly signup synthetic using test email domain (e.g.,
synthetic+timestamp@example.test). - Invite + activation synthetic (auto-click headless) once per hour; fails raise WARN.
13. Performance Budgets (Baseline)¶
| Operation | Budget p95 |
|---|---|
| Signup request | 1200ms |
| Tenant provisioning | 2000ms |
| Invitation creation | 300ms |
| Activation token validation | 300ms |
14. Security Considerations¶
- Token TTL & single-use enforced; logs never contain raw token.
- Failed token validations monitored for brute force pattern (rate > threshold triggers alert).
- Password hashing algorithm integrity check at startup (log INFO confirmation).
15. Data Integrity Checks¶
- Daily job: count tenants with zero owner user (should be 0).
- Orphan invite scan: invites expired > TTL + 7d purge.
- Audit mismatch: activation.Accepted without ActivationCompleted (report & backfill).
16. Tooling & Commands (Placeholder)¶
# Check auth service health
curl -s https://api.example.com/health/auth | jq
# Query recent activation errors (example structured log grep)
jq 'select(.logger=="activation" and .level=="ERROR")' /var/log/app.log | tail -20
# Synthetic signup script (pseudo)
./scripts/synthetic-signup.sh synthetic+$(date +%s)@example.test
17. Runbook Validation Checklist¶
- Dashboard panels exist & refreshed.
- Alerts configured & tested (dummy trigger).
- Synthetic tests implemented & stable.
- Escalation contacts up-to-date.
- Rollback steps verified in staging.
18. Open Questions¶
- Add separate funnel for multi-tenant user switch events in future?
- Auto-recovery for email backlog (dynamic worker scaling) implemented Sprint 2 or later?
- Need canary activation path for early detection (weighted routing)?
Document Version: 1.0