Skip to content

Activation Runbook

Generated: 2025-11-22 | Version: 1.0 Scope: End-to-end tenant activation flow (signup → invite → magic link activation) and supporting observability & recovery.

1. Purpose

Provide on-call & ops engineers a concise set of procedures to monitor, troubleshoot, and restore the activation funnel with minimal MTTR.

2. Flow Overview

  1. Anonymous visitor submits signup (US-101) → tenant & owner user provisioned.
  2. Owner sends invitation (US-102) → invitation token issued & email dispatched.
  3. Invitee clicks magic link (US-103) → token validated → account activated.
  4. (Optional) Additional invites; funnel metrics aggregated.

3. Components & Dependencies

Component Responsibility Criticality
Auth API Signup, token validation High
Tenant Service Provisioning & isolation metadata High
Invitation Service Create/resend invite, token issuance High
Email Dispatcher Sends invitation emails Medium
Activation Page (SPA) Token usage + profile capture Medium
Event Stream Emits activation.* events Medium
Logging/Tracing Adapter Correlation IDs High

4. Primary KPIs & SLOs

KPI Target Alert Threshold
Signup success rate > 98% < 95% (5m window)
Invite acceptance rate (72h) > 60% < 40% (rolling 24h)
Magic link validation latency p95 < 300ms > 600ms (15m)
Tenant provisioning latency p95 < 2000ms > 3000ms (15m)
Activation funnel completion (start→activated) > 50% < 30% (24h)

5. Monitoring Dashboard Checklist

  • Graph: signup attempts vs successes (stacked).
  • Latency panels: tenantProvisioningLatencyMs, magicLinkValidation.
  • Conversion funnel: SignupStarted → TenantProvisioned → InvitationCreated → InvitationAccepted → ActivationCompleted counts.
  • Error rate chart (Activation token invalid, Duplicate email, DB constraint failures).

6. Logging & Events (Reference)

Event names from events-schema.md: - activation.SignupStarted - activation.TenantProvisioned - activation.InvitationCreated / Accepted - activation.ActivationCompleted Correlate via traceId; all logs at INFO+ must carry tenantId after provisioning.

7. Alerting Rules (Examples)

Alert Condition Action
Low signup success success/attempts < 95% 5m Triage errors, check auth API health
Elevated provisioning latency p95 > 3s 3 consecutive intervals Inspect DB perf, provisioning traces
Magic link failures spike invalid/expired count > normal baseline x3 Validate token TTL config, clock drift
Email dispatch backlog queue length > threshold (e.g., >100 pending) Scale worker or check SMTP provider

8. Standard Operational Procedures

8.1 Signup Failure Surge

  1. Check error log sample for common error.type.
  2. Run health check endpoint (/health/auth, /health/tenant).
  3. Inspect DB connectivity & recent migrations.
  4. Mitigation: Apply feature flag to route traffic to secondary instance or degrade (static queue message).

8.2 Invitation Emails Not Sending

  1. Verify email dispatcher queue metrics.
  2. Check SMTP provider status / API key validity.
  3. Requeue stuck jobs; resend last 5 test invites with mail sink to confirm.
  4. Escalate if provider outage > 15m; communicate manual activation instructions (direct token copy) to support.
  1. Check token TTL configuration vs expected (env var drift?).
  2. Validate system clock sync (NTP) across auth nodes.
  3. Inspect hash validation path & recent deployments affecting token parsing.
  4. Temporary fix: Extend TTL or bypass expiry for new invites (announce security trade-off internally).

8.4 Provisioning Latency Spike

  1. Trace sample slow requests → identify span with highest duration (likely db.query or external call).
  2. Check DB resource metrics (CPU, I/O, locks).
  3. Examine concurrent provisioning volume (load surge?).
  4. Apply rate limiting or queue provisioning until resources stabilized.

8.5 Partial Funnel Drop (Invite Acceptance Low)

  1. Compare email send counts vs acceptance events.
  2. Validate UI activation page availability (synthetic test).
  3. Sample user feedback for link issues (URL rewriting by email client?).
  4. Consider switching to shorter branded links or alternate email template.

9. Incident Severity Mapping

Severity Criteria Response Time
SEV-1 Complete signup outage Immediate (<5m)
SEV-2 >30% activation latency degradation 15m
SEV-3 Invite acceptance dip (>50% drop) 30m
SEV-4 Non-critical analytics event missing Next business day

10. Escalation Path

  1. On-call engineer triage.
  2. If infrastructure issue (DB/network) → Infra team.
  3. Email provider outage → Vendor support + Product notification.
  4. Security anomaly (token replay) → Security team immediate.

11. Rollback Strategy

Criteria: new deployment correlates with spike in activation errors within 10m. Steps: 1. Identify release version (tag/commit). 2. Execute automated rollback script or redeploy previous stable container image. 3. Verify health checks; run synthetic signup & invite. 4. Annotate incident log with root cause placeholder.

  • Hourly signup synthetic using test email domain (e.g., synthetic+timestamp@example.test).
  • Invite + activation synthetic (auto-click headless) once per hour; fails raise WARN.

13. Performance Budgets (Baseline)

Operation Budget p95
Signup request 1200ms
Tenant provisioning 2000ms
Invitation creation 300ms
Activation token validation 300ms

14. Security Considerations

  • Token TTL & single-use enforced; logs never contain raw token.
  • Failed token validations monitored for brute force pattern (rate > threshold triggers alert).
  • Password hashing algorithm integrity check at startup (log INFO confirmation).

15. Data Integrity Checks

  • Daily job: count tenants with zero owner user (should be 0).
  • Orphan invite scan: invites expired > TTL + 7d purge.
  • Audit mismatch: activation.Accepted without ActivationCompleted (report & backfill).

16. Tooling & Commands (Placeholder)

# Check auth service health
curl -s https://api.example.com/health/auth | jq

# Query recent activation errors (example structured log grep)
jq 'select(.logger=="activation" and .level=="ERROR")' /var/log/app.log | tail -20

# Synthetic signup script (pseudo)
./scripts/synthetic-signup.sh synthetic+$(date +%s)@example.test

17. Runbook Validation Checklist

  • Dashboard panels exist & refreshed.
  • Alerts configured & tested (dummy trigger).
  • Synthetic tests implemented & stable.
  • Escalation contacts up-to-date.
  • Rollback steps verified in staging.

18. Open Questions

  1. Add separate funnel for multi-tenant user switch events in future?
  2. Auto-recovery for email backlog (dynamic worker scaling) implemented Sprint 2 or later?
  3. Need canary activation path for early detection (weighted routing)?

Document Version: 1.0