Skip to content

Capacity & Scaling Strategy

Purpose: Define quantitative assumptions, autoscaling policies, growth triggers, and failover readiness practices for the Core HR platform. Complements performance/performance-budget.md and deployment diagram.

1. Scope

  • Runtime services: API, Workers, Document Scan Service.
  • Data layers: PostgreSQL, Redis, Kafka.
  • External dependencies (IdP, Payment Gateway, Object Storage) mostly out of direct scale control; focus on internal readiness.

2. Demand Model Assumptions (Year 1)

Metric Baseline (Month 1) Peak (Seasonal) End of Year Projection Notes
Auth Requests RPS (peak 5-min) 50 200 500 Includes password + SSO
Active Tenants 10 50 200 Growth influences per-tenant data volume
Leave Ops RPS 20 80 200 Spikes near holidays
Document Uploads RPS 5 20 50 Bursty; queue absorbs surges
Analytics Events RPS 150 600 1500 Activation + usage events
Concurrent WebSocket / Sessions (future) N/A N/A 1000 Placeholder for real-time features

3. Scaling Objectives

  • Maintain p95 latencies within budget under 2x projected peak (stress tolerance).
  • Horizontal scaling preferred for stateless services (API, Workers).
  • Aim for single node redundancy (N+1) per tier at all times; no single point of failure.

4. Autoscaling Policies

4.1 API Service

Signal Threshold (Scale Out) Cooldown Scale In Condition
CPU Utilization >65% avg 3 min 5 min <35% avg 10 min
p95 Request Latency > Budget +15% for 3 windows 5 min Budget maintained for 10 windows
Active Connections > (Pods * 800) 5 min < (Pods * 400) for 10 min

Decision: Use composite policy (any two triggers within window) to avoid noisy scaling.

4.2 Worker Service

Signal Threshold (Scale Out) Cooldown Scale In Condition
Queue Lag (domain) p95 >2s for 3 windows 5 min <1s for 6 windows
Document Scan Queue Depth > (#Workers * 150) 10 min < (#Workers * 50) for 10 min
CPU Utilization >70% avg 5 min 5 min <40% avg 15 min

4.3 Document Scan Service

Signal Threshold Action
Average Scan Turnaround > Budget (30s) for 5m Add worker replica Reassess antivirus engine throughput
Infection Queue Stalls (>5 pending infected) Investigate Evaluate resource contention

4.4 Redis (Cache)

  • Vertical scaling first (memory & I/O); cluster sharding only when hit ratio drops below 90% and eviction >5%/hour.
  • Alert: memory usage >75% sustained 15 minutes.

4.5 PostgreSQL

Trigger Threshold Action
CPU Saturation >70% sustained 10 min Evaluate slow queries / add read replica
IOPS Saturation >85% Consider storage class upgrade
Connection Pool Wait p95 >25ms Increase pool size or add replicas
Table Size Growth > Forecast by 20% Reassess partitioning strategy

4.6 Kafka (Queue)

Trigger Threshold Action
Partition Lag > (Target Lag Budget * 3) Increase consumers or partitions
Broker CPU >70% Add broker node
Storage Utilization >75% Expand disk / retention tuning

5. Capacity Planning Cycle

  1. Monthly: Review actual vs assumed metrics (RPS, events, storage growth).
  2. Update projections if variance >25%.
  3. Adjust autoscale thresholds & budgets accordingly.
  4. Record changes in change log (to be added) for audit.

6. Scaling Levers

Lever Type Use Case Trade-offs
Horizontal Pod Scale (API) Reactive Increased RPS Cost increase; watch cold start latency
Worker Concurrency Increase Reactive Event backlog spikes Might raise DB contention
Cache Prewarming Proactive Known seasonal load Adds operational complexity
Read Replica (DB) Proactive Read-heavy expansion Consistency lag; failover complexity
Partitioning / Sharding Strategic Exceed single-node DB limits Increased complexity & query design
Event Batching Optimization High event throughput Increased latency tolerance for non-critical events
Autoscale Threshold Tuning Reactive Change workload shape Risk of thrash if too aggressive

7. Failover & RPO/RTO Verification

Objective Target Verification Method Frequency
RPO (Recovery Point) <5s Simulated replica promotion & binlog diff Quarterly
RTO (Recovery Time) <15m Full region failover drill Quarterly
Queue Mirror Consistency Lag <2s Mirror lag metrics Monthly
Cache Warm DR Key coverage >80% post-failover Automated warm script Quarterly

8. Stress & Load Testing

Test Type Scenario Goal Pass Criteria
Baseline Load 1x projected peak Validate initial sizing All p95 < budget
Stress Load 2x projected peak Observe degradation pattern Controlled latency increase (<30% over budget)
Spike Test 3x RPS for 5 min Validate autoscaling reaction Scale out within 2 min; no errors >1%
Soak Test 1x peak for 4h Detect memory leaks Memory growth <5%
DR Failover Region A simulated down Validate RTO Region B serving in <15m

Scripts stored under future perf-tests/ directory (not yet created).

9. Monitoring & Alert Matrix (Draft)

Metric Source Alert Threshold Severity
API p95 Latency Traces/Histograms > Budget +15% 3 windows High
Worker Queue Lag p95 Broker Metrics >2s 3 windows High
DB Connection Wait p95 DB Metrics >25ms 5 windows High
Cache Hit Ratio Redis Metrics <90% 10 min Medium
Scan Turnaround p95 Worker Metrics >45s 3 windows Medium
Kafka Partition Lag Broker Metrics >Lag Budget *3 High
RPO Drill Failure Ops Runbook Any failure Critical

10. Instrumentation Requirements

  • Add span attributes: tenantId, region, planTier for latency segmentation.
  • Metrics namespaces: api.request.latency, worker.queue.lag, db.pool.wait, cache.hit_ratio, scan.turnaround, kafka.partition.lag.
  • Dashboard segmentation by region & plan tier to detect localized or tier-specific issues.

11. Governance & Review

  • Capacity Review Meeting first week each month; outputs: updated projections & risk register.
  • Autoscaling threshold changes require PR with justification referencing last 30-day metrics.
  • Failover drill action items tracked in runbook improvements section.

12. Open Questions

  1. Introduce predictive scaling (trend-based) vs current reactive approach?
  2. Implement cost budget alerts tied to scaling events?
  3. Need per-tenant throttling to prevent noisy neighbor issues earlier than partitioning?

Version: 1.0 (2025-11-22)