Capacity & Scaling Strategy¶

Purpose: Define quantitative assumptions, autoscaling policies, growth triggers, and failover readiness practices for the Core HR platform. Complements performance/performance-budget.md and deployment diagram.

1. Scope¶

Runtime services: API, Workers, Document Scan Service.
Data layers: PostgreSQL, Redis, Kafka.
External dependencies (IdP, Payment Gateway, Object Storage) mostly out of direct scale control; focus on internal readiness.

2. Demand Model Assumptions (Year 1)¶

Metric	Baseline (Month 1)	Peak (Seasonal)	End of Year Projection	Notes
Auth Requests RPS (peak 5-min)	50	200	500	Includes password + SSO
Active Tenants	10	50	200	Growth influences per-tenant data volume
Leave Ops RPS	20	80	200	Spikes near holidays
Document Uploads RPS	5	20	50	Bursty; queue absorbs surges
Analytics Events RPS	150	600	1500	Activation + usage events
Concurrent WebSocket / Sessions (future)	N/A	N/A	1000	Placeholder for real-time features

3. Scaling Objectives¶

Maintain p95 latencies within budget under 2x projected peak (stress tolerance).
Horizontal scaling preferred for stateless services (API, Workers).
Aim for single node redundancy (N+1) per tier at all times; no single point of failure.

4. Autoscaling Policies¶

4.1 API Service¶

Signal	Threshold (Scale Out)	Cooldown	Scale In Condition
CPU Utilization	>65% avg 3 min	5 min	<35% avg 10 min
p95 Request Latency	> Budget +15% for 3 windows	5 min	Budget maintained for 10 windows
Active Connections	> (Pods * 800)	5 min	< (Pods * 400) for 10 min

Decision: Use composite policy (any two triggers within window) to avoid noisy scaling.

4.2 Worker Service¶

Signal	Threshold (Scale Out)	Cooldown	Scale In Condition
Queue Lag (domain) p95	>2s for 3 windows	5 min	<1s for 6 windows
Document Scan Queue Depth	> (#Workers * 150)	10 min	< (#Workers * 50) for 10 min
CPU Utilization	>70% avg 5 min	5 min	<40% avg 15 min

4.3 Document Scan Service¶

Signal	Threshold	Action
Average Scan Turnaround > Budget (30s) for 5m	Add worker replica	Reassess antivirus engine throughput
Infection Queue Stalls (>5 pending infected)	Investigate	Evaluate resource contention

4.4 Redis (Cache)¶

Vertical scaling first (memory & I/O); cluster sharding only when hit ratio drops below 90% and eviction >5%/hour.
Alert: memory usage >75% sustained 15 minutes.

4.5 PostgreSQL¶

Trigger	Threshold	Action
CPU Saturation	>70% sustained 10 min	Evaluate slow queries / add read replica
IOPS Saturation	>85%	Consider storage class upgrade
Connection Pool Wait p95	>25ms	Increase pool size or add replicas
Table Size Growth	> Forecast by 20%	Reassess partitioning strategy

4.6 Kafka (Queue)¶

Trigger	Threshold	Action
Partition Lag	> (Target Lag Budget * 3)	Increase consumers or partitions
Broker CPU	>70%	Add broker node
Storage Utilization	>75%	Expand disk / retention tuning

5. Capacity Planning Cycle¶

Monthly: Review actual vs assumed metrics (RPS, events, storage growth).
Update projections if variance >25%.
Adjust autoscale thresholds & budgets accordingly.
Record changes in change log (to be added) for audit.

6. Scaling Levers¶

Lever	Type	Use Case	Trade-offs
Horizontal Pod Scale (API)	Reactive	Increased RPS	Cost increase; watch cold start latency
Worker Concurrency Increase	Reactive	Event backlog spikes	Might raise DB contention
Cache Prewarming	Proactive	Known seasonal load	Adds operational complexity
Read Replica (DB)	Proactive	Read-heavy expansion	Consistency lag; failover complexity
Partitioning / Sharding	Strategic	Exceed single-node DB limits	Increased complexity & query design
Event Batching	Optimization	High event throughput	Increased latency tolerance for non-critical events
Autoscale Threshold Tuning	Reactive	Change workload shape	Risk of thrash if too aggressive

7. Failover & RPO/RTO Verification¶

Objective	Target	Verification Method	Frequency
RPO (Recovery Point)	<5s	Simulated replica promotion & binlog diff	Quarterly
RTO (Recovery Time)	<15m	Full region failover drill	Quarterly
Queue Mirror Consistency	Lag <2s	Mirror lag metrics	Monthly
Cache Warm DR	Key coverage >80% post-failover	Automated warm script	Quarterly

8. Stress & Load Testing¶

Test Type	Scenario	Goal	Pass Criteria
Baseline Load	1x projected peak	Validate initial sizing	All p95 < budget
Stress Load	2x projected peak	Observe degradation pattern	Controlled latency increase (<30% over budget)
Spike Test	3x RPS for 5 min	Validate autoscaling reaction	Scale out within 2 min; no errors >1%
Soak Test	1x peak for 4h	Detect memory leaks	Memory growth <5%
DR Failover	Region A simulated down	Validate RTO	Region B serving in <15m

Scripts stored under future perf-tests/ directory (not yet created).

9. Monitoring & Alert Matrix (Draft)¶

Metric	Source	Alert Threshold	Severity
API p95 Latency	Traces/Histograms	> Budget +15% 3 windows	High
Worker Queue Lag p95	Broker Metrics	>2s 3 windows	High
DB Connection Wait p95	DB Metrics	>25ms 5 windows	High
Cache Hit Ratio	Redis Metrics	<90% 10 min	Medium
Scan Turnaround p95	Worker Metrics	>45s 3 windows	Medium
Kafka Partition Lag	Broker Metrics	>Lag Budget *3	High
RPO Drill Failure	Ops Runbook	Any failure	Critical

10. Instrumentation Requirements¶

Add span attributes: tenantId, region, planTier for latency segmentation.
Metrics namespaces: api.request.latency, worker.queue.lag, db.pool.wait, cache.hit_ratio, scan.turnaround, kafka.partition.lag.
Dashboard segmentation by region & plan tier to detect localized or tier-specific issues.

11. Governance & Review¶

Capacity Review Meeting first week each month; outputs: updated projections & risk register.
Autoscaling threshold changes require PR with justification referencing last 30-day metrics.
Failover drill action items tracked in runbook improvements section.

12. Open Questions¶

Introduce predictive scaling (trend-based) vs current reactive approach?
Implement cost budget alerts tied to scaling events?
Need per-tenant throttling to prevent noisy neighbor issues earlier than partitioning?

Version: 1.0 (2025-11-22)