Capacity & Scaling Strategy
Purpose: Define quantitative assumptions, autoscaling policies, growth triggers, and failover readiness practices for the Core HR platform. Complements performance/performance-budget.md and deployment diagram.
1. Scope
- Runtime services: API, Workers, Document Scan Service.
- Data layers: PostgreSQL, Redis, Kafka.
- External dependencies (IdP, Payment Gateway, Object Storage) mostly out of direct scale control; focus on internal readiness.
2. Demand Model Assumptions (Year 1)
| Metric |
Baseline (Month 1) |
Peak (Seasonal) |
End of Year Projection |
Notes |
| Auth Requests RPS (peak 5-min) |
50 |
200 |
500 |
Includes password + SSO |
| Active Tenants |
10 |
50 |
200 |
Growth influences per-tenant data volume |
| Leave Ops RPS |
20 |
80 |
200 |
Spikes near holidays |
| Document Uploads RPS |
5 |
20 |
50 |
Bursty; queue absorbs surges |
| Analytics Events RPS |
150 |
600 |
1500 |
Activation + usage events |
| Concurrent WebSocket / Sessions (future) |
N/A |
N/A |
1000 |
Placeholder for real-time features |
3. Scaling Objectives
- Maintain p95 latencies within budget under 2x projected peak (stress tolerance).
- Horizontal scaling preferred for stateless services (API, Workers).
- Aim for single node redundancy (N+1) per tier at all times; no single point of failure.
4. Autoscaling Policies
4.1 API Service
| Signal |
Threshold (Scale Out) |
Cooldown |
Scale In Condition |
| CPU Utilization |
>65% avg 3 min |
5 min |
<35% avg 10 min |
| p95 Request Latency |
> Budget +15% for 3 windows |
5 min |
Budget maintained for 10 windows |
| Active Connections |
> (Pods * 800) |
5 min |
< (Pods * 400) for 10 min |
Decision: Use composite policy (any two triggers within window) to avoid noisy scaling.
4.2 Worker Service
| Signal |
Threshold (Scale Out) |
Cooldown |
Scale In Condition |
| Queue Lag (domain) p95 |
>2s for 3 windows |
5 min |
<1s for 6 windows |
| Document Scan Queue Depth |
> (#Workers * 150) |
10 min |
< (#Workers * 50) for 10 min |
| CPU Utilization |
>70% avg 5 min |
5 min |
<40% avg 15 min |
4.3 Document Scan Service
| Signal |
Threshold |
Action |
| Average Scan Turnaround > Budget (30s) for 5m |
Add worker replica |
Reassess antivirus engine throughput |
| Infection Queue Stalls (>5 pending infected) |
Investigate |
Evaluate resource contention |
4.4 Redis (Cache)
- Vertical scaling first (memory & I/O); cluster sharding only when hit ratio drops below 90% and eviction >5%/hour.
- Alert: memory usage >75% sustained 15 minutes.
4.5 PostgreSQL
| Trigger |
Threshold |
Action |
| CPU Saturation |
>70% sustained 10 min |
Evaluate slow queries / add read replica |
| IOPS Saturation |
>85% |
Consider storage class upgrade |
| Connection Pool Wait p95 |
>25ms |
Increase pool size or add replicas |
| Table Size Growth |
> Forecast by 20% |
Reassess partitioning strategy |
4.6 Kafka (Queue)
| Trigger |
Threshold |
Action |
| Partition Lag |
> (Target Lag Budget * 3) |
Increase consumers or partitions |
| Broker CPU |
>70% |
Add broker node |
| Storage Utilization |
>75% |
Expand disk / retention tuning |
5. Capacity Planning Cycle
- Monthly: Review actual vs assumed metrics (RPS, events, storage growth).
- Update projections if variance >25%.
- Adjust autoscale thresholds & budgets accordingly.
- Record changes in change log (to be added) for audit.
6. Scaling Levers
| Lever |
Type |
Use Case |
Trade-offs |
| Horizontal Pod Scale (API) |
Reactive |
Increased RPS |
Cost increase; watch cold start latency |
| Worker Concurrency Increase |
Reactive |
Event backlog spikes |
Might raise DB contention |
| Cache Prewarming |
Proactive |
Known seasonal load |
Adds operational complexity |
| Read Replica (DB) |
Proactive |
Read-heavy expansion |
Consistency lag; failover complexity |
| Partitioning / Sharding |
Strategic |
Exceed single-node DB limits |
Increased complexity & query design |
| Event Batching |
Optimization |
High event throughput |
Increased latency tolerance for non-critical events |
| Autoscale Threshold Tuning |
Reactive |
Change workload shape |
Risk of thrash if too aggressive |
7. Failover & RPO/RTO Verification
| Objective |
Target |
Verification Method |
Frequency |
| RPO (Recovery Point) |
<5s |
Simulated replica promotion & binlog diff |
Quarterly |
| RTO (Recovery Time) |
<15m |
Full region failover drill |
Quarterly |
| Queue Mirror Consistency |
Lag <2s |
Mirror lag metrics |
Monthly |
| Cache Warm DR |
Key coverage >80% post-failover |
Automated warm script |
Quarterly |
8. Stress & Load Testing
| Test Type |
Scenario |
Goal |
Pass Criteria |
| Baseline Load |
1x projected peak |
Validate initial sizing |
All p95 < budget |
| Stress Load |
2x projected peak |
Observe degradation pattern |
Controlled latency increase (<30% over budget) |
| Spike Test |
3x RPS for 5 min |
Validate autoscaling reaction |
Scale out within 2 min; no errors >1% |
| Soak Test |
1x peak for 4h |
Detect memory leaks |
Memory growth <5% |
| DR Failover |
Region A simulated down |
Validate RTO |
Region B serving in <15m |
Scripts stored under future perf-tests/ directory (not yet created).
9. Monitoring & Alert Matrix (Draft)
| Metric |
Source |
Alert Threshold |
Severity |
| API p95 Latency |
Traces/Histograms |
> Budget +15% 3 windows |
High |
| Worker Queue Lag p95 |
Broker Metrics |
>2s 3 windows |
High |
| DB Connection Wait p95 |
DB Metrics |
>25ms 5 windows |
High |
| Cache Hit Ratio |
Redis Metrics |
<90% 10 min |
Medium |
| Scan Turnaround p95 |
Worker Metrics |
>45s 3 windows |
Medium |
| Kafka Partition Lag |
Broker Metrics |
>Lag Budget *3 |
High |
| RPO Drill Failure |
Ops Runbook |
Any failure |
Critical |
10. Instrumentation Requirements
- Add span attributes:
tenantId, region, planTier for latency segmentation.
- Metrics namespaces:
api.request.latency, worker.queue.lag, db.pool.wait, cache.hit_ratio, scan.turnaround, kafka.partition.lag.
- Dashboard segmentation by region & plan tier to detect localized or tier-specific issues.
11. Governance & Review
- Capacity Review Meeting first week each month; outputs: updated projections & risk register.
- Autoscaling threshold changes require PR with justification referencing last 30-day metrics.
- Failover drill action items tracked in runbook improvements section.
12. Open Questions
- Introduce predictive scaling (trend-based) vs current reactive approach?
- Implement cost budget alerts tied to scaling events?
- Need per-tenant throttling to prevent noisy neighbor issues earlier than partitioning?
Version: 1.0 (2025-11-22)