Logging & Tracing Guidelines¶
Generated: 2025-11-22 | Version: 1.0 Scope: Core platform (activation, HR, leave, recruitment) – baseline observability standards.
1. Principles¶
- Actionable: Every log / span must help diagnose or measure behavior (no noise).
- Structured: Emit JSON objects; avoid free‑form string concatenation.
- Correlated: Propagate a single
traceId+spanIdacross HTTP, async messages, jobs. - Separated Concerns: Use events (analytics) for product metrics; logs/spans for technical diagnostics.
- Safe: Never log raw PII or secrets; hash or redact early.
2. Log Levels (Server)¶
| Level | Use Case | Retention |
|---|---|---|
| TRACE | Deep development diagnostics; disabled in prod default | Short (24h if enabled) |
| DEBUG | Non-critical flow details (branch decisions) | 7 days |
| INFO | High-level lifecycle milestones (provisioned tenant, accrual job start/finish) | 30 days |
| WARN | Recoverable issues (retry scheduled, invalid token attempt) | 60 days |
| ERROR | Failed operation where user impact likely | 90 days |
| FATAL | Process crash / unrecoverable state | 90 days |
Guideline: Default prod verbosity = INFO+. Temporarily raise DEBUG scoped by dynamic config/feature flag.
3. Structured Log Envelope¶
{
"timestamp": "2025-11-22T10:20:30.123Z",
"level": "INFO",
"logger": "tenant.provisioning",
"message": "Tenant provisioned",
"traceId": "trace-abc123",
"spanId": "span-xyz456",
"tenantId": "ten-001",
"userId": "usr-001",
"requestId": "req-789",
"host": "api-1",
"service": "core-api",
"env": "prod",
"durationMs": 532,
"severityCode": 20000, // optional mapping to OTLP codes
"tags": ["provisioning","activation"],
"metadata": {"plan": "trial"},
"pii": {"emailHash": "sha256:..."} // never raw
}
4. Field Rules¶
traceIdmandatory for all request-scoped logs; generate at ingress if missing.spanIdfor component-level operations (DB call, external API); omit only for root log lines.requestIdmaps 1:1 to HTTP request; for async messages usemessageId.- Always include
tenantIdwhen context established; omit before provisioning. - Avoid arrays of large objects in logs; summarize counts.
5. PII & Secrets Handling¶
| Data | Allowed? | Approach |
|---|---|---|
| Raw emails | No | Hash (salted SHA-256) |
| Passwords | No | Never logged |
| Access tokens | No | Truncate + hash if absolutely needed for correlation |
| Personal names | Avoid | Use userId reference |
| Documents content | No | Log documentId only |
Redaction pipeline: sanitize BEFORE serialization; enforce via central logging adapter.
6. Tracing Model¶
- OpenTelemetry for spans & context propagation.
- Root span:
http.serverormessaging.consume. - Key child spans:
db.query,auth.validate,accrual.batch,email.send. - Correlate analytics events by reusing current
traceId. - Span attributes naming: snake_case (
tenant.id,leave.type_id,error.code).
Mandatory Span Attributes¶
| Attribute | Example | Reason |
|---|---|---|
tenant.id |
ten-001 | Multi-tenant filtering |
user.id |
usr-002 | User action correlation |
http.route |
/api/signup | Performance segmentation |
db.system |
postgresql | Tech diagnostics |
error.type |
ValidationError | Faster grouping |
Span Events¶
Use for discrete in-span milestones (e.g., accrual.employee_processed). Keep payload small.
7. Sampling Strategy¶
- Default: 100% traces for MVP (low volume). Revisit > 100 RPS.
- Future adaptive sampling: keep 100% of ERROR spans, probabilistic (10–30%) of successful high-volume endpoints.
- Provide dynamic configuration via env/remote toggle.
8. Correlation Across Async Messaging¶
- Inject
traceIdandspanIdheaders/properties into messages (e.g.,x-trace-id). - Consumer creates child span if trace present; else start new trace with
incomplete_trace=trueattribute. - Preserve ordering metadata (
sequence) if needed for debugging streams.
9. Performance Budget Logging¶
- Log p95 latency summary per endpoint daily (aggregated metric, not individual logs) to reduce noise.
- Use metrics (histograms) rather than logs for high-frequency timing.
10. Error Logging Patterns¶
Example pattern:
{
"level": "ERROR",
"message": "Activation token invalid",
"error.type": "InvalidTokenError",
"error.message": "Token expired",
"error.stack": "(stack truncated)",
"traceId": "trace-t1",
"tenantId": "ten-001",
"invitationId": "inv-77"
}
invitationId) for direct trace triage.
11. Log vs Event vs Metric Decision Table¶
| Need | Use |
|---|---|
| Debugging failure root cause | Log (structured) |
| End-to-end latency breakdown | Trace spans |
| Business funnel conversion | Analytics events |
| Throughput / latency statistics | Metrics (histograms, counters) |
| Rare milestone (tenant provisioned) | Log + Event |
12. Tooling & Libraries (Proposed)¶
- Backend: OpenTelemetry SDK, structured logging (pino / zap style depending on language), central logging adapter.
- Frontend: Lightweight tracing (W3C Trace Context) for API calls; minimal user action INFO logs for critical flows only.
- Log aggregation: ELK / OpenSearch or vendor; must support JSON ingestion & field indexing.
13. Validation & CI Checks¶
- Lint rule forbids
console.log(frontend) outside approved wrapper. - Static scan for banned patterns (e.g.,
password=). - Schema validation for log envelope optional (JSON Schema) – future enhancement.
14. Operational Playbooks (Pointers)¶
- High ERROR surge: query by
error.type, examine correlatedtraceIdgroup, escalate if sustained >5% requests. - Latency regression: compare
http.routespan metrics vs p95 budget; diff span dependency timings (db.query).
15. Open Questions¶
- Retention beyond baseline: which regulatory requirements might extend? (HR records may imply longer audit needs.)
- Adopt semantic conventions for domain attributes (OpenTelemetry semantic conventions extension)?
- Need anonymization transforms for analytics events reused in logs?
Document Version: 1.0