Logging & Tracing Guidelines¶

Generated: 2025-11-22 | Version: 1.0 Scope: Core platform (activation, HR, leave, recruitment) – baseline observability standards.

1. Principles¶

Actionable: Every log / span must help diagnose or measure behavior (no noise).
Structured: Emit JSON objects; avoid free‑form string concatenation.
Correlated: Propagate a single traceId + spanId across HTTP, async messages, jobs.
Separated Concerns: Use events (analytics) for product metrics; logs/spans for technical diagnostics.
Safe: Never log raw PII or secrets; hash or redact early.

2. Log Levels (Server)¶

Level	Use Case	Retention
TRACE	Deep development diagnostics; disabled in prod default	Short (24h if enabled)
DEBUG	Non-critical flow details (branch decisions)	7 days
INFO	High-level lifecycle milestones (provisioned tenant, accrual job start/finish)	30 days
WARN	Recoverable issues (retry scheduled, invalid token attempt)	60 days
ERROR	Failed operation where user impact likely	90 days
FATAL	Process crash / unrecoverable state	90 days

Guideline: Default prod verbosity = INFO+. Temporarily raise DEBUG scoped by dynamic config/feature flag.

3. Structured Log Envelope¶

{
  "timestamp": "2025-11-22T10:20:30.123Z",
  "level": "INFO",
  "logger": "tenant.provisioning",
  "message": "Tenant provisioned",
  "traceId": "trace-abc123",
  "spanId": "span-xyz456",
  "tenantId": "ten-001",
  "userId": "usr-001",
  "requestId": "req-789",
  "host": "api-1",
  "service": "core-api",
  "env": "prod",
  "durationMs": 532,
  "severityCode": 20000, // optional mapping to OTLP codes
  "tags": ["provisioning","activation"],
  "metadata": {"plan": "trial"},
  "pii": {"emailHash": "sha256:..."} // never raw
}

4. Field Rules¶

traceId mandatory for all request-scoped logs; generate at ingress if missing.
spanId for component-level operations (DB call, external API); omit only for root log lines.
requestId maps 1:1 to HTTP request; for async messages use messageId.
Always include tenantId when context established; omit before provisioning.
Avoid arrays of large objects in logs; summarize counts.

5. PII & Secrets Handling¶

Data	Allowed?	Approach
Raw emails	No	Hash (salted SHA-256)
Passwords	No	Never logged
Access tokens	No	Truncate + hash if absolutely needed for correlation
Personal names	Avoid	Use userId reference
Documents content	No	Log documentId only

Redaction pipeline: sanitize BEFORE serialization; enforce via central logging adapter.

6. Tracing Model¶

OpenTelemetry for spans & context propagation.
Root span: http.server or messaging.consume.
Key child spans: db.query, auth.validate, accrual.batch, email.send.
Correlate analytics events by reusing current traceId.
Span attributes naming: snake_case (tenant.id, leave.type_id, error.code).

Mandatory Span Attributes¶

Attribute	Example	Reason
`tenant.id`	ten-001	Multi-tenant filtering
`user.id`	usr-002	User action correlation
`http.route`	/api/signup	Performance segmentation
`db.system`	postgresql	Tech diagnostics
`error.type`	ValidationError	Faster grouping

Span Events¶

Use for discrete in-span milestones (e.g., accrual.employee_processed). Keep payload small.

7. Sampling Strategy¶

Default: 100% traces for MVP (low volume). Revisit > 100 RPS.
Future adaptive sampling: keep 100% of ERROR spans, probabilistic (10–30%) of successful high-volume endpoints.
Provide dynamic configuration via env/remote toggle.

8. Correlation Across Async Messaging¶

Inject traceId and spanId headers/properties into messages (e.g., x-trace-id).
Consumer creates child span if trace present; else start new trace with incomplete_trace=true attribute.
Preserve ordering metadata (sequence) if needed for debugging streams.

9. Performance Budget Logging¶

Log p95 latency summary per endpoint daily (aggregated metric, not individual logs) to reduce noise.
Use metrics (histograms) rather than logs for high-frequency timing.

10. Error Logging Patterns¶

Example pattern:

{
  "level": "ERROR",
  "message": "Activation token invalid",
  "error.type": "InvalidTokenError",
  "error.message": "Token expired",
  "error.stack": "(stack truncated)",
  "traceId": "trace-t1",
  "tenantId": "ten-001",
  "invitationId": "inv-77"
}

Rules: - Truncate stack to first 20 lines; avoid multi-MB payloads. - Provide domain identifiers (e.g., invitationId) for direct trace triage.

11. Log vs Event vs Metric Decision Table¶

Need	Use
Debugging failure root cause	Log (structured)
End-to-end latency breakdown	Trace spans
Business funnel conversion	Analytics events
Throughput / latency statistics	Metrics (histograms, counters)
Rare milestone (tenant provisioned)	Log + Event

12. Tooling & Libraries (Proposed)¶

Backend: OpenTelemetry SDK, structured logging (pino / zap style depending on language), central logging adapter.
Frontend: Lightweight tracing (W3C Trace Context) for API calls; minimal user action INFO logs for critical flows only.
Log aggregation: ELK / OpenSearch or vendor; must support JSON ingestion & field indexing.

13. Validation & CI Checks¶

Lint rule forbids console.log (frontend) outside approved wrapper.
Static scan for banned patterns (e.g., password=).
Schema validation for log envelope optional (JSON Schema) – future enhancement.

14. Operational Playbooks (Pointers)¶

High ERROR surge: query by error.type, examine correlated traceId group, escalate if sustained >5% requests.
Latency regression: compare http.route span metrics vs p95 budget; diff span dependency timings (db.query).

15. Open Questions¶

Retention beyond baseline: which regulatory requirements might extend? (HR records may imply longer audit needs.)
Adopt semantic conventions for domain attributes (OpenTelemetry semantic conventions extension)?
Need anonymization transforms for analytics events reused in logs?

Document Version: 1.0