Skip to content

Logging & Tracing Guidelines

Generated: 2025-11-22 | Version: 1.0 Scope: Core platform (activation, HR, leave, recruitment) – baseline observability standards.

1. Principles

  • Actionable: Every log / span must help diagnose or measure behavior (no noise).
  • Structured: Emit JSON objects; avoid free‑form string concatenation.
  • Correlated: Propagate a single traceId + spanId across HTTP, async messages, jobs.
  • Separated Concerns: Use events (analytics) for product metrics; logs/spans for technical diagnostics.
  • Safe: Never log raw PII or secrets; hash or redact early.

2. Log Levels (Server)

Level Use Case Retention
TRACE Deep development diagnostics; disabled in prod default Short (24h if enabled)
DEBUG Non-critical flow details (branch decisions) 7 days
INFO High-level lifecycle milestones (provisioned tenant, accrual job start/finish) 30 days
WARN Recoverable issues (retry scheduled, invalid token attempt) 60 days
ERROR Failed operation where user impact likely 90 days
FATAL Process crash / unrecoverable state 90 days

Guideline: Default prod verbosity = INFO+. Temporarily raise DEBUG scoped by dynamic config/feature flag.

3. Structured Log Envelope

{
  "timestamp": "2025-11-22T10:20:30.123Z",
  "level": "INFO",
  "logger": "tenant.provisioning",
  "message": "Tenant provisioned",
  "traceId": "trace-abc123",
  "spanId": "span-xyz456",
  "tenantId": "ten-001",
  "userId": "usr-001",
  "requestId": "req-789",
  "host": "api-1",
  "service": "core-api",
  "env": "prod",
  "durationMs": 532,
  "severityCode": 20000, // optional mapping to OTLP codes
  "tags": ["provisioning","activation"],
  "metadata": {"plan": "trial"},
  "pii": {"emailHash": "sha256:..."} // never raw
}

4. Field Rules

  • traceId mandatory for all request-scoped logs; generate at ingress if missing.
  • spanId for component-level operations (DB call, external API); omit only for root log lines.
  • requestId maps 1:1 to HTTP request; for async messages use messageId.
  • Always include tenantId when context established; omit before provisioning.
  • Avoid arrays of large objects in logs; summarize counts.

5. PII & Secrets Handling

Data Allowed? Approach
Raw emails No Hash (salted SHA-256)
Passwords No Never logged
Access tokens No Truncate + hash if absolutely needed for correlation
Personal names Avoid Use userId reference
Documents content No Log documentId only

Redaction pipeline: sanitize BEFORE serialization; enforce via central logging adapter.

6. Tracing Model

  • OpenTelemetry for spans & context propagation.
  • Root span: http.server or messaging.consume.
  • Key child spans: db.query, auth.validate, accrual.batch, email.send.
  • Correlate analytics events by reusing current traceId.
  • Span attributes naming: snake_case (tenant.id, leave.type_id, error.code).

Mandatory Span Attributes

Attribute Example Reason
tenant.id ten-001 Multi-tenant filtering
user.id usr-002 User action correlation
http.route /api/signup Performance segmentation
db.system postgresql Tech diagnostics
error.type ValidationError Faster grouping

Span Events

Use for discrete in-span milestones (e.g., accrual.employee_processed). Keep payload small.

7. Sampling Strategy

  • Default: 100% traces for MVP (low volume). Revisit > 100 RPS.
  • Future adaptive sampling: keep 100% of ERROR spans, probabilistic (10–30%) of successful high-volume endpoints.
  • Provide dynamic configuration via env/remote toggle.

8. Correlation Across Async Messaging

  • Inject traceId and spanId headers/properties into messages (e.g., x-trace-id).
  • Consumer creates child span if trace present; else start new trace with incomplete_trace=true attribute.
  • Preserve ordering metadata (sequence) if needed for debugging streams.

9. Performance Budget Logging

  • Log p95 latency summary per endpoint daily (aggregated metric, not individual logs) to reduce noise.
  • Use metrics (histograms) rather than logs for high-frequency timing.

10. Error Logging Patterns

Example pattern:

{
  "level": "ERROR",
  "message": "Activation token invalid",
  "error.type": "InvalidTokenError",
  "error.message": "Token expired",
  "error.stack": "(stack truncated)",
  "traceId": "trace-t1",
  "tenantId": "ten-001",
  "invitationId": "inv-77"
}
Rules: - Truncate stack to first 20 lines; avoid multi-MB payloads. - Provide domain identifiers (e.g., invitationId) for direct trace triage.

11. Log vs Event vs Metric Decision Table

Need Use
Debugging failure root cause Log (structured)
End-to-end latency breakdown Trace spans
Business funnel conversion Analytics events
Throughput / latency statistics Metrics (histograms, counters)
Rare milestone (tenant provisioned) Log + Event

12. Tooling & Libraries (Proposed)

  • Backend: OpenTelemetry SDK, structured logging (pino / zap style depending on language), central logging adapter.
  • Frontend: Lightweight tracing (W3C Trace Context) for API calls; minimal user action INFO logs for critical flows only.
  • Log aggregation: ELK / OpenSearch or vendor; must support JSON ingestion & field indexing.

13. Validation & CI Checks

  • Lint rule forbids console.log (frontend) outside approved wrapper.
  • Static scan for banned patterns (e.g., password=).
  • Schema validation for log envelope optional (JSON Schema) – future enhancement.

14. Operational Playbooks (Pointers)

  • High ERROR surge: query by error.type, examine correlated traceId group, escalate if sustained >5% requests.
  • Latency regression: compare http.route span metrics vs p95 budget; diff span dependency timings (db.query).

15. Open Questions

  1. Retention beyond baseline: which regulatory requirements might extend? (HR records may imply longer audit needs.)
  2. Adopt semantic conventions for domain attributes (OpenTelemetry semantic conventions extension)?
  3. Need anonymization transforms for analytics events reused in logs?

Document Version: 1.0