Skip to content

Monitoring & Observability #201

@jmgilman

Description

@jmgilman

Monitoring & Observability

Overview

Implement comprehensive monitoring and observability features including Prometheus metrics, health check endpoints, structured logging, and audit logging to ensure operational visibility and system reliability.

Objective

Create a robust monitoring system that provides real-time insights into service health, performance metrics, and security events while maintaining detailed audit trails for compliance and troubleshooting.

Canonical Scope

  • This document is the canonical source for:
    • Structured logging approach and helpers
    • Request logging middleware behavior
    • Health and readiness endpoints, semantics, and status criteria
    • Metrics definitions and exposure
  • For audit storage and retention, see 05 Database Layer. For validation and error schema, see 07 Security & Validation.

Tasks

Prometheus Metrics Implementation

  • Set up Prometheus metrics in internal/monitoring/metrics.go
  • Implement request metrics:
    var (
      RequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
          Name: "certificate_api_requests_total",
          Help: "Total number of API requests",
        },
        []string{"method", "status"},
      )
    
      RequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
          Name: "certificate_api_request_duration_seconds",
          Help: "Request duration in seconds",
          Buckets: prometheus.DefBuckets,
        },
        []string{"method", "quantile"},
      )
    )
  • Implement certificate metrics:
    ActiveCertificates = prometheus.NewGaugeVec(
      prometheus.GaugeOpts{
        Name: "certificate_api_active_certificates",
        Help: "Number of active certificates",
      },
      []string{"ca"},
    )
    
    CertificatesExpiringSoon = prometheus.NewGaugeVec(
      prometheus.GaugeOpts{
        Name: "certificate_api_certificates_expiring_soon",
        Help: "Number of certificates expiring soon",
      },
      []string{"days", "ca"},
    )
    
    CertificatesIssuedTotal = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_certificates_issued_total",
        Help: "Total number of certificates issued",
      },
      []string{"profile", "ca"},
    )
    
    CertificatesRenewedTotal = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_certificates_renewed_total",
        Help: "Total number of certificates renewed",
      },
      []string{"profile", "ca"},
    )
  • Implement system health metrics:
    ServiceUp = prometheus.NewGauge(
      prometheus.GaugeOpts{
        Name: "certificate_api_up",
        Help: "Service availability (1 = up, 0 = down)",
      },
    )
    
    DatabaseConnectionsActive = prometheus.NewGauge(
      prometheus.GaugeOpts{
        Name: "certificate_api_database_connections_active",
        Help: "Number of active database connections",
      },
    )
    
    PCAAPICalls = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_pca_api_calls_total",
        Help: "Total number of AWS PCA API calls",
      },
      []string{"operation", "status"},
    )
    
    PCAAPILatency = prometheus.NewHistogramVec(
      prometheus.HistogramOpts{
        Name: "certificate_api_pca_api_latency_seconds",
        Help: "AWS PCA API call latency",
        Buckets: prometheus.DefBuckets,
      },
      []string{"operation", "quantile"},
    )
  • Implement security metrics:
    AuthenticationFailures = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_authentication_failures_total",
        Help: "Total number of authentication failures",
      },
      []string{"reason"},
    )
  • Register all metrics with Prometheus registry
  • Create metrics endpoint on port 9090 as specified

Health Check Implementation

  • Implement health check endpoint in internal/monitoring/health.go
  • Create health check handler at /health:
    type HealthCheck struct {
      Status     string                 `json:"status"`
      Checks     map[string]CheckResult `json:"checks"`
      Timestamp  time.Time             `json:"timestamp"`
    }
    
    type CheckResult struct {
      Status  string `json:"status"`
      Message string `json:"message,omitempty"`
    }
  • Implement health check criteria:
    • Database connectivity: Connection pool has available connections
    • AWS PCA connectivity: Can list CAs successfully
    • JWKS cache status: Cache populated and not expired
    • Certificate expiry: No CA certificates expiring within 30 days
  • Return appropriate response codes:
    • 200: All checks healthy
    • 503: Any check unhealthy
  • Implement readiness check endpoint at /ready
  • Ensure health checks complete in <100ms as specified

Structured Logging with slog

  • Implement structured JSON logging using standard library slog
  • Configure slog in internal/logging/logger.go:
    func NewLogger(level slog.Level) *slog.Logger {
      opts := &slog.HandlerOptions{
        Level: level,
        AddSource: false,
      }
      handler := slog.NewJSONHandler(os.Stdout, opts)
      return slog.New(handler)
    }
  • Create context-aware logging helpers:
    func LoggerWithRequestID(logger *slog.Logger, requestID string) *slog.Logger {
      return logger.With("request_id", requestID)
    }
    
    func LoggerWithComponent(logger *slog.Logger, component string) *slog.Logger {
      return logger.With("component", component)
    }
  • Include standard attributes in log entries:
    • Timestamp (automatic with slog)
    • Level (debug, info, warn, error)
    • Request ID (via context)
    • Component/module
    • Message
  • Implement request logging middleware using slog:
    • Log request start with method, path, client IP
    • Log request completion with status, duration
    • Include request ID for correlation
  • Configure log levels via configuration (default: info):
    var logLevel = new(slog.LevelVar) // can be changed at runtime
    logLevel.Set(slog.LevelInfo)
  • Ensure sensitive data is not logged (tokens, keys, etc.)

Audit Logging System

  • Implement audit logging in internal/monitoring/audit.go
  • Create audit logger that writes to database:
    type AuditLogger struct {
      repository AuditRepository
      logger     *slog.Logger
    }
  • Log the following audit events:
    • Certificate issued/renewed/requested
    • Authentication failures
    • Authorization failures
    • Admin operations
    • System errors
    • CA certificate refresh operations
    • CA certificate expiry warnings
  • Ensure audit log format includes:
    • Timestamp
    • Actor identity
    • Actor IP address
    • Resource type and identifier
    • Action/event type
    • Outcome (success/failure)
    • Additional details in JSONB
  • Implement audit log retention (1 year minimum)
  • Ensure audit logs are write-only (no updates/deletes)

Metrics Collection Jobs

  • Create background job to collect certificate metrics:
    func CollectCertificateMetrics(repo CertificateRepository) {
      // Run every 5 minutes
      // Count active certificates per CA
      // Count expiring certificates (7, 3, 1 days)
      // Update Prometheus gauges
    }
  • Monitor database connection pool metrics
  • Track AWS PCA API call patterns
  • Monitor CA certificate expiration dates

Performance Monitoring

  • Add request timing to all API endpoints
  • Track database query performance
  • Monitor AWS PCA API latency
  • Implement performance targets verification:
    • Certificate issuance: <2 seconds (95th percentile)
    • Certificate status lookup: <500ms (95th percentile)
    • Health check: <100ms

Alerting Configuration

  • Define alert rules for Prometheus:
    • Service down (certificate_api_up == 0)
    • High error rate (>1% of requests failing)
    • CA certificate expiring soon (<30 days)
    • Database connection pool exhausted
    • Authentication failure spike
    • AWS PCA API errors
  • Document alert thresholds and escalation paths

Gin Middleware Integration

  • Create Prometheus middleware for Gin:
    func PrometheusMiddleware() gin.HandlerFunc {
      return func(c *gin.Context) {
        start := time.Now()
    
        c.Next()
    
        duration := time.Since(start)
        status := strconv.Itoa(c.Writer.Status())
    
        RequestsTotal.WithLabelValues(c.Request.Method, status).Inc()
        RequestDuration.WithLabelValues(c.Request.Method, "p95").Observe(duration.Seconds())
      }
    }
  • Integrate with existing Gin router
  • Ensure metrics don't impact request performance

Acceptance Criteria

  • All Prometheus metrics implemented and exposed
  • Health check endpoint returning correct status
  • Health check completes in <100ms
  • Structured JSON logging working
  • Audit events logged to database
  • Metrics endpoint accessible on port 9090
  • Request tracing with request IDs
  • No sensitive data in logs
  • Performance metrics tracking accurately
  • Background metrics collection running

Technical Considerations

  • Use official Prometheus Go client library
  • Implement efficient metrics collection (avoid blocking)
  • Use appropriate metric types (Counter, Gauge, Histogram)
  • Ensure metric cardinality is controlled
  • Use context for request-scoped logging
  • Consider log aggregation requirements
  • Implement graceful shutdown for metrics server

Dependencies

Testing Requirements

  • Unit tests for metrics collection
  • Unit tests for health checks
  • Unit tests for audit logging
  • Integration tests for metrics endpoint
  • Test metric accuracy under load
  • Test health check failure scenarios
  • Verify audit log completeness
  • Test log output format
  • Performance impact testing

Definition of Done

  • Code reviewed and approved
  • All tests passing
  • Metrics documented
  • Grafana dashboards created (if applicable)
  • Alert rules configured
  • Logging standards documented
  • No performance regression
  • Audit trail verified complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions