Monitoring & Observability

# Monitoring & Observability

## Overview
Implement comprehensive monitoring and observability features including Prometheus metrics, health check endpoints, structured logging, and audit logging to ensure operational visibility and system reliability.

## Objective
Create a robust monitoring system that provides real-time insights into service health, performance metrics, and security events while maintaining detailed audit trails for compliance and troubleshooting.

## Canonical Scope
- This document is the canonical source for:
  - Structured logging approach and helpers
  - Request logging middleware behavior
  - Health and readiness endpoints, semantics, and status criteria
  - Metrics definitions and exposure
- For audit storage and retention, see 05 Database Layer. For validation and error schema, see 07 Security & Validation.

## Tasks

### Prometheus Metrics Implementation
- [ ] Set up Prometheus metrics in `internal/monitoring/metrics.go`
- [ ] Implement request metrics:
  ```go
  var (
    RequestsTotal = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_requests_total",
        Help: "Total number of API requests",
      },
      []string{"method", "status"},
    )

    RequestDuration = prometheus.NewHistogramVec(
      prometheus.HistogramOpts{
        Name: "certificate_api_request_duration_seconds",
        Help: "Request duration in seconds",
        Buckets: prometheus.DefBuckets,
      },
      []string{"method", "quantile"},
    )
  )
  ```
- [ ] Implement certificate metrics:
  ```go
  ActiveCertificates = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
      Name: "certificate_api_active_certificates",
      Help: "Number of active certificates",
    },
    []string{"ca"},
  )

  CertificatesExpiringSoon = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
      Name: "certificate_api_certificates_expiring_soon",
      Help: "Number of certificates expiring soon",
    },
    []string{"days", "ca"},
  )

  CertificatesIssuedTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
      Name: "certificate_api_certificates_issued_total",
      Help: "Total number of certificates issued",
    },
    []string{"profile", "ca"},
  )

  CertificatesRenewedTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
      Name: "certificate_api_certificates_renewed_total",
      Help: "Total number of certificates renewed",
    },
    []string{"profile", "ca"},
  )
  ```
- [ ] Implement system health metrics:
  ```go
  ServiceUp = prometheus.NewGauge(
    prometheus.GaugeOpts{
      Name: "certificate_api_up",
      Help: "Service availability (1 = up, 0 = down)",
    },
  )

  DatabaseConnectionsActive = prometheus.NewGauge(
    prometheus.GaugeOpts{
      Name: "certificate_api_database_connections_active",
      Help: "Number of active database connections",
    },
  )

  PCAAPICalls = prometheus.NewCounterVec(
    prometheus.CounterOpts{
      Name: "certificate_api_pca_api_calls_total",
      Help: "Total number of AWS PCA API calls",
    },
    []string{"operation", "status"},
  )

  PCAAPILatency = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
      Name: "certificate_api_pca_api_latency_seconds",
      Help: "AWS PCA API call latency",
      Buckets: prometheus.DefBuckets,
    },
    []string{"operation", "quantile"},
  )
  ```
- [ ] Implement security metrics:
  ```go
  AuthenticationFailures = prometheus.NewCounterVec(
    prometheus.CounterOpts{
      Name: "certificate_api_authentication_failures_total",
      Help: "Total number of authentication failures",
    },
    []string{"reason"},
  )
  ```
- [ ] Register all metrics with Prometheus registry
- [ ] Create metrics endpoint on port 9090 as specified

### Health Check Implementation
- [ ] Implement health check endpoint in `internal/monitoring/health.go`
- [ ] Create health check handler at `/health`:
  ```go
  type HealthCheck struct {
    Status     string                 `json:"status"`
    Checks     map[string]CheckResult `json:"checks"`
    Timestamp  time.Time             `json:"timestamp"`
  }

  type CheckResult struct {
    Status  string `json:"status"`
    Message string `json:"message,omitempty"`
  }
  ```
- [ ] Implement health check criteria:
  - Database connectivity: Connection pool has available connections
  - AWS PCA connectivity: Can list CAs successfully
  - JWKS cache status: Cache populated and not expired
  - Certificate expiry: No CA certificates expiring within 30 days
- [ ] Return appropriate response codes:
  - 200: All checks healthy
  - 503: Any check unhealthy
- [ ] Implement readiness check endpoint at `/ready`
- [ ] Ensure health checks complete in <100ms as specified

### Structured Logging with slog
- [ ] Implement structured JSON logging using standard library `slog`
- [ ] Configure slog in `internal/logging/logger.go`:
  ```go
  func NewLogger(level slog.Level) *slog.Logger {
    opts := &slog.HandlerOptions{
      Level: level,
      AddSource: false,
    }
    handler := slog.NewJSONHandler(os.Stdout, opts)
    return slog.New(handler)
  }
  ```
- [ ] Create context-aware logging helpers:
  ```go
  func LoggerWithRequestID(logger *slog.Logger, requestID string) *slog.Logger {
    return logger.With("request_id", requestID)
  }

  func LoggerWithComponent(logger *slog.Logger, component string) *slog.Logger {
    return logger.With("component", component)
  }
  ```
- [ ] Include standard attributes in log entries:
  - Timestamp (automatic with slog)
  - Level (debug, info, warn, error)
  - Request ID (via context)
  - Component/module
  - Message
- [ ] Implement request logging middleware using slog:
  - Log request start with method, path, client IP
  - Log request completion with status, duration
  - Include request ID for correlation
- [ ] Configure log levels via configuration (default: info):
  ```go
  var logLevel = new(slog.LevelVar) // can be changed at runtime
  logLevel.Set(slog.LevelInfo)
  ```
- [ ] Ensure sensitive data is not logged (tokens, keys, etc.)

### Audit Logging System
- [ ] Implement audit logging in `internal/monitoring/audit.go`
- [ ] Create audit logger that writes to database:
  ```go
  type AuditLogger struct {
    repository AuditRepository
    logger     *slog.Logger
  }
  ```
- [ ] Log the following audit events:
  - Certificate issued/renewed/requested
  - Authentication failures
  - Authorization failures
  - Admin operations
  - System errors
  - CA certificate refresh operations
  - CA certificate expiry warnings
- [ ] Ensure audit log format includes:
  - Timestamp
  - Actor identity
  - Actor IP address
  - Resource type and identifier
  - Action/event type
  - Outcome (success/failure)
  - Additional details in JSONB
- [ ] Implement audit log retention (1 year minimum)
- [ ] Ensure audit logs are write-only (no updates/deletes)

### Metrics Collection Jobs
- [ ] Create background job to collect certificate metrics:
  ```go
  func CollectCertificateMetrics(repo CertificateRepository) {
    // Run every 5 minutes
    // Count active certificates per CA
    // Count expiring certificates (7, 3, 1 days)
    // Update Prometheus gauges
  }
  ```
- [ ] Monitor database connection pool metrics
- [ ] Track AWS PCA API call patterns
- [ ] Monitor CA certificate expiration dates

### Performance Monitoring
- [ ] Add request timing to all API endpoints
- [ ] Track database query performance
- [ ] Monitor AWS PCA API latency
- [ ] Implement performance targets verification:
  - Certificate issuance: <2 seconds (95th percentile)
  - Certificate status lookup: <500ms (95th percentile)
  - Health check: <100ms

### Alerting Configuration
- [ ] Define alert rules for Prometheus:
  - Service down (certificate_api_up == 0)
  - High error rate (>1% of requests failing)
  - CA certificate expiring soon (<30 days)
  - Database connection pool exhausted
  - Authentication failure spike
  - AWS PCA API errors
- [ ] Document alert thresholds and escalation paths

### Gin Middleware Integration
- [ ] Create Prometheus middleware for Gin:
  ```go
  func PrometheusMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
      start := time.Now()

      c.Next()

      duration := time.Since(start)
      status := strconv.Itoa(c.Writer.Status())

      RequestsTotal.WithLabelValues(c.Request.Method, status).Inc()
      RequestDuration.WithLabelValues(c.Request.Method, "p95").Observe(duration.Seconds())
    }
  }
  ```
- [ ] Integrate with existing Gin router
- [ ] Ensure metrics don't impact request performance

## Acceptance Criteria
- [ ] All Prometheus metrics implemented and exposed
- [ ] Health check endpoint returning correct status
- [ ] Health check completes in <100ms
- [ ] Structured JSON logging working
- [ ] Audit events logged to database
- [ ] Metrics endpoint accessible on port 9090
- [ ] Request tracing with request IDs
- [ ] No sensitive data in logs
- [ ] Performance metrics tracking accurately
- [ ] Background metrics collection running

## Technical Considerations
- Use official Prometheus Go client library
- Implement efficient metrics collection (avoid blocking)
- Use appropriate metric types (Counter, Gauge, Histogram)
- Ensure metric cardinality is controlled
- Use context for request-scoped logging
- Consider log aggregation requirements
- Implement graceful shutdown for metrics server

## Dependencies
- Prometheus Go client library
- Gin framework (from issue #1)
- GORM repositories (from issue #5)
- Standard library `slog` for structured logging

## Testing Requirements
- [ ] Unit tests for metrics collection
- [ ] Unit tests for health checks
- [ ] Unit tests for audit logging
- [ ] Integration tests for metrics endpoint
- [ ] Test metric accuracy under load
- [ ] Test health check failure scenarios
- [ ] Verify audit log completeness
- [ ] Test log output format
- [ ] Performance impact testing

## Definition of Done
- [ ] Code reviewed and approved
- [ ] All tests passing
- [ ] Metrics documented
- [ ] Grafana dashboards created (if applicable)
- [ ] Alert rules configured
- [ ] Logging standards documented
- [ ] No performance regression
- [ ] Audit trail verified complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Monitoring & Observability #201

Monitoring & Observability

Overview

Objective

Canonical Scope

Tasks

Prometheus Metrics Implementation

Health Check Implementation

Structured Logging with slog

Audit Logging System

Metrics Collection Jobs

Performance Monitoring

Alerting Configuration

Gin Middleware Integration

Acceptance Criteria

Technical Considerations

Dependencies

Testing Requirements

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Monitoring & Observability #201

Description

Monitoring & Observability

Overview

Objective

Canonical Scope

Tasks

Prometheus Metrics Implementation

Health Check Implementation

Structured Logging with slog

Audit Logging System

Metrics Collection Jobs

Performance Monitoring

Alerting Configuration

Gin Middleware Integration

Acceptance Criteria

Technical Considerations

Dependencies

Testing Requirements

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions