-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Monitoring & Observability
Overview
Implement comprehensive monitoring and observability features including Prometheus metrics, health check endpoints, structured logging, and audit logging to ensure operational visibility and system reliability.
Objective
Create a robust monitoring system that provides real-time insights into service health, performance metrics, and security events while maintaining detailed audit trails for compliance and troubleshooting.
Canonical Scope
- This document is the canonical source for:
- Structured logging approach and helpers
- Request logging middleware behavior
- Health and readiness endpoints, semantics, and status criteria
- Metrics definitions and exposure
- For audit storage and retention, see 05 Database Layer. For validation and error schema, see 07 Security & Validation.
Tasks
Prometheus Metrics Implementation
- Set up Prometheus metrics in
internal/monitoring/metrics.go
- Implement request metrics:
var ( RequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "certificate_api_requests_total", Help: "Total number of API requests", }, []string{"method", "status"}, ) RequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "certificate_api_request_duration_seconds", Help: "Request duration in seconds", Buckets: prometheus.DefBuckets, }, []string{"method", "quantile"}, ) )
- Implement certificate metrics:
ActiveCertificates = prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "certificate_api_active_certificates", Help: "Number of active certificates", }, []string{"ca"}, ) CertificatesExpiringSoon = prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "certificate_api_certificates_expiring_soon", Help: "Number of certificates expiring soon", }, []string{"days", "ca"}, ) CertificatesIssuedTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "certificate_api_certificates_issued_total", Help: "Total number of certificates issued", }, []string{"profile", "ca"}, ) CertificatesRenewedTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "certificate_api_certificates_renewed_total", Help: "Total number of certificates renewed", }, []string{"profile", "ca"}, )
- Implement system health metrics:
ServiceUp = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "certificate_api_up", Help: "Service availability (1 = up, 0 = down)", }, ) DatabaseConnectionsActive = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "certificate_api_database_connections_active", Help: "Number of active database connections", }, ) PCAAPICalls = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "certificate_api_pca_api_calls_total", Help: "Total number of AWS PCA API calls", }, []string{"operation", "status"}, ) PCAAPILatency = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "certificate_api_pca_api_latency_seconds", Help: "AWS PCA API call latency", Buckets: prometheus.DefBuckets, }, []string{"operation", "quantile"}, )
- Implement security metrics:
AuthenticationFailures = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "certificate_api_authentication_failures_total", Help: "Total number of authentication failures", }, []string{"reason"}, )
- Register all metrics with Prometheus registry
- Create metrics endpoint on port 9090 as specified
Health Check Implementation
- Implement health check endpoint in
internal/monitoring/health.go
- Create health check handler at
/health
:type HealthCheck struct { Status string `json:"status"` Checks map[string]CheckResult `json:"checks"` Timestamp time.Time `json:"timestamp"` } type CheckResult struct { Status string `json:"status"` Message string `json:"message,omitempty"` }
- Implement health check criteria:
- Database connectivity: Connection pool has available connections
- AWS PCA connectivity: Can list CAs successfully
- JWKS cache status: Cache populated and not expired
- Certificate expiry: No CA certificates expiring within 30 days
- Return appropriate response codes:
- 200: All checks healthy
- 503: Any check unhealthy
- Implement readiness check endpoint at
/ready
- Ensure health checks complete in <100ms as specified
Structured Logging with slog
- Implement structured JSON logging using standard library
slog
- Configure slog in
internal/logging/logger.go
:func NewLogger(level slog.Level) *slog.Logger { opts := &slog.HandlerOptions{ Level: level, AddSource: false, } handler := slog.NewJSONHandler(os.Stdout, opts) return slog.New(handler) }
- Create context-aware logging helpers:
func LoggerWithRequestID(logger *slog.Logger, requestID string) *slog.Logger { return logger.With("request_id", requestID) } func LoggerWithComponent(logger *slog.Logger, component string) *slog.Logger { return logger.With("component", component) }
- Include standard attributes in log entries:
- Timestamp (automatic with slog)
- Level (debug, info, warn, error)
- Request ID (via context)
- Component/module
- Message
- Implement request logging middleware using slog:
- Log request start with method, path, client IP
- Log request completion with status, duration
- Include request ID for correlation
- Configure log levels via configuration (default: info):
var logLevel = new(slog.LevelVar) // can be changed at runtime logLevel.Set(slog.LevelInfo)
- Ensure sensitive data is not logged (tokens, keys, etc.)
Audit Logging System
- Implement audit logging in
internal/monitoring/audit.go
- Create audit logger that writes to database:
type AuditLogger struct { repository AuditRepository logger *slog.Logger }
- Log the following audit events:
- Certificate issued/renewed/requested
- Authentication failures
- Authorization failures
- Admin operations
- System errors
- CA certificate refresh operations
- CA certificate expiry warnings
- Ensure audit log format includes:
- Timestamp
- Actor identity
- Actor IP address
- Resource type and identifier
- Action/event type
- Outcome (success/failure)
- Additional details in JSONB
- Implement audit log retention (1 year minimum)
- Ensure audit logs are write-only (no updates/deletes)
Metrics Collection Jobs
- Create background job to collect certificate metrics:
func CollectCertificateMetrics(repo CertificateRepository) { // Run every 5 minutes // Count active certificates per CA // Count expiring certificates (7, 3, 1 days) // Update Prometheus gauges }
- Monitor database connection pool metrics
- Track AWS PCA API call patterns
- Monitor CA certificate expiration dates
Performance Monitoring
- Add request timing to all API endpoints
- Track database query performance
- Monitor AWS PCA API latency
- Implement performance targets verification:
- Certificate issuance: <2 seconds (95th percentile)
- Certificate status lookup: <500ms (95th percentile)
- Health check: <100ms
Alerting Configuration
- Define alert rules for Prometheus:
- Service down (certificate_api_up == 0)
- High error rate (>1% of requests failing)
- CA certificate expiring soon (<30 days)
- Database connection pool exhausted
- Authentication failure spike
- AWS PCA API errors
- Document alert thresholds and escalation paths
Gin Middleware Integration
- Create Prometheus middleware for Gin:
func PrometheusMiddleware() gin.HandlerFunc { return func(c *gin.Context) { start := time.Now() c.Next() duration := time.Since(start) status := strconv.Itoa(c.Writer.Status()) RequestsTotal.WithLabelValues(c.Request.Method, status).Inc() RequestDuration.WithLabelValues(c.Request.Method, "p95").Observe(duration.Seconds()) } }
- Integrate with existing Gin router
- Ensure metrics don't impact request performance
Acceptance Criteria
- All Prometheus metrics implemented and exposed
- Health check endpoint returning correct status
- Health check completes in <100ms
- Structured JSON logging working
- Audit events logged to database
- Metrics endpoint accessible on port 9090
- Request tracing with request IDs
- No sensitive data in logs
- Performance metrics tracking accurately
- Background metrics collection running
Technical Considerations
- Use official Prometheus Go client library
- Implement efficient metrics collection (avoid blocking)
- Use appropriate metric types (Counter, Gauge, Histogram)
- Ensure metric cardinality is controlled
- Use context for request-scoped logging
- Consider log aggregation requirements
- Implement graceful shutdown for metrics server
Dependencies
- Prometheus Go client library
- Gin framework (from issue feat: adds cuetools package #1)
- GORM repositories (from issue feat: adds top-level ci config to blueprint #5)
- Standard library
slog
for structured logging
Testing Requirements
- Unit tests for metrics collection
- Unit tests for health checks
- Unit tests for audit logging
- Integration tests for metrics endpoint
- Test metric accuracy under load
- Test health check failure scenarios
- Verify audit log completeness
- Test log output format
- Performance impact testing
Definition of Done
- Code reviewed and approved
- All tests passing
- Metrics documented
- Grafana dashboards created (if applicable)
- Alert rules configured
- Logging standards documented
- No performance regression
- Audit trail verified complete
Metadata
Metadata
Assignees
Labels
No labels