Production-Ready Kubernetes Platform on AWS

This repository contains a complete production-ready platform for deploying applications on AWS using Kubernetes, built with Terraform and following GitOps principles.

Overview
Prerequisites
Architecture
- Layer Structure
Layer 1: Foundation
Layer 2: Platform
Layer 3: Observability and Monitoring
Layer 4: Resilience and Backup
Layer 5: Cost Optimization - FinOps
- Components
- Data Flow
- Overview
- Details
Layer 6: Security - Platform Protection
Layer 7: Application Deployment
Traffic Flow
Troubleshooting
- Common Issues
- Useful Commands
Cost Analysis
✨ Best Practices Implemented
Contributing
License

Overview

The platform uses Terraform to create and manage infrastructure, making provisioning easier and more reproducible. It's built in layers to address Terraform's dependency management challenges (DAG issues) and follows best practices for state management using external storage like S3.

To Reach full production readiness, these points needs to be addressed in each of the layers below:

Foundations

Multi-region setup for disaster recovery
Backup strategy for EKS/RDS
KMS encryption for EBS/secrets
VPC Flow Logs and GuardDuty
Network policies and security groups hardening

Platform2

Resource quotas and limits
Pod Security Standards
RBAC policies (least privilege)
Image scanning and admission controllers
HA configurations (multi-replica ArgoCD, etc.)

Platform3

WAF on load balancer
Rate limiting on ingress
Secret rotation automation
DNS failover setup
Certificate monitoring

Observability

External storage (S3 for Loki/Tempo instead of local)
HA setup (3+ replicas for critical components)
Alerting rules and runbooks
SLI/SLO monitoring
Log retention policies
Backup for monitoring data

Prerequisites

Before getting started, ensure you have:

Terraform CLI - Installation Guide
AWS CLI - Properly configured with credentials
Set these env in your .zshrc or.bashrc file:

export TF_VAR_email="youremail@domain.com"
export TF_VAR_pixie_deploy_key="PIKIE_DEPLOY_KEY" # you create and get that from the Pixit UI

Architecture

The platform is organized into layers, each containing:

main.tf - Resource definitions
providers.tf - Provider versions and API configurations

Layer Structure

├── foundation/          # Core infrastructure
├── platform/           # Kubernetes platform components
│   ├── step02/         # Software installation
│   └── step03/         # Software configuration
|   observability       # System Observability and Monitoring Stack
└── app/                # Application deployment

Layer 1: Foundation

Creates the core infrastructure components required for any Kubernetes cluster.

Components

VPC - Virtual Private Cloud for network isolation
Subnets - Public, private, and intra subnets across availability zones
IAM - Identity and Access Management roles and policies
DNS - Route53 hosted zone configuration
Cluster - A collection of worker nodes and control plane (basically Kubernetes cluster)
NAT Gateway - Outbound internet access for private subnets

Deployment

cd foundation/
terraform init
terraform apply

Understanding .terraform.lock.hcl

The .terraform.lock.hcl file is Terraform's dependency lock file. It:

Locks provider versions - Records the exact versions of providers that were downloaded
Ensures consistency - Makes sure everyone on your team uses the same provider versions
Prevents drift - Stops providers from auto-updating to newer versions unexpectedly
Contains checksums - Verifies provider authenticity and integrity

Should you commit it? Yes, always commit it to version control.

When is it created? During terraform init when providers are downloaded.

Example content:

provider "registry.terraform.io/hashicorp/aws" {
  version = "5.0.0"
  hashes = [
    "h1:xyz123...",
    "zh:abc456...",
  ]
}

It's similar to package-lock.json in Node.js or Pipfile.lock in Python - it locks your dependencies to specific versions for reproducible builds.

Post-Deployment

After running terraform apply, all resources should be created on AWS. Now run:

aws eks update-kubeconfig --region [YOUR_REGION] --name [CLUSTER_NAME]

This downloads the certificate that allows us to communicate with the API via kubectl. Then we can run:

kubectl get nodes

to check the nodes (internal Kubernetes nodes).

Layer 2: Platform

Split into two steps: Step02 (installing the software with basic config) and Step03 (configuring the software).

This layer consists of the components needed for Kubernetes to be ready for production - they are defined as Helm charts:

Gateway/Ingress (Ingress Nginx) - For routing traffic inside the cluster
Secret Management (External Secret Operator) - For syncing secrets in a secure way
Certificate Management (Cert Manager) - For TLS certificate issuance
Continuous Delivery (ArgoCD) - Deploying the application with the GitOps pattern
Cluster Autoscaling

Secret Management Options

We can handle secrets in our cluster in three different ways:

Using secret manager [Recommended] like Vault, and when the container is spun up, we query vault for the secrets, then inject them in the container.
Sealed secrets, where we encrypt the secrets and put them in our YAML manifests, and these encrypted values are committed in our git repo. [No secret manager is needed here - secret rotation is handled manually]
External secret operator [Implemented] is running inside the cluster, it is a resource that we can use to make it point to a store (KeyVault, AWS parameters, ...) - And we commit a reference to this secret in our YAML manifests (in the app)

Step 02: Software Installation

Note: In the ingress resource, we set enableHttp: true for the cert-manager to be able to issue certificates.

Good read on AWS load balancer types: Network vs Application LB

ArgoCD is a GitOps tool that monitors git, and once it sees a change (in code or config, depends how it was set up), it will deploy and update the cluster.

cd platform/step02/
terraform init
terraform apply

Step 03: Software Configuration

In Step03 we will configure:

1. DNS Configuration

Create a DNS record in the hosted zone on AWS Route53, and make it point to our cluster's ingress.

Creates app.terraform-aws-platform.xyz → Points to your application
Creates grafana.terraform-aws-platform.xyz → Points to your monitoring dashboard
Both records point to the same AWS Load Balancer (ingress controller)

This means we have a domain, and we want to map this domain to our ingress's IP.

kubectl -n ingress-nginx get svc

You can see the ingress IP from this command.

2. Certificate Management

Configure cert-manager to create TLS certificates (Let's Encrypt is used for issuing the cert).

Automatic SSL: When you create an Ingress with TLS, cert-manager will automatically:

Request a certificate from Let's Encrypt
Serve the challenge file via nginx
Get the certificate and store it in Kubernetes
Auto-renew before expiration

3. IAM Roles for ServiceAccounts Configuration:

Service Accounts Created:

Service Account	Namespace	Purpose	AWS Role
secret-store	external-secrets	Access AWS Parameter Store	secret-store
cluster-autoscaler	kube-system	Scale EC2 Instances	cluster-autoscaler
ebs-csi-controller-sa	kubesystem	Manage EBS volumes	ebs-csi-controller

4. Secret Management Configuration

Configure the secrets manager → external-secrets operator can fetch certificates/secrets from AWS Parameter Store.

A service account named secret-store (assuming the role created in foundation) will be created, and a ParameterStore as a secrets store.

Automatic Secret Sync: You can now create ExternalSecret resources that:

Reference secrets stored in AWS Parameter Store
Automatically sync them to Kubernetes secrets
Keep them updated when Parameter Store values change

cd platform/step03/
terraform init
terraform apply

Hosted Zones

Hosted zones are AWS's way of organizing and managing all DNS records for a domain in one place.

Hosted Zone: terraform-aws-platform.xyz
├── A Record: terraform-aws-platform.xyz → 192.168.1.100
├── CNAME Record: app.terraform-aws-platform.xyz → load-balancer.aws.com
├── MX Record: mail.terraform-aws-platform.xyz → mail-server.com
├── NS Records: (nameservers that handle this zone)
└── SOA Record: (zone authority information)

Layer 3: Observability and Monitoring

This part installs and configures many tools to help us monitor and check our cluster overall health.

Components:

Visualization (Grafana)
Logging (Grafana Loki)
Metrics (Prometheus)
Auto-instrumented Tracing (Pixie)
Tracing (Grafana Tempo & Open Telemetry)

Observability Data Flow:

graph TD
    %% Metrics Flow
    A[Applications] -->|scrapes| B[Prometheus]
    B -->|stores| B1[(Prometheus Storage)]
    C[Grafana] -->|queries| B1
    C -->|displays| D[Dashboards]
    
    %% Logs Flow
    E[Containers] -->|collects| F[Promtail]
    F -->|pushes| G[(Loki)]
    C -->|queries| G
    
    %% Traces Flow
    H[Applications] -->|sends| I[OpenTelemetry Collector]
    I -->|forwards| J[(Tempo)]
    C -->|queries| J
    
    %% OTel Self-Monitoring
    I -->|exposes /metrics| B
    
    %% Pixie Flow
    K[Cluster] -->|eBPF| L[Pixie]
    L -->|analysis| M[Pixie UI]
    
    %% Additional monitoring targets
    N[Kubernetes API] -->|scrapes| B
    O[Node Exporters] -->|scrapes| B
    P[Other Services] -->|scrapes| B
    
    %% Styling
    classDef storage fill:#e1f5fe
    classDef collector fill:#f3e5f5
    classDef ui fill:#e8f5e8
    classDef selfmon fill:#fff3e0
    
    class B1,G,J storage
    class B,F,I,L collector
    class C,D,M ui
    class I selfmon

Overview

cluster probably needs the EBS CSI driver for dynamic volume provisioning

kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.25"

Automatic Data Collection (No Code Changes):

Prometheus: Infrastructure metrics (CPU, memory, network) Loki + Promtail: All application logs automatically Pixie: HTTP traffic, database queries, service maps (eBPF magic!)

Requires Instrumentation:

Tempo + OpenTelemetry: Distributed tracing (needs SDK integration)

Details:

Please head to the RAEDME file inside the 04_observability folder for a more in depth insights about the tools and the over all architecture.

Layer 4: Resilience

Provides automated backup and disaster recovery capabilities for the entire Kubernetes platform using Velero. I will enclose all config and resource creation for this layer here, because backup is a complete product, and in the future we can decide moving to a different technology (Longhorn, Kasten...etc), so we would have the entire solution in one place.

Components

Velero Server - Orchestrates backup and restore operations
S3 Backup Storage - Encrypted storage with lifecycle policies for cost optimization
EBS Snapshots - Point-in-time copies of persistent volumes
Node Agents - File-level backup capabilities on worker nodes
IAM Roles - Secure access via IRSA for backup operations

Backup Strategy

Schedule	Frequency	Retention	Type
Daily	2 AM	30 days	EBS snapshots + configs
Weekly	Sunday 3 AM	90 days	Full backup with file-level
Monthly	1st at 4 AM	365 days	Complete disaster recovery

What Gets Protected

Application Data:

Database volumes (PostgreSQL, MySQL)
User uploads and file storage
Monitoring data (Prometheus, Grafana)

Cluster Configuration:

Deployments, Services, ConfigMaps
Secrets (encrypted)
RBAC policies and permissions
Custom resources (ArgoCD apps, certificates)

Disaster Recovery Scenarios

Volume corruption → Restore from EBS snapshots (5-10 min)
Accidental deletion → Selective namespace restore (2-5 min)
Cluster failure → Full cluster rebuild and restore (15-30 min)
Region outage → Cross-region recovery (30-60 min)

Deployment

cd resilience/
terraform init
terraform apply

Usage Examples

# Check Velero namespace and pods
kubectl get namespace velero
kubectl get pods -n velero

# List available backups
velero backup get

# Restore specific namespace
velero restore create --from-backup daily-backup-20250110 \
  --include-namespaces ecommerce-app

# Manual backup before major changes
velero backup create pre-deployment-backup \
  --include-namespaces production

Production Readiness

Current Features:

Automated scheduled backups with appropriate retention
Encrypted storage with lifecycle management
Cross-service backup coordination
Monitoring integration ready

Still Needed for Production:

Cross-region backup replication
Automated backup validation testing
Comprehensive alerting rules
Disaster recovery runbooks

For detailed configuration and advanced features, see the Resilience Layer README.

Layer 6: FinOps - Cost Optimization

Purpose

Real-time Kubernetes cost monitoring and resource optimization foundation.

Components

Deployed

OpenCost - Real-time cost monitoring and allocation
Grafana Integration - Cost dashboards in existing monitoring stack
OpenCost UI - Standalone web interface for detailed cost analysis

Future Enhancements

KEDA - Event-driven autoscaling (SQS, CloudWatch, etc.)
Karpenter - Advanced cluster autoscaling (replace cluster-autoscaler)

What's Included

├── OpenCost Helm Chart
├── Grafana Dashboard ConfigMap
├── Ingress + DNS (opencost.domain.com)
├── Prometheus ServiceMonitor
└── Cost metrics collection

Deployment

# Navigate to FinOps layer
cd 06_finops

# Deploy cost monitoring
terraform init
terraform plan
terraform apply

Access Points

OpenCost UI: https://opencost.terraform-aws-platform.xyz
Grafana: https://grafana.terraform-aws-platform.xyz (cost dashboard)
Metrics: Available in Prometheus

Key Features

Cost breakdown by namespace, deployment, service
Resource efficiency tracking (CPU/memory/storage costs)
Historical cost trends and waste identification
Dual interface - Technical (Grafana) + Business (OpenCost UI)

Quick Tips

Wait 2-4 weeks for meaningful cost data before optimization
Check Grafana first for daily cost monitoring
Use OpenCost UI for detailed cost analysis and reports
Monitor namespace costs to identify optimization opportunities

Next Steps

Establish cost baseline over 2-4 weeks
Identify high-cost namespaces/workloads
Consider KEDA for event-driven workloads
Plan Karpenter migration for 20-50% cost reduction

Expected Impact

Immediate: 100% cost visibility
Short-term: 10-30% waste identification
Long-term: 20-50% cost reduction (with full FinOps stack)

Layer 7: Security - Platform Protection

Purpose

Comprehensive security layer providing policy enforcement, vulnerability scanning, cloud security assessment, and runtime threat detection.

Components

Deployed

Kyverno - Configuration security and policy enforcement
Trivy Operator - Image vulnerability scanning and CIS compliance
Falco - Runtime security monitoring and threat detection
Prowler - AWS security posture assessment (weekly scans)
Security Dashboard - Unified monitoring in Grafana

Key Features

Policy enforcement for container security standards
Continuous vulnerability scanning of all images
Real-time threat detection with behavioral monitoring
Cloud compliance checking against CIS benchmarks
Integrated alerting through existing observability stack

What's Included

├── Kyverno Policy Engine (2 replicas)
├── Trivy Operator (image + config scanning)
├── Falco Runtime Security (eBPF-based) - DISABLED FOR NOW
├── Prowler CronJob (weekly AWS assessment)  
├── Security Grafana Dashboard
└── S3 bucket for security reports

Deployment

# Navigate to security layer
cd layers/07_security/

# Deploy security stack
terraform init
terraform plan
terraform apply

Access Points

Security Dashboard: https://grafana.terraform-aws-platform.xyz (Security Overview)
Policy Reports: kubectl get clusterpolicies
Vulnerability Reports: kubectl get vulnerabilityreports -A
Runtime Alerts: kubectl logs -n falco-system -l app.kubernetes.io/name=falco
Cloud Reports: S3 bucket with weekly Prowler assessments

Key Features

Zero-trust policies - No privileged containers, resource limits enforced
Continuous scanning - All images automatically scanned for vulnerabilities
Behavioral monitoring - Real-time detection of suspicious runtime activity
Compliance automation - Weekly AWS security posture assessment
Unified visibility - All security metrics in existing Grafana dashboards

Quick Tips

Monitor policy violations in Grafana security dashboard
Review vulnerability reports weekly for critical CVEs
Check Falco alerts for runtime security incidents
Access Prowler reports in S3 for compliance evidence
Use Kyverno policies to prevent misconfigurations before deployment

Next Steps

Monitor security dashboard for baseline establishment
Tune Falco rules based on environment-specific patterns
Create additional Kyverno policies for organization requirements
Set up automated remediation workflows for critical vulnerabilities

Expected Impact

Immediate: Policy enforcement prevents misconfigurations
Short-term: Vulnerability visibility and runtime threat detection
Long-term: Compliance-ready platform with defense-in-depth security

Application Deployment

In the 100_app directory we have main.tf file which is responsible for deploying Google's Microservices Demo App.

Components

1. Creates Namespace

resource "kubernetes_namespace_v1" "onlineboutique"

Creates onlineboutique namespace for the application.

2. ArgoCD Application

resource "kubernetes_manifest" "app_chart"

Deploys Google's Online Boutique microservices demo via ArgoCD
Uses Helm chart from Google's container registry
Auto-sync enabled: ArgoCD automatically applies changes and heals drift
Frontend config: Sets externalService: false (uses ingress instead of LoadBalancer)

3. Ingress with TLS

resource "kubernetes_ingress_v1" "frontend"

Exposes the frontend service at app.terraform-aws-platform.xyz
Auto SSL: cert-manager automatically gets Let's Encrypt certificate
Routes: All traffic to the frontend service on port 80

4. External Secret

resource "kubernetes_manifest" "cluster_secret_store"

Syncs secret from AWS Parameter Store (cluster-prod-k8s-platform-tutorial-secret)
Creates Kubernetes secret named onlineboutique-custom-secret
Auto-refresh: Updates every hour

End Result

Complete e-commerce demo: A fully functional microservices application with:

Public access: https://app.terraform-aws-platform.xyz (SSL-enabled)
GitOps: Managed by ArgoCD
Secret management: AWS Parameter Store integration
Production-ready: Ingress, TLS, namespace isolation

How It Works

1. ArgoCD Deploys the App

source:
  repoURL: us-docker.pkg.dev/online-boutique-ci/charts
  chart: onlineboutique

This Helm chart creates multiple services: frontend, cartservice, paymentservice, productcatalogservice, etc.

2. The "frontend" Service Gets Created

When the Online Boutique chart deploys, it creates a service named frontend that looks like:

apiVersion: v1
kind: Service
metadata:
  name: frontend        # <-- This is the actual service name
  namespace: onlineboutique
spec:
  ports:
  - port: 80
  selector:
    app: frontend

3. Ingress References the Real Service

backend {
  service {
    name = "frontend"    # <-- Must match the actual service name
    port {
      number = 80
    }
  }
}

4. You Can Verify This

After deployment, run:

kubectl get services -n onlineboutique

You'll see services like:

frontend
cartservice
paymentservice
etc.

Key Point

The ingress backend service name must match an existing Kubernetes service. Since Google's Online Boutique creates a service called frontend, that's what you reference in the ingress.

If you tried to use name = "my-random-name", the ingress would fail because that service doesn't exist.

Deployment

cd app/
terraform init
terraform apply

Traffic Flow

User (https://app.terraform-aws-platform.xyz)
    ↓
DNS Resolution (Route53)
    ↓
AWS Load Balancer
    ↓
Ingress Controller (nginx)
    ↓
Frontend Service
    ↓
OnlineBoutique Pods

Troubleshooting

Common Issues

DNS Resolution Problems
- Verify nameserver configuration at domain registrar
- Check Route53 hosted zone records
Certificate Issues
- Ensure HTTP traffic is enabled for cert-manager validation
- Verify domain ownership and DNS propagation
Ingress Controller Issues
- Check load balancer provisioning
- Verify security group configurations

Useful Commands

# Check cluster status
kubectl get nodes

# Check ingress controller
kubectl -n ingress-nginx get svc

# Check certificates
kubectl get certificates -A

# Check ArgoCD applications
kubectl -n argocd get applications

Platform Cost Analysis

Monthly Cost Breakdown by Layer

Layer	Primary Cost Drivers	Estimated Monthly Cost
Foundation	EKS control plane ($72), Worker nodes (~$400), NAT Gateways ($135)	~$650
Core Platform	Network Load Balancer ($16), EBS storage ($50)	~$80
Configuration	DNS queries, certificates (minimal)	~$5
Observability	EBS storage for Prometheus/Loki ($100), compute overhead ($100)	~$200
Resilience	S3 backup storage ($15), EBS snapshots ($20)	~$35
FinOps	OpenCost compute overhead (minimal)	~$10
Security	Trivy/Kyverno overhead ($25), Prowler S3 ($5)	~$30
Total Platform Cost		~$1,010/month

Note: Costs based on us-east-1 pricing with 3-4 t3.xlarge worker nodes and moderate storage usage

Cost Optimization Strategies

Immediate Savings (20-40% reduction)

Use Spot Instances: Replace on-demand worker nodes with spot instances for 60-90% compute savings
Right-size Worker Nodes: Start with t3.large instead of t3.xlarge, scale up as needed
Single AZ for Development: Use one NAT Gateway for dev environments (-$90/month)
Reduce EBS Storage: Use gp3 instead of gp2, implement storage lifecycle policies

Medium-term Optimizations (Additional 10-20%)

Implement Karpenter: Replace cluster-autoscaler for better instance selection and spot integration
Storage Optimization: Compress Loki logs, reduce Prometheus retention period
Resource Requests/Limits: Properly size all workloads to improve node utilization
Reserved Instances: Commit to 1-year terms for predictable workloads

Advanced Cost Controls

Cluster Consolidation: Run multiple environments on same cluster with namespace separation
Managed Services: Consider EKS Fargate for specific workloads to eliminate node management
Backup Optimization: Implement intelligent backup scheduling and retention policies
Development Environment: Use smaller instance types and reduced replica counts for non-prod

Monitoring & Governance

Set up Cost Alerts: Configure AWS Budgets for proactive cost monitoring
Regular Cost Reviews: Weekly cost analysis using OpenCost dashboards
Resource Tagging: Implement comprehensive tagging for accurate cost allocation
Automated Cleanup: Schedule removal of unused resources and old backups

Potential Total Savings: 30-60% reduction possible with aggressive optimization, bringing monthly costs down to $400-700/month while maintaining production capabilities.

Best Practices Implemented

Infrastructure as Code: Everything defined in Terraform
Layered Architecture: Clear separation of concerns
Secret Management: External secret synchronization
Certificate Management: Automatic SSL/TLS provisioning
State Management: Remote state storage in S3
GitOps: Automated deployments via ArgoCD

Contributing

Fork the repository
Create a feature branch
Make changes following the established patterns
Test thoroughly in a development environment
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
01_foundation		01_foundation
02_platform		02_platform
03_platform		03_platform
04_observability		04_observability
05_resilience		05_resilience
06_finops		06_finops
07_security		07_security
100_app		100_app
.gitignore		.gitignore
FUTURE_ENHANCEMENTS.md		FUTURE_ENHANCEMENTS.md
Makefile		Makefile
Platform_Architecture_Graph.md		Platform_Architecture_Graph.md
Platform_DataFlow_Graph.md		Platform_DataFlow_Graph.md
Platform_Infrastructure_Graph.md		Platform_Infrastructure_Graph.md
Platform_topology_Graph.md		Platform_topology_Graph.md
README.md		README.md
Security&Monitoring_Graph.md		Security&Monitoring_Graph.md

mu-majid/aws-k8s-platform-blueprint

Folders and files

Latest commit

History

Repository files navigation

Production-Ready Kubernetes Platform on AWS

Table of Contents

Overview

Prerequisites

Architecture

Layer Structure

Layer 1: Foundation

Components

Deployment

Understanding .terraform.lock.hcl

Post-Deployment

Layer 2: Platform

Secret Management Options

Step 02: Software Installation

Step 03: Software Configuration

1. DNS Configuration

2. Certificate Management

3. IAM Roles for ServiceAccounts Configuration:

4. Secret Management Configuration

Hosted Zones

Layer 3: Observability and Monitoring

Components:

Observability Data Flow:

Overview

Details:

Layer 4: Resilience

Components

Backup Strategy

What Gets Protected

Disaster Recovery Scenarios

Deployment

Usage Examples

Production Readiness

Layer 6: FinOps - Cost Optimization

Purpose

Components

Deployed

Future Enhancements

What's Included

Deployment

Access Points

Key Features

Quick Tips

Next Steps

Expected Impact

Layer 7: Security - Platform Protection

Purpose

Components

Deployed

Key Features

What's Included

Deployment

Access Points

Key Features

Quick Tips

Next Steps

Expected Impact

Application Deployment

Components

1. Creates Namespace

2. ArgoCD Application

3. Ingress with TLS

4. External Secret

End Result

How It Works

1. ArgoCD Deploys the App

2. The "frontend" Service Gets Created

3. Ingress References the Real Service

4. You Can Verify This

Key Point

Deployment

Traffic Flow

Troubleshooting

Common Issues

Useful Commands

Packages