Skip to content

Fixed grammar and formatting issues in architecture.mdx #6922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 28, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,26 @@ description: An overview of the core components and design principles that enabl

![High-level diagram of Hybrid Manager's architecture](../images/hcp_high-level_light.svg)

The outer-most containers of the diagram illustrate a Kubernetes cluster running on a Cloud Service Provider's infrastructure, such as [EKS on AWS](https://aws.amazon.com/eks/) or [GKE on GCP](https://cloud.google.com/kubernetes-engine), or on-premises infrastructure. "On-premises" in this context includes deployments on physical servers (bare metal), virtual machines, or private clouds hosted within an organization’s infrastructure.
The outer-most containers of the diagram illustrate a Kubernetes cluster running on a Cloud Service Provider's infrastructure, such as [EKS on AWS](https://aws.amazon.com/eks/) or [GKE on GCP](https://cloud.google.com/kubernetes-engine), or on-premise infrastructure. "On-premise" in this context includes deployments on physical servers (bare metal), virtual machines, or private cloud(s) hosted within an organization’s infrastructure.
Working inward through the diagram, there are three main logical groupings to note:

1. **The Kubernetes control plane**: The Kubernetes control plane is comprised of the core Kubernetes components, such as `kube-apiserver`, `etcd`, `kube-scheduler`, `kube-controller-manager`, and, if on a cloud setup, `cloud-controller-manager` and implemented by two or three Kubernetes control nodes.

2. **The Control Compute Grouping**: The Control Compute Grouping of HM is a logical grouping of 70+ pods running on 2 or 3 Kubernetes worker nodes that implement the HM's core management components, such as Grafana, Loki, Thanos, Prometheus for observability, Cert Manager for certification management, Istio for networking, a trust manager for securing software releases, and many others.
The 70+ pods for the components are distributed across the worker nodes according to how the `kube-scheduler` places them, which is not as simply as pictured in the diagram.
2. **The Control Compute Grouping**: The Control Compute Grouping of HM is a logical grouping of 70+ pods running on (2) or (3) Kubernetes worker nodes that implement the HM's core management components, such as Grafana, Loki, Thanos, Prometheus for observability, Cert Manager for certificates management, Istio for networking, a trust manager for securing software releases, and many others.
The 70+ pods for the components are distributed across the worker nodes according to how the `kube-scheduler` places them, which is not as simple as what is illustrated in the diagram.
If needed, more worker nodes can be added to support more database-as-a-service resources.

3. **The Data Compute Grouping**: The Data Compute Grouping of HM is a logical grouping of Postgres clusters organized by namespaces and running on a number of pods distributed across at least 3 Kubernetes worker nodes as the `kube-scheduler` places them, which may not be as simply as pictured.
The number of worker nodes can be increased from three as more databases are added using HM, but maybe require either adding the worker nodes manually or configuring auto scaling alongside your infrastructure to do this automatically.
3. **The Data Compute Grouping**: The Data Compute Grouping of HM is a logical grouping of Postgres clusters organized by namespaces and running on a number of pods distributed across at least (3) Kubernetes worker nodes as the `kube-scheduler` places them, which may not be as simple as what is illustrated in the diagram.
The number of worker nodes can be increased from (3) as more databases are added using HM, but maybe require either adding the worker nodes manually or configuring auto scaling alongside your infrastructure to do this automatically.

Both the Control Compute Grouping's worker nodes and the and the Data Compute Grouping's worker nodes are backed up nightly to either Snapshots (on-premises using a SAN, or using EBS with EKS) or S3 compatible object store using Barman ([for the cloud](https://docs.pgbarman.org/release/3.12.0/user_guide/barman_cloud.html)).
Both the Control Compute Grouping's worker nodes and the Data Compute Grouping's worker nodes are backed up nightly to either Snapshots (on-premises using a SAN, or using EBS with EKS) or S3 compatible object store using Barman for the cloud (https://docs.pgbarman.org/release/3.12.0/user_guide/barman_cloud.html).

### Scalability

To support scalability, HM integrates with Kubernetes' auto-scaling capabilities.
This means that on platforms like AWS, worker nodes in the Data Compute Grouping can be automatically added to the cluster as needed using AWS's manged service [Auto Scaling groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html).
This means that on platforms like AWS, worker nodes in the Data Compute Grouping can be automatically added to the cluster as needed using AWS's managed service [Auto Scaling groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html).
However, the HM's Kubernetes auto-scaling capability can be disabled to control costs.
In addition, it's important to note that enabling auto-scaling functionality for on-premises deployments requires additional infrastructure configuration since there is no out-of-the-box auto-scaling feature like AWS's Auto Scaling groups for on-premises deployments.
In addition, it's important to note that enabling auto-scaling functionality for on-premise deployments requires additional infrastructure configuration since there is no out-of-the-box auto-scaling feature like AWS's Auto Scaling groups for on-premises deployments.
Monitoring the distribution and resource usage of pods and worker nodes implementing the database clusters can be achieved through the [included Grafana dashboards](../using_hybrid_manager/cluster_management/manage-clusters/trace-clusters/#grafana-hardware-utilization-dashboard) or by [using Kubernetes `kubectl` commands](../using_hybrid_manager/cluster_management/manage-clusters/trace-clusters/#using-kubectl-to-check-underlying-cluster-resources).

### Backup architecture
Expand All @@ -40,9 +40,9 @@ For more robust, multi-availability zone recovery, Barman Cloud backups can be s
Alternatively, local object stores might require additional configuration to ensure backups are replicated to remote sites.
Stretch clusters mitigate these limitations by enabling data replication and redundancy across geographically dispersed locations.

### High Availability and resilience
### High Availability and resiliency

HM leverages Kubernetes' proven technologies to ensure high availability through automated failover mechanisms.
These mechanisms function within stretch clusters, across availability zones, or between zones in the same region.
The architecture supports faster backups, via snapshot backups, as well as recovery for large datasets, though multi-site recovery capabilities depend on the underlying storage infrastructure.
For customers with advanced storage configurations, such as SANs (Storage Area Networks) spanning multiple racks or data centers, HM can leverage the infrastructure to enhance resilience and ensure rapid failover. However, these configurations may not be common among all users.
For customers with advanced storage configurations, such as SANs (Storage Area Networks) spanning multiple racks or data centers, HM can leverage the infrastructure to enhance resiliency and ensure rapid failover. However, these configurations may not be common among all enviroments.