stackhpc
diff --git a/‎.github/workflows/stackhpc-container-image-build.yml
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/stackhpc-container-image-build.yml
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/source/configuration/cloudkitty.rst
Lines changed: 141 additions & 0 deletions b/‎doc/source/configuration/cloudkitty.rst
Lines changed: 141 additions & 0 deletions
diff --git a/‎doc/source/configuration/index.rst
Lines changed: 1 addition & 0 deletions b/‎doc/source/configuration/index.rst
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/source/configuration/monitoring.rst
Lines changed: 81 additions & 3 deletions b/‎doc/source/configuration/monitoring.rst
Lines changed: 81 additions & 3 deletions
diff --git a/‎doc/source/configuration/release-train.rst
Lines changed: 27 additions & 0 deletions b/‎doc/source/configuration/release-train.rst
Lines changed: 27 additions & 0 deletions
@@ -149,7 +149,7 @@ jobs:
       # Normally installed during host configure.
       - name: Install Docker Python SDK
         run: |
-          sudo pip install docker
+          sudo pip install docker 'requests<2.32.0'
       
       - name: Get Kolla tag
         id: write-kolla-tag
@@ -253,7 +253,7 @@ jobs:
               if docker push $image; then
                 echo "Pushed $image"
                 break
-              elif $i == 5; then
+              elif [ $i -eq 5 ] ; then
                 echo "Failed to push $image"
                 echo $image >> image-build-logs/push-failed-images.txt
               else
 
@@ -0,0 +1,141 @@
+==========
+CloudKitty
+==========
+
+Configuring in kayobe-config
+============================
+
+By default, CloudKitty uses Gnocchi and Ceilometer as the collector and fetcher
+backends. Unless the system has a specific reason not to, we recommend instead
+using Prometheus as the backend for both. The following instructions explain
+how to do this. Also, see the `Kolla Ansible docs on CloudKitty
+<https://docs.openstack.org/kolla-ansible/latest/reference/rating/cloudkitty-guide.html>`__
+for more details.
+
+Enable CloudKitty and disable InfluxDB, as we are using OpenSearch as the
+storage backend. Set the following in ``kolla.yml``:
+
+.. code-block:: yaml
+
+  kolla_enable_cloudkitty: true
+  # Explicitly disable influxdb as we are using OpenSearch as the CloudKitty backend
+  kolla_enable_influxdb: false
+
+Set Prometheus as the backend for both the collector and fetcher, and
+Elasticsearch as the storage backend. Note that our fork of CloudKitty is
+patched so that the CloudKitty Elasticsearch V2 storage backend will also work
+with an OpenSearch cluster. Proper support for the V2 OpenSearch storage
+backend is still pending in Kolla-Ansible `here
+<https://review.opendev.org/c/openstack/kolla-ansible/+/898555>`__. Set the
+following in ``kolla/globals.yml``:
+
+.. code-block:: yaml
+
+  cloudkitty_collector_backend: prometheus
+  cloudkitty_fetcher_backend: prometheus
+  cloudkitty_storage_backend: elasticsearch
+
+If you have TLS enabled, you will also need to set the cafile for Prometheus
+and Elasticsearch. Set the following in ``kolla/globals.yml``.
+
+.. code-block::
+
+  {% raw %}
+  cloudkitty_prometheus_cafile: "{{ openstack_cacert }}"
+  cloudkitty_elasticsearch_cafile: "{{ openstack_cacert }}"
+  {% endraw %}
+
+The default collection period is one hour, which is likely too long for most
+systems as CloudKitty charges by the **entire** collection period if any usage
+is seen within this timeframe. This is regardless of actual usage, meaning that
+even one minute will be charged as a full hour's usage. As a result, it is
+recommended to adjust the collection interval, ``period`` (in units of
+seconds), appropriately (e.g. ten minutes). Furthermore, when using Prometheus
+as the collector, you need to change the ``scope_key`` to match the metrics
+provided by the Prometheus OpenStack Exporter. Both of these can be achieved by
+setting the following in ``kolla/config/cloudkitty.conf``:
+
+.. code-block:: console
+
+  [collect]
+  scope_key = tenant_id
+  period = 600
+
+You will need to configure which metrics CloudKitty should track. The following
+example, set in ``kolla/config/cloudkitty/metrics.yml``, will track for VM flavors and
+the total utilised volume.
+
+.. code-block:: yaml
+
+  metrics:
+    openstack_nova_server_status:
+      alt_name: instance
+      groupby:
+        - uuid
+        - user_id
+        - tenant_id
+      metadata:
+        - flavor_id
+        - name
+      mutate: MAP
+      mutate_map:
+        0.0: 1.0  # ACTIVE
+        11.0: 1.0 # SHUTOFF
+        12.0: 1.0 # SUSPENDED
+        16.0: 1.0 # PAUSED
+      unit: instance
+    openstack_cinder_limits_volume_used_gb:
+      alt_name: storage
+      unit: GiB
+      groupby:
+        - tenant_id
+
+If your system had Monasca deployed in the past, you likely have some
+relabelled attributes in the Prometheus OpenStack exporter. To account for
+this, you should either remove the custom relabelling (in
+``kolla/config/prometheus.yml``) or change your ``metrics.yml`` to use the
+correct attributes.
+
+Post-configuration with openstack-config
+========================================
+
+This is an example `openstack-config
+<https://github.com/stackhpc/openstack-config>`__ setup to create mappings for
+the metrics configured above. Note that the costs are scaled for the ten minute
+collection period, e.g. a flavor with 1 VCPU will cost 1 unit per hour.
+
+.. code-block:: yaml
+
+  # Map flavors based on VCPUs
+  openstack_ratings_hashmap_field_mappings:
+    - service: instance
+      name: flavor_id
+      mappings:
+      - value: '1' # tiny compute flavor (1 vcpu) with an OpenStack flavor ID of 1
+        cost: 0.1666666666666666
+        type: flat
+      - value: '2' # small compute flavor (2 vcpus) with an OpenStack flavor ID of 2
+        cost: 0.3333333333333333
+        type: flat
+      - value: '3' # medium compute flavor (3 vcpus) with an OpenStack flavor ID of 3
+        cost: 0.5
+        type: flat
+      - value: '4' # large compute flavor (4 vcpus) with an OpenStack flavor ID of 4
+        cost: 0.6666666666666666
+        type: flat
+      - value: '5' # xlarge compute flavor (8 vcpus) with an OpenStack flavor ID of 5
+        cost: 1.3333333333333333
+        type: flat
+      - value: '6' # tiny 2 compute flavor (2 vcpus) with an OpenStack flavor ID of 6
+        cost: 0.3333333333333333
+        type: flat
+
+  # Map volumes based on GB
+  openstack_ratings_hashmap_service_mappings:
+    - service: storage
+      cost: 0.16666666666666666
+      type: flat
+
+See the `OpenStack CloudKitty Ratings role
+<https://github.com/stackhpc/ansible-collection-openstack/tree/main/roles/os_ratings>`__
+for more details.
@@ -20,3 +20,4 @@ the various features provided.
    magnum-capi
    ci-cd
    security-hardening
+   cloudkitty
@@ -126,6 +126,8 @@ depending on your configuration, you may need set the
 ``kolla_enable_prometheus_ceph_mgr_exporter`` variable to ``true`` in order to
 enable the ceph mgr exporter.
 
+.. _os-capacity:
+
 OpenStack Capacity
 ==================
 
@@ -149,9 +151,19 @@ project domain name in ``stackhpc-monitoring.yml``:
     stackhpc_os_capacity_openstack_region_name: <openstack_region_name>
 
 Additionally, you should ensure these credentials have the correct permissions
-for the exporter. If you are deploying in a cloud with internal TLS, you may be required
-to disable certificate verification for the OpenStack Capacity exporter
-if your certificate is not signed by a trusted CA.
+for the exporter.
+
+If you are deploying in a cloud with internal TLS, you may be required
+to provide a CA certificate for the OpenStack Capacity exporter if your
+certificate is not signed by a trusted CA. For example, to use a CA certificate
+named ``vault.crt`` that is also added to the Kolla containers:
+
+.. code-block:: yaml
+
+    stackhpc_os_capacity_openstack_cacert: "{{ kayobe_env_config_path }}/kolla/certificates/ca/vault.crt"
+
+Alternatively, to disable certificate verification for the OpenStack Capacity
+exporter:
 
 .. code-block:: yaml
 
@@ -169,3 +181,69 @@ If you notice ``HaproxyServerDown`` or ``HaproxyBackendDown`` prometheus
 alerts after deployment it's likely the os_exporter secrets have not been
 set correctly, double check you have entered the correct authentication
 information appropiate to your cloud and re-deploy.
+
+Friendly Network Names
+=======================
+For operators that prefer to see descriptive or friendly interface names the
+following play can be run. This takes network names as defined in kayobe and
+relabels the devices/interfaces in Prometheus to make use of these names.
+
+**Check considerations and known limitations to see if this is suitable in any
+given environment before applying.**
+
+This reuses existing fields to provide good compatibility with existing
+dashboards and alerts.
+
+To enable the change:
+
+.. code-block:: console
+
+    kayobe playbook run etc/kayobe/ansible/prometheus-network-names.yml
+    kayobe overcloud service reconfigure --kt prometheus
+
+This first generates a template based on the prometheus.yml.j2
+``etc/kayobe/ansible/`` and which is further templated for use with
+kolla-ansible.
+This is then rolled out via service reconfigure.
+
+
+This helps Prometheus provide insights that can be more easily understood by
+those without an intimate understanding of a given site. Prometheus Node
+Exporter and cAdvisor both provide network statistics using the
+interface/device names. This play causes Prometheus to relabel these fields to
+human readable names based on the networks as defined in kayobe
+e.g. bond1.1838 may become storage_network.
+
+The default labels are preserved with the prefix ``original_``.
+
+* For node_exporter, ``device`` is then used for network names, while
+  ``original_device`` is used for the interface itself.
+* For cAdvisor, ``interface`` is used for network names, and
+  ``original_interface`` is used to preserve the interface name.
+
+:Known-Limitations/Considerations/Requirements:
+
+Before enabling this feature, the implications must be discussed with the
+customer. The following are key considerations for that conversation:
+
+* Only network names defined within kayobe are within scope.
+* Tenant network interfaces, including SR-IOV are not considered or modified.
+* Only the interface directly attributed to a network will be relabelled.
+  This may be a bond, a vlan tagged sub-interface, or both.
+  The parent bond, or bond members are not relabelled unless they are
+  captured within a distinct defined network.
+* Modified entries will be within existing labels. This may be breaking for
+  anything that expects the original structure, including custom dashboards,
+  alerting, billing, etc.
+* After applying, there will be inconsistency in the time-series db for the
+  duration of the retention period i.e until previously ingested entries
+  expire.
+  The metrics gathered prior to applying these modifications will be unaltered,
+  with all new metrics using the new structure.
+* The interface names and their purpose must be consistent and unique within
+  the environment. i.e if eth0 is defined as admin_interface on one node, no
+  other node can include a different network definition using eth0.
+  This does not apply in the case when both devices are bond members.
+  e.g. bond0 on a controller has eth0 and eth1 as members. bond1 on a compute
+  uses eth0 and eth1 as members. This is not problematic as it is only
+  the bond itself that is relabelled.
@@ -147,6 +147,33 @@ By default, HashiCorp images (Consul and Vault) are not synced from Docker Hub
 to the local Pulp. To sync these images, set ``stackhpc_sync_hashicorp_images``
 to ``true``.
 
+Custom container images
+-----------------------
+
+A custom list of container images can be synced to the local Pulp using the
+``stackhpc_pulp_repository_container_repos_extra`` and
+``stackhpc_pulp_distribution_container_extra`` variables.
+
+.. code-block:: yaml
+
+   # List of extra container image repositories.
+   stackhpc_pulp_repository_container_repos_extra:
+     - name: "certbot/certbot"
+       url: "https://registry-1.docker.io"
+       policy: on_demand
+       proxy_url: "{{ pulp_proxy_url }}"
+       state: present
+       include_tags: "nightly"
+       required: True
+
+   # List of extra container image distributions.
+   stackhpc_pulp_distribution_container_extra:
+     - name: certbot
+       repository: certbot/certbot
+       base_path: certbot/certbot
+       state: present
+       required: True
+
 Usage
 =====