diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index 825384c4b..c73ed1ac9 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -10,7 +10,6 @@ This guide is for operators of the StackHPC Kayobe configuration project. hotfix-playbook nova-compute-ironic octavia - rabbitmq secret-rotation tempest upgrading-openstack diff --git a/doc/source/operations/rabbitmq.rst b/doc/source/operations/rabbitmq.rst deleted file mode 100644 index 5a82d98f4..000000000 --- a/doc/source/operations/rabbitmq.rst +++ /dev/null @@ -1,177 +0,0 @@ -======== -RabbitMQ -======== - -High Availability -================= - -In order to improve the stability of RabbitMQ, some changes need to be rolled, -out. These changes are: - -* Update RabbitMQ to version 3.9.22, if you are running old images on Wallaby - or Xena, by synchronising and then pulling a new RabbitMQ container from - Pulp. -* Enable the high availability setting via Kolla-Ansible - ``om_enable_rabbitmq_high_availability``. - -By default in Kolla-Ansible, two key options for the high availability of -RabbitMQ are disabled. These are durable queues, where messages are persisted -to disk; and classic queue mirroring, where messages are replicated across -multiple exchanges. Without these, a deployment has a higher risk of experiencing -issues when updating RabbitMQ, or recovering from network outages. -Messages held in RabbitMQ nodes that are stopped will be lost, which causes -knock-on effects to the OpenStack services which either sent or were expecting -to receive them. The Kolla-Ansible flag -``om_enable_rabbitmq_high_availability`` can be used to enable both of these -options. The default will be overridden to ``true`` from Xena onwards in StackHPC Kayobe configuration. - -While the `RabbitMQ docs `_ do -say "throughput and latency of a queue is not affected by whether a queue is -durable or not in most cases", it should be mentioned that there could be a -potential performance hit from replicating all messages to the disk within -large deployments. These changes would therefore be a tradeoff of performance -for stability. - -**NOTE:** There is guaranteed to be downtime during this procedure, as it -requires restarting RabbitMQ and all the OpenStack services that use it. The -state of RabbitMQ will also be reset. - -Instructions ------------- -If you are planning to perform an upgrade, it is recommended to first roll out these changes. - -The configuration should be merged with StackHPC Kayobe configuration. If -bringing in the latest changes is not possible for some reason, you may cherry -pick the following changes: - -RabbitMQ hammer playbook (all releases): - -* ``3933e4520ba512b5bf095a28b791c0bac12c5dd0`` -* ``d83cceb2c41c18c2406032dac36cf90e57f37107`` -* ``097c98565dd6bd0eb16d49b87e4da7e2f2be3a5c`` - -RabbitMQ tags (Wallaby): - -* ``69c245dc91a2eb4d34590624760c32064c3ac07b`` - -RabbitMQ tags & HA flag (Xena): - -* ``2fd1590eb8ac739a07ad9cccbefc7725ea1a3855`` - -RabbitMQ HA flag (Yoga): - -* ``31406648544372187352e129d2a3b4f48498267c`` - -If you are currently running Wallaby, you will need to enable the HA config option in -``etc/kayobe/kolla/globals.yml``. - -.. code-block:: console - - om_enable_rabbitmq_high_availability: true - -If you are running Wallaby or Xena, synchronise the Pulp containers. - -.. code-block:: console - - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml -e stackhpc_pulp_images_kolla_filter=rabbitmq - -Ensure that Kolla Ansible is up to date. - -.. code-block:: console - - kayobe control host bootstrap - -Generate the new config files for the overcloud services. This ensures that -queues are created as durable. - -.. code-block:: console - - kayobe overcloud service configuration generate --node-config-dir /etc/kolla - -Pull the RabbitMQ container image. - -.. code-block:: console - - kayobe overcloud container image pull -kt rabbitmq - -Stop all the OpenStack services which use RabbitMQ. - -.. code-block:: console - - kayobe overcloud host command run -b --command "systemctl -a | egrep '(barbican|blazar|ceilometer|cinder|cloudkitty|designate|heat|ironic|keystone|magnum|manila|masakari|neutron|nova|octavia)' | awk '{ print \$1 }' | xargs systemctl stop" - -Upgrade RabbitMQ. - -.. code-block:: console - - kayobe overcloud service upgrade -kt rabbitmq --skip-prechecks - -In order to convert the queues to be durable, you will need to reset the state -of RabbitMQ. This can be done with the RabbitMQ hammer playbook: - -.. code-block:: console - - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/rabbitmq-reset.yml --skip-tags restart-openstack - -Check to see if RabbitMQ is functioning as expected. - -.. code-block:: console - - kayobe overcloud host command run --limit controllers --show-output --command 'docker exec rabbitmq rabbitmqctl cluster_status' - -The cluster status should list all controllers. - -Check to see if all OpenStack queues and exchanges have been removed from the RabbitMQ cluster. - -.. code-block:: console - - kayobe overcloud host command run --limit controllers --show-output --command 'docker exec rabbitmq rabbitmqctl list_queues name' - kayobe overcloud host command run --limit controllers --show-output --command 'docker exec rabbitmq rabbitmqctl list_exchanges name' - -There should be no queues listed, and the only exchanges listed should start with `amq.`. - -Start the OpenStack services which use RabbitMQ. Note that this will start all -matching services, even if they weren't running prior to starting this -procedure. - -.. code-block:: console - - kayobe overcloud host command run -b --command "systemctl -a | egrep '(barbican|blazar|ceilometer|cinder|cloudkitty|designate|heat|ironic|keystone|magnum|manila|masakari|neutron|nova|octavia)' | awk '{ print \$1 }' | xargs systemctl start" - -Check to see if the expected queues are durable. - -.. code-block:: console - - kayobe overcloud host command run --limit controllers --show-output --command 'docker exec rabbitmq rabbitmqctl list_queues name durable' - -The queues listed should be durable if their names do not start with the -following: - -* amq. -* .\*\_fanout\_ -* reply\_ - -If there are issues with the services after this, particularly during upgrades, -you may find it useful to reuse the hammer playbook, ``rabbitmq-reset.yml``. - -Known issues ------------- - -If there are any OpenStack services running without durable queues enabled -while the RabbitMQ cluster is reset, they are likely to create non-durable -queues before the other OpenStack services start. This leads to an error -such as the following when other OpenStack services start:: - - Unable to connect to AMQP server on :5672 after inf tries: - Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' - for exchange 'neutron' in vhost '/': received 'true' but current is - 'false': amqp.exceptions.PreconditionFailed: Exchange.declare: (406) - PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in - vhost '/': received 'true' but current is 'false' - -This may happen if a host is not in the inventory, leading to them not being -targeted by the ``systemctl stop`` command. If this does happen, look for the -hostname of the offending node in the queues created after the RabbitMQ reset. - -Once the rogue services have been stopped, reset the RabbitMQ cluster again to -clear the queues. diff --git a/doc/source/operations/upgrading-openstack.rst b/doc/source/operations/upgrading-openstack.rst index a3f511f83..ad5d4da11 100644 --- a/doc/source/operations/upgrading-openstack.rst +++ b/doc/source/operations/upgrading-openstack.rst @@ -35,6 +35,64 @@ Notable changes in the |current_release| Release There are many changes in the OpenStack |current_release| release described in the release notes for each project. Here are some notable ones. +RabbitMQ SLURP upgrade +---------------------- + +Because this is a SLURP upgrade, RabbitMQ must be upgraded manually from 3.11, +to 3.12, then to 3.13 on Antelope before the Caracal upgrade. This upgrade +should not cause an API outage (though it should still be considered "at +risk"). + +There are two prerequisites: + +1. Kolla-Ansible should be upgraded to the latest version: + + .. code-block:: bash + + cd $KOLLA_SOURCE_PATH + git fetch && git pull + $KOLLA_VENV_PATH/bin/pip install . + +2. The RabbitMQ container image tag must be equal to or newer than + ``20240823T101942``. Check the timestamps in + ``etc/kayobe/kolla-image-tags.yml``. + +Once complete, upgrade RabbitMQ: + +.. code-block:: bash + + kayobe overcloud service configuration generate --node-config-dir /tmp/ignore -kt none + kayobe kolla ansibe run "rabbitmq-upgrade 3.12" + kayobe kolla ansibe run "rabbitmq-upgrade 3.13" + +RabbitMQ quorum queues +---------------------- + +In Caracal, quorum queues are enabled by default for RabbitMQ. This is +different to Antelope which used HA queues. Before upgrading to Caracal, it is +strongly recommended that you migrate from HA to quorum queues. The migration +is automated using a script. + +.. warning:: + This migration will stop all services using RabbitMQ and cause an + extended API outage while queues are migrated. It should only be + performed in a pre-agreed maintenance window. + +Set the following variables in your kolla globals file (i.e. +``$KAYOBE_CONFIG_PATH/kolla/globals.yml`` or +``$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/kolla/globals.yml``): + +.. code-block:: yaml + + om_enable_rabbitmq_high_availability: false + om_enable_rabbitmq_quorum_queues: true + +Then execute the migration script: + +.. code-block:: bash + + $KAYOBE_CONFIG_PATH/../../tools/rabbitmq-quorum-migration.sh + Heat disabled by default ------------------------ @@ -54,6 +112,24 @@ using Heat, and disable the service. TODO: guide for disabling Heat +Designate sink disabled by default +---------------------------------- + +Designate sink is optional designate service which listens for event +Notifications, primarily from Nova and Neutron. It is disabled by default (when +designate is enabled) in Caracal. It is not required for Designate to function. + +If you still wish to use it, you should set the flag manually: + +.. code-block:: yaml + :caption: ``kolla/globals.yml`` + + designate_enable_notifications_sink: true + +If you are using Designate and do not make this change, the Antelope +``designate-sink`` container will remain on the controllers after the upgrade. +It must be removed manually. + Grafana Volume -------------- The Grafana container volume is no longer used. If you wish to automatically @@ -85,7 +161,16 @@ configuration must change the names of those files in Known issues ============ -* None! +* OVN breaks on Rocky 9 deployments where hostnames are FQDNs. + Before upgrading, you must make sure no compute or controller nodes have any + ``.`` characters in their hostnames. Run the command below to check: + + .. code-block:: bash + + kayobe overcloud host command run --command "grep -v \'\.\' /etc/hostname" --show-output + + There is currently no known fix for this issue aside from reprovisioning. A + patch will be developed soon. Security baseline =================