Skip to content

Full shutdown document added #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# -- Project information -----------------------------------------------------

project = 'OpenStack Administration Guide'
copyright = '2020, StackHPC Ltd'
copyright = '2021, StackHPC Ltd'
author = 'StackHPC Ltd'


Expand Down
206 changes: 206 additions & 0 deletions source/full_shutdown.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
.. include:: vars.rst

=======================
Full Shutdown Procedure
=======================

In case a full shutdown of the system is required, we advise to use the
following order:

* Perform a graceful shutdown of all virtual machine instances
* Stop Ceph (if applicable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be early for stopping Ceph, in case the OpenStack services are still using Ceph state (eg, image uploads). Perhaps stop Ceph at the point where the Ceph nodes are shut down.

* Put all nodes into maintenance mode in Bifrost
* Shut down compute nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this lists shutting down different types of nodes separately, but the procedure only stops the services separately, then shuts down all nodes at once.

* Shut down monitoring node
* Shut down network nodes (if separate from controllers)
* Shut down controllers
* Shut down Ceph nodes (if applicable)
* Shut down seed VM
* Shut down Ansible control host

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one isn't covered

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should't make any assumptions about what or where this is. It may not be the seed hypervisor, which should also be called out explicitly.


Virtual Machines shutdown
-------------------------

Contact Openstack users to stop their virtual machines gracefully,
If that is not possible shut down VMs using openstack CLI as admin user:

.. code-block:: bash

for i in `openstack server list --all-projects -c ID -f value` ; \
do openstack server stop $i ; done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this asynchronous? Should we check for success?



.. ifconfig:: deployment['ceph_managed']

Stop Ceph
---------
Procedure based on `Red Hat documentation <https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/understanding-process-management-for-ceph#powering-down-and-rebooting-a-red-hat-ceph-storage-cluster_admin>`__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's something equivalent in the community docs it would be better, but the closest I found was https://docs.ceph.com/en/latest/rados/operations/operating/ and it doesn't cover setting all the flags below.


- Stop the Ceph clients from using any Ceph resources (RBD, RADOS Gateway, CephFS)
- Check if cluster is in healthy state

.. code-block:: bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be indented more to be part of the bullet?


ceph status

- Stop CephFS (if applicable)

Stop CephFS cluster by reducing the number of ranks to 1, setting the cluster_down flag, and then failing the last rank.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, indentation?


.. code-block:: bash

ceph fs set FS_NAME max_mds 1
ceph mds deactivate FS_NAME:1 # rank 2 of 2
ceph status # wait for rank 1 to finish stopping
ceph fs set FS_NAME cluster_down true
ceph mds fail FS_NAME:0

Setting the cluster_down flag prevents standbys from taking over the failed rank.

- Set the noout, norecover, norebalance, nobackfill, nodown and pause flags.

.. code-block:: bash

ceph osd set noout
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set nodown
ceph osd set pause

- Shut down the OSD nodes one by one:

.. code-block:: bash

systemctl stop ceph-osd.target

- Shut down the monitor/manager nodes one by one:

.. code-block:: bash

systemctl stop ceph.target

Set Bifrost maintenance mode
----------------------------

Set maintenance mode in bifrost to prevent nodes from automatically
powering back on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other option is to power off via bifrost


.. code-block:: bash

bifrost# for i in `openstack --os-cloud bifrost baremetal node list -c UUID -f value` ; \
do openstack --os-cloud bifrost baremetal node maintenance set --reason full-shutdown $i ; done

Shut down nodes
---------------

Shut down nodes one at a time gracefully using:

.. code-block:: bash

systemctl poweroff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be serialised form of shutdown invocation using Kayobe's tools https://docs.openstack.org/kayobe/latest/administration/overcloud.html#running-commands - perhaps also with a small delay to the shutdown command so that it doesn't immediately chop off the ansible connection.


Shut down the seed VM
---------------------

Shut down seed vm on ansible control host gracefully using:

.. code-block:: bash
:substitutions:

ssh stack@|seed_name| sudo systemctl poweroff
virsh shutdown |seed_name|

.. _full-power-on:

Full Power on Procedure
-----------------------

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a different heading style. Alternatively (preferably?) this section could go in another page called cold_start.rst.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or change the page to be: "Shutdown and power on procedures"


* Start ansible control host and seed vm
* Remove nodes from maintenance mode in bifrost
* Recover MariaDB cluster
* Start Ceph (if applicable)
* Check that all docker containers are running

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: they haven't been started

* Check Kibana for any messages with log level ERROR or equivalent

Start Ansible Control Host
--------------------------

The Ansible control host is not enrolled in Bifrost and will have to be powered
on manually.

Start Seed VM
-------------

The seed VM (and any other service VM) should start automatically when the seed
hypervisor is powered on. If it does not, it can be started with:

.. code-block:: bash

virsh start seed-0

Unset Bifrost maintenance mode
------------------------------

Unsetting maintenance mode in bifrost should automatically power on the nodes

.. code-block:: bash

bifrost# for i in `openstack --os-cloud bifrost baremetal node list -c UUID -f value` ; \
do openstack --os-cloud bifrost baremetal node maintenance unset $i ; done

Recover MariaDB cluster
-----------------------

If all of the servers were shut down at the same time, it is necessary to run a
script to recover the database once they have all started up. This can be done
with the following command:

.. code-block:: bash

kayobe# kayobe overcloud database recover

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would be cleaner to stop the containers before shutdown, to avoid them starting up in a broken state.


.. ifconfig:: deployment['ceph_managed']

Start Ceph
----------
Procedure based on `Red Hat documentation <https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/understanding-process-management-for-ceph#powering-down-and-rebooting-a-red-hat-ceph-storage-cluster_admin>`__

- Start monitor/manager nodes:

.. code-block:: bash

systemctl start ceph.target

- Start the OSD nodes:

.. code-block:: bash

systemctl start ceph-osd.target

- Wait for all the nodes to come up

- Unset the noout, norecover, norebalance, nobackfill, nodown and pause flags

.. code-block:: bash

ceph osd unset noout
ceph osd unset norecover
ceph osd unset norebalance
ceph osd unset nobackfill
ceph osd unset nodown
ceph osd unset pause

- Start CephFS (if applicable)

CephFS cluster must be brought back up by setting the cluster_down flag to false

.. code-block:: bash

ceph fs set FS_NAME cluster_down false

- Verify ceph cluster status

.. code-block:: bash

ceph status
1 change: 1 addition & 0 deletions source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Contents
ceph_storage
managing_users_and_projects
operations_and_monitoring
full_shutdown
customising_deployment
gpus_in_openstack

Expand Down
16 changes: 2 additions & 14 deletions source/operations_and_monitoring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -502,22 +502,10 @@ Shutting down the seed VM
kayobe# ssh stack@|seed_name| sudo systemctl poweroff
kayobe# virsh shutdown |seed_name|

.. _full-shutdown:

Full shutdown
-------------

In case a full shutdown of the system is required, we advise to use the
following order:

* Perform a graceful shutdown of all virtual machine instances
* Shut down compute nodes
* Shut down monitoring node
* Shut down network nodes (if separate from controllers)
* Shut down controllers
* Shut down Ceph nodes (if applicable)
* Shut down seed VM
* Shut down Ansible control host
Follow separate :doc:`document <full_shutdown>`.

Rebooting a node
----------------
Expand Down Expand Up @@ -575,7 +563,7 @@ hypervisor is powered on. If it does not, it can be started with:
Full power on
-------------

Follow the order in :ref:`full-shutdown`, but in reverse order.
Follow separate :ref:`document <full-power-on>`.

Shutting Down / Restarting Monitoring Services
----------------------------------------------
Expand Down