-
Notifications
You must be signed in to change notification settings - Fork 2
Full shutdown document added #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,206 @@ | ||
.. include:: vars.rst | ||
|
||
======================= | ||
Full Shutdown Procedure | ||
======================= | ||
|
||
In case a full shutdown of the system is required, we advise to use the | ||
following order: | ||
|
||
* Perform a graceful shutdown of all virtual machine instances | ||
* Stop Ceph (if applicable) | ||
* Put all nodes into maintenance mode in Bifrost | ||
* Shut down compute nodes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: this lists shutting down different types of nodes separately, but the procedure only stops the services separately, then shuts down all nodes at once. |
||
* Shut down monitoring node | ||
* Shut down network nodes (if separate from controllers) | ||
* Shut down controllers | ||
* Shut down Ceph nodes (if applicable) | ||
* Shut down seed VM | ||
* Shut down Ansible control host | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one isn't covered There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably should't make any assumptions about what or where this is. It may not be the seed hypervisor, which should also be called out explicitly. |
||
|
||
Virtual Machines shutdown | ||
------------------------- | ||
|
||
Contact Openstack users to stop their virtual machines gracefully, | ||
If that is not possible shut down VMs using openstack CLI as admin user: | ||
|
||
.. code-block:: bash | ||
|
||
for i in `openstack server list --all-projects -c ID -f value` ; \ | ||
do openstack server stop $i ; done | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this asynchronous? Should we check for success? |
||
|
||
|
||
.. ifconfig:: deployment['ceph_managed'] | ||
|
||
Stop Ceph | ||
--------- | ||
Procedure based on `Red Hat documentation <https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/understanding-process-management-for-ceph#powering-down-and-rebooting-a-red-hat-ceph-storage-cluster_admin>`__ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there's something equivalent in the community docs it would be better, but the closest I found was https://docs.ceph.com/en/latest/rados/operations/operating/ and it doesn't cover setting all the flags below. |
||
|
||
- Stop the Ceph clients from using any Ceph resources (RBD, RADOS Gateway, CephFS) | ||
- Check if cluster is in healthy state | ||
|
||
.. code-block:: bash | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it need to be indented more to be part of the bullet? |
||
|
||
ceph status | ||
|
||
- Stop CephFS (if applicable) | ||
|
||
Stop CephFS cluster by reducing the number of ranks to 1, setting the cluster_down flag, and then failing the last rank. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, indentation? |
||
|
||
.. code-block:: bash | ||
|
||
ceph fs set FS_NAME max_mds 1 | ||
ceph mds deactivate FS_NAME:1 # rank 2 of 2 | ||
ceph status # wait for rank 1 to finish stopping | ||
ceph fs set FS_NAME cluster_down true | ||
ceph mds fail FS_NAME:0 | ||
|
||
Setting the cluster_down flag prevents standbys from taking over the failed rank. | ||
|
||
- Set the noout, norecover, norebalance, nobackfill, nodown and pause flags. | ||
|
||
.. code-block:: bash | ||
|
||
ceph osd set noout | ||
ceph osd set norecover | ||
ceph osd set norebalance | ||
ceph osd set nobackfill | ||
ceph osd set nodown | ||
ceph osd set pause | ||
|
||
- Shut down the OSD nodes one by one: | ||
|
||
.. code-block:: bash | ||
|
||
systemctl stop ceph-osd.target | ||
|
||
- Shut down the monitor/manager nodes one by one: | ||
|
||
.. code-block:: bash | ||
|
||
systemctl stop ceph.target | ||
|
||
Set Bifrost maintenance mode | ||
---------------------------- | ||
|
||
Set maintenance mode in bifrost to prevent nodes from automatically | ||
powering back on | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other option is to power off via bifrost |
||
|
||
.. code-block:: bash | ||
bbezak marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
bifrost# for i in `openstack --os-cloud bifrost baremetal node list -c UUID -f value` ; \ | ||
do openstack --os-cloud bifrost baremetal node maintenance set --reason full-shutdown $i ; done | ||
|
||
Shut down nodes | ||
--------------- | ||
|
||
Shut down nodes one at a time gracefully using: | ||
|
||
.. code-block:: bash | ||
|
||
systemctl poweroff | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There might be serialised form of shutdown invocation using Kayobe's tools https://docs.openstack.org/kayobe/latest/administration/overcloud.html#running-commands - perhaps also with a small delay to the shutdown command so that it doesn't immediately chop off the ansible connection. |
||
|
||
Shut down the seed VM | ||
--------------------- | ||
|
||
Shut down seed vm on ansible control host gracefully using: | ||
|
||
.. code-block:: bash | ||
:substitutions: | ||
|
||
ssh stack@|seed_name| sudo systemctl poweroff | ||
virsh shutdown |seed_name| | ||
|
||
.. _full-power-on: | ||
|
||
Full Power on Procedure | ||
----------------------- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be a different heading style. Alternatively (preferably?) this section could go in another page called cold_start.rst. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or change the page to be: "Shutdown and power on procedures" |
||
|
||
* Start ansible control host and seed vm | ||
* Remove nodes from maintenance mode in bifrost | ||
* Recover MariaDB cluster | ||
* Start Ceph (if applicable) | ||
* Check that all docker containers are running | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: they haven't been started |
||
* Check Kibana for any messages with log level ERROR or equivalent | ||
|
||
Start Ansible Control Host | ||
-------------------------- | ||
|
||
The Ansible control host is not enrolled in Bifrost and will have to be powered | ||
on manually. | ||
|
||
Start Seed VM | ||
------------- | ||
|
||
The seed VM (and any other service VM) should start automatically when the seed | ||
hypervisor is powered on. If it does not, it can be started with: | ||
|
||
.. code-block:: bash | ||
|
||
virsh start seed-0 | ||
|
||
Unset Bifrost maintenance mode | ||
------------------------------ | ||
|
||
Unsetting maintenance mode in bifrost should automatically power on the nodes | ||
|
||
.. code-block:: bash | ||
|
||
bifrost# for i in `openstack --os-cloud bifrost baremetal node list -c UUID -f value` ; \ | ||
do openstack --os-cloud bifrost baremetal node maintenance unset $i ; done | ||
|
||
Recover MariaDB cluster | ||
----------------------- | ||
|
||
If all of the servers were shut down at the same time, it is necessary to run a | ||
script to recover the database once they have all started up. This can be done | ||
with the following command: | ||
|
||
.. code-block:: bash | ||
|
||
kayobe# kayobe overcloud database recover | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering if it would be cleaner to stop the containers before shutdown, to avoid them starting up in a broken state. |
||
|
||
.. ifconfig:: deployment['ceph_managed'] | ||
|
||
Start Ceph | ||
---------- | ||
Procedure based on `Red Hat documentation <https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/understanding-process-management-for-ceph#powering-down-and-rebooting-a-red-hat-ceph-storage-cluster_admin>`__ | ||
|
||
- Start monitor/manager nodes: | ||
|
||
.. code-block:: bash | ||
|
||
systemctl start ceph.target | ||
|
||
- Start the OSD nodes: | ||
|
||
.. code-block:: bash | ||
|
||
systemctl start ceph-osd.target | ||
|
||
- Wait for all the nodes to come up | ||
|
||
- Unset the noout, norecover, norebalance, nobackfill, nodown and pause flags | ||
|
||
.. code-block:: bash | ||
|
||
ceph osd unset noout | ||
ceph osd unset norecover | ||
ceph osd unset norebalance | ||
ceph osd unset nobackfill | ||
ceph osd unset nodown | ||
ceph osd unset pause | ||
|
||
- Start CephFS (if applicable) | ||
|
||
CephFS cluster must be brought back up by setting the cluster_down flag to false | ||
|
||
.. code-block:: bash | ||
|
||
ceph fs set FS_NAME cluster_down false | ||
|
||
- Verify ceph cluster status | ||
|
||
.. code-block:: bash | ||
|
||
ceph status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be early for stopping Ceph, in case the OpenStack services are still using Ceph state (eg, image uploads). Perhaps stop Ceph at the point where the Ceph nodes are shut down.