Skip to content

Commit 8a38e6f

Browse files
committed
add release notes and upgrade guide
1 parent 87ee0e1 commit 8a38e6f

File tree

1 file changed

+249
-0
lines changed

1 file changed

+249
-0
lines changed

docs/xena-antelope-upgrade.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# CHI-in-a-box Antelope/2023.1 Release
2+
3+
Welcome to the Antelope/2023.1 Release of CHI-in-a-Box! This release rolls up a lot of bug fixes, quality of life improvements, and some new features.
4+
5+
## New Features
6+
7+
* Support for python3.10 and Ubuntu 22.04 as a controller node host OS.
8+
* cc-ansible now uses git submodules to pin the version of kolla-ansible. This helps ensure that the dependencies and ansible roles are at a known working version, and makes it easier to manage backwards compatibility during changes.
9+
* Dev-in-a-box deployment allows both development and CI/CD workflows to execise the baremetal codepaths, network booting VMs instead of real nodes. In particular, this allows testing of networking-generic-switch with vlan-based networks, not just flat networks as in upstream's tenks.
10+
* Tempest test framework can now be installed via chi-in-a-box, this will allow using tempest to run end-to-end acceptance tests on a site after deployment.
11+
12+
13+
### Some things now unblocked and on the roadmap:
14+
15+
* networking-generic-switch: support for "port groups" to enable port bonding depending on switch operating system
16+
* Neutron and networking-generic-switch: support for vlan trunk ports with baremetal nodes
17+
18+
### Selected features from Kolla-Ansible
19+
20+
#### [2023.2 backports](https://docs.openstack.org/releasenotes/kolla-ansible/2023.2.html#):
21+
* (Replaced our fork) Add Lets Encrypt TLS certificate service integration into Openstack deployment. Enables trusted TLS certificate generation option for secure communcation with OpenStack HAProxy instances using letsencrypt_email, kolla_internal_fqdn and/or kolla_external_fqdn is required. One container runs an Apache ACME client webserver and one runs Lego for certificate retrieval and renewal. The Lego container starts a cron job which attempts to renew certificates every 12 hours.
22+
* (Replaced our fork) Adds support for deploying the ironic-prometheus-exporter, ‘a Tool to expose hardware sensor data in the Prometheus format through an HTTP endpoint’. See https://opendev.org/openstack/ironic-prometheus-exporter for more details about the exporter.
23+
24+
#### [2023.1](https://docs.openstack.org/releasenotes/kolla-ansible/2023.1.html#)
25+
* Add skyline ansible role
26+
* Adds support for container state control through systemd in kolla_docker. Every container logs only to journald and has it’s own unit file in /etc/systemd/system named `kolla-<container name>-container.service`. Systemd control is implemented in new file `ansible/module_utils/kolla_systemd_worker.py`.
27+
* If credentials are updated in passwords.yml kolla-ansible is now able to update these credentials in the keystone database and in the on disk config files.
28+
The changes to passwords.yml are applied once kolla-ansible -i INVENTORY reconfigure has been run.
29+
30+
#### [Zed](https://docs.openstack.org/releasenotes/kolla-ansible/zed.html#)
31+
* Enables configuring firewalld for external API services. Extracts the required services and checks the external port, then adds the ports to a firewalld zone. Assumes that firewalld has been installed and configured beforehand. The variable disable_firewall, is disabled by default to preserve backwards compatibility. But its good practice to have the system firewall configured.
32+
* (Replaced our fork) Adds possibility for inlcuding custom alert notification templates with Prometheus Alertmanager.
33+
* (Replaced our fork) Adds variables to configure whether monitoring services should be exposed externally: `enable_grafana_external`,`enable_kibana_external`,`enable_prometheus_alertmanager_external`
34+
* Adds the prometheus_scrape_interval configuration option. The default is set to 60s. This configures the default scrape interval for all jobs.
35+
* Adds support for managing resource providers via [config files](https://docs.openstack.org/nova/latest/admin/managing-resource-providers.html).
36+
* Adds support for configuring a coordination backend for Ironic Inspector via the ironic_coordination_backend variable. Possible values are redis or etcd.
37+
* Networking Generic Switch: support for `SONiC` switch operating system
38+
39+
#### [Yoga](https://docs.openstack.org/releasenotes/kolla-ansible/yoga.html#)
40+
* Adds support for deploying OpenSearch and OpenSearch dashboards. These services directly replace ElasticSearch and Kibana which are now end-of-life. Support for sending logs to a remote ElasticSearch (or OpenSearch) cluster is maintained.
41+
* Horizon themes can now be customized at deploy-time, rather than at build-time.
42+
* Healthchecks added to ironic-neutron-agent service
43+
* Support for both PXE and iPXE enabled in Ironic at the same time.
44+
* Upgrades of Ironic will now wait for nodes in wait states to change their state. This is to improve the user experience by avoiding breaking processes being waited on. This can be disabled by setting `ironic_upgrade_skip_wait_check` to `yes`.
45+
46+
## Upgrade Notes
47+
48+
### Host OS:
49+
* Interface name changes during release-upgrade: Dependng on the network card driver in use, the "stable interface name" may change unexpectly after the upgrade, potentially breaking remote acess to the controller node. To avoid this, ensure each interface in netplan has a `match` stanza, like the following:
50+
```
51+
dataplane1:
52+
match:
53+
macaddress: 04:3f:72:ff:9b:33
54+
driver: mlx5_core
55+
set-name: dataplane1
56+
```
57+
This will ensure that even if the “stable interface name” reported by udev changes, the interface will still be configured as expected. Both macaddress AND driver must be specified, as otherwise vlan interfaces may also match the macaddress.
58+
59+
### Nova:
60+
* nova hypervisors with uuid!=hypervisor_hostname
61+
* Nova-compute-ironic version on upgrade
62+
63+
### Ironic:
64+
65+
* Boot mode default changed from BIOS -> UEFI: To prevent breakage, please ensure all nodes have the `boot_mode` capability set prior to upgrading.
66+
67+
Detection: list ironic nodes and exclude any that have the `boot_mode` capability set"
68+
```
69+
baremetal node list --fields name properties -f value | grep -v "boot_mode"
70+
# P3-CPU-017 {'capabilities': 'cpu_arch': 'x86_64', 'vendor': 'dell inc'}
71+
```
72+
fix: while still on the xena release, for each node identified above, execute:
73+
```
74+
openstack baremetal node set --property capabilities="boot_mode:bios" $node_name_or_uuid
75+
```
76+
77+
### Keystone:
78+
79+
* the `admin` keystone endpoint has been deprecated. Existing users of this endpoint should use the internal endpoint instead. To remove the now unused admin endpoints, run the following:
80+
```
81+
openstack endpoint list --interface admin -f value | \
82+
awk '!/keystone/ {print $1}' | xargs openstack endpoint delete
83+
```
84+
85+
* default user role changed from `_member_` to `member`. This was a transitional role during prior keystone database migrations, and has been deprecated for several openstack releases. It has now been fully removed, which will break any consumers that depend on it. To fix this, those consumers should use the `member` role instead.
86+
87+
### Networking-Generic-Switch:
88+
* Dell OS10 switch port names": We are now targeting the upstream release of networking-generic-switch. For compatiblity, all nodes with an ironic port connected to a Dell OS10-based switch will need to change the switchport name from the format `1/2/3:4` to `ethernet 1/2/3:4`.
89+
90+
### RabbitMQ
91+
92+
* RabbitMQ migration to durable queues
93+
* RabbitMQ feature flags on upgrade
94+
95+
### Letsencrypt/Haproxy
96+
97+
We have replaced our custom letsencrypt integration with the one [that was introduced upstream, including backports of features from 2023.2](https://github.com/ChameleonCloud/kolla-ansible/commit/fee63ec239eb42a80b196b5c5676120ac1ebd715).
98+
99+
All 3 of the following configuration lines in `defaults.yml` are necessary to enable the new feature.
100+
```
101+
kolla_enable_tls_external: true
102+
enable_letsencrypt: true
103+
letsencrypt_email: <email_for_letsencrypt>
104+
```
105+
106+
If Chameleon's customized variant was enabled before the upgrade, you will need to manually shut down and remove their containers:
107+
```
108+
docker container stop letsencrypt_certbot
109+
docker container stop letsencrypt_acme
110+
docker container rm letsencrypt_certbot
111+
docker container rm letsencrypt_acme
112+
```
113+
After deploying the new version, you will instead see containers named `letsencrypt_lego` and `letsencrypt_webserver`.
114+
115+
116+
## Deprecation Notes
117+
118+
* `neutron_networks` configuration format
119+
* usage of multiple different ssh keypairs for baremetal switches
120+
121+
## Bug Fixes
122+
123+
* LetsEncrypt: The new deployment mechansim for LetsEncrypt uses ssh to copy certs into the haproxy container, as well as to access the admin socket. This improves reliability of the certificate reload mechanism, and no longer requires the letsencrypt container to run on the same host as haproxy.
124+
* Fluentd now parses logs from ironic-neutron-agent. These were missing from the match rules, preventing such logs from appearing in centralized logging.
125+
* heat_encryption_key length is now properly set to 32 characters
126+
127+
# Detailed Upgrade Procedure
128+
129+
## Prior to beginning upgrade
130+
131+
1. Set boot_mode capability for all ironic nodes
132+
* in xena: if boot mode not set, then set it to BIOS
133+
* `baremetal node list --fields properties -f json | jq '.[] | .Properties.capabilities'`
134+
* for each node, `baremetal node set --property capabilities="boot_mode:bios"`
135+
1. Set cpu_arch capability for all ironic nodes (minor, avoids a warning)
136+
1. Ensure all ironic ports for dell os10 switches have the switchport renamed from (1/2/3:4) to (ethernet 1/2/3:4), using `openstack baremetal port set --fields uuid node_uuid local_link_connection`
137+
1. fix for any nova compute hosts with uuid!=hypervisor_hostname
138+
1. find hosts affected by the issue
139+
```
140+
use nova;
141+
select hypervisor_hostname,uuid from compute_nodes WHERE hypervisor_type='ironic' AND isnull(deleted_at) AND uuid!=hypervisor_hostname limit 100;
142+
```
143+
1. for each host found with the issue, set reservation=disabled. the `hypervisor_hostname` is the ironic node uuid, and so it can be used for the lease.
144+
1. Once the host is not in an active reservation:
145+
1. delete it from blazar
146+
1. delete it from ironic
147+
1. run `openstack hardware sync` to have Doni re-create the node's entries, without the conflicts this time.
148+
149+
150+
151+
152+
153+
154+
155+
1. Ensure remote access to the management node:
156+
1. Ipmi serial console can be accessed
157+
1. Root password works
158+
1. Root login works EVEN IF NETWORK IS DOWN (I’m looking at you, PAM)
159+
1. CEPH RGW: update configuration to reference:
160+
1. Keystone internal API endpoint, as admin endpoint will be deprecated
161+
1. Ceph service username and password, not keystone superadmin
162+
1. Update known problematic node firmware:
163+
1. For skylake nodes with firmware: “Intel(R) Ethernet 10G 4P X710/I350 rNDC - 24:6E:96:7E:1F:BE 18.0.17”: Update firmware to version 22.xx
164+
1. Set boot mode from bios -> uefi
165+
166+
## During Upgrade
167+
168+
### Prepare:
169+
170+
1. Verify above pre-tasks
171+
1. Make a backup of:
172+
1. Deploy host: chi-in-a-box and site-config directories
173+
1. Control node: move /etc/kolla directory out of the way
174+
```
175+
mv /etc/kolla /etc/kolla.bak
176+
```
177+
1. Use ./cc-ansible mariadb_backup to create a full DB backup
178+
1. Copy DB backup from mariadb_backup container to somewhere else
179+
180+
### Execute:
181+
182+
#### On target control node:
183+
1. Get to the latest stable package versions and clean up old kernels:
184+
1. `apt-get update && apt-get dist-upgrade`
185+
1. Reboot, then `apt-get autoremove`
186+
1. ensure netplan interfaces have a `match` stanza
187+
1. `do-release-upgrade` to move from 20.04 -> 22.04
188+
1. Reboot
189+
1. Check on interface definitions and any apt sources that need updating
190+
1. Delete kolla virtualenv from `/etc/ansible/venv`
191+
1. shut down containers which have had their names changed:
192+
```
193+
docker stop letsencrypt_certbot
194+
docker stop letsencrypt_acme
195+
docker stop ironic_pxe
196+
docker stop ironic_ipxe
197+
```
198+
199+
#### On deploy host:
200+
1. Delete chi-in-a-box venv from `chi-in-a-box/venv`
201+
1. Delete any tools venv from site-config
202+
1. Delete any venv from `/etc/ansible/venv`
203+
1. Update chi-in-a-box via:
204+
```
205+
git checkout stable/2023.1
206+
git pull
207+
git submodule update --init
208+
```
209+
1. Install new tools into venv via:
210+
1. `cc-ansible install_deps` # this doesn’t need a site-config
211+
1. Source the new venv:
212+
`source venv/bin/activate`
213+
1. Update site-config/inventory/hosts by:
214+
1. mv inventory/hosts inventory/hosts.bak
215+
1. cp chi-in-a-box/site-config.example/inventory/hosts site-config/inventory/hosts
216+
1. Manually apply any customizations to the new file (primarily where your controller and kvm compute nodes are listed in the first 5 entries)
217+
1. Update site-config/passwords.yml by:
218+
1. Decrypt the file: ansible-vault … decrypt passwords.yml
219+
1. Mv passwords.yml passwords.yml.bak
220+
1. Update pwds:
221+
1. Kolla-mergepwds –old site-config/passwords.yml.bak –new site-config.example/passwords.yml –final site-config/passwords.yml
222+
1. Kolla-genpwd site-config/passwords.yml
223+
1. Encrypt the file: Ansible-vault … encrypt passwords.yml
224+
1. Apply migrations if necessary:
225+
1. If using ssh keys with NGS, only one key will be used going forward (instead of one per switch)
226+
1. Ensure neutron_ssh_key is added to all ngs managed switches, or set this keypair to an already added ones
227+
1. Regenerate heat_auth_password by setting it to the empty string. This works around an issue where it may be incorrectly set to != 32 characters in length
228+
1. Cc-ansible bootstrap-servers
229+
1. Cc-ansible prechecks
230+
1. Cc-ansible pull
231+
1. run `cc-ansible upgrade` (this will fail at nova, this is expected)
232+
1. Manually edit nova compute service “version” in the db to 61, [the oldest allowed in the upgrade check](https://github.com/openstack/nova/commit/a1731927ccd17aeb634c4eed61dce16de16fa7b3#diff-c0b6a5928be3ac40200a2078b084341bb9187a12b1f959ad862e0038c9029193L233)
233+
```
234+
USE nova;
235+
236+
UPDATE services
237+
SET version=61
238+
WHERE services.deleted=0
239+
AND services.topic="compute"
240+
AND services.version < 61;
241+
```
242+
1. rerun `cc-ansible deploy --tags nova` to create service user
243+
1. rerun `cc-ansible upgrade` this should now pass nova and continue.
244+
1. Apply cleanups for keystone admin endpoints:
245+
1. Cc-ansible deploy –tags keystone
246+
```
247+
openstack endpoint list --interface admin -f value | \
248+
awk '!/keystone/ {print $1}' | xargs openstack endpoint delete
249+
```

0 commit comments

Comments
 (0)