Hardware Inventory Management¶

At its lowest level, hardware inventory is managed in the Bifrost service (see Accessing the Bifrost Service).

Reconfiguring Control Plane Hardware¶

If a server’s hardware or firmware configuration is changed, it should be re-inspected in Bifrost before it is redeployed into service. A single server can be reinspected like this (for a host named comp0):

kayobe# kayobe overcloud hardware inspect --limit comp0

Enrolling New Hypervisors¶

New hypervisors can be added to the Bifrost inventory by using its discovery capabilities. Assuming that new hypervisors have IPMI enabled and are configured to network boot on the provisioning network, the following commands will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent kernel and ramdisk, which is configured to extract hardware information and send it to Bifrost. Note that IPMI credentials can be found in the encrypted file located at ${KAYOBE_CONFIG_PATH}/secrets.yml.

bifrost# ipmitool -I lanplus -U admin -H comp0-ipmi chassis bootdev pxe

If node is are off, power them on:

bifrost# ipmitool -I lanplus -U admin -H comp0-ipmi power on

If nodes is on, reset them:

bifrost# ipmitool -I lanplus -U admin -H comp0-ipmi power reset

Once node have booted and have completed introspection, they should be visible in Bifrost:

bifrost# baremetal node list --provision-state enroll
+--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name                  | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
| da0c61af-b411-41b9-8909-df2509f2059b | comp0 | None          | power off   | enroll             | False       |
+--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+

After editing ${KAYOBE_CONFIG_PATH}/overcloud.yml to add these new hosts to the correct groups, import them in Kayobe’s inventory with:

kayobe# kayobe overcloud inventory discover

We can then provision and configure them:

kayobe# kayobe overcloud provision --limit comp0
kayobe# kayobe overcloud host configure --limit comp0 --kolla-limit comp0
kayobe# kayobe overcloud service deploy --limit comp0 --kolla-limit comp0

Replacing a Failing Hypervisor¶

To replace a failing hypervisor, proceed as follows:

Disable the hypervisor to avoid scheduling any new instance on it
Evacuate all instances
Set the node to maintenance mode in Bifrost
Physically fix or replace the node
It may be necessary to reinspect the node if hardware was changed (this will require deprovisioning and reprovisioning)
If the node was replaced or reprovisioned, follow Enrolling New Hypervisors

To deprovision an existing hypervisor, run:

kayobe# kayobe overcloud deprovision --limit comp0

Warning

Always use --limit with kayobe overcloud deprovision on a production system. Running this command without a limit will deprovision all overcloud hosts.

Evacuating all instances¶

admin# nova host-evacuate-live comp0

You should now check the status of all the instances that were running on that hypervisor. They should all show the status ACTIVE. This can be verified with:

admin# openstack server show <instance uuid>

Troubleshooting¶

Servers that have been shut down¶

If there are any instances that are SHUTOFF they won’t be migrated, but you can use nova host-servers-migrate for them once the live migration is finished.

Also if a VM does heavy memory access, it may take ages to migrate (Nova tries to incrementally increase the expected downtime, but is quite conservative). You can use nova live-migration-force-complete <instance_uuid> <migration_id> to trigger the final move.

You get the migration ID via nova server-migration-list <instance_uuid>.

For more details see: http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/

Flavors have changed¶

If the size of the flavors has changed, some instances will also fail to migrate as the process needs manual confirmation. You can do this with:

openstack # openstack server resize confirm <instance-uuid>

The symptom to look out for is that the server is showing a status of VERIFY RESIZE as shown in this snippet of openstack server show <instance-uuid>:

| status | VERIFY_RESIZE |

Set maintenance mode on a node in Bifrost¶

For example, to put comp0 into maintenance:

seed# docker exec -it bifrost_deploy /bin/bash
(bifrost-deploy)[root@seed bifrost-base]# OS_CLOUD=bifrost baremetal node maintenance set comp0

Unset maintenance mode on a node in Bifrost¶

For example, to take comp0 out of maintenance:

seed# docker exec -it bifrost_deploy /bin/bash
(bifrost-deploy)[root@seed bifrost-base]# OS_CLOUD=bifrost baremetal node maintenance unset comp0

Detect hardware differences with cardiff¶

Hardware information captured during the Ironic introspection process can be analysed to detect hardware differences, such as mismatches in firmware versions or missing storage devices. The cardiff tool can be used for this purpose. It was developed as part of the Python hardware package, but was removed from release 0.25. The mungetout utility can be used to convert Ironic introspection data into a format that can be fed to cardiff.

The following steps are used to install cardiff and mungetout:

kayobe# virtualenv ~/kayobe-env/venvs/cardiff
kayobe# source ~/kayobe-env/venvs/cardiff/bin/activate
kayobe# pip install -U pip
kayobe# pip install git+https://github.com/stackhpc/mungetout.git@feature/kayobe-introspection-save
kayobe# pip install 'hardware==0.24'

Extract introspection data from Bifrost with Kayobe. JSON files will be created into ${KAYOBE_CONFIG_PATH}/overcloud-introspection-data:

kayobe# source ~/kayobe-env/venvs/kayobe/bin/activate
kayobe# source ~/kayobe-env/src/kayobe-config/kayobe-env
kayobe# kayobe overcloud introspection data save

The cardiff utility can only work if the extra-hardware collector was used, which populates a data key in each node JSON file. Remove any that are missing this key:

kayobe# for file in ~/kayobe-env/src/kayobe-config/overcloud-introspection-data/*; do if [[ $(jq .data $file) == 'null' ]]; then rm $file; fi; done

Cardiff identifies each unique system by its serial number. However, some high-density multi-node systems may report the same serial number for multiple systems (this has been seen on Supermicro hardware). The following script will replace the serial number used by Cardiff by the node name captured by LLDP on the first network interface. If this node name is missing, it will append a short UUID string to the end of the serial number.

import json
import sys
import uuid

with open(sys.argv[1], "r+") as f:
    node = json.loads(f.read())

    serial = node["inventory"]["system_vendor"]["serial_number"]
    try:
        new_serial = node["all_interfaces"]["eth0"]["lldp_processed"]["switch_port_description"]
    except KeyError:
        new_serial = serial + "-" + str(uuid.uuid4())[:8]

    new_data = []
    for e in node["data"]:
        if e[0] == "system" and e[1] == "product" and e[2] == "serial":
            new_data.append(["system", "product", "serial", new_serial])
        else:
            new_data.append(e)
    node["data"] = new_data

    f.seek(0)
    f.write(json.dumps(node))
    f.truncate()

Apply this Python script on all generated JSON files:

kayobe# for file in ~/src/kayobe-config/overcloud-introspection-data/*; do python update-serial.py $file; done

Convert files into the format supported by cardiff:

source ~/kayobe-env/venvs/cardiff/bin/activate
mkdir -p ~/kayobe-env/cardiff-workspace
rm -rf ~/kayobe-env/cardiff-workspace/extra*
cd ~/kayobe-env/cardiff-workspace/
m2-extract ~/kayobe-env/src/kayobe-config/overcloud-introspection-data/*.json

Note

The m2-extract utility needs to work in an empty folder. Delete the extra-hardware, extra-hardware-filtered and extra-hardware-json folders before executing it again.

We are now ready to compare node hardware. The following command will compare all known nodes, which may include multiple generations of hardware. Replace *.eval by a stricter globbing expression or by a list of files to compare a smaller group.

hardware-cardiff -I ipmi -p 'extra-hardware/*.eval'

Since the output can be verbose, it is recommended to pipe it to a terminal pager or redirect it to a file. Cardiff will display groups of identical nodes based on various hardware characteristics, such as system model, BIOS version, CPU or network interface information, or benchmark results gathered by the extra-hardware collector during the initial introspection process.