.. include:: vars.rst

=========================
Operations and Monitoring
=========================

Access to Kibana
================

OpenStack control plane logs are aggregated from all servers by Fluentd and
stored in ElasticSearch. The control plane logs can be accessed from
ElasticSearch using Kibana, which is available at the following URL:
|kibana_url|

To log in, use the ``kibana`` user. The password is auto-generated by
Kolla-Ansible and can be extracted from the encrypted passwords file
(|kolla_passwords|):

.. code-block:: console
   :substitutions:

   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^kibana

Access to Grafana
=================

Control plane metrics can be visualised in Grafana dashboards. Grafana can be
found at the following address: |grafana_url|

To log in, use the |grafana_username| user. The password is auto-generated by
Kolla-Ansible and can be extracted from the encrypted passwords file
(|kolla_passwords|):

.. code-block:: console
   :substitutions:

   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^grafana_admin_password

.. _prometheus-alertmanager:

Access to Prometheus Alertmanager
=================================

Control plane alerts can be visualised and managed in Alertmanager, which can
be found at the following address: |alertmanager_url|

To log in, use the ``admin`` user. The password is auto-generated by
Kolla-Ansible and can be extracted from the encrypted passwords file
(|kolla_passwords|):

.. code-block:: console
   :substitutions:

   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^prometheus_alertmanager_password

Migrating virtual machines
==========================

To see where all virtual machines are running on the hypervisors:

.. code-block:: console

   admin# openstack server list --all-projects --long

To move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to
|hypervisor_hostname|:

.. code-block:: console
   :substitutions:

   admin# openstack --os-compute-api-version 2.30 server migrate --live-migration --host |hypervisor_hostname| 6a35592c-5a7e-4da3-9ab9-6765345641cb

To move a virtual machine with local disks:

.. code-block:: console
   :substitutions:

   admin# openstack --os-compute-api-version 2.30 server migrate --live-migration --block-migration --host |hypervisor_hostname| 6a35592c-5a7e-4da3-9ab9-6765345641cb

OpenStack Reconfiguration
=========================

Disabling a Service
-------------------

Ansible is oriented towards adding or reconfiguring services, but removing a
service is handled less well, because of Ansible's imperative style.

To remove a service, it is disabled in Kayobe's Kolla config, which prevents
other services from communicating with it. For example, to disable
``cinder-backup``, edit ``${KAYOBE_CONFIG_PATH}/kolla.yml``:

.. code-block:: diff

   -enable_cinder_backup: true
   +enable_cinder_backup: false

Then, reconfigure Cinder services with Kayobe:

.. code-block:: console

   kayobe# kayobe overcloud service reconfigure --kolla-tags cinder

However, the service itself, no longer in Ansible's manifest of managed state,
must be manually stopped and prevented from restarting.

On each controller:

.. code-block:: console

   kayobe# docker rm -f cinder_backup

Some services may store data in a dedicated Docker volume, which can be removed
with ``docker volume rm``.

Installing TLS Certificates
---------------------------

|tls_setup|

To configure TLS for the first time, we write the contents of a PEM
file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``.
Use a command of this form:

.. code-block:: console
   :substitutions:

   kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file=|vault_password_file_path|

Concatenate the contents of the certificate and key files to create
``secrets_kolla_external_tls_cert``.  The certificates should be installed in
this order:

* TLS certificate for the |project_name| OpenStack endpoint |public_endpoint_fqdn|
* Any intermediate certificates
* The TLS certificate private key

In ``${KAYOBE_CONFIG_PATH}/kolla.yml``, set the following:

.. code-block:: yaml

   kolla_enable_tls_external: True
   kolla_external_tls_cert: "{{ secrets_kolla_external_tls_cert }}"

To apply TLS configuration, we need to reconfigure all services, as endpoint URLs need to
be updated in Keystone:

.. code-block:: console

   kayobe# kayobe overcloud service reconfigure

Alternative Configuration
+++++++++++++++++++++++++

As an alternative to writing the certificates as a variable to
``secrets.yml``, it is also possible to write the same data to a file,
``etc/kayobe/kolla/certificates/haproxy.pem``.  The file should be
vault-encrypted in the same manner as secrets.yml.  In this instance,
variable ``kolla_external_tls_cert`` does not need to be defined.

See `Kolla-Ansible TLS guide
<https://docs.openstack.org/kolla-ansible/latest/admin/tls.html>`__ for
further details.

Updating TLS Certificates
-------------------------

Check the expiry date on an installed TLS certificate from a host that can
reach the |project_name| OpenStack APIs:

.. code-block:: console
   :substitutions:

   openstack# openssl s_client -connect |public_endpoint_fqdn|:443 2> /dev/null | openssl x509 -noout -dates

*NOTE*: Prometheus Blackbox monitoring can check certificates automatically
and alert when expiry is approaching.

To update an existing certificate, for example when it has reached expiration,
change the value of ``secrets_kolla_external_tls_cert``, in the same order as
above.  Run the following command:

.. code-block:: console

   kayobe# kayobe overcloud service reconfigure --kolla-tags haproxy

.. _taking-a-hypervisor-out-of-service:

Taking a Hypervisor out of Service
----------------------------------

To take a hypervisor out of Nova scheduling, for example |hypervisor_hostname|:

.. code-block:: console
   :substitutions:

   admin# openstack compute service set --disable \
          |hypervisor_hostname| nova-compute

Running instances on the hypervisor will not be affected, but new instances
will not be deployed on it.

A reason for disabling a hypervisor can be documented with the
``--disable-reason`` flag:

.. code-block:: console
   :substitutions:

   admin# openstack compute service set --disable \
          --disable-reason "Broken drive" |hypervisor_hostname| nova-compute

Details about all hypervisors and the reasons they are disabled can be
displayed with:

.. code-block:: console

   admin# openstack compute service list --long

And then to enable a hypervisor again:

.. code-block:: console
   :substitutions:

   admin# openstack compute service set --enable \
          |hypervisor_hostname| nova-compute

Managing Space in the Docker Registry
-------------------------------------

If the Docker registry becomes full, this can prevent container updates and
(depending on the storage configuration of the seed host) could lead to other
problems with services provided by the seed host.

To remove container images from the Docker Registry, follow this process:

* Reconfigure the registry container to allow deleting containers. This can be
  done in ``docker-registry.yml`` with Kayobe:

.. code-block:: yaml

   docker_registry_env:
     REGISTRY_STORAGE_DELETE_ENABLED: "true"

* For the change to take effect, run:

.. code-block:: console

   kayobe seed host configure

* A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool
  (this requires ``jq``). To delete all images with a specific tag, use:

.. code-block:: console

   for repo in `./docker_reg_tool http://registry-ip:4000 list`; do
        ./docker_reg_tool http://registry-ip:4000 delete $repo $tag
   done

* Deleting the tag does not actually release the space. To actually free up
  space, run garbage collection:

.. code-block:: console

   seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.yml

The seed host can also accrue a lot of data from building container images.
The images stored locally in the seed host can be seen using ``docker image ls``.

Old and redundant images can be identified from their names and tags, and
removed using ``docker image rm``.

Backup of the OpenStack Control Plane
=====================================

As the backup procedure is constantly changing, it is normally best to check
the upstream documentation for an up to date procedure. Here is a high level
overview of the key things you need to backup:

Controllers
-----------

* `Back up SQL databases <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#performing-database-backups>`__
* `Back up configuration in /etc/kolla <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#saving-overcloud-service-configuration>`__

Compute
-------

The compute nodes can largely be thought of as ephemeral, but you do need to
make sure you have migrated any instances and disabled the hypervisor before
decommissioning or making any disruptive configuration change.

Monitoring
----------

* `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
* `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__

Seed
----

* `Back up bifrost <https://docs.openstack.org/kayobe/latest/administration/seed.html#database-backup-restore>`__

Ansible control host
--------------------

* Back up service VMs such as the seed VM

Control Plane Monitoring
========================

The control plane has been configured to collect logs centrally using the EFK
stack (Elasticsearch, Fluentd and Kibana).

Telemetry monitoring of the control plane is performed by Prometheus. Metrics
are collected by Prometheus exporters, which are either running on all hosts
(e.g.  node exporter), on specific hosts (e.g. controllers for the memcached
exporter or monitoring hosts for the OpenStack exporter). These exporters are
scraped by the Prometheus server.

Configuring Prometheus Alerts
-----------------------------

Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
custom rules.

Silencing Prometheus Alerts
---------------------------

Sometimes alerts must be silenced because the root cause cannot be resolved
right away, such as when hardware is faulty. For example, an unreachable
hypervisor will produce several alerts:

* ``InstanceDown`` from Node Exporter
* ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of
  the ``nova-compute`` agent on the host
* ``PrometheusTargetMissing`` from several Prometheus exporters

Rather than silencing each alert one by one for a specific host, a silence can
apply to multiple alerts using a reduced list of labels. :ref:`Log into
Alertmanager <prometheus-alertmanager>`, click on the ``Silence`` button next
to an alert and adjust the matcher list to keep only ``instance=<hostname>``
label.  Then, create another silence to match ``hostname=<hostname>`` (this is
required because, for the OpenStack exporter, the instance is the host running
the monitoring service rather than the host being monitored).

.. note::

   After creating the silence, you may get redirected to a 404 page. This is a
   `known issue <https://github.com/prometheus/alertmanager/issues/1377>`__
   when running several Alertmanager instances behind HAProxy.

Generating Alerts from Metrics
++++++++++++++++++++++++++++++

Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
custom rules.

Control Plane Shutdown Procedure
================================

Overview
--------

* Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They
  should all report a healthy status.
* Put node into maintenance mode in bifrost to prevent it from automatically
  powering back on
* Shutdown down nodes one at a time gracefully using systemctl poweroff

Controllers
-----------

If you are restarting the controllers, it is best to do this one controller at
a time to avoid the clustered components losing quorum.

Checking Galera state
+++++++++++++++++++++

On each controller perform the following:

.. code-block:: console
   :substitutions:

   [stack@|controller0_hostname| ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"
   Variable_name   Value
   wsrep_local_state_comment       Synced

The password can be found using:

.. code-block:: console
   :substitutions:

   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \
           --vault-password-file |vault_password_file_path| | grep ^database

Checking RabbitMQ
+++++++++++++++++

RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``:

.. code-block:: console
   :substitutions:

   [stack@|controller0_hostname| ~]$ docker exec rabbitmq rabbitmqctl cluster_status
   Cluster status of node rabbit@|controller0_hostname| ...
   [{nodes,[{disc,['rabbit@|controller0_hostname|','rabbit@|controller1_hostname|',
                   'rabbit@|controller2_hostname|']}]},
    {running_nodes,['rabbit@|controller1_hostname|','rabbit@|controller2_hostname|',
                    'rabbit@|controller0_hostname|']},
    {cluster_name,<<"rabbit@|controller2_hostname|">>},
    {partitions,[]},
    {alarms,[{'rabbit@|controller1_hostname|',[]},
             {'rabbit@|controller2_hostname|',[]},
             {'rabbit@|controller0_hostname|',[]}]}]

Checking Keepalived
+++++++++++++++++++

On (for example) three controllers:

.. code-block:: console
   :substitutions:

   [stack@|controller0_hostname| ~]$ docker logs keepalived

Two instances should show:

.. code-block:: console

   VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE

and the other:

.. code-block:: console

   VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE

Ansible Control Host
--------------------

The Ansible control host is not enrolled in bifrost. This node may run services
such as the seed virtual machine which will need to be gracefully powered down.

Compute
-------

If you are shutting down a single hypervisor, to avoid down time to tenants it
is advisable to migrate all of the instances to another machine. See
:ref:`evacuating-all-instances`.

.. ifconfig:: deployment['ceph_managed']

   Ceph
   ----

   The following guide provides a good overview:
   https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph

Shutting down the seed VM
-------------------------

.. code-block:: console
   :substitutions:

   kayobe# virsh shutdown |seed_name|

.. _full-shutdown:

Full shutdown
-------------

In case a full shutdown of the system is required, we advise to use the
following order:

* Perform a graceful shutdown of all virtual machine instances
* Shut down compute nodes
* Shut down monitoring node
* Shut down network nodes (if separate from controllers)
* Shut down controllers
* Shut down Ceph nodes (if applicable)
* Shut down seed VM
* Shut down Ansible control host

Rebooting a node
----------------

Example: Reboot all compute hosts apart from |hypervisor_hostname|:

.. code-block:: console
   :substitutions:

   kayobe# kayobe overcloud host command run --limit 'compute:!|hypervisor_hostname|' -b --command "shutdown -r"

References
----------

* https://galeracluster.com/library/training/tutorials/restarting-cluster.html

Control Plane Power on Procedure
================================

Overview
--------

* Remove the node from maintenance mode in bifrost
* Bifrost should automatically power on the node via IPMI
* Check that all docker containers are running
* Check Kibana for any messages with log level ERROR or equivalent

Controllers
-----------

If all of the servers were shut down at the same time, it is necessary to run a
script to recover the database once they have all started up. This can be done
with the following command:

.. code-block:: console

   kayobe# kayobe overcloud database recover

Ansible Control Host
--------------------

The Ansible control host is not enrolled in Bifrost and will have to be powered
on manually.

Seed VM
-------

The seed VM (and any other service VM) should start automatically when the seed
hypervisor is powered on. If it does not, it can be started with:

.. code-block:: console

   kayobe# virsh start seed-0

Full power on
-------------

Follow the order in :ref:`full-shutdown`, but in reverse order.

Shutting Down / Restarting Monitoring Services
----------------------------------------------

Shutting down
+++++++++++++

Log into the monitoring host(s):

.. code-block:: console
   :substitutions:

   kayobe# ssh stack@|monitoring_host|

Stop all Docker containers:

.. code-block:: console
   :substitutions:

   |monitoring_host|# for i in `docker ps -q`; do docker stop $i; done

Shut down the node:

.. code-block:: console
   :substitutions:

   |monitoring_host|# sudo shutdown -h

Starting up
+++++++++++

The monitoring services containers will automatically start when the monitoring
node is powered back on.

Software Updates
================

Update Packages on Control Plane
--------------------------------

OS packages can be updated with:

.. code-block:: console
   :substitutions:

   kayobe # kayobe overcloud host package update --limit |hypervisor_hostname| --packages '*'
   kayobe # kayobe overcloud seed package update --packages '*'

See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages

Minor Upgrades to OpenStack Services
------------------------------------

* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable)
* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default)
* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release``
* Rebuild container images
* Pull container images to overcloud hosts
* Run kayobe overcloud service upgrade

For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html

Troubleshooting
===============

Deploying to a Specific Hypervisor
----------------------------------

To test creating an instance on a specific hypervisor, *as an admin-level user*
you can specify the hypervisor name as part of an extended availability zone
description.

To see the list of hypervisor names:

.. code-block:: console

   admin# openstack hypervisor list

To boot an instance on a specific hypervisor, for example on
|hypervisor_hostname|:

.. code-block:: console
   :substitutions:

   admin# openstack server create --flavor |flavor_name| --network |network_name| --key-name <key> --image CentOS8.2 --availability-zone nova::|hypervisor_hostname| vm-name

Cleanup Procedures
==================

OpenStack services can sometimes fail to remove all resources correctly. This
is the case with Magnum, which fails to clean up users in its domain after
clusters are deleted. `A patch has been submitted to stable branches
<https://review.opendev.org/#/q/Ibadd5b57fe175bb0b100266e2dbcc2e1ea4efcf9>`__.
Until this fix becomes available, if Magnum is in use, administrators can
perform the following cleanup procedure regularly:

.. code-block:: console

   admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
            if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
              echo "$user still in use, not deleting"
            else
              openstack user delete --domain magnum $user
            fi
          done

Elasticsearch indexes retention
===============================

To enable and alter default rotation values for Elasticsearch Curator, edit
``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:

.. code-block:: console

   # Allow Elasticsearch Curator to apply a retention policy to logs
   enable_elasticsearch_curator: true

   # Duration after which index is closed
   elasticsearch_curator_soft_retention_period_days: 90

   # Duration after which index is deleted
   elasticsearch_curator_hard_retention_period_days: 180

Reconfigure Elasticsearch with new values:

.. code-block:: console

   kayobe overcloud service reconfigure --kolla-tags elasticsearch

For more information see the `upstream documentation
<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#curator>`__.