Debugging Consul

Info

Consul is installed using upstream Helm charts.

Documentation and bug reports supplied upstream can sometimes provide helpful troubleshooting steps for issues outside the scope of this documentation.

Warning

Consul cluster failures are most common during and after network interruptions or activities which cause HA cluster nodes to be created or destroyed, such as:

Upgrading the HA cluster to a new Kubernetes version (which requires a new image).
An auto-healing event has replaced a HA cluster member which was determined to be in a down state.

Consul failures are generally confined to two failure modes:

Failing leader elections

This failure mode is characterised by an increase in leader election events, which may trigger Alertmanager alerts.

Additionally, log messages in the following format are generated by consul-server:

2023-04-19T14:39:46.896Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2023-04-19T14:39:47.736Z [INFO]  agent.server: New leader elected: payload=consul-server-1
2023-04-19T14:39:51.495Z [WARN]  agent.server.raft: Election timeout reached, restarting election
2023-04-19T14:39:51.495Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.16.189.31:8300 [Candidate]" term=598

View the consul-server logs using the following command:

On the K3S node, targetting the HA cluster

kubectl -n azimuth logs statefulsets/consul-server

To remedy the errors, issue a rolling restart the Consul cluster:

On the K3S node, targetting the HA cluster

kubectl -n azimuth rollout restart daemonset/consul-client
kubectl -n azimuth rollout restart sts/consul-server

Check consul-server logs for leader elections that fail:

On the K3S node, targetting the HA cluster

kubectl -n azimuth logs statefulsets/consul-server

Delete any pods that fail to become leader:

On the K3S node, targetting the HA cluster

kubectl -n azimuth delete pod $FAILING_CONSUL_SERVER

For the example logs given above, the correct command would be:

On the K3S node, targetting the HA cluster

kubectl -n azimuth delete pod consul-server-1

Wait for all consul pods to become ready:

On the K3S node, targetting the HA cluster

kubectl get -n azimuth po -l app=consul
NAME                  READY   STATUS    RESTARTS      AGE
consul-server-0       1/1     Running   0             12m
consul-server-1       1/1     Running   0             11m
consul-server-2       1/1     Running   0             11m
consul-client-9x6k7   1/1     Running   0             13m
consul-client-9e87d   1/1     Running   0             12m
consul-client-f73c3   1/1     Running   0             12m

After removing misbehaving clients, consul-server should log no futher [ERROR] messages related to failed leader elections, and Zenith Sync should begin to reconcile services, which can be monitored as descibed in the Debugging Zenith services section.

Clients registering with an existing name but new IP address

This failure mode is characterised by an increase in RPC errors reported by Consul, which may trigger Alertmanager alerts.

Additionally, log messages in the following format are generated by consul-server:

2023-08-04T09:04:45.998Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=172.23.196.11:8300 error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: \"deadbeef-dead-beef-dead-beefdeadbee4\": Node name azimuth-env1-md-0-9aacb97c-ggdnb is reserved by node deadbeef-0000-0000-0000-000000000000 with name azimuth-env1-md-0-9aacb97c-ggdnb (172.17.195.48)"
2023-08-04T09:04:45.998Z [WARN]  agent: Syncing node info failed.: error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: \"deadbeef-dead-beef-dead-beefdeadbee4\": Node name azimuth-env1-md-0-9aacb97c-ggdnb is reserved by node deadbeef-0000-0000-0000-000000000000 with name azimuth-env1-md-0-9aacb97c-ggdnb (172.17.195.48)"
2023-08-04T09:04:45.998Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: \"deadbeef-dead-beef-dead-beefdeadbee4\": Node name azimuth-env1-md-0-9aacb97c-ggdnb is reserved by node deadbeef-0000-0000-0000-000000000000 with name azimuth-env1-md-0-9aacb97c-ggdnb (172.17.195.48)"

View the consul-server logs using the following command:

On the K3S node, targetting the HA cluster

kubectl -n azimuth logs statefulsets/consul-server

To remedy the errors, remove the misbehaving client from the Consul cluster. The client will reattempt registration.

On the K3S node, targetting the HA cluster

kubectl -n azimuth exec consul-server-0 -- curl --request PUT --data '{"Node":"$NODENAME"}' -v http://localhost:8500/v1/catalog/deregister

For the example logs given above, the correct command would be:

On the K3S node, targetting the HA cluster

kubectl -n azimuth exec consul-server-0 -- curl --request PUT --data '{"Node":"azimuth-env1-md-0-9aacb97c-ggdnb"}' -v http://localhost:8500/v1/catalog/deregister

This process may need to be repeated multiple times, for each node whose node name appears in the consul-server log.

After removing misbehaving clients, consul-server should log no futher [ERROR] messages related to client RPC calls, and Zenith Sync should begin to reconcile services, which can be monitored as descibed in the Debugging Zenith services section.

Last update: April 9, 2024
Created: April 9, 2024