Debugging Consul
Info
Consul is installed using upstream Helm charts.
Documentation and bug reports supplied upstream can sometimes provide helpful troubleshooting steps for issues outside the scope of this documentation.
Warning
Consul cluster failures are most common during and after network interruptions or activities which cause HA cluster nodes to be created or destroyed, such as:
- Upgrading the HA cluster to a new Kubernetes version (which requires a new image).
- An auto-healing event has replaced a HA cluster member which was determined to be in a down state.
Consul failures are generally confined to two failure modes:
Failing leader elections
This failure mode is characterised by an increase in leader election events, which may trigger Alertmanager alerts.
Additionally, log messages in the following format are generated by consul-server
:
2023-04-19T14:39:46.896Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2023-04-19T14:39:47.736Z [INFO] agent.server: New leader elected: payload=consul-server-1
2023-04-19T14:39:51.495Z [WARN] agent.server.raft: Election timeout reached, restarting election
2023-04-19T14:39:51.495Z [INFO] agent.server.raft: entering candidate state: node="Node at 172.16.189.31:8300 [Candidate]" term=598
View the consul-server
logs using the following command:
To remedy the errors, issue a rolling restart the Consul cluster:
kubectl -n azimuth rollout restart daemonset/consul-client
kubectl -n azimuth rollout restart sts/consul-server
Check consul-server
logs for leader elections that fail:
Delete any pods that fail to become leader:
For the example logs given above, the correct command would be:
Wait for all consul pods to become ready:
kubectl get -n azimuth po -l app=consul
NAME READY STATUS RESTARTS AGE
consul-server-0 1/1 Running 0 12m
consul-server-1 1/1 Running 0 11m
consul-server-2 1/1 Running 0 11m
consul-client-9x6k7 1/1 Running 0 13m
consul-client-9e87d 1/1 Running 0 12m
consul-client-f73c3 1/1 Running 0 12m
After removing misbehaving clients, consul-server
should log no futher [ERROR]
messages related to failed leader
elections, and Zenith Sync should begin to reconcile services, which can be monitored as descibed in the Debugging Zenith services section.
Clients registering with an existing name but new IP address
This failure mode is characterised by an increase in RPC errors reported by Consul, which may trigger Alertmanager alerts.
Additionally, log messages in the following format are generated by consul-server
:
2023-08-04T09:04:45.998Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=172.23.196.11:8300 error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: \"deadbeef-dead-beef-dead-beefdeadbee4\": Node name azimuth-env1-md-0-9aacb97c-ggdnb is reserved by node deadbeef-0000-0000-0000-000000000000 with name azimuth-env1-md-0-9aacb97c-ggdnb (172.17.195.48)"
2023-08-04T09:04:45.998Z [WARN] agent: Syncing node info failed.: error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: \"deadbeef-dead-beef-dead-beefdeadbee4\": Node name azimuth-env1-md-0-9aacb97c-ggdnb is reserved by node deadbeef-0000-0000-0000-000000000000 with name azimuth-env1-md-0-9aacb97c-ggdnb (172.17.195.48)"
2023-08-04T09:04:45.998Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error making call: rpc error making call: failed inserting node: Error while renaming Node ID: \"deadbeef-dead-beef-dead-beefdeadbee4\": Node name azimuth-env1-md-0-9aacb97c-ggdnb is reserved by node deadbeef-0000-0000-0000-000000000000 with name azimuth-env1-md-0-9aacb97c-ggdnb (172.17.195.48)"
View the consul-server
logs using the following command:
To remedy the errors, remove the misbehaving client from the Consul cluster. The client will reattempt registration.
kubectl -n azimuth exec consul-server-0 -- curl --request PUT --data '{"Node":"$NODENAME"}' -v http://localhost:8500/v1/catalog/deregister
For the example logs given above, the correct command would be:
kubectl -n azimuth exec consul-server-0 -- curl --request PUT --data '{"Node":"azimuth-env1-md-0-9aacb97c-ggdnb"}' -v http://localhost:8500/v1/catalog/deregister
This process may need to be repeated multiple times, for each node whose node name appears in the consul-server
log.
After removing misbehaving clients, consul-server
should log no futher [ERROR]
messages related to client RPC calls,
and Zenith Sync should begin to reconcile services, which can be monitored as descibed in the Debugging Zenith services section.
Created: April 9, 2024