Node Maintenance on K8s Turbine Platform Installer (TPI) deployment
at some point, you'll need to carry out maintenance on a node (i e a kernel upgrade, apply a security patch, upgrade the operating system, perform hardware maintenance, take a snapshot) which may require a single node shutdown or reboot it's critical that these events are handled gracefully in a kubernetes (k8s) turbine platform installer (tpi) environment applies to tpi standalone and high availability (ha) deployment type piotu9 erwy3456780 definitions node a virtual machine or physical server cluster a group of interconnected nodes note the steps below are supported for maintenance on one node at a time support for cluster wide shutdown is currently not supported steps pick a node to be taken down for maintenance kubectl get nodes perform a health check of critical services \# the namespace used is "default" in case of an embedded cluster deployment export ns=\<swimlane namespace> \# the mongo prefix is either "swimlane sw " for spi or empty for tpi export mongo prefix= export mongo admin password=\<define password> \# the json response includes key of type list called "members", note down the "name" and "statestr" for each mongodb server \# there should be 3 members with "statestr" showing one "primary" and two "secondary" kubectl n $ns exec ${mongo prefix}mongo 0 mongosh u admin p $mongo admin password authenticationdatabase admin tls tlsallowinvalidcertificates tlsallowinvalidhostnames admin eval="rs status()" \# check the replication lag specifically for the secondaries, it should show a maximum lag of a few seconds max kubectl n $ns exec ${mongo prefix}mongo 0 mongosh u admin p $mongo admin password authenticationdatabase admin tls tlsallowinvalidcertificates tlsallowinvalidhostnames admin eval="rs printsecondaryreplicationinfo()" \# check the current state of the postgresql cluster according to the replication manager patroni kubectl exec n $ns postgresql 0 patronictl list \# output should return a table with all postgresql cluster members \# tl column should show the same value across all rows this means that each server is on the correct timeline/epoch \# lag column should show 0 across all rows and no value for the leader this means that there is no observed lag in replication, value otherwise shown in mb \# state should not show stopped in any row, it should be streaming export elastic password=$(kubectl n $ns get secret turbine es elastic user o jsonpath='{ data elastic}' | base64 d) \# run an api call to the elasticsearch service to check for the cluster's status kubectl exec n $ns turbine es default 0 c elasticsearch curl k "https //elastic ${elastic password}@localhost 9200/ cluster/health?wait for status=green" \# this should return a json response \# the "status" key should say "green" \# the "relocating shards" key should say 0 \# https //www elastic co/guide/en/elasticsearch/reference/current/cluster health html#cluster health api response body \# example json response {"cluster name" "turbine","status" "green","timed out"\ false,"number of nodes" 3,"number of data nodes" 3,"active primary shards" 63,"active shards" 127,"relocating shards" 0,"initializing shards" 0,"unassigned shards" 0,"delayed unassigned shards" 0,"number of pending tasks" 0,"number of in flight fetch" 0,"task max waiting in queue millis" 0,"active shards percent as number" 100 0} if any of these checks do not return 'ok,' retry after some time and note any changes in the situation if any of the services indicate that they are not progressing toward a healthy state, reach out to support to minimize unwarranted disruption to mongodb, check if the current node selected for maintenance is hosting the mongodb pod with the primary role the following command returns the hostname of the node where that pod is scheduled \# returns which kubernetes node is running the mongodb primary role kubectl o jsonpath='{ spec nodename}' get pod n $ns $(kubectl n $ns exec ${mongo prefix}mongo 0 mongosh u admin p $mongo admin password authenticationdatabase admin tls tlsallowinvalidcertificates tlsallowinvalidhostnames admin eval="rs ismaster() primary" | tail n 1 | cut d" " f1) if the node is the same as the one picked for maintenance, run this command to trigger a re election and switch the primary role to a different node export mongo primary=$(kubectl n $ns exec ${mongo prefix}mongo 0 mongosh u admin p $mongo admin password authenticationdatabase admin tls tlsallowinvalidcertificates tlsallowinvalidhostnames admin eval="rs ismaster() primary" | tail n 1 | cut d" " f1) kubectl n $ns exec $mongo primary mongosh u admin p $mongo admin password authenticationdatabase admin tls tlsallowinvalidcertificates tlsallowinvalidhostnames admin eval="rs stepdown()" run this command again and confirm that it is not the same node that is currently picked to undergo maintenance kubectl o jsonpath='{ spec nodename}' get pod n $ns $(kubectl n $ns exec ${mongo prefix}mongo 0 mongosh u admin p $mongo admin password authenticationdatabase admin tls tlsallowinvalidcertificates tlsallowinvalidhostnames admin eval="rs ismaster() primary" | tail n 1 | cut d" " f1) as root, use the shutdown script to delete pods on the node /opt/ekco/shutdown sh you may need to run this command to drain the node from all existing workloads if the prior command hasn't completed kubectl drain \<nodename> ignore daemonsets delete local data grace period=600 force ensure the drain node command completes/returns successfully it should return error free and the exit code is 0 if the command gets stuck or takes a long time to finish, then press control+c and continue to next steps perform maintenance on the affected node and/or reboot if the node was not rebooted, run these commands on the node to bring it back into service /opt/ekco/startup sh kubectl uncordon \<nodename> allow the workload to redistribute across all nodes in the cluster set the swimlane namespace for embedded clusters, the namespace is default export ns=\<swimlane namespace> kubectl rollout restart deployment n $ns then check if all services are operational (you can also run through the procedure outlined in step 2) \# tpi admin console and swimlane components kubectl get pods \# remaining cluster pods kubectl get pods all namespaces log in to the tpi admin console and swimlane application to confirm they are back online the status should display a green icon and a 'ready' message repeat the steps on the next node that needs maintenance