Monitoring Script for Remote-Agent Host Health

this article provides a script that customers can use to collect system health metrics from the server running the turbine remote agent containers the goal is to capture key data points (dns resolution, cpu, memory, i/o, network performance, etc ) at regular intervals so that, in the event of a container restart or connectivity issue, system conditions at the time can be reviewed this is particularly useful in environments where connectivity to the turbine cloud instance may be impacted by dns timeouts, network quality issues, or resource exhaustion this solution has been tested on commonly used linux distributions, including rhel 7 / 8 / 9 ubuntu 18 04 / 20 04 / 22 04 what the script collects running containers (docker ps) container ips uptime memory usage cpu load (top processes) disk i/o (iostat) disk usage dns resolution network response times using curl prerequisites docker must be installed and running optional but recommended iostat via sysstat package install the sysstat package \# rhel/centos sudo yum install sysstat bind utils y # rhel 8/9 dnf sudo dnf install sysstat bind utils # ubuntu/debian sudo apt update && sudo apt install dnsutils sysstat y installation steps save the script below to /usr/local/bin/monitor remote agent health sh sudo vi /usr/local/bin/monitor remote agent health sh paste the following content into the file be sure to replace \<fqdn> with the appropriate turbine cloud domain (e g \<region> swimlane app) \#!/bin/bash fqdn=${1 "\<region> swimlane app"} timestamp=$(date "+%y %m %d %h %m %s") logfile="/var/log/remote agent health log" { echo "=== \[$timestamp] ===" echo "\[docker ps]" docker ps format "table {{ names}}\t{{ status}}\t{{ runningfor}}" 2>&1 echo "\[uptime]" uptime echo "\[free m]" free m echo "\[top bn1 o %mem | head n 15]" top bn1 o %mem | head n 15 echo "\[disk i/o stats]" iostat x 1 1 | awk 'nr==1 || nr==2 || ($1 /^device/ || $1 /^sd/ || $1 /^nvme/)' 2>/dev/null || echo "iostat not available" echo "\[disk usage]" df h echo "\[network speed test]" curl s w "dns lookup %{time namelookup}s\nconnect %{time connect}s\nttfb %{time starttransfer}s\ntotal %{time total}s\n" o /dev/null https //$fqdn echo "\[dns resolution to $fqdn]" dig "$fqdn" +stats +tries=1 +time=1 | grep "query time" echo "" } >> "$logfile" make the script executable sudo chmod +x /usr/local/bin/monitor remote agent health sh add the following line (replacing \<region> swimlane app with the correct turbine cloud region fqdn) /2 /usr/local/bin/monitor remote agent health sh \<region> swimlane app set up the script to run via crontab at a frequent interval of every 2 minutes crontab e output location logs will be saved to /var/log/remote agent health log you can review this log when investigating container restarts or remote agent connectivity issues understanding and interpreting the output \[docker ps] shows all currently running containers with status and uptime useful for checking whether containers are restarting or unexpectedly stopped if a container shows "exited", note the timestamp for correlation \[uptime] shows system uptime and average cpu load over 1, 5, and 15 minutes load values should generally be less than the number of cpu cores example on a 4 core system, a 15 minute load of 3 5 is acceptable, but 8+ may indicate cpu contention \[free m] displays memory usage in megabytes available memory should ideally not drop too low watch for high used and low free without much in cached/buffer \[top bn1 o %mem | head n 15] lists top memory consuming processes helpful to identify any app or container unexpectedly consuming large amounts of ram \[disk i/o summary] shows simplified disk performance data using iostat field meaning healthy range r/s reads per second relative to workload w/s writes per second relative to workload r await average wait time for reads (ms) < 20ms is ideal, < 100ms ok w await average wait time for writes (ms) same as above %util how busy the disk was < 80% preferred, 100% = bottleneck focus especially on r await, w await, and %util for signs of disk delay or saturation \[disk usage] shows available disk space ensure root (/) and /var partitions are not close to full < 85% usage is generally safe \[network speed test to $fqdn] breaks down how long it takes to connect to your turbine cloud endpoint dns lookup time to resolve the hostname connect time to establish a tcp connection ttfb time to first byte — indicates backend response time total full round trip request duration use this to track spikes in network or application latency \[dns resolution to $fqdn] shows how long dns took using dig ideally under 100ms longer resolution times may indicate local dns resolver problems