How-To
Troubleshooting DNS Resolution Failures on Remote-Agent (Linux)
context running playbook jobs can either be on the $default or $remote pools jobs in the $default pool run directly on our swimlane provided agent instance when jobs are run in the $remote pool, they are executed via a remote agent installed on the customer’s internal linux (ubuntu or rhel) vm these remote agents require reliable dns resolution to maintain a heartbeat connection with the turbine instance if the dns fails to resolve the turbine domain, then the remote agent will get disconnected, resulting in failed jobs in the $remote pool this kb helps customers understand and troubleshoot such dns resolution failures on their remote agent servers introduction a dns (domain name system) resolution failure occurs when your server is unable to convert a domain name into its corresponding ip address this typically happens due to issues with your internal dns configuration, network connectivity, or the upstream dns server itself when this failure occurs on a remote agent server, it loses connection to our instance this leads to job failures for any workflows scheduled to run in the $remote pool symptoms jobs fail in the $remote pool with connection errors remote agent appears disconnected in the ui dns resolution fails when tested with tools like dig or ping what to check when the issue happens basic dns testing run the following to test dns from your remote agent (replace the domains and ip addresses below as required) dig @127 0 0 53 xyz com dig @1 1 1 1 example com ping test instance com check system dns resolver systemctl status systemd resolved journalctl u systemd resolved review dns configuration check /etc/resolv conf to see what nameservers are configured capture logs during the incident please collect the following logs when the issue occurs (replace the example com with domain name of turbine instance) cat /etc/resolv conf dig example com ping example com curl v example com systemctl status systemd resolved journalctl u systemd resolved since "10 minutes ago" post incident checks after the system starts working again, gather the same outputs to compare (replace the example com with the domain name of turbine instance) dig example com cat /etc/resolv conf you may also want to capture routing and network status ip route ip addr nmcli device show monitor dns resolution automatically set up a cron job to run hourly and log dns query times for monitoring script (save as /usr/local/bin/dns monitor sh) (replace example com with the domain of your turbine instance) \#!/bin/bash timestamp=$(date "+%y %m %d %h %m %s") log file="/var/log/dns monitor log" dig local=$(dig @127 0 0 53 example com +stats | grep "query time") dig public=$(dig @1 1 1 1 example com +stats | grep "query time") echo "\[$timestamp] local dns $dig local" >> $log file echo "\[$timestamp] public dns $dig public" >> $log file make script executable chmod +x /usr/local/bin/dns monitor sh add to crontab crontab e add this line 0 /usr/local/bin/dns monitor sh additional network troubleshooting if dns appears fine but issues persist, try the following check network interfaces ip link show test external connectivity curl v https //example com traceroute example com system logs journalctl xe dmesg | grep i network networkmanager logs (if used) journalctl u networkmanager next steps if the issue recurs, please collect and share the logs above when opening a support ticket your internal network team may also assist in reviewing dns or firewall configurations if you need help analyzing the logs or confirming root cause, feel free to reach out to our support team we're happy to assist