Troubleshooting Velero Snapshot Backup and Restore Failures

in some cases, velero backups or restores fail without clear error messages use these steps to gather logs, diagnose issues, and understand the root cause prerequisite set the namespace for embedded cluster, the value should be "default" export ns=\<your namespace> 1\ collect velero debug logs to generate detailed logs for a specific backup or restore velero get backup velero debug backup \<backup name> velero get restore velero debug restore \<restore name> this creates a bundle \<date> tar gz file containing logs and metadata 2\ capture a support bundle gather the cluster’s state after failure kubectl support bundle interactive=false secret/${ns}/kotsadm swimlane platform supportbundle kubectl support bundle n $ns interactive=false https //raw\ githubusercontent com/replicatedhq/troubleshoot specs/main/host/default yaml 3\ look for out of memory (oom) events on each node dmesg t | grep i oom dmesg t | egrep i 'killed process' journalctl k | grep i 'killed process' from any node, check specific pods kubectl describe pod swimlane tools 0 n $ns | grep i oom kubectl get pod swimlane tools 0 n $ns o jsonpath="{ status containerstatuses\[ ] laststate terminated reason}" kubectl describe pod swimlane sw mongo 0 n $ns | grep i oom kubectl get pod swimlane sw mongo 0 n $ns o jsonpath="{ status containerstatuses\[ ] laststate terminated reason}" 4\ check node and container memory configuration on each node kubectl describe node \<node name> | grep i memory journalctl u kubelet | tail 100 > kubelet log the journalctl u kubelet command helps uncover kubelet level issues from any node, check resource limits for mongo and tools containers kubectl get sts swimlane sw mongo n $ns o jsonpath="{ spec template spec containers\[ ] resources}" && echo kubectl get sts swimlane tools n $ns o jsonpath="{ spec template spec containers\[ ] resources}" && echo 5\ check for large mongodb collections run inside mongo shell swimlane 10 x kubectl exec it n $ns swimlane sw mongo 0 mongosh u admin p authenticationdatabase admin tls tlsallowinvalidcertificates admin note for older versions, change "mongosh" to "mongo" turbine kubectl exec it n $ns mongo 0 mongosh u admin p authenticationdatabase admin tls tlsallowinvalidcertificates admin then identify top collections by data and index size using the provided javascript scripts below large mongodb collections let collections = \[]; db getmongo() getdbnames() foreach(function(dbname) { const currentdb = db getsiblingdb(dbname); currentdb getcollectioninfos() foreach(function(collinfo) { if (collinfo type === "collection" && !collinfo name startswith("system ")) { const stats = currentdb getcollection(collinfo name) stats(); collections push({ db dbname, collection collinfo name, sizebytes stats size, sizemb (stats size / (1024 1024)) tofixed(2), sizegb (stats size / (1024 1024 1024)) tofixed(2) }); } }); }); collections sort((a, b) => b sizebytes a sizebytes); print("top 10 largest collections by data size "); collections slice(0, 10) foreach(function(item, index) { print(`${index + 1} ${item db} ${item collection} ${item sizebytes} bytes | ${item sizemb} mb | ${item sizegb} gb`); }); mongodb index size let indexstats = \[]; db getmongo() getdbnames() foreach(function(dbname) { const currentdb = db getsiblingdb(dbname); currentdb getcollectioninfos() foreach(function(collinfo) { if (collinfo type === "collection" && !collinfo name startswith("system ")) { const stats = currentdb getcollection(collinfo name) stats(); indexstats push({ db dbname, collection collinfo name, indexbytes stats totalindexsize, indexmb (stats totalindexsize / (1024 1024)) tofixed(2), indexgb (stats totalindexsize / (1024 1024 1024)) tofixed(2) }); } }); }); indexstats sort((a, b) => b indexbytes a indexbytes); print("top 10 collections by index size "); indexstats slice(0, 10) foreach(function(item, index) { print(`${index + 1} ${item db} ${item collection} ${item indexbytes} bytes | ${item indexmb} mb | ${item indexgb} gb`); }); 6\ check disk space and dump folder content inside the swimlane tools pod kubectl exec n $ns $(kubectl get pod n $ns l app=swimlane tools o name) ls lh /dump on each node, check disk usage df h | head 20 7\ additional log collection to improve verbosity during troubleshooting use the log level argument the available values (in increasing verbosity) are error – logs only critical errors warn – logs warnings and errors info – default; logs general operational messages debug – logs detailed debug information trace – logs everything, including low level internal operations (very verbose) you can set this in the velero deployment like this kubectl edit deployment/velero n velero \# add argument log level debug under spec containers args spec containers \ name velero args \ server \ log level=debug this will provide deeper insight 8\ additional recommendations capture screenshots of error messages in the ui note the failure timestamp and backup or restore names for restore, ensure the backup being used is complete