(**) Translated with www.DeepL.com/Translator
Backup script to set up on the master
Name of the file : /usr/local/bin/kubernetes-backup-etcd.sh
Add this content :
#!/bin/bash
DIRBACKUP="/backup-etcd/snapshots/"
DEFAULTETCD="/backup-etcd/snapshots/default.etcd"
# prune in days
PRUNE=5
DEBUG=0
# Api etcd V3
export ETCDCTL_API=3
if [ ! -d "$DIRBACKUP" ];then
mkdir -p "$DIRBACKUP"
fi
if [ -d "$DEFAULTETCD" ];then
rm -rf "$DEFAULTETCD"
if [ "$?" != "0" ];then
exit 1
fi
fi
# Backup
etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --endpoints=https://127.0.0.1:2379 --key=/etc/kubernetes/pki/etcd/server.key snapshot save "$DIRBACKUP/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db" > /dev/null
if [ "$?" == "0" ];then
# Prune
FILESTODELETE=$(find "$DIRBACKUP" -mtime +$PRUNE)
if [ "$FILESTODELETE" != "" ];then
if [ "$DEBUG" == "1" ];then
echo "-----------------"
echo "Deleting obsolete files ($PRUNE days)"
echo "-----------------"
fi
for F in $FILESTODELETE
do
if [ "$DEBUG" == "1" ];then
echo "Deletion of $F"
fi
rm -f "$F"
if [ "$?" != "0" ];then
echo "[ERR] - impossible to delete the file $F"
exit 1
fi
done
fi
fi
Job (cron task)
Personally this script is executed once a day (increase the frequency when you do tests) :
vi /etc/cron.d/kubernetes-backup
# Backup etcd master
0 23 * * * root /usr/local/bin/kubernetes-backup-etcd.sh
## To restore Before running the script make sure the backup job is not running during the restore (comment out the line).
On the master server run this script :
# !!!!this directory is where you save the snapshots
# (see the shell above : /usr/local/bin/kubernetes-backup-etcd.sh )
DIRBACKUP="/backup-etcd/snapshots/"
# stop kubelet
systemctl stop kubelet
# delete dir /var/lib/etcd
rm -r /var/lib/etcd
cd "$DIRBACKUP"
# delete all files ending in .part (partial backups not usable)
rm *.part
# Retrieving the most recent file
FILETORESTORE=$(ls -l -t | head -2 | awk '{print $9}' | xargs)
# restore
ETCDCTL_API=3 etcdctl snapshot restore './$FILETORESTORE'
mv default.etcd /var/lib/etcd
# start kubelet
systemctl stop kubelet
Wait a few minutes and test with, for example, the command: kubectl get nodes
… The cluster is UP again!!!