Following an upgrade of the file server system (
apt-get upgrade), Glusterfs volumes are no longer available (one or more volumes per application installed in the Kubernetes cluster). Result: everything is down.
After analysis, it turns out that it wasn’t the system update that caused the failure, but an I/O problem on the server’s SD card.
All Glusterfs data is on an SSD disk, encrypted, and was therefore not affected by the server update.
Don’t panic, I had already tested this procedure a long time ago. This saves me from keeping a system backup of the file server.
After a crash, you can either restore your system backup and re-establish services one by one, or start from scratch. I always opt for the “from scratch” solution, taking care to periodically back up the configuration of services and to create installation automata right from the start. The file server uses the 64-bit Raspbian distribution, so it’s time to switch to the pure Debian version.
Consequently, I took a new SD card, on which I installed the latest Debian Bullseye for Raspberry Pi image, the “glusterfs” and “cryptsetup” packages.
New SD card
Installing the Debian Bullseye for Raspberry Pi image on the SD card
Mount the Glusterfs data disk (use “cryptsetup” for the latter, all data is encrypted)
Retrieve the names of the volumes used by the services, for which you’ll need to keep a list… Alternatively, browse the Kubernetes manifests of your applications; the name of the volume is indicated in the “Persistent Volumes” section.
Recreate volumes one by one from the data on disk, with the “force “ parameter, you are connected to the Gluster server:
Create the “/repair” directory as a temporary mount point for a Glusterfs volume.
Create the volume from existing data:
gluster v create [nom de volume] $(hostname):[Emplacement du volume] force
force is important here, confirm (y), as the author says this subtlety doesn’t appear in any documentation :
The process for creating a new volume from the data of an existing set of bricks from an existing volume is not documented anywhere.
Volume startup :
gluster v start [nom de volume]
Mounting the volume :
mount -t glusterfs $(hostname):/[nom de volume] /repair
Then, as indicated by the author, in the “recovery” procedure :
du -hc --max-depth=1
Dismantling the temporary volume :
Execute this operation on all volumes, restart the Kubernetes cluster, and the applications are back on their volumes.
Important: Speaking of Glusterfs, Kubernetes’ support for this type of volume has been removed as of version 1.27… To avoid reinstalling everything, Kubernetes supports NFS volumes and Glusterfs can export its volumes via NFS… Before migrating, I’ll have to make sure, on the current version of my Kubernetes cluster, that all applications work with NFS volumes through Glusterfs, and then finally update the Kubernetes cluster. +++
Comments (not translated)