Troubleshooting a broken ZFS pool Print

  • 1

Corruption happens. It sucks, but we have to deal with it. So what do we do with a broken ZFS pool that is causing a server to kernel panic on every boot?

 

1. Blacklist ZFS module on boot

The first step is to blacklist the ZFS module to stop it from loading automatically. To do this, we need to reboot the server, then watch it like a hawk so that we get to the GRUB screen on time to make an edit.

Most Linux distros will have a simple and familiar GRUB interface that will have an 'Advanced options' or 'Rescue boot' - select this.

Now we want to hit the 'e' key to make real-time edits to the configuration. We want to scroll down with the arrow key until we find the 'kernel' line, then scroll to the end (with the END key), then add the following:

module_blacklist=zfs modprobe.blacklist=zfs

We don't know which Linux distro or kernel you're using so providing the two will suffice. Next, we want to boot with those settings, usually by pressing 'CTRL+X' but you will have instructions on-screen of which key combo to use.

With a little luck, your system should boot normally without loading ZFS and should be stable.

 

2. Delete the zfs zpool cache file

Sometimes, destroying ZFS' zpool cache file is enough to stop ZFS from loading the problematic pool and locking your system. Once you're booted, simply remove or empty the file '/etc/zfs/zpool.cache'

To test if this has worked, simply reboot your system normally with the 'reboot' command.

If your system fails to boot properly and you have the same symptoms, you'll need to repeat step 1 and then go to step 3. Additionally, if your OS configuration continues to reference/load the problematic ZFS pool, you might need to download and boot to a rescue ISO image (eg SystemRescueCD) to perform the remedial commands.

 

3. Use wipefs to get rid of broken zpool/data

If your pool is very badly corrupted or keeps locking your system, chances are even mounting it read-only with ZFS will not allow much data recovery. Nonetheless you could try it if needed, but we won't cover that here.

Instead, we're going to completely destroy the data and start afresh. To do this, we need to know the partition or disk that ZFS is on. The command lsblk will show us a list of disks and partitions to help.

Once we've ascertained which disks/partitions our zpool is on, we can use wipefs to remove the start of the partition like so:

wipefs -af /dev/nvme0n1p3

In the above case, ZFS was installed on NVMe drive 0, partition 3, but your case will vary. Make sure you do this for every disk or every partition of your broken pool.

 

Use this article as a generic guide. We accept no responsibility for any loss of data, crashes or corruption as a result of any commands in this article.


Was this answer helpful?

« Back