Troubleshooting a slow or failing hard disk can be quite a difficult task, especially if you have several disks in a RAID/ZFS array and you're not sure which one might be causing the problem. Here are our top tools that will help identify any problematic disks.
1. SMART status
Step one is to check each disk's SMART status. Let's go ahead and install the smartmontools package:
apt install smartmontools
yum install smartmontools
Once installed, there's no need to do tests yet though; they take a very long time, and we want to isolate the suspected faulty disk first. So let's just view all of the information for each disk and see if there are any obvious failing values:
smartctl -a /dev/sda
Replace sda with your actual disk (use "lsblk" if you're not sure), and of course repeat the status check for each disk in your system if you have more than one. If any values show as failing, it's best not to chance your data and just replace that disk outright. If you want to make double sure just in case, now you should run a test against that disk:
smartctl -t long /dev/sda
Again, replace sda with your actual disk. This process will take a very, very, very long time. We've seen anywhere from 30 minutes to two days... so be patient. You can check the status of the test by running smartctl -a periodically.
2. Sysstat package and iostat tool
Sometimes you'll see no SMART errors but the performance of the hard disk is most certainly abnormal. This means the disk is close to failure but not quite there yet. The sysstat package includes a really handy tool called iostat. which allows us to see how many transactions per second are hitting each disk, how much read/write activity each disk is doing, and most importantly, the latency and queues for each disk. If a disk is beginning to fail, more often than not it'll run slower and hotter than other similar size/speed disks, and so a higher latency value is a great identifier.
First let's install sysstat:
apt install sysstat
yum install sysstat
Now we'll use the iostat package to check the latency:
iostat -x
You can also "watch" these stats in realtime (eg prefix watch) for a more detailed look. The columns to pay attention to are r_await and w_await, which are the read and write average latency values respectively, and %util can also show a hindering disk in an array which is another telltale sign of failure.
3. ioping utility
For a more detailed look into realtime latency, we can also use the ioping utility. First let's install it:
apt install ioping
yum install ioping
Now let's run it in "ping" latency mode against a particular disk:
ioping /dev/sda
This will do a completely random 4k read every second and show you the latency. For idle hard disks, this should be under 20ms. For idle SSDs, this should be under 3ms. Obviously busy disks and SSDs will have higher latency values. If you have multiple disks in your system and you need to find the slowest one, run ioping against all your disks and record the average latency of each one.
We hope this guide has helped you identify a poorly performing or failing disk in your system!