So I’ve had some zfs raidz/mirror problems, and again I noticed another troublesome disk in my zfs setup.
This time I noticed a minor sector error on a disk pretty early, and didn’t want to take any chances and decided to replace it at once. The disk is one of those super crappy WD green disks anyway, which I’ve found that REALLY shouldn’t be used in any raid setup / server setup ( anything other than a desktop you don’t care about) .
A bit wiser from last time, this time my nagios nrpe script picked up:
(da134:ciss1:2:11:0): READ(6). CDB: 8 a 9c d3 1 0 (da134:ciss1:2:11:0): CAM status: CCB request completed with an error (da134:ciss1:2:11:0): Retrying command
Not a 100% sure if that message really is a sector error, but I’m not taking any chances, I had a lot of troubles with this server last time a disk died.
So I did:
root:~# zpool offline tank da134 root:~# zpool status [...] mirror-5 DEGRADED 0 0 0 da122 ONLINE 0 0 0 11771992511548113470 OFFLINE 0 0 0 was /dev/da134 [...] root:~# zpool detach tank da134 root:~# zpool status [...] mirror-4 ONLINE 0 0 0 da98 ONLINE 0 0 0 da110 ONLINE 0 0 0 da122 ONLINE 0 0 0 [...] root:~# halt -p ( replaced the drive and turned the server back on ) root:~# zpool attach tank da122 da134 root:~# zpool status mirror-5 ONLINE 0 0 0 da122 ONLINE 0 0 0 da134 ONLINE 0 0 0 (resilvering)
I never blog’ed about what really happened from my previous raidz rebuild which went south (to put it mildly). Problem then was that I was running raidz, which supports 1 disk failure, but it turned out I had several block read errors, so after 4 attempts to resilver / rebuild, zfs still wasn’t able to rebuild the fresh drive simply because there was at least 2 other pretty rotten disks in the raid that kept on throwing new sector errors …
So I had to scrap the whole setup, and setup a fresh zfs pool. I got my boss to buy some new disks, but it turned out that 4 of those disks (some samsung disks) wasn’t recognized by the raid hardware controller (?), sooooo I put in some WD green disks there …
BUT I figured since the pool is less than 25% filled, I changed the whole setup from raidz to a mirrored setup, that is 6 mirror pairs with a total of 12 disks, and on top of that I set copies=2 for the backup pool. copies=2 will double the amount of space usage because every block is written to 2 blocks on the disk. As long as I have plenty of space I should be better set for corrupted sectors/blocks, bit rotting and what not. 🙂
root:~# zfs get copies tank/backup NAME PROPERTY VALUE SOURCE tank/backup copies 2 local root:~#
And the mirrored zfs setup look like this:
root:~# zpool status pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Thu Apr 25 14:20:24 2013 68.2G scanned out of 4.61T at 61.2M/s, 21h37m to go 13.8G resilvered, 1.44% done config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 da2 ONLINE 0 0 0 da14 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 da26 ONLINE 0 0 0 da38 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 da50 ONLINE 0 0 0 da62 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 da74 ONLINE 0 0 0 da86 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 da98 ONLINE 0 0 0 da110 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 da122 ONLINE 0 0 0 da134 ONLINE 0 0 0 (resilvering) logs da1 ONLINE 0 0 0 errors: No known data errors