So I’ve had some zfs raidz/mirror problems, and again I noticed another troublesome disk in my zfs setup.
This time I noticed a minor sector error on a disk pretty early, and didn’t want to take any chances and decided to replace it at once. The disk is one of those super crappy WD green disks anyway, which I’ve found that REALLY shouldn’t be used in any raid setup / server setup ( anything other than a desktop you don’t care about) .
A bit wiser from last time, this time my nagios nrpe script picked up:
(da134:ciss1:2:11:0): READ(6). CDB: 8 a 9c d3 1 0
(da134:ciss1:2:11:0): CAM status: CCB request completed with an error
(da134:ciss1:2:11:0): Retrying command
Not a 100% sure if that message really is a sector error, but I’m not taking any chances, I had a lot of troubles with this server last time a disk died.
So I did:
root:~# zpool offline tank da134
root:~# zpool status
[...]
mirror-5 DEGRADED 0 0 0
da122 ONLINE 0 0 0
11771992511548113470 OFFLINE 0 0 0 was /dev/da134
[...]
root:~# zpool detach tank da134
root:~# zpool status
[...]
mirror-4 ONLINE 0 0 0
da98 ONLINE 0 0 0
da110 ONLINE 0 0 0
da122 ONLINE 0 0 0
[...]
root:~# halt -p
( replaced the drive and turned the server back on )
root:~# zpool attach tank da122 da134
root:~# zpool status
mirror-5 ONLINE 0 0 0
da122 ONLINE 0 0 0
da134 ONLINE 0 0 0 (resilvering)
I never blog’ed about what really happened from my previous raidz rebuild which went south (to put it mildly). Problem then was that I was running raidz, which supports 1 disk failure, but it turned out I had several block read errors, so after 4 attempts to resilver / rebuild, zfs still wasn’t able to rebuild the fresh drive simply because there was at least 2 other pretty rotten disks in the raid that kept on throwing new sector errors …
So I had to scrap the whole setup, and setup a fresh zfs pool. I got my boss to buy some new disks, but it turned out that 4 of those disks (some samsung disks) wasn’t recognized by the raid hardware controller (?), sooooo I put in some WD green disks there …
BUT I figured since the pool is less than 25% filled, I changed the whole setup from raidz to a mirrored setup, that is 6 mirror pairs with a total of 12 disks, and on top of that I set copies=2 for the backup pool. copies=2 will double the amount of space usage because every block is written to 2 blocks on the disk. As long as I have plenty of space I should be better set for corrupted sectors/blocks, bit rotting and what not. 🙂
root:~# zfs get copies tank/backup
NAME PROPERTY VALUE SOURCE
tank/backup copies 2 local
root:~#
And the mirrored zfs setup look like this:
root:~# zpool status
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Apr 25 14:20:24 2013
68.2G scanned out of 4.61T at 61.2M/s, 21h37m to go
13.8G resilvered, 1.44% done
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da2 ONLINE 0 0 0
da14 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
da26 ONLINE 0 0 0
da38 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
da50 ONLINE 0 0 0
da62 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
da74 ONLINE 0 0 0
da86 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
da98 ONLINE 0 0 0
da110 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
da122 ONLINE 0 0 0
da134 ONLINE 0 0 0 (resilvering)
logs
da1 ONLINE 0 0 0
errors: No known data errors