FreeBSD

When you have to backup millions of files, use zfs if you can

I’ve been increasingly worried about one of the media servers I backup, the backup time has been crawling it’s ways from 12 to about 14 hours as of today. It’s not that it’s so many TB’s, only 1.4 TB actually, the problem is that there is more than 2 000 000 small files, and rsync have to fstat every one of them. So even when there hasn’t been much change the backup takes longer and longer.
So yesterday I figured I should try out zfs. The media server is already on zfs, and I use zfs on the backup server + I have already got a zfs sync script which I modified some months ago to use mbuffer for maximum bandwith utilization .

And how did this turn out ? well, the backup time went from about 14 hours to 7 seconds … I just realised I could never go away from zfs for these kind of setups (backing up millions of files ) .

lsof equivalent in FreeBSD

When I want to umount a filesystem in Linux and the kernel says it can’t umount because the filesystem is busy I usually do something like

# lsof | grep [mount point]

To figure out what files are open in the mount point .

In FreeBSD you may install lsof or you could instead use the builtin fstat command that comes bundled with FreeBSD . For cases like the one mentioned above fstat with the -f switch is really handy:

# zfs destroy tank/data
cannot unmount '/tank/data': Device busy
#
# fstat -f /tank/data
USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
www      java        1674    4 -        20595308 -rw-r--r--  19269164  r
www      java        1674    5 -        20595308 -rw-r--r--  19269164  r
# 

building vlc for i386

Building vlc-2.0.5 with poudriere in FreeBSD kept failing for our i386 build.
I had numerous errors such as:

[...]
misc/cpu.c:129:1: error: unknown type name 'VLC_MMX_is_not_implemented_on_this_compiler'
VLC_MMX static void ThreeD_Now_test (void)
^
../include/vlc_cpu.h:46:19: note: expanded from macro 'VLC_MMX'
#  define VLC_MMX VLC_MMX_is_not_implemented_on_this_compiler
                  ^
misc/cpu.c:129:9: error: expected identifier or '('
VLC_MMX static void ThreeD_Now_test (void)
        ^
misc/cpu.c:230:35: error: use of undeclared identifier 'SSE_test'
        if (vlc_CPU_check ("SSE", SSE_test))
                                  ^
misc/cpu.c:239:56: error: use of undeclared identifier 'SSE2_test'
    if ((i_edx & 0x04000000) && vlc_CPU_check ("SSE2", SSE2_test))
                                                       ^
misc/cpu.c:246:56: error: use of undeclared identifier 'SSE3_test'
    if ((i_ecx & 0x00000001) && vlc_CPU_check ("SSE3", SSE3_test))
[...]

in the build log.

Looking at the ports Makefile I noticed that:

# prefer clang on 9.1+
.if (${OSVERSION} >= 901000) && exists(${DESTDIR}/usr/bin/clang)
CC=     clang
CXX=    clang++
CPP=    clang-cpp
.else
.if ${ARCH} == "i386"
USE_GCC?=       4.6+ # sse/3dnow detection on i386 needs newer gcc
.endif
.endif

I’m only building packages for 9.1-RELEASE så according to the above the build would use clang instead of gcc. clang is a pretty new compiler, it worked fine for the amd64 build so I just hacked that so that for any i386 build gcc 4.6+ would be used instead

.if ${ARCH} == "i386"
USE_GCC?=       4.6+ # sse/3dnow detection on i386 needs newer gcc
.else
.if (${OSVERSION} >= 901000) && exists(${DESTDIR}/usr/bin/clang)
CC=     clang
CXX=    clang++
CPP=    clang-cpp
.endif
.endif

And the build went through.

I suspect the ports maintainer has fixed this in a recent ports tree, but since I’ve frozen our ports tree from ‘sometime in january this year’ and I can’t / won’t update that yet, building with gcc is fine by me anyway.

zfs to the rescue

So I’ve had some zfs raidz/mirror problems, and again I noticed another troublesome disk in my zfs setup.
This time I noticed a minor sector error on a disk pretty early, and didn’t want to take any chances and decided to replace it at once. The disk is one of those super crappy WD green disks anyway, which I’ve found that REALLY shouldn’t be used in any raid setup / server setup ( anything other than a desktop you don’t care about) .
A bit wiser from last time, this time my nagios nrpe script picked up:

(da134:ciss1:2:11:0): READ(6). CDB: 8 a 9c d3 1 0 
(da134:ciss1:2:11:0): CAM status: CCB request completed with an error
(da134:ciss1:2:11:0): Retrying command

Not a 100% sure if that message really is a sector error, but I’m not taking any chances, I had a lot of troubles with this server last time a disk died.
So I did:

root:~# zpool offline tank da134
root:~# zpool status
[...]
	  mirror-5                DEGRADED     0     0     0
	    da122                 ONLINE       0     0     0
	    11771992511548113470  OFFLINE      0     0     0  was /dev/da134
[...]
root:~# zpool detach tank da134
root:~# zpool status
[...]
mirror-4  ONLINE       0     0     0
	    da98    ONLINE       0     0     0
	    da110   ONLINE       0     0     0
	  da122     ONLINE       0     0     0
[...]
root:~# halt -p
( replaced the drive and turned the server back on )
root:~# zpool attach tank da122 da134
root:~# zpool status
mirror-5  ONLINE       0     0     0
	    da122   ONLINE       0     0     0
	    da134   ONLINE       0     0     0  (resilvering)

I never blog’ed about what really happened from my previous raidz rebuild which went south (to put it mildly). Problem then was that I was running raidz, which supports 1 disk failure, but it turned out I had several block read errors, so after 4 attempts to resilver / rebuild, zfs still wasn’t able to rebuild the fresh drive simply because there was at least 2 other pretty rotten disks in the raid that kept on throwing new sector errors …

So I had to scrap the whole setup, and setup a fresh zfs pool. I got my boss to buy some new disks, but it turned out that 4 of those disks (some samsung disks) wasn’t recognized by the raid hardware controller (?), sooooo I put in some WD green disks there …

BUT I figured since the pool is less than 25% filled, I changed the whole setup from raidz to a mirrored setup, that is 6 mirror pairs with a total of 12 disks, and on top of that I set copies=2 for the backup pool. copies=2 will double the amount of space usage because every block is written to 2 blocks on the disk. As long as I have plenty of space I should be better set for corrupted sectors/blocks, bit rotting and what not. 🙂

root:~# zfs get copies tank/backup
NAME         PROPERTY  VALUE   SOURCE
tank/backup  copies    2       local
root:~# 

And the mirrored zfs setup look like this:

root:~# zpool status
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 25 14:20:24 2013
        68.2G scanned out of 4.61T at 61.2M/s, 21h37m to go
        13.8G resilvered, 1.44% done
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da14    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    da26    ONLINE       0     0     0
	    da38    ONLINE       0     0     0
	  mirror-2  ONLINE       0     0     0
	    da50    ONLINE       0     0     0
	    da62    ONLINE       0     0     0
	  mirror-3  ONLINE       0     0     0
	    da74    ONLINE       0     0     0
	    da86    ONLINE       0     0     0
	  mirror-4  ONLINE       0     0     0
	    da98    ONLINE       0     0     0
	    da110   ONLINE       0     0     0
	  mirror-5  ONLINE       0     0     0
	    da122   ONLINE       0     0     0
	    da134   ONLINE       0     0     0  (resilvering)
	logs
	  da1       ONLINE       0     0     0

errors: No known data errors

zfs raidz – replacing a drive

So I had a faulty drive in a zfs raidz configuration. I replaced the drive without setting the faulty drive to offline, or detaching it from the current raidz configuration … I probably did that part wrong. After reboot I had something like this:

root@backupmh:~# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scan: none requested
config:

	NAME                     STATE     READ WRITE CKSUM
	tank                     DEGRADED     0     0     0
	  raidz1-0               ONLINE       0     0     0
	    da1                  ONLINE       0     0     0
	    da13                 ONLINE       0     0     0
	    da25                 ONLINE       0     0     0
	    da37                 ONLINE       0     0     0
	    da49                 ONLINE       0     0     0
	  raidz1-1               DEGRADED     0     0     0
	    da73                 ONLINE       0     0     0
	    da85                 ONLINE       0     0     0
	    da97                 ONLINE       0     0     0
	    da109                ONLINE       0     0     0
	    9120273794345838000  UNAVAIL      0     0     0  was /dev/da121
	spares
	  da61                   AVAIL   
	  da133                  AVAIL   

errors: No known data errors
root@backupmh:~# ls -l /dev/da121
crw-r-----  1 root  operator    1, 100 14 mar 11:59 /dev/da121

It’s /dev/da121 that I replaced. First I smply wannted to hot-swap the drive, but since this is a really old HP MSA-20 or something, and I’m using som super crap’y cheap sata drives I had to reboot and some other stuff until the raid controller wanted to recognize the drive …
Anyway, to get zfs to rebuild ( or resilver, as zfs calls it) I ended up with:

root@backupmh:~# zpool offline tank /dev/da121
root@backupmh:~# zpool online tank /dev/da121
warning: device '/dev/da121' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
root@backupmh:~# zpool replace tank /dev/da121

After that I got :

root@backupmh:~# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Thu Mar 14 12:14:51 2013
    6.46G scanned out of 1.36T at 4.60M/s, 85h38m to go
    685M resilvered, 0.46% done
config:

	NAME                       STATE     READ WRITE CKSUM
	tank                       DEGRADED     0     0     0
	  raidz1-0                 ONLINE       0     0     0
	    da1                    ONLINE       0     0     0
	    da13                   ONLINE       0     0     0
	    da25                   ONLINE       0     0     0
	    da37                   ONLINE       0     0     0
	    da49                   ONLINE       0     0     0
	  raidz1-1                 DEGRADED     0     0     0
	    da73                   ONLINE       0     0     0
	    da85                   ONLINE       0     0     0
	    da97                   ONLINE       0     0     0
	    da109                  ONLINE       0     0     0
	    replacing-4            UNAVAIL      0     0     0
	      9120273794345838000  UNAVAIL      0     0     0  was /dev/da121/old
	      da121                ONLINE       0     0     0  (resilvering)
	spares
	  da61                     AVAIL   
	  da133                    AVAIL   

errors: No known data errors

the resilver progress look slow … but it’s getting faster and faster every time I check it (started at an estimate of +600 hours) .
I initially setup the 2 raidz configurations with a hot-spare, I wonder why it didn’t start resilvering to the corresponding hot-spare after reboot (?) .
I should check that out, some day …

Puppet, FreeBSD and custom $PACKAGESITE

I recently noticed a puppet managed server installed a package that wasn’t supposed to be available …
The thing is, we’re running Tinderbox to build our own packages from ports, it’s kind of normal when using FreeBSD and it’s ‘rolling release, source based software package system (ports)’ .
Anyway, we’ve setup all servers with a global PACKAGESITE variable pointing to our local repo, so that pkg_add and portmaster will pull packages from there.
We distribute config files and so on via puppet, and what happened was that when running puppet initially it will push out the global /etc/csh.cshrc config which contains the magic PACKAGESITE setup, but at the same time it will start installing a bunch of packages without actually source’ing the /etc/csh.cshrc … So actually PACKAGESITE is empty the first time puppet runs, and that means pkg_add will default to the official FreeBSD packagesite …
This is not how I want it, because for instance a FreeBSD 8.3-RELEASE with an empty PACKAGESITE will pull packages from a spesific ports ‘freeze’ around the 8.2-RELEASE, and that’s a long time ago today … Next time I install any extra packages on that system, it’ll use our internal PACKAGESITE, and now our system has a mix of our lates ports and the original ports from RELEASE, and stuff start to get complicated ….

Ruby, bundle install –without development test might be troublesome in FreeBSD

I’m setting up a new Redmine installation on a FreeBSD server. I’m having trouble with:

# bundle install --without development test
Installing sqlite3 (1.3.7) with native extensions 
Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native extension.
...

Turns out I have to install the sqlite3 gem by giving bundle a hint on where to find the sqlite3.h file .

# gem install sqlite3 -- --with-sqlite3-include=/usr/local/include
Building native extensions.  This could take a while...
Successfully installed sqlite3-1.3.7
1 gem installed
Installing ri documentation for sqlite3-1.3.7...
Installing RDoc documentation for sqlite3-1.3.7...

FreeBSD upgrading packages with portmaster

So we’re running our local package repository for our FreeBSD farm. When using FreeBSD, can choose to stick to binary packages for your current RELEASE (which never get patched for security holes), or you could use binary packages for the stable branch of your RELEASE (they’ll be patch’ed for security holes, but they’ll also keep rolling into new versions, e.g. if you got a 100 or so servers you kinda fucked if you want to have consistency with versions of packages on your servers), or you could, like us (and many other FreeBSD admins) run your local repo.
With a local repo you get control on versions of packages, at least, as a bonus you’ll also get all kinds of freakin’ dependency problems, tinderbox doesn’t always cooperate, and so on.
So we configure portmaster on each server to use binary packages only, and point to our package repo server . Then you’ll ‘only’ have to issue portmaster -a to upgrade your packages on a server… I’d say you got about 30% chance of that working out for you on your first run .
Here’s what I usually have to do (kinda pseudo’ish cli):

portmaster -a
(portmaster brakes on package a ... so manually remove that package so that portmaster -a could finish, then reinstall the troublesome package)
pkg_info | grep package a
pkg_delete package a
ERROR: package b depends on package a
pkg_delete package b
pkg_delete package a
portmaster -a
(portmaster brakes on package c ... so manually remove that package so that portmaster -a could finish, then reinstall the troublesome package)
pkg_info | grep package c
pkg_delete package c
ERROR: package d depends on package c
pkg_delete package d
ERROR: package e depends on package d
pkg_delete package e
pkg_delete package d
pkg_delete package c
portmaster -a
(let's say this time portmaster managed to complete)
pkg_add -r [all the packages you manually removed]
(now I suddenly got a)
pkg_version: corrupted record (pkgdep line without argument), ignoring
(google it, mabye you get <a href=http://www.cyberciti.biz/faq/pkg_version-corrupted-record-pkgdep-line-without-argument-ignoring/>here</a> so you try)
portmaster --check-depends
you'll mabye reinstall any troublesome packages, and if you're lucky you're done

It’s probably because I got more than 10 years of experience in using Debian, and only about 6 months experience in using FreeBSD that I keep getting myself into these kind of problems, but I’m still getting grumpy about it !

semget errors in FreeBSD Jail’s

We got some semget errors when trying to start uwsgi inside a FreeBSD jail . The solution is to set some /boot/loader.conf variables

kern.ipc.shmmni="512"
kern.ipc.semmni="512"
kern.ipc.semmnu="512"
kern.ipc.semmns="1024"
kern.ipc.semume="512"

But we still had some problems in the jail after reboot . It turns out FreeBSD sysctl variable security.jail.sysvipc_allowed defaults to 0, that is ipc by default is disabled for jails. Note that issuing:

# sysctl security.jail.sysvipc_allowed=1

and restarting the jail isn’t enough, ’cause /etc/rc.d/jail will reset that variable upon jail restart, you’d also have to have this in /etc/Rc.conf

jail_sysvipc_allow="YES"

and you’re able to start jails with ipc support .

This is explained in more detail at www.freebsddiary.org
Wonder what all these cryptic sysctl variable names mean ? try:

# sysctl -d kern.ipc.shmmni
kern.ipc.shmmni: Number of shared memory identifiers
# 

Also note that: I ran into an other problem where a lot of ipc semaphores where taken but not released, I wrote a small script to fix that:

#!/bin/sh

ipcs | grep [username] | awk '{print $2}' > sem.txt
for i in `cat sem.txt`; do 
 ipcrm -s $i; 
done;

substitute [username] with the username of the users semaphores you’d want to delete .