rsync

When you have to backup millions of files, use zfs if you can

I’ve been increasingly worried about one of the media servers I backup, the backup time has been crawling it’s ways from 12 to about 14 hours as of today. It’s not that it’s so many TB’s, only 1.4 TB actually, the problem is that there is more than 2 000 000 small files, and rsync have to fstat every one of them. So even when there hasn’t been much change the backup takes longer and longer.
So yesterday I figured I should try out zfs. The media server is already on zfs, and I use zfs on the backup server + I have already got a zfs sync script which I modified some months ago to use mbuffer for maximum bandwith utilization .

And how did this turn out ? well, the backup time went from about 14 hours to 7 seconds … I just realised I could never go away from zfs for these kind of setups (backing up millions of files ) .

rsync and variables

I ran into this annoying thing about rsync when I was writing this new backup script in BASH . I wanted support for excluding directories in config files for clients being backed up, since rsync support –exclude=’some/path’ I could have config files with stuff like this:

EXCLUDE="--exclude='proc/*' --exclude='dev/*'"

But while testing this I noticed rsync simply ignored the EXCLUDE variable, so I created this little test where I put in some arbitrary proc and dev directories with some arbitrary sub-directories, check out this example:

joar@uranus:~/tmp$ rsync -va --delete --exclude "dev" --exclude "proc/*" rsynctest/ rsynctest2/ | grep proc
jdk1.7.0_07/jre/lib/amd64/libsaproc.so
jdk1.7.0_07/proc/
proc/
joar@uranus:~/tmp$ EXCLUDE="--exclude 'dev' --exclude 'proc/*'"
joar@uranus:~/tmp$ rsync -va --delete $EXCLUDE rsynctest/ rsynctest2/ | grep proc
jdk1.7.0_07/jre/lib/amd64/libsaproc.so
jdk1.7.0_07/proc/
jdk1.7.0_07/proc/1
jdk1.7.0_07/proc/2
jdk1.7.0_07/proc/3
jdk1.7.0_07/proc/4
jdk1.7.0_07/proc/5
jdk1.7.0_07/proc/6
proc/
proc/1/
proc/2/
proc/3/
proc/4/
proc/5/
proc/6/
proc/7/
proc/8/
joar@uranus:~/tmp$ 

Notice when I specify –exclude= in the first rsync, the proc dir is rsync’ed without sub-dirs, but when i put that stuff into the EXCLUDE variable the subdirs of proc is being rsynced (!) . I poked around this for a while trying double quotes VS. single quotes, even tried expansion stuff like $(echo $EXCLUDE) .
After a good night sleep I came up with this workaround:

joar@uranus:~/tmp$ EXCLUDE='proc/* dev/*'
joar@uranus:~/tmp$ for f in $EXCLUDE; do echo $f >> tmpfile; done
joar@uranus:~/tmp$ cat tmpfile 
proc/*
dev/*
joar@uranus:~/tmp$ rsync -va --delete --exclude-from=./tmpfile rsynctest/ rsynctest2/ | grep proc
jdk1.7.0_07/jre/lib/amd64/libsaproc.so
jdk1.7.0_07/proc/
proc/
joar@uranus:~/tmp$ 

So I got what I want 🙂
The reason I want to cat stuff from the EXCLUDE variable into a tmpfile is because I want 1 config file pr. backup client. Those who setup new backups shouldn’t have to remember to setup a 2nd config file with directory exceptions.