Backup strategies in the age of large harddrives

Abstract:	Our harddrives are large. Backup solutions are not. What to do?
Intended audience:	Anybody who operates a computer with more harddrive space than you can make regular backups of. And that means backups that are not overwritten the next time you do backup.
Required knowledge:	You should know what a filesystem, a snapshot, a checksum and rsync are. http://en.wikipedia.org/wiki/Snapshot_(computer_storage) http://en.wikipedia.org/wiki/Checksum http://en.wikipedia.org/wiki/Rsync The basics of RAID won't hurt either, but let's just say it means combining a couple of harddrives so that one can fail without losing any data.

Introduction

Most of us feel uneasy when we compare the masses on data that we store on our modern terabyte harddrives and think back to a time where 1-3 tapes would provide suitable backup.

The most common backup strategy today, outside of companies that afford a tape robot is external harddrives, usually USB or firewire. This is a very shallow form of backup.

Having one single backup is a problem, as it covers only a part of the failure modes that can wreck your data. Also, if you are using USB in particular, be warned that the USB connection can be very flimsy. Both USB hardware and USB software (drivers) often suffer from "mouse-syndrome", meaning that they have been designed with a simple devices in mind, not with massive flows of data where no bit is allowed to turn.

What do we want to guard against?

That's the first thing to do: make a catalog of the possible failure modes.

Here are the obvious ones:

Accidentally deleting data (user action). This includes overwriting good files with bad files.

Harddrives dying.

If we use any form of backup, we also have to consider what happens when the backup hardware or software breaks.

On first sight this is pretty manageable. You use RAID-1, 5 or 6 against dead harddrives. You back up to a USB drive so that you can "roll back" when you fatfingered your files. If the external drive breaks that's fine as long as you don't kill files on the primary computer at the same time.

Now, here is the class of "silent corruption" events. This means that you damage or otherwise lose files on your primary computer but don't notice right away. By the time you notice you probably have overrwritten that single external harddrive backup with the bad version, leading to a permanent loss of the data.

Reasons for silent corruption include:

Memory corruption on the primary computer. As I explain in my page about ECC memory, it is much more common than people assume that your computer just flips random bits. A recent study on google's computers also found that random bit errors are several orders of magnitude more common than previously thought (previous assumptions were based on data provided by DRAM manufacturers and were too optimistic). The details of how this works can be found on my ECC page, but in short, every bit flipped can destroy any data on your harddrive. You can kill sections in files you didn't even touch and you might not notice for months or years. This can also kill directories and you only notice much later. You can lose the whole filesystem but at least you'll notice that one right away. The question is what happens to data you only look again months after it has been corrupted.

User deletion of files that goes unnoticed. Let say you shuffle data from one place to another, as in moving the data, but when deleting the original tree deleting something that wasn't actually copied first. The classic example is random commandlining with rsync and getting lost in the commandline options. Again, you might not notice for a long time.

The "RAID hole". Even if you use RAID on the primary computer, there is the problem knows as RAID hole. This stems from the fact that in classic RAID you have the redundancy mechanism and the filesystem not talk to each other (RAID on raw device). If you have a power fail, or harddrive fail with subsequent revival in the wrong moment, then the RAID subsystem might be left with multiple possible interpretations of the data available - and flip a coin deciding which one is right. It isn't very likely to happen, but it is another form of silent memory corruption.

And some more things to keep in mind:

You can have complicated filesystems that support snapshots, or that integrate RAID in the filesystem to avoid the RAID hole, on the primary computer. But then you have to live with more complex software, all this software is pretty new, and you get new constraints on OS and kernel versions that might clash with other requirements.

When power supplies in computers die, they more often than not kill several components connected. That is bad for RAID users. You lose 2 drives (or whatever you "loss threshold" is, it's 3 in raid-6 and whatever you payed for in raid-1) and you are out.

RAID re-sync. Even if they have RAID, our home systems, and many business systems, are usually made from a low number of large harddrives. That is partially good because a low count keeps the frequency of failures down. But the re-sync time can be insanely long. And remember that losing a second (or whatever you loss threshold is) drive during this re-sync will make you lose everything.

But the trick is: the sudden, long-lasting stress from the re-sync of the RAID greatly increases the chance of failure of another drive. This is a real problem. It is kind of OK on systems where you can avoid doing anything else while the re-sync is going on, which is usually the case at home but not necessarily in the office. And many people panic and try to pull the most important data off the raid while the re-sync is going on, thereby both increasing the stress some more and prolonging the duration of the re-sync.

Harddrives die on shelves. Many harddrives do not come back up after spending a year or so on a shelf. Happened to me just recently. So just putting away a USB drive every few months won't really help that much.

Proposed solution

Here is what I ended up doing:

Primary machine has simple software - Linux software RAID, ext3fs. No snapshots, no compression. This is done so that the primary machine is not constraint in kernel choices and so that I can easily read this data on a different computer in case of disaster. I don't use hardware RAID for reasons I can explain elsewhere.

Primary machine has RAID. I would prefer to not do this to keep the software even simpler, but it is just not practical. If I don't do RAID on the primary machine, I will have to recover from backup on a single disk failure. If I have -say- three harddrives with 1 TB each, then I'll have to restore a full terabyte before my primary computer is back up. That takes too long. So RAID it is.

Primary machine has ECC RAM, a deeply trusted power supply and a quality battery backup unit.

Basement has a backup server. It use it on a network-controllable power switch and only power it up when I need it.

Backup server has ECC RAM and a deeply trusted power supply.

Backup server runs a filesystem that does:

snapshots

compression

RAID-Z, which is a RAID inside the filesystem so that there is no RAID hole

The backup then goes as follows:

Exclude obvious junk hanging out on primary machine from backup.

rsync all data from primary machine to backup server.

Look for obvious junk one more time.

Make snapshot.

Repeat after appropriate amount of time.

Obviously, the snapshots then allow me to go back to the point in time of any previously issued rsync command.

Snapshots are very small for small changes, it is block by block. So I can do it often as long as I don't push junk.

Junk management is required here. A snapshot becomes readonly. If you got some large piece of junk into a snapshot you can only get the space back by deleting that whole snapshot. I deal with this in more detail below.

Evaluation against threats

Let's see how this solution does against the threats:

User fatfingers files - covered of course. Just get it from the backup server.

Harddrive failure. Any harddrive in here can die without impacting anything, both primary and secondary computer have RAID.

Power supply failure kills all harddrives in the primary machines - covered via intact backup server. This also covers any other kind of double disk fault, including new disk faults during recovery of a disk.

Silent data corruption or unnoticed fatfingering of data on the primary machine, including RAID hole occurrences. That is the biggie, and that is why I really like this solution.

Not only can you go back in time to before the corruption - this setup allows you to detect corruption. You can look at the differences between snapshots (using the regular diff command). If the primary machine silently corrupted data then you will see large differences in places where there shouldn't be any. This requires that you identify a suitable set of files that you don't expect to change, and it requires that you use rsync with the "-c" option that does checksumming. You don't have to do this on every snapshot, I'd say once a month? Or whenever you clean up snapshots and want to drop some?

Silent memory corruption on backup server. Well, anything can happen with the data on there. But in any case you have your primary computer's filesystem and the backup push will at least temporarily "repair" the pre-snapshot copy on there, which you can retrieve freshly fatfingered files from.

Failure of the backup software. Rsync has been around for many, many years and is in so heavy use for all these things I feel comfortable it'll be about the last thing to fail. The bigger risk is misuse.

But no matter how dumb you play when running the rsync command, you will easily be able to avoid damaging data on your primary computer by just reviewing that you have a hostname in the target. The bigger risk is that you make a mistake when compiling the exclusion list. If you exclude files from being backed up that's that. I recommend keeping the backup commandline simple and after each change manually reviewing that the goodies you want actually arrived. If you can't manage, buy bigger harddrives for the backup server and back up everything.

Software failure due to complicated subsystems used. Even if ZFS or ZRAID fail catastrophically you still have the data on the primary computer. On the primary computer you have chosen to use the most tested and simplest solutions, namely ext3fs. That should be the best of both worlds.

Harddrives die on shelves. My intention is to finally place the drives that are in the backup server in storage and start over with new drives. Obviously that opens the possibility of death by underuse, a risk I rate as high. However, keep in mind that these things are RAIDed. You can absorb as many drive failures on the shelf as you could absorb during runtime. Since we are talking a low drive count here the chance of being able to get enough drives going far into the future isn't too bad.

I could also introduce some regular "exercise" for old harddrives but so far I haven't been paranoid enough. Working on it.

Also keep in mind harddrives keep getting much larger and cheaper. By the time you retire one array you can probably acquire enough disk space to store a copy of the whole array very cheaply, either online or offline. I have whole copies of my 1994 SunOS4 installation on my primary computer, in less space than a DVD takes. I like that extra plain copy because I don't want to have to read -say- a FreeBSD ZFS splattered across several drives in 15 years when we all use megacloud bitwizzler filesystem on Iceland or whatever. It's pain enough to read a SPARC SunOS4 UFS dump on anything modern today.

Management details

I mentioned previously that "junk management" is critical here. Snapshots are readonly after they have been taken. If you even pushed large junk from the primary computer via the backup server into a snapshot, you can only ever get the disk space back by deleting the whole snapshot.

That sounds scary, but I found that there is a way to deal with this more comfortably, and that is by comparing the sizes of the snapshots. Let's say you have a couple snapshots already. You push data and make a new snapshot. You compare the size of the snapshots, you look whether the new one is much larger. If this is your personal computer you will usually have an idea whether you created some legitimate large piece of data since the last snapshot. So you are able to spot junk, kind of. If you spot this you can hunt for the junk, exclude it from the backup script (the one that pushes from primary machine to backup server) and take a new snapshot right after the bad one. If the new snapshot meets expectations you nuke the overblown one.

Typically people will use a system of varying frequency, that means of the newest snapshots you keep each, snapshots older than 3 weeks you keep one per week, snapshots older than 3 months you keep one per month. That means dropping snapshots, and you can selectively drop snapshots that have odd sizes - after looking what's in that space, it might be valuable.

Implementation

My way of building this thing in practice:

When I buy a new harddrive array for the primary computer I keep the old one. The old array goes into the backup machine. Space-wise that works out in my case. Large amounts of data on the primary computer are not important enough to require full backup (remember I have RAID on the primary computer). The backup server filesystem has compression but the primary machine does not, that makes good for some of the space difference. In practice this comes out space-wise to allow me to comfortably do a full backup of all important data and have snapshots and run out of space around the time when it's time for a new array anyway. YMMV.

The machine used for the backup server is multi-purpose. It's multi-boot and when I don't push backup it can do something else. It's all diskless boot and network-controlled power strip. Of course I picked one that has ECC RAM and is generally trustworthy.

The OS chosen is FreeBSD, because of ZFS. ZFS fulfills all the requirements of compression, snapshots and RAID integrated into the filesystem (protection against the RAID hole). Linux will have to wait for BTRFS to be ready to have this. An alternative is Solaris which has ZFS, too. You don't have the use the OS that does your backup when you are not snapshotting your primary machine so you don't have to like it either. You could also think about using a virtual machine, passing the raw devices for the backup disk through to the guest/domU that has the backup OS, but I didn't try that.

GbE. A little lame but what can you do?

rsync.

Alternatives

There are userlevel programs that do this kind of snapshotting in userland.
http://rsnapshot.org/
http://www.nongnu.org/rdiff-backup/

That way you wouldn't have to have an OS with snapshots on the backup server. This won't be as fine-graded, filesystem does blocks, these guys do it by file, AFAIK. And of course no integrated RAID and I don't think they do compression.

Didn't try this yet, I have no idea whether it works better or worse than a full machine with ZFS.

Weaknesses and expectations

The primary thing I don't like about this is the lack of snapshots on the primary machine.

Backup to the backup server takes hours, so I won't have -say- one snapshot every hour, a thing that would be entirely practical with snapshots on the primary machine. You would then drop these snapshots after the big rsync to the backup server runs.

Unfortunately Linux is on my primary machine and for whatever reason they are years behind the other OSes when it comes to filesystems and snapshots. LVM level snapshots are a complete joke (sorry, raw device snapshots and they get dropped on overflow, I'm not making this up). Before Linux gets BTRFS we are probably out of luck here.

I could experiment with NFS storage towards some fileserver that has a modern filesystem, but that has obvious latency issues, GbE is too slow even outside latency/turnaround and this requires one more permanently running machine - requiring ECC RAM, best power supply and battery backup. Not gonna happen. A Netapp with 10GbE or computer-to-computer SCSI would work I guess.

iSCSI doesn't help as it could only do raw-device snapshots.

I expect that some time in the future either Linux will have real snapshots or that I will be able to run FreeBSD on my primary machine again. Then I will certainly do a hierarchy of snapshots. A couple quick ones on the primary machine while the backup server is off or does something else. Then you do the main backup and save to a snapshot on the backup server and on successful completion drop the short-term snapshots on the primary machine. Or don't drop them as long as there's space.