Backup Retention with Amazon S3
I manage several web servers, and a solid backup strategy is absolutely essential. I have been using Amazon S3 (Amazon Web Services, Simple Storage Service) for quite some time, for the purposes of storing backup files for the long-term, and keeping those backup files away from my production systems (essentially "off-site"). One item which I have needed to address since I have been using S3 is the subject of backup retention, or more specifically, automated purging of backup files no longer needed. Well, just sometimes, good things come to those who wait.
Generally, backups are made for three purposes: to recover from a major system failure, to recover from software or human error, and to maintain archives. To accomplish these goals, backups must be made on a regular basis, and need to be retained for some period of time. Further, processing time and costs need to be considered.
There are numerous classic strategies for managing backups effectively. I make two kinds of backup; full backups and incremental backups. A full backup includes a backup of everything I need in order to fully recover my system. They are fairly large, and include all software on the system, all configuration files, all data files, and complete dumps of all databases. The incremental backups include only files that have been created or updated since the last backup (whether it be a full backup or an incremental backup). This helps to limit necessary disk space and processing time. In the event that backups are needed to recover a system, I can then restore the system using the latest full backup and all incremental backups made since the latest full backup.
Another aspect of a backup strategy includes retention; or the length of time any backup should be kept. I have, unfortunately, been managing this manually, with all intent of automating the process. The intent is to delete from the storage service all backup files that are no longer needed.
For my purposes, I classify my backups into several categories based on the time at which the backup is taken: quarterly, monthly, weekly, and daily. Once per week, my systems make a full backup, and depending on the date, that full backup is classified as either quarterly, monthly, or weekly. The first full backup of January, April, July, and October are classified as quarterly backups. The first full backup of all other months is classified as a monthly backup. And all other full backups are classified as weekly. Backups taken on the other six days of the week are considered daily backups, which are all incremental.
Of course, backup files don't need to be kept forever, and as such, my strategy includes a period of time after which backup files are to be purged. In order to accommodate system recovery, I retain daily backups for three weeks, weekly backups for six weeks, and monthly backups for one year. For archival purposes, I retain quarterly backups permanently.
As mentioned, I use Amazon S3 for long-term storage of my backups. S3 allows you to create "buckets" into which you can store "objects". Each backup file is an object in an S3 bucket. Each object in a bucket has a unique name. One can use name prefixes to organize objects much like the idea of folders or directories in an hierarchical file system (although there are not true folders in S3). This service is cheap, with users paying based on how much data they have stored over time, and how much bandwidth they use transferring objects into and out of storage.
So, the idea of purging unneeded backups is important because of the costs involved. You need to pay for everything you have stored on an ongoing basis, so it's important for me to purge the backups that have essentially "expired". This is what I have intended to automate for quite some time, but have been handling manually to this point.
This is where it has paid for me to wait. I commend Amazon on the improvements they continually make to their web services business. It seems they are always improving and innovating their services.
They have just recently added a feature to S3 where you can add "Lifecycle Rules". For any given bucket, you can define a time period after which objects with a certain prefix should be automatically deleted. A lifecycle rule basically includes an object name prefix and some number of days. Any object with the given prefix is automatically deleted from the bucket after the specified number of days has passed.
Now, all I have to do is make sure that I prefix my backup file object names with one of the following, as appropriate: "quaterly/", "monthly/", "weekly/", or "daily/". And I can define lifecycle rules for the latter three to automatically delete them according to my backup retention strategy.
Yes, sometimes it pays to wait. I have been putting off automating my backup retention strategy for as long as I have been using Amazon S3. Now, Amazon has basically implemented the feature for me. Problem solved.