HPR: A private data cloud

The Hacker Public Radio Logo of a old style microphone with the letters HPR

Transcript(ish) of Hacker Public Radio Episode 544

Over the last two years I have stopped using analogue cameras for my photos and videos. As a result I also don’t print out photos any more when the roll is full. This goes some way to explaining why my mother has no recent pictures of the kids. Living in a digital world comes a realization that we need to take a lot more care when it comes to making backups.

In the past if my pc’s hard disk blew up virtually everything of importance could be recreated. That simply isn’t the case any more when the only copy of your cherished photos and videos are now on your computer. Add to that the fact that in an effort to keep costs decreasing and capacity increasing hardisks are becoming more and more unreliable (pdf).

A major hurdle to efficient backups is that the capacity of data storage is exceeding what can be practically transferred to ‘traditional’ backup media. I now have a collection of media reaching 250G where backing up to DVD is not feasible any more. Even if your collection is smaller be aware that sd cards, USB Sticks, DVD’s or even tapes also degrade over time.

And yet hard disks are cheap. You can get a 1.5 TB disk from amazon.com for $95 or a 1T for €70 from mycom.nl. So the solution would appear to be a juggling act where you keep moving your data on a pool of disks and replace the drives as they fail. Probably the easiest solution is to get a hand holding drobo or a sub 100$/€ low power NAS solutions. If you fancy doing it yourself Linux has had support for fast software mirroring/raid or years.

Problem solved ….

NASA Image of the Earth taken by Apollo 17 over green binary data

…well not quite. What if your nas is stolen or accidentally destroyed ?

You need to consider a backup strategy that also mirrors your data to another geographic location. There are solutions out there to store data in the cloud (ubuntu one, dropbox, etc.) The problem is that these services are fine for ‘small’ amounts of data but get very expensive very quickly for the amount of data we’re talking about.

The solution, well my solution, is to mirror data across the Internet using rsync over ssh to my brothers NAS and he mirrors his data to mine. This involves a degree of trust as you are now putting your data into someone else’s care. In my case it’s not an issue but if you are worried about this you can take the additional step of shipping them an entire pc. This might be a low power device that has enough of an OS that can get onto the Internet. From there you can ssh in to mount an encrypted partition. When hosting content for someone else you should consider the security implications of another user having access to your network from behind your firewall. You would also need to be confident that they are not hosting anything or doing anything that would lead you to get in trouble with the law.

Once you are happy to go ahead what you need to do is to start storing all your important data to the NAS in the first place. You will want to have all your PC’s and other devices back up to it. It’s probably a good idea to mount the nas on the client PC’s directly using nfs, samba, sshfs etc so that data is saved there directly. If you and your peering partner have enough space you can start replicating immediately or you may need to purchase an additional disk for your remote peer to install. I suggest that you do the initial drop locally and transfer the data by sneaker net, which will be faster and avoid issues with the ISP’s.

It’s best to mirror between drives that can support the same file attributes. For instance copying files from ext3 to fat32 will result in a loss of user and group permissions.

When testing I usually create a test directory on the source and destination that have some files and directories that are identical, different and modified so that I can confirm rsync operations.

To synchronize between locally mounted disks you can use the command:

rsync -vva --dry-run --delete --force /data/AUTOSYNC/ /media/disk/

/data/AUTOSYNC/ is the source and /media/disk/ is the destination. The --dry-run option will go through the motions of copying the data but not actually do anything and this is very important when you start so you know what’s going on. The -a option is the archive option and is equivalent to -rlptgoD. Here’s a quick run through the rsync options

-n, --dry-run
    perform a trial run that doesn't make any changes
-v, --verbose
    increases the amount of information you are given during the transfer.
-r, --recursive
    copy directories recursively.
-l, --links
    recreate the symlink on the destination.
-p, --perms
    set the destination permissions to be the same as the source.
-t, --times
    set the destination modification times to be the same as the source.
-g, --group
    set the group of the destination file to be the same as the source.
-o, --owner
    set the owner of the destination file to be the same as the source.
-D
    transfer character, block device files, named sockets and fifos.
--delete
    delete extraneous files from dest dirs
--force
    force deletion of dirs even if not empty

For a complete list see the rsync web page.

Warning: Be careful when you are transferring data that you don’t accidentally delete or overwrite anything.

Once you are happy that the rsync is doing what you expect, you can drop the --dry-run and wait for the transfer to complete.

The next step might be to ship the disk off to the remote location and then setup the rsync over ssh. However I prefer to have an additional testing step where I rsync over ssh to a pc in the home. This allows me to work out all the rsync ssh issues before the disk is shipped. The steps are identical so you can repeat this step once the disk has been shipped and installed at the remote end.

OpenBSD and OpenSSH mascot Puffy

OpenSSH

On your NAS server you will need to generate a new ssh public and private key pair that has no password associated. The reason for this is that you want the synchronization to occur automatically so you will need to be able to access the remote system securely without having to enter a password. There are security concerns with this approach so again proceed with caution. You may wish to create a separate user for this but I’ll leave that up to you. Now you can add the public key to the remote users .ssh/authorized_keys file. Jeremy Mates site has more information on this.

To confirm the keys are working you can try to open a ssh session using the key you just setup.

ssh -i /home/user/.ssh/rsync-key user@example.com

You may need to type yes to add the keys to the .ssh/known_hosts file, so it makes sense to run that command as the user that will be doing the rsyncing. All going well you should now be logged into the other system.

Once you are happy that secure shell is working all you now need to do is add the option to tell rsync to use secure shell as the transport.

rsync -va --delete --force -e "ssh -i /home/user/.ssh/rsync-key" /data/AUTOSYNC/ user@example.com:AUTOSYNC/

All going well there should be no updates but you may want to try adding, deleting and modifying files on both ends to make sure the process is working correctly. When you are happy you can ship the disk to the other side. The only requirement on the other network is that ssh is allowed through the firewall to your server and that you know the public IP address of the remote network. For those poor people without a fixed IP address, most systems provide a means to register a dynamic dns entry. Once you can ssh to your server you should also be able to rsync to it like we did before.

Of course the whole point is that the synchronization should be seamless so you want your rsync to be running constantly. The easiest way to do this is just to start a screen session and then run the command above in a simple loop. This has the advantage of allowing you to get going quickly but is not very resistant to reboots. I created a simple bash script to do the synchronization.

user@pc:~$ cat /usr/local/bin/autosync
#!/bin/bash
while true
  do
  date
  rsync -va --delete --force -e "ssh -i /home/user/.ssh/rsync-key" /data/AUTOSYNC/ user@example.com:AUTOSYNC/
  date
  sleep 3600
done
user@pc:~$ chmod +x /usr/local/bin/autosync

We wrap the rsync command into a infinite while loop that outputs a time stamp before and after it has run. I then pause the script for an hour after each run so that I’m not swamping either side. After making the file executable you can add it to the crontab of the user doing the rsync. See my episode on Cron on how to do that. This is a listing of the crontab file that I use.

user@pc:~$ crontab -l
MAILTO=""
0 1 * * * timeout 54000 /usr/local/bin/autosync > /tmp/autosync.log  2>&1

There are a few additions to what you might expect here. Were I to run the script directly from cron then it would spawn a new copy of the autosync script at one o’clock every morning. The script itself would never terminate so over time there would be many copies of the script running simultaneously. This isn’t an issue here as I am actually calling the timeout command first and it’s the one that actually calls the autosync script. The reason for this is that my brother doesn’t want me rsyncing in the evening when he is usually online. I could have throttled the amount of bandwidth I used as well but he said not to bother.

--bwlimit=KBPS
    This option allows you to specify a maximum transfer rate in kilobytes per second.

As the timeout command runs in it’s own process it’s output is not redirected to the logfile. In order to stop the cron owners email account getting a mail every time the timeout occurs I added a blank MAILTO="" line at the start of the crontab file. Thanks to UnixCraft for that tip.

Well that’s it. Once anyone on your network saves a file it will be stored on their local NAS and over time it will be automatically replicated to the remote network. There’s nothing stopping you replicating to other sites as well.

An image from screencasters.heathenx.org episode 94

screencasters.heathenx.org

This months recommended podcast is screencasters at heathenx.org.
From the about page:

The goal of Screencasters.heathenx.org is to provide a means, through a simple website, of allowing new users in the Inkscape community to watch some basic and intermediate tutorials by the authors of this website.

heathenx and Richard Querin have produced a series of shows that put a lot of ‘professional tutorials’ to shame. Their instructions are clear and simple and have given me a good grounding into a complex and powerfull graphics program despite the fact I have as yet not even installed inkskape. They even have mini tutorials on how to make your way around the interface and menus.

After watching the entire series I find myself looking at posters and advertisements knowing how that effect could be achieved in inkskape. If you are interested in graphics you owe it to yourself to check out the series. If you know someone using photoshop then burn these onto DVD and install inkskape for them. Even if you have no creative bone in your body this series would allow you to bluff your way through graphic design.

Excellent work.

This entry was posted in General, Podcasts and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *