#!/usr/bin/env python # coding: utf-8 # # "Andrew At Home": separating the storage and application of data # Andrew Chen # # chenjandrew@gmail.com # ### Abstract # In this notebook we backup computer. Backup computer has two components: organization and (redundant) storage. This notebook demonstrates a method of backup computer that simplifies the file organization step, which makes the storage step trivial, using standard filesystem or operating system file copy tools. # ## Introduction # As mentioned in the abstract, simple backup computer ussually involves file organization and storage. Specifically, the user browses the filesystem, finds files that are of significance, and copies them to a storage media such as external USB hard drive or cloud storage. # This method is very error-prone and time intensive, the computational complexity of finding the relevant files is $O(n)$ on disk, the slowest storage on mondern computers, which is furthermore performed at a human's pace, making the process even slower. # There are alternatives, of course, for such an inefficient process. A user can decide to copy the entire contents of the disk. However, this process can only maintain the redundancy of a single machine-- backup and restore will only work if the entire contents of the backup are from the same machine. A user with an OSX and an Ubuntu machine will not be able to restore from the same medium, because the OSX disk data will be very different from the Ubuntu disk data. Backup of data in this format therefore scales according to $O(n)$, meaning that the data required to backup n machines is proportional to the number of machines requiring backup. Clearly, this is not optimal. # The other alternative is for the backup software to maintain a list of directories that require backup. This works and is computationally and storage efficient, but the maintinence of this list is troublesome. # One such problem with a database for the relevant files is that this database requires syncing between the various machines the owner uses. For this reason, most backup software stores settings to the cloud in order to automatically update the lists. However, any sensible user with any concern for privacy will find issue with this policy. Not only is the backup service at the whim of a company service, which may shut down without notice, but also the lists of every file, and possibly the contents of sensitive files, are being exchanged on the internet. # This work continues the theme of studying the file organization of backup computer. File organization is an important but not-popular aspect of backup computer-- the wealth of energy is exerted on storage media. # The goal of this work is to provide backup computer that works. This work is inspired by the unix philosophy: do one thing and do it well. In the rest of the notebook we detail the theory of operation and a reference implementation. # # Background # As discussed in the Introduction, most backup systems use a file sychronization database to discriminate files which need to be moved to the external storage media. # The need for a file synchronization database stored in `$HOME` or on the cloud also lacks elegance. After all, the backup media contain a list of all the files that require backup... that's stored on a ubiquitous piece of software called a filesystem. In the simplest scheme, the backup folder is *the* folder, that is exchanged freely between machines. # The key observation is that the filesystem is a database that lists the files that requires backup-- to explicitly describe these files feels redundant. In order to do this, a single directory can contain all the files that are synchronized between machines-- in this work we refer to it as "Andrew", because it is the name I use for my folder. # However, simply exchanging "Andrew" and putting it in the users's \$HOME doesn't work. (Andrew at Home, if you will). Lots of sofware requires files to be placed in specific locations-- for example, unix "dotfiles", which are strictly required to be placed in \$HOME, not \$HOME/Andrew. To remedy this we can use symbolic links, which link from "Andrew" to any location required. This is a classic technique for managing unix dotfiles, which has not been used in this style in the literature. # The primary contribution of this work is simple: an ensemble of file system symbolic links maintains the mapping from the "Andrew" directory to the local machine. # ## Theory of Operation # Andrew At Home simply proposes the use of file system links to divorce the **storage** of data from the **application** of data. Separation of these has great benefits, in particular making maintaining synchronous backup easy with standard filesystem tools. # # The **storage** of data is simple: all data is stored in a folder called "Andrew", which is stored in exactly the same way on all storage media. In the cartoon above, this is illustrated by the dark grey "Andrew"'s on all the storage media. # The **application** of data organizes it in a way that the local machine can use it. This is a per-machine setting. In the cartoon above, this is illustrated by the symbolic links from "Andrew" to the local disk. # # Reference Implementation # As detailed in the background, the separation of **storage** and **application** of data is key. # Our data **application** step simply uses the unix command "ln" to make symbolic links. # Our data **storage** step is maintained using "rsync". # ### Data Application: Making Links # Using the `ln` command we can take our "Andrew" directory and link normal filesystem locations to it: # In[107]: get_ipython().run_cell_magic('bash', '', 'cd\nls Andrew\nls Andrew/Media\nln -s Andrew/Applications\nln -s Andrew/Documents\nln -s Andrew/Projects\nln -s Andrew/Media/Music\nln -s Andrew/Media/Pictures\n') # Now, the files are listed in my $HOME directory as my local applications expect them to be, but exist in the "Andrew" directory to simplify backup computer. # ### Data Storage: rsync # Because the filesystem is now maintaining the list of files to sychronize, it is very easy to quickly save all files. # To make the backups easy to follow, they are launched from this notebook in external gnome-terminals. I have scripted the machinery to do this automatically below: # In[72]: import subprocess def run_command_in_terminal(command): return subprocess.Popen( 'gnome-terminal -x sh -c \ \'echo INPUT COMMAND={0};{0}; \ exec bash \''.format(command), shell=True) # In[74]: def rsync_1_direction(source, destination): run_command_in_terminal( 'rsync -avhWP --no-compress --stats \ {0} {1}'.format(source, destination)) # #### local disk --> SP2 (disk 1) # In[93]: rsync_1_direction('/home/ajc/Andrew/', '/media/ajc/SP2/Andrew/') # #### local disk --> Seagate3TB (disk 2) # In[92]: rsync_1_direction('/home/ajc/Andrew/', '/media/ajc/Seagate3TB//Andrew/') # # Conclusion # In this work we present backup computer which does not require external storage of file metadata. The mapping of backup to local disk files is done on the filesystem level, which enables users to use standard filesystem tools to perform backups.