Silver Bullets

Published 2021-08-07 in security
backup linux security

Backing up to the nines

I want to write about synchronising distinct backup media. Different mediums are part of a backup strategy, in which consistency must be maintained. All the major cloud providers do this. This post will not consider facilities in different counties, countries, or zones, but rather different media that may be left around the house, or in a locker, for example.

That's not to say there is anything wrong with cloud providers' redundant backup solutions. Cloud providers' prices are competitive, but I prefer to spend on CapEx (capital expense), not OpEx (operational expense). We all must start somewhere and for me, a bird in the hand is worth two in the bush.

I'd hope we all have backup strategies for our data. I spend far too long worrying along the lines of, "I think I've got everything backed up and safe, so why change something and invalidate that DVD I spent a weekend mastering in 2016?"

Backup strategies

The "different mediums" aspect is important. As well as duplicating the duplicate. Cloud certification places great emphasis on this. While cloud providers have a product to sell they don't have a monopoly.

This article isn't a treatise on strategies. All cloud providers want your traffic, backing up to diverse locations is how they market their availability and durability as a length of nines. 99.999% being referred to as, "5 nines", for example.

Like flash drives, DVD-Rs go bad after a while. The decline of optical media drives prompted my change in strategy to use flash storage, and backup more often. I'm compliant with the former at least.

I'm also taking a leaf out of most cloud providers' playbooks and using Linux now.

Synchronisation strategy

My very first plan was to dump everything off all my computers onto a couple of disks.

This is not agile. I have way too many computers and too little time. Considering how ingrained spinning disks are in computing history this is still better than nothing.

The immediate drawback is that nothing much should change once the backup is done. I need a working copy that can be shared between computers. Better still if that is the only copy other than the offline ones.

The strategy must be to periodically synchronise what is on the working copy with the cold storage.

Finding the latest modification to the cold storage

Sorts and shows the modification dates:

find $1 -type f -print0 | xargs -0 stat --format '%Y :%y %n' | sort -nr | cut -d: -f2- | head

FAT32 doesn't update a file upon access. Replacing %Y :%y with %X :%x will show when the file was last copied to the stick. Redirecting to file rather than head is another option.

We have a problem. We only consider recently modified files. If there is a wide range of modification dates, then we will need to be sure that we update the cold storage copy with all modified files.

The solution to this is to keep the working/mutating copies in a single place, contrary to diversity in backup locality. That diversity tenet is for the backups, not the working copy that we desire to contain. The frequency of backups is key to any strategy. Backup management is an overhead, so a balance between security and simplicity must be struck. Cloud providers, for example, tend to daily backups for nearly all their individual services.

Another solution is to automate. Before writing this post, I was working on a cross-platform backup solution in Python. I discovered so much functionality in the Linux shell that I began to wonder how much of my script re-invented the wheel. We can't all use Linux.

Ok, cool, I've found the dates when I originally mastered my cold storage; the 17th and 18th of March, 2021. I'd be happy with that precision, but it is a reminder I need to allow 2 days. No wonder buyers place a premium on dual disk NAS hardware.

Find newer files that were updated

In the previous section I wrote about keeping the new, dirty data in one place. A single source of truth.

I have 2 ways of mounting the Samba share. The usual way is to navigate to it using Nautilus. This created one path:

/run/user/1000/gvfs/smb-share:server=192.168.66.10,share=usb_share_name/path_of_interest

Treating this path as if it was local would more often than not cause connection instabilities, and my tests to fail.

To address this, I learned how to mount the Samba share manually. If I permit anonymous access, it is simply:

sudo mkdir /media/testsmb
sudo mount -t cifs -o username=Anonymous //192.168.66.10/usb_share_name /media/testsmb/

The following command ran quickly on a locally inserted USB disk. It takes longer over Samba, so capturing the entire output for reference is prudent:

find $1 -type f -print0 | xargs -0 stat --format '%Y :%y %n' | sort -nr | cut -d: -f2- > ~/smb_share_mtimes.txt

This reads file metadata across my whole NAS! 133 files (14%) have changed! The paths have also all been changed, confusingly. I need to ensure important dirty data is contained. A bit of Python should help organise this into folders. Well known, Linux shell, solutions are available. With this Python I'll at least have a list to help with what is still a manual process:

#!/usr/bin/env python3

import argparse
import datetime
import json
import subprocess
from pathlib import Path


def parse_for_date():
    parser = argparse.ArgumentParser(
        description='Print hierarchy of files changed since a given date. '
                    'Must be called from the hierarchy root.')
    parser.add_argument('first_date', type=datetime.date.fromisoformat,
                        help="First modification time date we are interested in.")
    return parser.parse_args().first_date


def main():
    first_date = str(parse_for_date())
    result = subprocess.run(
        "find $1 -type f -print0 | xargs -0 stat --format '%Y :%y %n' | sort -nr | cut -d: -f2-",
        shell=True, capture_output=True, text=True
    )
    dir_nest = {}
    # Leverage fixed width columns of stat format
    for line in result.stdout.split("\n"):
        if line[:10] < first_date:
            break
        folders = Path(line[36:]).parts
        path_ptr = dir_nest
        for folder_name in folders[:-1]:
            path_ptr = path_ptr.setdefault(folder_name, {})
        path_ptr[folders[-1]] = line[:10]
    print(json.dumps(dir_nest, indent=2))


main()

Calling that yields something like:

{
  "return_RoyalMail_ZTTRACKINGNUMBER.pdf": "2021-07-29",
  "4GWiFi-positioning": {
    "Screenshot_2021-06-23-23-11-02.android.chrome.jpg": "2021-06-23",
    "Screenshot_2021-06-23-23-10-40.android.chrome.jpg": "2021-06-23",
    "Screenshot_2021-06-23-05-57-55.android.chrome.jpg": "2021-06-23"
  },
  "Thumbs.db": "2021-06-24",
  "IMG_20210619_100000.jpg": "2021-06-19",
  "IMG_20110102_090000.jpg": "2021-06-19",
  "IMG_20110102_080000.jpg": "2021-06-19"
}

When I'm done, I can, optionally, clean up like this:

sudo umount /media/testsmb
sudo rmdir /media/testsmb

Reflections in August 2022

This was an interesting look at mounting network shares locally.

I can replace all the python comparison code above with a simple linux command which is both intuitive and concise:

diff --brief --recursive . ~/Downloads/kingston\ backup/
Only in ./media/photos: drone
Only in ./media/photos/xperia x: DSC_0402.JPG
Only in ./media/photos/xperia x: DSC_0408.JPG
Only in .: System Volume Information
Only in .: .Trash-1000

This took about 5 minutes checking two versions of 1000 items in 2.3 GB.

That beats maintaining the small Python function presented earlier. Python offers value as an abstraction layer. It might not be Windows ready yet, but I have to replace substantially smaller functionality to make it compatible rather than focussing on replacing diff all at once.


Log in to comment
Register to comment
0 Comments No comments yet...