S02E01 Archive this!

So you need to archive data for years (if not decades), and you really don’t trust a service provider to both stay in business, and play fair, for that long a time frame.

Here’s what you can do.

Visual cues:
What to do Results Caveats Package to install

A few caveats, aka: »management of expectations«

Archives contain valuable data, and long-term archival of specific data may even be mandated by law, for some. Always test new archival procedures on dummy data, using dummy accounts. Never commit to a changed archival routine before you have tested it, in-depth. Also, before a change, backup and archive your data according to your existing routines. You have been warned.
Solopreneurs and freelancers working from their home offices are high-profile users. Usually, there isn’t much hierarchical layering in their access control systems, besides lifting all restrictions for themselves, and imposing only a few on an assistant or helper who is permitted to access »almost« everything.
Many tasks may have been outsourced, but legally required safekeeping periods ranging from several years to up to a whole decade (for some data items) raise hard questions in terms of trust, costs, and vendor changes. The thought of service fees piling up, and of inescapable vendor lock-ins may lead to archiving being neglected, or ignored; or to a makeshift files-and-directories mess that only superficially qualifies as a controlled archive.

This episode deals with keeping a files-in-directories based archive under one’s own control. We consider the trade-offs of an archive based on a plain files-in-directories storage to be acceptable, and hence such an archive as a reasonable choice.

Archives aren’t backups

While it is common to mistake one for the other, they are quite different things:

  • The purpose of a backup is being able to restore a system to a previous state that is as close as possible to the state before a disaster. A backup is tightly coupled to the hardware and software that you are using right now (but maybe won’t use anymore, next year).
  • The purpose of an archive is to preserve data in an immutable format that is easy to index, search and view, for a long time to come. An archive is as independent as possible of specific hardware or software.

You should always backup your archives, but never archive your backups.

If you’re rather concerned with backups right now, read our episode S01E03 Secure backups, and a time-machine for your home office, instead.

Requirements for archives

Indexing and search

The archive formats chosen must allow for indexing, since fast, complex searches within archives are not just time-savers, but quite often also audit requirements laid down in e.g. tax regulations.

Immutability

Your local jurisdiction may require some of your data to be archived in an immutable format, to prevent both accidental modifications as well as tampering. In some contexts, deletions are entirely forbidden, or must be painstakingly recorded.

Legal requirements likely include the long-term, reliable reproduction of the visual appearance of documents, especially when deletion of physical copies is permitted.

Last, but not least: changes to the organisation and structure of the archive itself may require extensive documentation to guarantee the original structure can be restored, upon demand.

Time stamps

Time stamping provides evidence that a specific content already existed at the given time, and hasn’t been altered, since.

Access control

While live data and its backups may rely on similar access control structures, access to an archive must be controlled by a system that can cope with years, if not decades of constant change.

Also, granting specific forms of access to law enforcement agencies and auditors most likely differs from regular live data access control.

Lifecycle management

Depending on your legislation, some of your data may need to be archived for decades. In general, the longest time frames are most likely imposed on you by tax regulations affecting business transactions.

The opposite may be true, as well: privacy-related regulations (like the GDPR) may require that you are able to remove specific records of personally identifiable data within a month’s time from your storage devices, unless they’re not part of business transactions. Other examples include job applications.

Ask yourself whether such data really needs to be archived, opposed to just being backed up. Try to have that data automatically taper off, by ensuring your backup window does not extend beyond the respective permitted time frames.

Data formats and viewing

Choosing an appropriate archival format that can be viewed and queried for years to come may be a challenge for you.

There may be unexpected twists, e.g. your local jurisdiction may only require that an email attachment gets archived, but exempt the email it was attached to, in case the email is a pure envelope that contains no part of the data.

Boot into Linux

Boot into a fairly recent Linux distribution. For this episode, we’ve used Manjaro 21.2.0 KDE, and cross-checked our results on Ubuntu 20.04 LTS.

Open a terminal and you’re ready to go.

Tools used, and their packages

  • btrfs btrfs-progs
  • btrbk We suggest you always install the most recent version, as described in its README
  • offlineimap offlineimap
  • okular okular
  • openssl openssl
  • siegfried We suggest to follow the installation instructions on the homepage
  • unoconv unoconv (please note you also need OpenJDK >= 11)
  • veraPDF from their website, or via AUR
  • wget wget

Use archive formats, use viewers

Backups of live data usually don’t make the best format for archival. A recent version of a software may not be able to fully render or search data that was archived a few years ago. Even worse, you may have switched over to a different software, multiple times within a decade, accumulating archival data in a multitude of software-specific formats.

When archiving, convert your data to proven archival formats that allow for indexing, complex searches, and quick viewing.

Viewers aren’t editors. Since archives are immutable by nature, data items must not be changed after they’ve been archived. Not even by in-file tagging or annotating. Some raw data types are »instant archive items« (e.g, photos).

As a rule of thumb: access archive content via viewers, not via editors, whenever possible. Even if write access is prohibited already on the file level to prevent you from harming your archive, you’ll stay more aware of the respective content when you use different tool sets.

Plain text

If you intend to enrich plain text files by some kind of formatting, the Markdown format is the way to go. Whether your plaintext files contain markup or not, Okular is a fast and capable viewer for it.

Text with layout

The most widely used archive format that preserves text layout and is guarantees visually reproducible results is PDF/A. Please see our S01E02 »Can I have this as a PDF?« for how to create it, from various source formats.

Okular is a feature-rich PDF viewer. A faster, light-weight alternative available in most Linux distributions is MuPDF.

Spreadsheets

To document calculated values, and to make the results immutable, it is recommended to convert spreadsheets to PDF/A format, too.

See our episode S01E02 »Can I have this as a PDF?« for how to script this using unoconv.

Calendars and address books

iCalendar is a widely used format for calendars one can subscribe to. It is also the most generic and most suitable format for archiving calendars.

vCard is widely used for exchanging both contact information of a single individual as well as whole address books; since many vCard implementations have interpoerability problems, it is very important to verify that all relevant data survives an export & import cycle.

Calendars accessible via a URL can simply be downloaded via wget. As an example, Google’s moon phases calendar for Berlin:

wget https://calendar.google.com/calendar/ical/ht3jlfaac5lfd6263ulfh4tql8%40group.calendar.google.com/public/basic.ics -O moonphases.ics
./moonphases.ics

For Nextcloud, we recommend calcardbackup, a script that can extract both iCalendar calendars and vCard address books.

Email

The Maildir format prescribes one text file per email, and is supported by a wide range of mail clients. For safely viewing archived email, we recommend mutt, as the canonical viewer.

OfflineIMAP allows you to pull mail folders from IMAP servers into a local Maildir-formatted archive directory. Here’s an example of how to use it for archival.

OfflineIMAP needs a configuration file. In our example, we’ll pull all folders except those named Trash, Junk, Drafts and Notes, for the account mymail, from a STARTTLS-secured IMAP server to a local Maildir directory ~/mymail/.

We’ll prevent any propagation of email deletions from the server to the local directory (sync_deletes = no). Also, local email file modification times are made to match the Date field of the respective email contained within (utime_from_header = yes).

Our sample configuration file ~/offlineimap.conf looks like this (please adapt IMAP credentials etc. to your testing environment):

[DEFAULT]
sslcacertfile = OS-DEFAULT
folderfilter = lambda foldername: not foldername.startswith(('Trash', 'Junk', 'Spam', 'Drafts', 'Notes'))

[general]
metadata = ~/.offlineimap
accounts = mymail

[Account mymail]
localrepository = Local.mymail
remoterepository = Remote.mymail

[Repository Local.mymail]
type = Maildir
localfolders = ~/mymail/
utime_from_header = yes

[Repository Remote.mymail]
remotehost = smtp.example.com
remoteuser = mymail@example.com
remotepass = mysecretpassword
sync_deletes = no

type = IMAP
starttls = yes
ssl = no
remoteport = 143

Now, run:

offlineimap -c ~/offlineimap.conf
~/mymail/ contains all IMAP folders pulled from the server

Organizing a file-based archive

Access control

Do not hand over your password to assistants, or to persons performing audits. Instead, learn how to handle Access Control Lists (ACLs).

Start simple, e.g. by creating groups like private, assistants, audits. Do not give any group write access to archive directories. Restrict that to a single archive admin user who is not even a member of your ACL scheme, and after the respective retention period has ended, purge data from your archives only under that user account.

Archive files with tight access permissions in directory trees that are separated from the rest.

The more exemptions you need to make from that rule of thumb, the more likely you are to accidentally expose data in a directory branch that should have better been kept private. Also, a piebold access rights assignment makes it harder to turn over isolated directories to entitled parties, should you ever have to.

Separate archive metadata and data

Keep archival metadata like tags, annotations, timestamps etc. separate from archived data, even if embedding it within the data files looks much more convenient.

The archiving process

Depending on your needs, archiving a file or directory may require multiple steps:

  • File verification
  • Time stamping
  • Adding error detection & correction information
  • Content indexing
  • 


It is possible to detect new arrivals in archive folders, and have that trigger the archiving process (see below for an automation example).

Detecting changes in archive directories

While it is technically feasible to monitor, in real time, a large amount of directories for changes, this creates a substantial load for the respective computer.

A better alternative is to use a file system with inherent, fast snapshotting capabilities, like btrfs. Before taking a new snapshot, compare the current state of the file system to the most recent snapshot, to detect updates.

See the script archive.sh for an example how to do that, based on the btrbk backup tool.

Archiving process options

File verification

For verifying conformance of PDF files to the PDF/A standards, use veraPDF.

Many file formats that are widely used lack a tool that can verify strict standard conformance. It is however possible to check for some essential markers within a file. We have described here how to use the tool siegfried for that.

Indexing and search

Try to limit the amount of indexers and related databases, so you can focus on a single frontend, with a single query language.

Recoll is a GUI application that indexes most essential archive file formats:

  • Office documents
  • Plaintext
  • Images
  • iCalendar
As of today, vCard (address books) are unsupported.

It supports boolean searches, phrases, proximity, wildcards, filters on file types and on the directory tree, as well as filtering time spans.

Multiple indexes with different settings can be created using multiple configuration directories. Updating indexes, or rebuilding them from scratch, can be automated via the GUI, or scripted.

Time stamping

According to RFC 3161

Time Stamping Authorities (TSAs) process signing requests that include hash values of the respective data. Besides the hash values, their response contains a digitally signed time stamp, as proof that a certain content already existed at the given point in time.

While it is easy to host a time stamping service (e.g., company-internal; or freeTSA.org), acquiring and maintaining the status of a qualified Trusted Service Provider (TSP) requires considerable investments, not only financially, but also in process (e.g., security; mandatory periodical audits). Hence, e.g. the EU’s trusted lists of qualified trust service providers include mostly commercial providers, besides government agencies and academic associations.

From a user perspective, you can either pay for the timestamping service of a qualified TSP; or,have your data time-stamped by multiple free services, including free services run by qualified TSPs.

See timestamp.sh as an example for a solution involving multiple, free (as in: costs) TSAs.

Via blockchains

There are projects implementing time stamping by means of a blockchain. While this doesn’t require qualified TSPs, it suffers heavily from environmental and/or ethical issues caused by the consensus mechanisms that are in use for the respective blockchain (Proof of Work, or Proof of Stake).

As of today, for that reason we don’t recommend blockchain-based timestamping.

Error detection and correction

There are several ways to monitor your archives for bit rot, and also for tampering. Not all of them auto-update, triggered by new, changed or deleted files or directories.

SolutionPrevents bit rotPrevents tamperingAuto-update on changes
RAID, either software or hardware
Checksumming at file-system level (e.g., via btrfs)
Parity files (e.g., via par2)
Parity drives (e.g., via SnapRAID)

The longer the required retention period, the more you need to stay independent of changes in storage technology, and of specific hardware. For data with an extremely long retention period, we recommend parity files. par2 is a great option, here.

Signing

Given the potential legal implications of an irrefutable digital signature, we strongly recommend to turn signing into a manual task, and to check twice what you are signing.

Pruning

While it is possible to automate the removal of archives that are no longer needed, we strongly recommend to turn pruning into a manual task to be executed at the end of the retention period (or even restricted to a cleanup at the start of a new year).

Image Credits:
Svalbard Global Seed Vault | by Frode Ramone | Licensed under Creative Commons Attribution 2.0 (CC BY 2.0)

Licensing:
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.