So you need to archive data for years (if not decades), and you really don’t trust a service provider to both stay in business, and play fair, for that long a time frame.
Here’s what you can do.
What to do Results Caveats Package to install
A few caveats, aka: »management of expectations«
This episode deals with keeping a files-in-directories based archive under one’s own control. We consider the trade-offs of an archive based on a plain files-in-directories storage to be acceptable, and hence such an archive as a reasonable choice.
Archives aren’t backups
While it is common to mistake one for the other, they are quite different things:
- The purpose of a backup is being able to restore a system to a previous state that is as close as possible to the state before a disaster. A backup is tightly coupled to the hardware and software that you are using right now (but maybe won’t use anymore, next year).
- The purpose of an archive is to preserve data in an immutable format that is easy to index, search and view, for a long time to come. An archive is as independent as possible of specific hardware or software.
You should always backup your archives, but never archive your backups.
If you’re rather concerned with backups right now, read our episode S01E03 Secure backups, and a time-machine for your home office, instead.
Requirements for archives
Indexing and search
The archive formats chosen must allow for indexing, since fast, complex searches within archives are not just time-savers, but quite often also audit requirements laid down in e.g. tax regulations.
Your local jurisdiction may require some of your data to be archived in an immutable format, to prevent both accidental modifications as well as tampering. In some contexts, deletions are entirely forbidden, or must be painstakingly recorded.
Legal requirements likely include the long-term, reliable reproduction of the visual appearance of documents, especially when deletion of physical copies is permitted.
Last, but not least: changes to the organisation and structure of the archive itself may require extensive documentation to guarantee the original structure can be restored, upon demand.
Time stamping provides evidence that a specific content already existed at the given time, and hasn’t been altered, since.
While live data and its backups may rely on similar access control structures, access to an archive must be controlled by a system that can cope with years, if not decades of constant change.
Also, granting specific forms of access to law enforcement agencies and auditors most likely differs from regular live data access control.
Depending on your legislation, some of your data may need to be archived for decades. In general, the longest time frames are most likely imposed on you by tax regulations affecting business transactions.
The opposite may be true, as well: privacy-related regulations (like the GDPR) may require that you are able to remove specific records of personally identifiable data within a month’s time from your storage devices, unless they’re not part of business transactions. Other examples include job applications.
Ask yourself whether such data really needs to be archived, opposed to just being backed up. Try to have that data automatically taper off, by ensuring your backup window does not extend beyond the respective permitted time frames.
Data formats and viewing
Choosing an appropriate archival format that can be viewed and queried for years to come may be a challenge for you.
There may be unexpected twists, e.g. your local jurisdiction may only require that an email attachment gets archived, but exempt the email it was attached to, in case the email is a pure envelope that contains no part of the data.
Boot into Linux
Boot into a fairly recent Linux distribution. For this episode, we’ve used Manjaro 21.2.0 KDE, and cross-checked our results on Ubuntu 20.04 LTS.
Open a terminal and you’re ready to go.
Tools used, and their packages
btrbkWe suggest you always install the most recent version, as described in its README
siegfriedWe suggest to follow the installation instructions on the homepage
unoconv(please note you also need OpenJDK >= 11)
veraPDFfrom their website, or via
Use archive formats, use viewers
Backups of live data usually don’t make the best format for archival. A recent version of a software may not be able to fully render or search data that was archived a few years ago. Even worse, you may have switched over to a different software, multiple times within a decade, accumulating archival data in a multitude of software-specific formats.
When archiving, convert your data to proven archival formats that allow for indexing, complex searches, and quick viewing.
Viewers aren’t editors. Since archives are immutable by nature, data items must not be changed after they’ve been archived. Not even by in-file tagging or annotating. Some raw data types are »instant archive items« (e.g, photos).
As a rule of thumb: access archive content via viewers, not via editors, whenever possible. Even if write access is prohibited already on the file level to prevent you from harming your archive, you’ll stay more aware of the respective content when you use different tool sets.
If you intend to enrich plain text files by some kind of formatting, the Markdown format is the way to go. Whether your plaintext files contain markup or not, Okular is a fast and capable viewer for it.
Text with layout
The most widely used archive format that preserves text layout and is guarantees visually reproducible results is PDF/A. Please see our S01E02 »Can I have this as a PDF?« for how to create it, from various source formats.
To document calculated values, and to make the results immutable, it is recommended to convert spreadsheets to PDF/A format, too.
See our episode S01E02 »Can I have this as a PDF?« for how to script this using
Calendars and address books
iCalendar is a widely used format for calendars one can subscribe to. It is also the most generic and most suitable format for archiving calendars.
vCard is widely used for exchanging both contact information of a single individual as well as whole address books; since many vCard implementations have interpoerability problems, it is very important to verify that all relevant data survives an export & import cycle.
Calendars accessible via a URL can simply be downloaded via
wget. As an example, Google’s moon phases calendar for Berlin:
wget https://calendar.google.com/calendar/ical/ht3jlfaac5lfd6263ulfh4tql8%40group.calendar.google.com/public/basic.ics -O moonphases.ics
OfflineIMAP needs a configuration file. In our example, we’ll pull all folders except those named Trash, Junk, Drafts and Notes, for the account
mymail, from a STARTTLS-secured IMAP server to a local Maildir directory
We’ll prevent any propagation of email deletions from the server to the local directory (
sync_deletes = no). Also, local email file modification times are made to match the
Datefield of the respective email contained within (
utime_from_header = yes).
Our sample configuration file
~/offlineimap.conflooks like this (please adapt IMAP credentials etc. to your testing environment):
[DEFAULT] sslcacertfile = OS-DEFAULT folderfilter = lambda foldername: not foldername.startswith(('Trash', 'Junk', 'Spam', 'Drafts', 'Notes')) [general] metadata = ~/.offlineimap accounts = mymail [Account mymail] localrepository = Local.mymail remoterepository = Remote.mymail [Repository Local.mymail] type = Maildir localfolders = ~/mymail/ utime_from_header = yes [Repository Remote.mymail] remotehost = smtp.example.com remoteuser = email@example.com remotepass = mysecretpassword sync_deletes = no type = IMAP starttls = yes ssl = no remoteport = 143
offlineimap -c ~/offlineimap.conf
~/mymail/contains all IMAP folders pulled from the server
Organizing a file-based archive
Do not hand over your password to assistants, or to persons performing audits. Instead, learn how to handle Access Control Lists (ACLs).
Start simple, e.g. by creating groups like
audits. Do not give any group write access to archive directories. Restrict that to a single archive admin user who is not even a member of your ACL scheme, and after the respective retention period has ended, purge data from your archives only under that user account.
Archive files with tight access permissions in directory trees that are separated from the rest.
The more exemptions you need to make from that rule of thumb, the more likely you are to accidentally expose data in a directory branch that should have better been kept private. Also, a piebold access rights assignment makes it harder to turn over isolated directories to entitled parties, should you ever have to.
Separate archive metadata and data
Keep archival metadata like tags, annotations, timestamps etc. separate from archived data, even if embedding it within the data files looks much more convenient.
The archiving process
Depending on your needs, archiving a file or directory may require multiple steps:
- File verification
- Time stamping
- Adding error detection & correction information
- Content indexing
It is possible to detect new arrivals in archive folders, and have that trigger the archiving process (see below for an automation example).
Detecting changes in archive directories
While it is technically feasible to monitor, in real time, a large amount of directories for changes, this creates a substantial load for the respective computer.
A better alternative is to use a file system with inherent, fast snapshotting capabilities, like
btrfs. Before taking a new snapshot, compare the current state of the file system to the most recent snapshot, to detect updates.
Archiving process options
For verifying conformance of PDF files to the PDF/A standards, use veraPDF.
Many file formats that are widely used lack a tool that can verify strict standard conformance. It is however possible to check for some essential markers within a file. We have described here how to use the tool siegfried for that.
Indexing and search
Try to limit the amount of indexers and related databases, so you can focus on a single frontend, with a single query language.
- Office documents
It supports boolean searches, phrases, proximity, wildcards, filters on file types and on the directory tree, as well as filtering time spans.
Multiple indexes with different settings can be created using multiple configuration directories. Updating indexes, or rebuilding them from scratch, can be automated via the GUI, or scripted.
According to RFC 3161
Time Stamping Authorities (TSAs) process signing requests that include hash values of the respective data. Besides the hash values, their response contains a digitally signed time stamp, as proof that a certain content already existed at the given point in time.
While it is easy to host a time stamping service (e.g., company-internal; or freeTSA.org), acquiring and maintaining the status of a qualified Trusted Service Provider (TSP) requires considerable investments, not only financially, but also in process (e.g., security; mandatory periodical audits). Hence, e.g. the EU’s trusted lists of qualified trust service providers include mostly commercial providers, besides government agencies and academic associations.
From a user perspective, you can either pay for the timestamping service of a qualified TSP; or,have your data time-stamped by multiple free services, including free services run by qualified TSPs.
See timestamp.sh as an example for a solution involving multiple, free (as in: costs) TSAs.
There are projects implementing time stamping by means of a blockchain. While this doesn’t require qualified TSPs, it suffers heavily from environmental and/or ethical issues caused by the consensus mechanisms that are in use for the respective blockchain (Proof of Work, or Proof of Stake).
As of today, for that reason we don’t recommend blockchain-based timestamping.
Error detection and correction
There are several ways to monitor your archives for bit rot, and also for tampering. Not all of them auto-update, triggered by new, changed or deleted files or directories.
|Prevents bit rot||Prevents tampering||Auto-update on changes|
|RAID, either software or hardware|
|Checksumming at file-system level (e.g., via |
|Parity files (e.g., via |
|Parity drives (e.g., via |
The longer the required retention period, the more you need to stay independent of changes in storage technology, and of specific hardware. For data with an extremely long retention period, we recommend parity files. par2 is a great option, here.
Given the potential legal implications of an irrefutable digital signature, we strongly recommend to turn signing into a manual task, and to check twice what you are signing.
While it is possible to automate the removal of archives that are no longer needed, we strongly recommend to turn pruning into a manual task to be executed at the end of the retention period (or even restricted to a cleanup at the start of a new year).
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.