So, you’ve got a handful of files and need to turn them into PDF format. PDF files that don’t just look good on your device, but everywhere, and are suitable for a long-term archiving.
Here’s what you can do.
What to do Results Caveats Package to install
A few caveats, aka: »management of expectations«
Background: Not all PDFs are created equal
What is commonly referred to as the »PDF standard« is actually a hodgepodge of features, some of them proprietary; the current version is PDF 1.7 (ISO 32000–1:2008). Most PDF software packages don’t even come close to supporting all of those features. A cleaned-up. newer version of the standard is PDF 2.0 (ISO 32000–2:2020), which isn’t wide-spread yet (and paradoxically, can’t be downloaded for free – unlike its predecessor).
With »the« PDF standard being such a grab bag, several standards with narrower focuses have been defined. Typically, they require a minimum of supported »PDF« features, but also forbid the use of several others.
A few examples: PDF/E focuses on engineering purposes, e.g. allowing for interactive 3D visualization. The PDF/X family of standards focuses on PDFs as input for commercial printing. PDF/UA focuses on accessibility, especially by supporting assistive technologies.
We’re focusing on PDF/A here.
»A« stands for »Archive«: PDF/A is a set of restrictions imposed on PDF files, to ensure they’ll still be usable 50 years from today.
Which essentially means two things: First, all that is required to display the content is contained within the PDF file, including the fonts used. Any existing or future device with a PDF viewer should do, and what is displayed should look identical across devices.
Second, anything that puts the »archive« purpose at risk is forbidden: no embedded executable code, no encryption, no external links, no multimedia, nothing else that might be causing trouble, in a distant future.
How many flavors of PDF/A are there?
As you may have guessed, the mess is starting here. PDF/A comes in several versions (1, 2 or 3), that define available features. Conformance levels (B, U, and A; U for versions 2 and 3 only) define implementation requirements with respect to features. E.g., a PDF/A‑2B file follows version 2 of the standard, and conforms to level B.
- PDF/A‑1 (ISO 19005–1:2005): the oldest, and most widely supported version. Cannot handle transparency, cannot embed JPEG2000 images, cannot have any other files embedded, as attachments. Many software packages will use workarounds to compensate for such issues, mainly by converting graphics to supported formats, when exporting to PDF/A. Which isn’t as bad as it sounds, results still look fine.
- PDF/A‑2 (ISO 19005–2:2011): a greatly modernized version that may lead to files that are incompatible with viewers capable of displaying PDF/A‑1 files only. Adds support for JPEG2000, transparency, layers, specific digital signatures, and embedded PDF/A files, as attachments.
- PDF/A‑3 (ISO 19005–3:2012): a minor update of version 2, additionally allowing for arbitrary embedded file attachments (which caused some controversy, not only among archivists).
PDF/A conformance levels:
- Level B (Basic conformance): essentially, the document is visually reproducible, it looks the same on all viewers. That doesn’t mean its internal structure reflects anything about »words«, or »sentences«, or »paragraphs« – it’s all about looks, only.
You may liken a level B PDF/A file to a blackmail letter, carefully crafted from newspaper snippets to make it look like a document; technically, it’s still just a bunch of paper patches that each have coordinates on the paper sheet, but no real relationships among themselves. A software trying to convert a level B PDF/A document into a plain text file, or into speech, is out of luck, because the file content is merely patches and their coordinates. Which means: Level B files are not accessible.
- Level U (Unicode conformance): extends level B by requiring that all text in the file can be mapped to Unicode. This resolves problems like searching for words with »foreign characters«.
- Level A (Accessible conformance): in addition to Level B, some metadata is required, like information about language, hierarchy (»heading levels«), text spans (»reading order« of the letters). Essentially, that additional metadata turns a mere collage of unrelated letters into a »document«.
Which flavor of PDF/A should I choose, then?
Our recommendations for creating PDF/A files:
- In general, you cannot rely on software packages to produce valid PDF/A files, under all circumstances. By all means validate your results (see below).
- Prefer compliance level A over U over B, to include as much useful metadata as possible in the resulting PDF file. We expressly deplore the current prevalence of PDF/A‑2b, because we consider it an electronic equivalent of scrapbook archiving. Level B fails to reasonably support the quick, meaningful electronic processing of PDF content.
- Go for PDF/A‑2. Avoid PDF/A‑3, because its option to embed »arbitrary« files may annihilate the purpose of a long-term PDF archive, and is rightfully frowned upon by many.
How to validate a file against the standards?
Use veraPDF – see below.
Boot into Linux
Boot into a fairly recent Linux distribution. For this episode, we’ve used Manjaro 21.2.0 KDE, and cross-checked our results (except digitally signing PDF files, see the respective section for details) on Ubuntu 21.04 (Hirsute Hippo).
Open a terminal and you’re ready to go.
Tools used, and their packages
pdfsigfrom Poppler >= v21.10.0
- a Java Runtime Environment (JRE) >= v11, e.g. OpenJDK
- OCRmyPDF see installation notes
tesseract-ocr-spa, … or
tesseract-data-spa, … )
unoconv(please note you also need OpenJDK >= 11)
veraPDFfrom their website, or via
Four PDF/A creation strategies
Obviously, the order of preference is:
- Prefer office software that can export to PDF/A
- Script PDF/A file creation from your source files (if effort is acceptable)
- Try to convert PDF files exported by applications into valid PDF/A files
- Last resort: »Print to PDF« (followed by OCR, if necessary)
1. Prefer office software that can export to PDF/A
While generic PDF export capabilities are surprisingly common, only few software packages can directly export to PDF/A.
I need PDF/A versions of my LibreOffice documents
By all means, do not print into a PDF, but export to PDF (see below for why »Print to PDF« is only a second choice). Here’s what you get, in return for not »printing to a PDF printer« instead:
- Selectable text everywhere.
- A »table of content«, aka bookmarks, in the resulting PDF file.
Export as PDF..., then enable
[x] Archive (PDF/A, ISO 19005)and
[x] Export outlines. Finally, select the best available version/compliance combination, currently
… but I need to automate that, from the terminal
As a sample file, we’ll use LoremIpsum.odt
libreofficedoes not accept all of the required export filter parameters, we’ll use
unoconv, which will in turn use
libreoffice, to do its work.
We’ll create a
2(conformance level B is implied);
ExportBookmarkscreates for Table-of-Content entries; we’ll limit image resolution to max.
600DPI and set image quality to
- First, close all running instances of LibreOffice.
- Then, run:
unoconv -f pdf -e SelectPdfVersion=2 -e ExportBookmarks=true -e MaxImageResolution=600 -e ReduceImageResolution=true -e Quality=100 LoremIpsum.odt
2. Create PDF/A files from other files
I need a PDF/A photo documentation from a grab bag of snapshots
We’ll create a »slideshow« PDF that starts full screen by default, can be printed on A4 paper sheets, and validates against PDF/A‑2B.
Let’s start by preparing a directory holding four selected photographs that have wildly differing sizes / resolutions:
Credits: selective focus photography of shoreline during golden hour | Photo by Ishan @seefromthesky on Unsplash · brown wooden houses near body of water under blue sky during daytime | Photo by Divya Agrawal on Unsplash · people walking on sidewalk between buildings during daytime | Photo by Mohammed Ajwad on Unsplash · mountain sea of clouds | Photo by Mohammed Ajwad on Unsplash
The script used to create a PDF/A »slideshow«:
- Copy your selection of photographs into a separate directory. In our example, it’s
- Rename them to something meaningful. The names will be used as bookmarks in the table of contents (TOC). In our example, we’ll keep the original names.
- Decide upon the order and accordingly prefix the file names with numbers. Don’t use consecutive numbers (1, 2, 3, …) but leave some space (e.g., 10, 20, 30, …): This will allow for adding new photos in between, and for easy reordering, in case you change your mind. In our example, let’s assume we’ll end up with:
We’re ready to execute the script:
./pics2photodoc.sh -i ./photodoc -o ./photodocumentation
I need PDF bookmarks, aka a »Table of Content«
The preferred way to include bookmarks is having the document-creating software add them. LibreOffice can do that.
While it is a bit tedious, you can also add bookmarks to any existing PDF file. This is achieved by using pdfMark operators.
Let’s assume the table of content that you have in mind looks like this:
Computing at home (page 1) 1. Intro (page 2) 2. Computers (page 5) 2.1 History (page 7) 2.1.1 The rise of the PC (page 8) 2.1.2 The rise of the Notebook (page 11) 3. Smartphones (page 15) ... etc ...
First, prepare a text file file that reflects this structure, and save it as
pdfmarks.txt. The pdfMark standard mandates a specific syntax that is still human-readable:
[ /Title (Computing at home) /Page 1 /OUT pdfmark [ /Title (1. Intro) /Page 2 /OUT pdfmark [ /Title (2. Computers) /Page 5 /Count 1 /OUT pdfmark [ /Title (2.1 History) /Page 7 /Count -2 /OUT pdfmark [ /Title (2.1.1 The rise of the PC) /Page 8 /OUT pdfmark [ /Title (2.1.2 The rise of the Notebook) /Page 11 /OUT pdfmark [ /Title (3. Smartphones) /Page 15 /OUT pdfmark
Titleis followed by the bookmark title
Pageis the target page for the bookmark
Countgives the number of child bookmarks. Putting a minus sign
-in front of the number tells the PDF viewer to show it collapsed
To add the bookmarks to a file
gs -o ./document-plus-bookmarks.pdf -dPDFA=2 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor ./document.pdf
3. Try to convert PDF files exported by applications into valid PDF/A files
Since you might have close to no control about what the »PDF export« feature of an application gives you, we’ll kind of shoehorn the original file into a new one that conforms to PDF/A.
gs is our tool of choice here.
Best guesses, and caveats:
- We’ll follow the recommendations of the PDF Association and use an RGB color space for the output intent, described by the sRGB ICC profile. The resulting PDF/A should display well on most devices.
- It’s not possible to »fix« every feature in the original PDF file that does not conform to PDF/A standards. Therefor,
gswill sometimes force-remove features. Unfortunately, this may mean that we lose some metadata as well. Also,
gswill give many warnings on the terminal, in such cases.
- Create a working directory, and
- Download the freely available sRGB profile
sRGB2014.iccfrom the website of the International Color Consortium, and place it in the working directory
- Export a colorful sample page from an application into a PDF, e.g. a web site from your browser. Move it to the working directory, and rename it to
gscomes with a sample prefix file that needs to be adapted to the chosen profile; e.g., for
gsversion 9.55.0, you can find the sample at
/usr/share/ghostscript/9.55.0/lib/PDFA_def.ps. Copy it to your working directory as well, and modify the lines for
OutputConditionIdentifieras shown (leave everything else unchanged):
You can now convert your PDF file to PDF/A:
gs -I . -o ./outputPDFA.pdf -dPDFA=2 -dPDFACompatibilityPolicy=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor ./PDFA_def.ps ./input.pdf
Validate the result against PDF/A‑2b (see veraPDF, below). Here’s a short explanation of what some parameters do:
-I .extends the library search path of
gsto the by the current directory, so
sRGB2014.iccwill be found
-dPDFA=2requests PDF/A‑2b as output
-dPDFACompatibilityPolicy=1permits to relentlessly drop all features from the original PDF file that can’t be made to validate against PDF/A.When everything runs as expected, you can copy
/usr/share/color/icc. Make sure you adapt the
-Iparameter accordingly, as well as the profile’s path in
PDFA_def.ps. As an alternative to
-I, you can study how
--permit-file-readworks, and use that instead.
4. Last resort: »Print to PDF«
Most Linux distributions support the CUPS printing system, including CUPS-PDF, a virtual printer that creates PDF files. After you’ve installed CUPS-PDF according to the instructions for your distribution, make sure that a test print actually generates a PDF file before you venture into PDF/A creation.
After the installation of CUPS-PDF, and a first test, we can reconfigure it to generate files that validate against PDF/A‑2b:
The configuration file to change is
/etc/cups/cups-pdf.conf. Open it in your preferred editor.
Make sure that
Outpoints to a reasonable output directory
Enforce PDF/A‑2b creation
GSCallline and add the following parameters:
Leave the rest of the line as-is (…):
GSCall %s -q -dPDFA=2 -dPDFACompatibilityPolicy=1 -sColorConversionStrategy=UseDeviceIndependentColor (...)
…but I REALLY need a PDF file containing text that can be copied, not just bitmaps
Let OCRmyPDF post-process a PDF file that is in the
english language, allow it to
redo-ocrand discard OCR content already in that file; and produce an output file that conforms to PDF/A‑2b.
ocrmypdf -l eng --redo-ocr --output-type pdfa-2 in.pdf out.pdf
…but that still didn’t work
There is a brute-force approach that can be a last resort: have a script render all pages of a PDF file into bitmaps; compress them so the resulting PDF won’t become too large; use OCR to obtain the text; and finally re-assemble both text and bitmaps into a new PDF file.
We had presented the principle in our first episode. For a refined, scripted version see unsearchable2searchablepdfa.sh.
Beyond just creating PDF/A files
I need to validate PDF files against PDF/A standards
veraPDF is a tool to validate files against PDF/A, version 1–3, compliance levels B, A and U requirements. The tool is written in Java and offers both a GUI and a CLI.
If you don’t want to use its GUI, or need to automate the validation, here’s how to validate on the terminal:
Let veraPDF give a
verbose report in
textformat on the file validation against
./veraPDF/verapdf -v -f 2b --format text ./LoremIpsum.pdf
PASS ./LoremIpsum.pdfIf validation fails, all sections of the standard that were violated will be listed. Change the output format to
mrrfor more details.
I need to digitally sign a PDF/A file
pdfsig, as described here, requires Poppler >= v21.10.0. If your Linux distribution is not a rolling release, finding and installing a sufficiently new version might be a challenge.
For organizations or owners of an internet domain, digitally signing PDF/A files with a private key may be desirable. Commonly, you pay a Certification Authority (CA) that will verify you’re the owner of the domain, or a legal representative of your organization. At the end of the process, it’s common to receive a PKCS#12 keystore file (extension
.pfx) that contains both a private key, and the corresponding X.509 certificate.
To be ready for signing PDF/A files, you would import the keystore you received into a trusted certificate store on your server.
Setting up a dummy certificate store
For demonstration purposes, we’ll first create a dummy keystore file, and import it into a dummy certificate store. OpenSSL is the tool of choice for this.
Create a dummy keystore file for a fictitious company:
openssl req -x509 -newkey rsa:4096 -subj '/C=DE/ST=Bavaria/L=Nuremberg/CN=www.example.com/O=ACME Inc.' -keyout theprivatekey.pem -out thecertificate.crt -days 3650 -nodes
openssl pkcs12 -export -out thekeystore.p12 -inkey theprivatekey.pem -in thecertificate.crt
We’ve now got our dummy
thekeystore.p12 that we’ll import into a dummy certificate store, so we can start signing PDF/A files:
Set up a dummy certificate store, and import the keystore file into it:
certutil -N -d ./dummynss
pk12util -i thekeystore.p12 -d sql:./dummynss
Digitally signing a PDF/A file, and validating the signature
We’re now ready to start signing PDF/A files:
LoremIpsum.pdf(please note that accessing the certificate keystore requires to have the password on the command line, after
pdfsig -nssdir ./dummynss -nss-pwd password -add-signature -nick "www.example.com - ACME Inc." ./LoremIpsum.pdf ./LoremIpsum-signed.pdf
Let’s validate the signature:
Validate the signature of
Will print an output similar to this:
Digital Signature Info of: ./LoremIpsum-signed.pdf Signature #1: - Signer Certificate Common Name: www.example.com - Signer full Distinguished Name: O=ACME Inc.,CN=www.example.com,L=Nuremberg,ST=Bavaria,C=DE - Signing Time: Nov 20 2021 14:25:39 - Signing Hash Algorithm: SHA-256 - Signature Type: adbe.pkcs7.detached - Signed Ranges: [0 - 46812], [54652 - 55032] - Total document signed - Signature Validation: Signature is Valid. - Certificate Validation: Certificate issuer isn't Trusted.
Since our dummy certificate is self-signed only, the certificate issuer is being reported as untrusted, of course.
I need to disable copy & paste for this PDF file
Yes, some software packages are offering options for that. Just forget it: Whatever can be displayed, can ultimately be copied, too. Our first episode even has a section for dealing with »stubborn« PDF files.
1 us dollar bill (modified) | Photo by Kirk Cameron on Unsplash
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.