S01E02 »Can I have this as a PDF?«

So, you’ve got a handful of files and need to turn them into PDF format. PDF files that don’t just look good on your device, but everywhere, and are suitable for a long-term archiving.

Here’s what you can do.

Visual cues:
What to do Results Caveats Package to install

A few caveats, aka: »management of expectations«

You can’t »understand PDF« without briefly looking at what PDF standards exist. We present a bit of background, below.
Current PDF creation implementations are a mess. You’ll probably need to compromise a bit, with respect to the achievable level of standard conformance.

Background: Not all PDFs are created equal

What is commonly referred to as the »PDF standard« is actually a hodgepodge of features, some of them proprietary; the current version is PDF 1.7 (ISO 32000–1:2008). Most PDF software packages don’t even come close to supporting all of those features. A cleaned-up. newer version of the standard is PDF 2.0 (ISO 32000–2:2020), which isn’t wide-spread yet (and paradoxically, can’t be downloaded for free – unlike its predecessor).

With »the« PDF standard being such a grab bag, several standards with narrower focuses have been defined. Typically, they require a minimum of supported »PDF« features, but also forbid the use of several others.

A few examples: PDF/E focuses on engineering purposes, e.g. allowing for interactive 3D visualization. The PDF/X family of standards focuses on PDFs as input for commercial printing. PDF/UA focuses on accessibility, especially by supporting assistive technologies.

We’re focusing on PDF/A here.

What’s »PDF/A«?

»A« stands for »Archive«: PDF/A is a set of restrictions imposed on PDF files, to ensure they’ll still be usable 50 years from today.

Which essentially means two things: First, all that is required to display the content is contained within the PDF file, including the fonts used. Any existing or future device with a PDF viewer should do, and what is displayed should look identical across devices.

Second, anything that puts the »archive« purpose at risk is forbidden: no embedded executable code, no encryption, no external links, no multimedia, nothing else that might be causing trouble, in a distant future.

How many flavors of PDF/A are there?

As you may have guessed, the mess is starting here. PDF/A comes in several versions (1, 2 or 3), that define available features. Conformance levels (B, U, and A; U for versions 2 and 3 only) define implementation requirements with respect to features. E.g., a PDF/A‑2B file follows version 2 of the standard, and conforms to level B.

Yes, there is a version 4 of PDF/A, but as of today, it isn’t widely supported.

PDF/A versions:

  • PDF/A‑1 (ISO 19005–1:2005): the oldest, and most widely supported version. Cannot handle transparency, cannot embed JPEG2000 images, cannot have any other files embedded, as attachments. Many software packages will use workarounds to compensate for such issues, mainly by converting graphics to supported formats, when exporting to PDF/A. Which isn’t as bad as it sounds, results still look fine.
  • PDF/A‑2 (ISO 19005–2:2011): a greatly modernized version that may lead to files that are incompatible with viewers capable of displaying PDF/A‑1 files only. Adds support for JPEG2000, transparency, layers, specific digital signatures, and embedded PDF/A files, as attachments.
  • PDF/A‑3 (ISO 19005–3:2012): a minor update of version 2, additionally allowing for arbitrary embedded file attachments (which caused some controversy, not only among archivists).

PDF/A conformance levels:

  • Level B (Basic conformance): essentially, the document is visually reproducible, it looks the same on all viewers. That doesn’t mean its internal structure reflects anything about »words«, or »sentences«, or »paragraphs« – it’s all about looks, only.
    You may liken a level B PDF/A file to a blackmail letter, carefully crafted from newspaper snippets to make it look like a document; technically, it’s still just a bunch of paper patches that each have coordinates on the paper sheet, but no real relationships among themselves. A software trying to convert a level B PDF/A document into a plain text file, or into speech, is out of luck, because the file content is merely patches and their coordinates. Which means: Level B files are not accessible.
  • Level U (Unicode conformance): extends level B by requiring that all text in the file can be mapped to Unicode. This resolves problems like searching for words with »foreign characters«.
  • Level A (Accessible conformance): in addition to Level B, some metadata is required, like information about language, hierarchy (»heading levels«), text spans (»reading order« of the letters). Essentially, that additional metadata turns a mere collage of unrelated letters into a »document«.

Which flavor of PDF/A should I choose, then?

Our recommendations for creating PDF/A files:

  • In general, you cannot rely on software packages to produce valid PDF/A files, under all circumstances. By all means validate your results (see below).
  • Prefer compliance level A over U over B, to include as much useful metadata as possible in the resulting PDF file. We expressly deplore the current prevalence of PDF/A‑2b, because we consider it an electronic equivalent of scrapbook archiving. Level B fails to reasonably support the quick, meaningful electronic processing of PDF content.
  • Go for PDF/A‑2. Avoid PDF/A‑3, because its option to embed »arbitrary« files may annihilate the purpose of a long-term PDF archive, and is rightfully frowned upon by many.

How to validate a file against the standards?

Use veraPDF – see below.

Boot into Linux

Boot into a fairly recent Linux distribution. For this episode, we’ve used Manjaro 21.2.0 KDE, and cross-checked our results (except digitally signing PDF files, see the respective section for details) on Ubuntu 21.04 (Hirsute Hippo).

Open a terminal and you’re ready to go.

Tools used, and their packages

  • certutil, pk12util, pdfsig from Poppler >= v21.10.0 nss
  • convert imagemagick
  • CUPS-PDF printer-driver-cups-pdf or cups-pdf
  • gs ghostscript
  • a Java Runtime Environment (JRE) >= v11, e.g. OpenJDK
  • OCRmyPDF see installation notes
  • openssl openssl
  • optipng optipng
  • tesseract tesseract-ocr or tesseract (+tesseract-ocr-fra, tesseract-ocr-spa, 
 or tesseract-data-fra, tesseract-data-spa, 
 )
  • unoconv unoconv (please note you also need OpenJDK >= 11)
  • veraPDF from their website, or via AUR

Four PDF/A creation strategies

Obviously, the order of preference is:

  1. Prefer office software that can export to PDF/A
  2. Script PDF/A file creation from your source files (if effort is acceptable)
  3. Try to convert PDF files exported by applications into valid PDF/A files
  4. Last resort: »Print to PDF« (followed by OCR, if necessary)

1. Prefer office software that can export to PDF/A

While generic PDF export capabilities are surprisingly common, only few software packages can directly export to PDF/A.

I need PDF/A versions of my LibreOffice documents

By all means, do not print into a PDF, but export to PDF (see below for why »Print to PDF« is only a second choice). Here’s what you get, in return for not »printing to a PDF printer« instead:

  • Selectable text everywhere.
  • A »table of content«, aka bookmarks, in the resulting PDF file.
File > Export as > Export as PDF..., then enable [x] Archive (PDF/A, ISO 19005) and [x] Export outlines. Finally, select the best available version/compliance combination, currently PDF/A-2b.


 but I need to automate that, from the terminal

As a sample file, we’ll use LoremIpsum.odt

Since libreoffice does not accept all of the required export filter parameters, we’ll use unoconv, which will in turn use libreoffice, to do its work.

We’ll create a pdf file, PDF version 2 (conformance level B is implied); ExportBookmarks creates for Table-of-Content entries; we’ll limit image resolution to max. 600 DPI and set image quality to 100%.

  1. First, close all running instances of LibreOffice.
  2. Then, run:
unoconv -f pdf -e SelectPdfVersion=2 -e ExportBookmarks=true -e MaxImageResolution=600 -e ReduceImageResolution=true -e Quality=100 LoremIpsum.odt
./LoremIpsum.pdf

2. Create PDF/A files from other files

I need a PDF/A photo documentation from a grab bag of snapshots

We’ll create a »slideshow« PDF that starts full screen by default, can be printed on A4 paper sheets, and validates against PDF/A‑2B.

Let’s start by preparing a directory holding four selected photographs that have wildly differing sizes / resolutions:

Credits: selective focus photography of shoreline during golden hour | Photo by Ishan @seefromthesky on Unsplash · brown wooden houses near body of water under blue sky during daytime | Photo by Divya Agrawal on Unsplash · people walking on sidewalk between buildings during daytime | Photo by Mohammed Ajwad on Unsplash · mountain sea of clouds | Photo by Mohammed Ajwad on Unsplash

The script used to create a PDF/A »slideshow«:

  1. Copy your selection of photographs into a separate directory. In our example, it’s ./photodoc.
  2. Rename them to something meaningful. The names will be used as bookmarks in the table of contents (TOC). In our example, we’ll keep the original names.
  3. Decide upon the order and accordingly prefix the file names with numbers. Don’t use consecutive numbers (1, 2, 3, 
) but leave some space (e.g., 10, 20, 30, 
): This will allow for adding new photos in between, and for easy reordering, in case you change your mind. In our example, let’s assume we’ll end up with:
  • 1 mohammed-ajwad-YY_zYDn4T5g-unsplash.jpg
  • 10 fotografu-3WdUBdr9Pxw-unsplash.jpg
  • 20 ishan-seefromthesky-3eP5-K5DQ9A-unsplash.jpg
  • 100 divya-agrawal-qa8VhqvJGIo-unsplash.jpg

We’re ready to execute the script:

./pics2photodoc.sh -i ./photodoc -o ./photodocumentation
./photodocumentation.pdf

I need PDF bookmarks, aka a »Table of Content«

The preferred way to include bookmarks is having the document-creating software add them. LibreOffice can do that.

While it is a bit tedious, you can also add bookmarks to any existing PDF file. This is achieved by using pdfMark operators.

Let’s assume the table of content that you have in mind looks like this:

Computing at home (page 1)
1. Intro (page 2)
2. Computers (page 5)
2.1 History (page 7)
2.1.1 The rise of the PC (page 8)
2.1.2 The rise of the Notebook (page 11)
3. Smartphones (page 15)
... etc ...

First, prepare a text file file that reflects this structure, and save it as pdfmarks.txt. The pdfMark standard mandates a specific syntax that is still human-readable:

[ /Title (Computing at home) /Page 1 /OUT pdfmark
[ /Title (1. Intro) /Page 2 /OUT pdfmark
[ /Title (2. Computers) /Page 5 /Count 1 /OUT pdfmark
[ /Title (2.1 History) /Page 7 /Count -2 /OUT pdfmark
[ /Title (2.1.1 The rise of the PC) /Page 8 /OUT pdfmark
[ /Title (2.1.2 The rise of the Notebook) /Page 11 /OUT pdfmark
[ /Title (3. Smartphones) /Page 15 /OUT pdfmark

Where

  • Title is followed by the bookmark title
  • Page is the target page for the bookmark
  • Count gives the number of child bookmarks. Putting a minus sign - in front of the number tells the PDF viewer to show it collapsed
The pdfMark standard offers way more than just bookmarks. Even for bookmarks, there are more options with respect to formatting. See the section Bookmarks (OUT) in Chapter 2 Basic Features of the pdfMark reference for details.

To add the bookmarks to a file document.pdf:

gs -o ./document-plus-bookmarks.pdf -dPDFA=2 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor ./document.pdf
./document-plus-bookmarks.pdf

3. Try to convert PDF files exported by applications into valid PDF/A files

Since you might have close to no control about what the »PDF export« feature of an application gives you, we’ll kind of shoehorn the original file into a new one that conforms to PDF/A. gs is our tool of choice here.

Best guesses, and caveats:

  • We’ll follow the recommendations of the PDF Association and use an RGB color space for the output intent, described by the sRGB ICC profile. The resulting PDF/A should display well on most devices.
  • It’s not possible to »fix« every feature in the original PDF file that does not conform to PDF/A standards. Therefor, gs will sometimes force-remove features. Unfortunately, this may mean that we lose some metadata as well. Also, gs will give many warnings on the terminal, in such cases.

Preparations:

  • Create a working directory, and cd into it
  • Download the freely available sRGB profile sRGB2014.icc from the website of the International Color Consortium, and place it in the working directory
  • Export a colorful sample page from an application into a PDF, e.g. a web site from your browser. Move it to the working directory, and rename it to input.pdf
  • gs comes with a sample prefix file that needs to be adapted to the chosen profile; e.g., for gs version 9.55.0, you can find the sample at /usr/share/ghostscript/9.55.0/lib/PDFA_def.ps. Copy it to your working directory as well, and modify the lines for ICCProfile and OutputConditionIdentifier as shown (leave everything else unchanged):
/ICCProfile (./sRGB2014.icc)
/OutputConditionIdentifier (sRGB2014)

You can now convert your PDF file to PDF/A:

gs -I . -o ./outputPDFA.pdf -dPDFA=2 -dPDFACompatibilityPolicy=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor ./PDFA_def.ps ./input.pdf
./outputPDFA.pdf

Validate the result against PDF/A‑2b (see veraPDF, below). Here’s a short explanation of what some parameters do:

  • -I . extends the library search path of gs to the by the current directory, so sRGB2014.icc will be found
  • -dPDFA=2 requests PDF/A‑2b as output
  • -dPDFACompatibilityPolicy=1 permits to relentlessly drop all features from the original PDF file that can’t be made to validate against PDF/A.
When everything runs as expected, you can copy sRGB2014.icc to, e.g., /usr/share/color/icc. Make sure you adapt the -I parameter accordingly, as well as the profile’s path in PDFA_def.ps. As an alternative to -I, you can study how --permit-file-read works, and use that instead.

4. Last resort: »Print to PDF«

Most Linux distributions support the CUPS printing system, including CUPS-PDF, a virtual printer that creates PDF files. After you’ve installed CUPS-PDF according to the instructions for your distribution, make sure that a test print actually generates a PDF file before you venture into PDF/A creation.

CUPS-PDF transforms into a PDF whatever is delivered by the application. Most applications turn text into rasterized bitmaps, as soon as the visual structure gets »too complex« to be handled as mostly text stripes, positioned on a page. You end up with a PDF file containing several (or only) bitmaps of text, in many cases. CUPS-PDF isn’t to blame for that.

After the installation of CUPS-PDF, and a first test, we can reconfigure it to generate files that validate against PDF/A‑2b:

The configuration file to change is /etc/cups/cups-pdf.conf. Open it in your preferred editor.

Make sure that Out points to a reasonable output directory

E.g. to ~/PDF:

Out ${HOME}/PDF

Enforce PDF/A‑2b creation

Uncomment the GSCall line and add the following parameters:

  • -dPDFA=2
  • -dPDFACompatibilityPolicy=1
  • -sColorConversionStrategy=UseDeviceIndependentColor

Leave the rest of the line as-is (
):

GSCall %s -q -dPDFA=2 -dPDFACompatibilityPolicy=1 -sColorConversionStrategy=UseDeviceIndependentColor (...)


but I REALLY need a PDF file containing text that can be copied, not just bitmaps

Consider using OCRmyPDF for post-processing PDF files. In most cases, it produces outstanding results.

Let OCRmyPDF post-process a PDF file that is in the english language, allow it to redo-ocr and discard OCR content already in that file; and produce an output file that conforms to PDF/A‑2b.

ocrmypdf -l eng --redo-ocr --output-type pdfa-2 in.pdf out.pdf
./out.pdf


but that still didn’t work

There is a brute-force approach that can be a last resort: have a script render all pages of a PDF file into bitmaps; compress them so the resulting PDF won’t become too large; use OCR to obtain the text; and finally re-assemble both text and bitmaps into a new PDF file.

We had presented the principle in our first episode. For a refined, scripted version see unsearchable2searchablepdfa.sh.

Beyond just creating PDF/A files

I need to validate PDF files against PDF/A standards

veraPDF is a tool to validate files against PDF/A, version 1–3, compliance levels B, A and U requirements. The tool is written in Java and offers both a GUI and a CLI.

If you don’t want to use its GUI, or need to automate the validation, here’s how to validate on the terminal:

Let veraPDF give a verbose report in text format on the file validation against flavor PDF/A-2b.

./veraPDF/verapdf -v -f 2b --format text ./LoremIpsum.pdf

Will print:

PASS ./LoremIpsum.pdf
If validation fails, all sections of the standard that were violated will be listed. Change the output format to mrr for more details.

I need to digitally sign a PDF/A file

Adding signatures to a PDF file via pdfsig, as described here, requires Poppler >= v21.10.0. If your Linux distribution is not a rolling release, finding and installing a sufficiently new version might be a challenge.

For organizations or owners of an internet domain, digitally signing PDF/A files with a private key may be desirable. Commonly, you pay a Certification Authority (CA) that will verify you’re the owner of the domain, or a legal representative of your organization. At the end of the process, it’s common to receive a PKCS#12 keystore file (extension .p12, or .pfx) that contains both a private key, and the corresponding X.509 certificate.

To be ready for signing PDF/A files, you would import the keystore you received into a trusted certificate store on your server.

Setting up a dummy certificate store

For demonstration purposes, we’ll first create a dummy keystore file, and import it into a dummy certificate store. OpenSSL is the tool of choice for this.

You’ll be asked to set and use a variety of passwords. For this test (only!), don’t bother and keep using something like »password«, everywhere.
Create a dummy keystore file for a fictitious company:
openssl req -x509 -newkey rsa:4096 -subj '/C=DE/ST=Bavaria/L=Nuremberg/CN=www.example.com/O=ACME Inc.' -keyout theprivatekey.pem -out thecertificate.crt -days 3650 -nodes
openssl pkcs12 -export -out thekeystore.p12 -inkey theprivatekey.pem -in thecertificate.crt
./theprivatekey.pem, thecertificate.crt, thekeystore.p12

We’ve now got our dummy thekeystore.p12 that we’ll import into a dummy certificate store, so we can start signing PDF/A files:

Set up a dummy certificate store, and import the keystore file into it:
mkdir ./dummynss
certutil -N -d ./dummynss
pk12util -i thekeystore.p12 -d sql:./dummynss
./dummynss/cert9.db, ./dummynss/key4.db, ./dummynss/pkcs11.txt

Digitally signing a PDF/A file, and validating the signature

We’re now ready to start signing PDF/A files:

Sign LoremIpsum.pdf (please note that accessing the certificate keystore requires to have the password on the command line, after -nss-pwd):
pdfsig -nssdir ./dummynss -nss-pwd password -add-signature -nick "www.example.com - ACME Inc." ./LoremIpsum.pdf ./LoremIpsum-signed.pdf
./LoremIpsum-signed.pdf

Let’s validate the signature:

Validate the signature of LoremIpsum-signed.pdf:
pdfsig ./LoremIpsum-signed.pdf

Will print an output similar to this:

Digital Signature Info of: ./LoremIpsum-signed.pdf
Signature #1:
  - Signer Certificate Common Name: www.example.com
  - Signer full Distinguished Name: O=ACME Inc.,CN=www.example.com,L=Nuremberg,ST=Bavaria,C=DE
  - Signing Time: Nov 20 2021 14:25:39
  - Signing Hash Algorithm: SHA-256
  - Signature Type: adbe.pkcs7.detached
  - Signed Ranges: [0 - 46812], [54652 - 55032]
  - Total document signed
  - Signature Validation: Signature is Valid.
  - Certificate Validation: Certificate issuer isn't Trusted.

Since our dummy certificate is self-signed only, the certificate issuer is being reported as untrusted, of course.

After a PDF/A file has been digitally signed, no further modifications to it are allowed. Signature validation will detect any tampering with the document.
There is no visual indicator of the digital signature, although PDF viewers like Okular will report such files as being digitally signed. The reason is that there is no default »empty« space on a sheet where such an indicator could be placed, without potentially hiding any document content.
Okular can add a digital signature plus a visual indicator, in a rectangle anywhere on the sheet, that you can specify. You can even configure it to use the dummy certificate store we’ve created above: set it under Settings > Configure Backends... > PDF). Caveat: the font used for the visual indicator may not be embedded in the original PDF/A, since the original content might not have used that font at all. Hence, the end result is a PDF file with a valid digital signature, but it’s no longer a valid PDF/A file.

I need to disable copy & paste for this PDF file

Yes, some software packages are offering options for that. Just forget it: Whatever can be displayed, can ultimately be copied, too. Our first episode even has a section for dealing with »stubborn« PDF files.

Image Credits:
1 us dollar bill (modified) | Photo by Kirk Cameron on Unsplash

Licensing:
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.