So, youâve got a handful of files and need to turn them into PDF format. PDF files that donât just look good on your device, but everywhere, and are suitable for a long-term archiving.
Hereâs what you can do.
Visual cues:
What to do Results Caveats Package to install
A few caveats, aka: »management of expectations«
Background: Not all PDFs are created equal
What is commonly referred to as the »PDF standard« is actually a hodgepodge of features, some of them proprietary; the current version is PDF 1.7 (ISO 32000â1:2008). Most PDF software packages donât even come close to supporting all of those features. A cleaned-up. newer version of the standard is PDF 2.0 (ISO 32000â2:2020), which isnât wide-spread yet (and paradoxically, canât be downloaded for free â unlike its predecessor).
With »the« PDF standard being such a grab bag, several standards with narrower focuses have been defined. Typically, they require a minimum of supported »PDF« features, but also forbid the use of several others.
A few examples: PDF/E focuses on engineering purposes, e.g. allowing for interactive 3D visualization. The PDF/X family of standards focuses on PDFs as input for commercial printing. PDF/UA focuses on accessibility, especially by supporting assistive technologies.
Weâre focusing on PDF/A here.
Whatâs »PDF/A«?
»A« stands for »Archive«: PDF/A is a set of restrictions imposed on PDF files, to ensure theyâll still be usable 50 years from today.
Which essentially means two things: First, all that is required to display the content is contained within the PDF file, including the fonts used. Any existing or future device with a PDF viewer should do, and what is displayed should look identical across devices.
Second, anything that puts the »archive« purpose at risk is forbidden: no embedded executable code, no encryption, no external links, no multimedia, nothing else that might be causing trouble, in a distant future.
How many flavors of PDF/A are there?
As you may have guessed, the mess is starting here. PDF/A comes in several versions (1, 2 or 3), that define available features. Conformance levels (B, U, and A; U for versions 2 and 3 only) define implementation requirements with respect to features. E.g., a PDF/Aâ2B file follows version 2 of the standard, and conforms to level B.
PDF/A versions:
- PDF/Aâ1 (ISO 19005â1:2005): the oldest, and most widely supported version. Cannot handle transparency, cannot embed JPEG2000 images, cannot have any other files embedded, as attachments. Many software packages will use workarounds to compensate for such issues, mainly by converting graphics to supported formats, when exporting to PDF/A. Which isnât as bad as it sounds, results still look fine.
- PDF/Aâ2 (ISO 19005â2:2011): a greatly modernized version that may lead to files that are incompatible with viewers capable of displaying PDF/Aâ1 files only. Adds support for JPEG2000, transparency, layers, specific digital signatures, and embedded PDF/A files, as attachments.
- PDF/Aâ3 (ISO 19005â3:2012): a minor update of version 2, additionally allowing for arbitrary embedded file attachments (which caused some controversy, not only among archivists).
PDF/A conformance levels:
- Level B (Basic conformance): essentially, the document is visually reproducible, it looks the same on all viewers. That doesnât mean its internal structure reflects anything about »words«, or »sentences«, or »paragraphs« â itâs all about looks, only.
You may liken a level B PDF/A file to a blackmail letter, carefully crafted from newspaper snippets to make it look like a document; technically, itâs still just a bunch of paper patches that each have coordinates on the paper sheet, but no real relationships among themselves. A software trying to convert a level B PDF/A document into a plain text file, or into speech, is out of luck, because the file content is merely patches and their coordinates. Which means: Level B files are not accessible. - Level U (Unicode conformance): extends level B by requiring that all text in the file can be mapped to Unicode. This resolves problems like searching for words with »foreign characters«.
- Level A (Accessible conformance): in addition to Level B, some metadata is required, like information about language, hierarchy (»heading levels«), text spans (»reading order« of the letters). Essentially, that additional metadata turns a mere collage of unrelated letters into a »document«.
Which flavor of PDF/A should I choose, then?
Our recommendations for creating PDF/A files:
- In general, you cannot rely on software packages to produce valid PDF/A files, under all circumstances. By all means validate your results (see below).
- Prefer compliance level A over U over B, to include as much useful metadata as possible in the resulting PDF file. We expressly deplore the current prevalence of PDF/Aâ2b, because we consider it an electronic equivalent of scrapbook archiving. Level B fails to reasonably support the quick, meaningful electronic processing of PDF content.
- Go for PDF/Aâ2. Avoid PDF/Aâ3, because its option to embed »arbitrary« files may annihilate the purpose of a long-term PDF archive, and is rightfully frowned upon by many.
How to validate a file against the standards?
Use veraPDF â see below.
Boot into Linux
Boot into a fairly recent Linux distribution. For this episode, weâve used Manjaro 21.2.0 KDE, and cross-checked our results (except digitally signing PDF files, see the respective section for details) on Ubuntu 21.04 (Hirsute Hippo).
Open a terminal and youâre ready to go.
Tools used, and their packages
certutil
,pk12util
,pdfsig
from Poppler >= v21.10.0nss
convert
imagemagick
- CUPS-PDF
printer-driver-cups-pdf
orcups-pdf
gs
ghostscript
- a Java Runtime Environment (JRE) >= v11, e.g. OpenJDK
- OCRmyPDF see installation notes
openssl
openssl
optipng
optipng
tesseract
tesseract-ocr
ortesseract
(+tesseract-ocr-fra
,tesseract-ocr-spa
, ⊠ortesseract-data-fra
,tesseract-data-spa
, ⊠)unoconv
unoconv
(please note you also need OpenJDK >= 11)veraPDF
from their website, or viaAUR
Four PDF/A creation strategies
Obviously, the order of preference is:
- Prefer office software that can export to PDF/A
- Script PDF/A file creation from your source files (if effort is acceptable)
- Try to convert PDF files exported by applications into valid PDF/A files
- Last resort: »Print to PDF« (followed by OCR, if necessary)
1. Prefer office software that can export to PDF/A
While generic PDF export capabilities are surprisingly common, only few software packages can directly export to PDF/A.
I need PDF/A versions of my LibreOffice documents
By all means, do not print into a PDF, but export to PDF (see below for why »Print to PDF« is only a second choice). Hereâs what you get, in return for not »printing to a PDF printer« instead:
- Selectable text everywhere.
- A »table of content«, aka bookmarks, in the resulting PDF file.
File
>Export as
>Export as PDF...
, then enable[x] Archive (PDF/A, ISO 19005)
and[x] Export outlines
. Finally, select the best available version/compliance combination, currentlyPDF/A-2b
.
⊠but I need to automate that, from the terminal
As a sample file, weâll use LoremIpsum.odt
Sincelibreoffice
does not accept all of the required export filter parameters, weâll useunoconv
, which will in turn uselibreoffice
, to do its work.Weâll create a
2
(conformance level B is implied);ExportBookmarks
creates for Table-of-Content entries; weâll limit image resolution to max.600
DPI and set image quality to100
%.
- First, close all running instances of LibreOffice.
- Then, run:
unoconv -f pdf -e SelectPdfVersion=2 -e ExportBookmarks=true -e MaxImageResolution=600 -e ReduceImageResolution=true -e Quality=100 LoremIpsum.odt
./LoremIpsum.pdf
2. Create PDF/A files from other files
I need a PDF/A photo documentation from a grab bag of snapshots
Weâll create a »slideshow« PDF that starts full screen by default, can be printed on A4 paper sheets, and validates against PDF/Aâ2B.
Letâs start by preparing a directory holding four selected photographs that have wildly differing sizes / resolutions:
»Shoreline during golden hour«
»Brown wooden houses near body of water under blue sky«
»Cobblestone street with reddish old buildings«
»Mountain sea of clouds«
Credits: selective focus photography of shoreline during golden hour | Photo by Ishan @seefromthesky on Unsplash · brown wooden houses near body of water under blue sky during daytime | Photo by Divya Agrawal on Unsplash · people walking on sidewalk between buildings during daytime | Photo by Mohammed Ajwad on Unsplash · mountain sea of clouds | Photo by Mohammed Ajwad on Unsplash
The script used to create a PDF/A »slideshow«:
- Copy your selection of photographs into a separate directory. In our example, itâs
./photodoc
.- Rename them to something meaningful. The names will be used as bookmarks in the table of contents (TOC). In our example, weâll keep the original names.
- Decide upon the order and accordingly prefix the file names with numbers. Donât use consecutive numbers (1, 2, 3, âŠ) but leave some space (e.g., 10, 20, 30, âŠ): This will allow for adding new photos in between, and for easy reordering, in case you change your mind. In our example, letâs assume weâll end up with:
1 mohammed-ajwad-YY_zYDn4T5g-unsplash.jpg
10 fotografu-3WdUBdr9Pxw-unsplash.jpg
20 ishan-seefromthesky-3eP5-K5DQ9A-unsplash.jpg
100 divya-agrawal-qa8VhqvJGIo-unsplash.jpg
Weâre ready to execute the script:
./pics2photodoc.sh -i ./photodoc -o ./photodocumentation
./photodocumentation.pdf
I need PDF bookmarks, aka a »Table of Content«
The preferred way to include bookmarks is having the document-creating software add them. LibreOffice can do that.
While it is a bit tedious, you can also add bookmarks to any existing PDF file. This is achieved by using pdfMark operators.
Letâs assume the table of content that you have in mind looks like this:
Computing at home (page 1) 1. Intro (page 2) 2. Computers (page 5) 2.1 History (page 7) 2.1.1 The rise of the PC (page 8) 2.1.2 The rise of the Notebook (page 11) 3. Smartphones (page 15) ... etc ...
First, prepare a text file file that reflects this structure, and save it as
pdfmarks.txt
. The pdfMark standard mandates a specific syntax that is still human-readable:[ /Title (Computing at home) /Page 1 /OUT pdfmark [ /Title (1. Intro) /Page 2 /OUT pdfmark [ /Title (2. Computers) /Page 5 /Count 1 /OUT pdfmark [ /Title (2.1 History) /Page 7 /Count -2 /OUT pdfmark [ /Title (2.1.1 The rise of the PC) /Page 8 /OUT pdfmark [ /Title (2.1.2 The rise of the Notebook) /Page 11 /OUT pdfmark [ /Title (3. Smartphones) /Page 15 /OUT pdfmark
Where
Title
is followed by the bookmark titlePage
is the target page for the bookmarkCount
gives the number of child bookmarks. Putting a minus sign-
in front of the number tells the PDF viewer to show it collapsedThe pdfMark standard offers way more than just bookmarks. Even for bookmarks, there are more options with respect to formatting. See the section Bookmarks (OUT) in Chapter 2 Basic Features of the pdfMark reference for details.To add the bookmarks to a file
document.pdf
:gs -o ./document-plus-bookmarks.pdf -dPDFA=2 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor ./document.pdf
./document-plus-bookmarks.pdf
3. Try to convert PDF files exported by applications into valid PDF/A files
Since you might have close to no control about what the »PDF export« feature of an application gives you, weâll kind of shoehorn the original file into a new one that conforms to PDF/A. gs
is our tool of choice here.
Best guesses, and caveats:
- Weâll follow the recommendations of the PDF Association and use an RGB color space for the output intent, described by the sRGB ICC profile. The resulting PDF/A should display well on most devices.
- Itâs not possible to »fix« every feature in the original PDF file that does not conform to PDF/A standards. Therefor,
gs
will sometimes force-remove features. Unfortunately, this may mean that we lose some metadata as well. Also,gs
will give many warnings on the terminal, in such cases.
Preparations:
- Create a working directory, and
cd
into it- Download the freely available sRGB profile
sRGB2014.icc
from the website of the International Color Consortium, and place it in the working directory- Export a colorful sample page from an application into a PDF, e.g. a web site from your browser. Move it to the working directory, and rename it to
input.pdf
gs
comes with a sample prefix file that needs to be adapted to the chosen profile; e.g., forgs
version 9.55.0, you can find the sample at/usr/share/ghostscript/9.55.0/lib/PDFA_def.ps
. Copy it to your working directory as well, and modify the lines forICCProfile
andOutputConditionIdentifier
as shown (leave everything else unchanged):/ICCProfile (./sRGB2014.icc)
/OutputConditionIdentifier (sRGB2014)
You can now convert your PDF file to PDF/A:
gs -I . -o ./outputPDFA.pdf -dPDFA=2 -dPDFACompatibilityPolicy=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor ./PDFA_def.ps ./input.pdf
./outputPDFA.pdf
Validate the result against PDF/Aâ2b (see veraPDF, below). Hereâs a short explanation of what some parameters do:
-I .
extends the library search path ofgs
to the by the current directory, sosRGB2014.icc
will be found-dPDFA=2
requests PDF/Aâ2b as output-dPDFACompatibilityPolicy=1
permits to relentlessly drop all features from the original PDF file that canât be made to validate against PDF/A.When everything runs as expected, you can copysRGB2014.icc
to, e.g.,/usr/share/color/icc
. Make sure you adapt the-I
parameter accordingly, as well as the profileâs path inPDFA_def.ps
. As an alternative to-I
, you can study how--permit-file-read
works, and use that instead.
4. Last resort: »Print to PDF«
Most Linux distributions support the CUPS printing system, including CUPS-PDF, a virtual printer that creates PDF files. After youâve installed CUPS-PDF according to the instructions for your distribution, make sure that a test print actually generates a PDF file before you venture into PDF/A creation.
After the installation of CUPS-PDF, and a first test, we can reconfigure it to generate files that validate against PDF/Aâ2b:
The configuration file to change is
/etc/cups/cups-pdf.conf
. Open it in your preferred editor.Make sure that
Out
points to a reasonable output directoryE.g. to
Out ${HOME}/PDF
Enforce PDF/Aâ2b creation
Uncomment the
GSCall
line and add the following parameters:
-dPDFA=2
-dPDFACompatibilityPolicy=1
-sColorConversionStrategy=UseDeviceIndependentColor
Leave the rest of the line as-is (âŠ):
GSCall %s -q -dPDFA=2 -dPDFACompatibilityPolicy=1 -sColorConversionStrategy=UseDeviceIndependentColor (...)
âŠbut I REALLY need a PDF file containing text that can be copied, not just bitmaps
Let OCRmyPDF post-process a PDF file that is in the
eng
lish language, allow it toredo-ocr
and discard OCR content already in that file; and produce an output file that conforms to PDF/Aâ2b.ocrmypdf -l eng --redo-ocr --output-type pdfa-2 in.pdf out.pdf
./out.pdf
âŠbut that still didnât work
There is a brute-force approach that can be a last resort: have a script render all pages of a PDF file into bitmaps; compress them so the resulting PDF wonât become too large; use OCR to obtain the text; and finally re-assemble both text and bitmaps into a new PDF file.
We had presented the principle in our first episode. For a refined, scripted version see unsearchable2searchablepdfa.sh.
Beyond just creating PDF/A files
I need to validate PDF files against PDF/A standards
veraPDF is a tool to validate files against PDF/A, version 1â3, compliance levels B, A and U requirements. The tool is written in Java and offers both a GUI and a CLI.
If you donât want to use its GUI, or need to automate the validation, hereâs how to validate on the terminal:
Let veraPDF give a
v
erbose report intext
format on the file validation againstf
lavor PDF/A-2b
../veraPDF/verapdf -v -f 2b --format text ./LoremIpsum.pdf
Will print:
PASS ./LoremIpsum.pdf
If validation fails, all sections of the standard that were violated will be listed. Change the output format tomrr
for more details.
I need to digitally sign a PDF/A file
pdfsig
, as described here, requires Poppler >= v21.10.0. If your Linux distribution is not a rolling release, finding and installing a sufficiently new version might be a challenge.For organizations or owners of an internet domain, digitally signing PDF/A files with a private key may be desirable. Commonly, you pay a Certification Authority (CA) that will verify youâre the owner of the domain, or a legal representative of your organization. At the end of the process, itâs common to receive a PKCS#12 keystore file (extension .p12
, or .pfx
) that contains both a private key, and the corresponding X.509 certificate.
To be ready for signing PDF/A files, you would import the keystore you received into a trusted certificate store on your server.
Setting up a dummy certificate store
For demonstration purposes, weâll first create a dummy keystore file, and import it into a dummy certificate store. OpenSSL is the tool of choice for this.
Create a dummy keystore file for a fictitious company:openssl req -x509 -newkey rsa:4096 -subj '/C=DE/ST=Bavaria/L=Nuremberg/CN=www.example.com/O=ACME Inc.' -keyout theprivatekey.pem -out thecertificate.crt -days 3650 -nodes
openssl pkcs12 -export -out thekeystore.p12 -inkey theprivatekey.pem -in thecertificate.crt
./theprivatekey.pem
,thecertificate.crt
,thekeystore.p12
Weâve now got our dummy thekeystore.p12
that weâll import into a dummy certificate store, so we can start signing PDF/A files:
Set up a dummy certificate store, and import the keystore file into it:mkdir ./dummynss
certutil -N -d ./dummynss
pk12util -i thekeystore.p12 -d sql:./dummynss
./dummynss/cert9.db
,./dummynss/key4.db
,./dummynss/pkcs11.txt
Digitally signing a PDF/A file, and validating the signature
Weâre now ready to start signing PDF/A files:
SignLoremIpsum.pdf
(please note that accessing the certificate keystore requires to have the password on the command line, after-nss-pwd
):pdfsig -nssdir ./dummynss -nss-pwd password -add-signature -nick "www.example.com - ACME Inc." ./LoremIpsum.pdf ./LoremIpsum-signed.pdf
./LoremIpsum-signed.pdf
Letâs validate the signature:
Validate the signature ofLoremIpsum-signed.pdf
:pdfsig ./LoremIpsum-signed.pdf
Will print an output similar to this:
Digital Signature Info of: ./LoremIpsum-signed.pdf Signature #1: - Signer Certificate Common Name: www.example.com - Signer full Distinguished Name: O=ACME Inc.,CN=www.example.com,L=Nuremberg,ST=Bavaria,C=DE - Signing Time: Nov 20 2021 14:25:39 - Signing Hash Algorithm: SHA-256 - Signature Type: adbe.pkcs7.detached - Signed Ranges: [0 - 46812], [54652 - 55032] - Total document signed - Signature Validation: Signature is Valid. - Certificate Validation: Certificate issuer isn't Trusted.
Since our dummy certificate is self-signed only, the certificate issuer is being reported as untrusted, of course.
Settings
> Configure Backends...
> PDF
). Caveat: the font used for the visual indicator may not be embedded in the original PDF/A, since the original content might not have used that font at all. Hence, the end result is a PDF file with a valid digital signature, but itâs no longer a valid PDF/A file.I need to disable copy & paste for this PDF file
Yes, some software packages are offering options for that. Just forget it: Whatever can be displayed, can ultimately be copied, too. Our first episode even has a section for dealing with »stubborn« PDF files.
Image Credits:
1 us dollar bill (modified) | Photo by Kirk Cameron on Unsplash
Licensing:
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.