So, all you’ve got is a PDF file full of information – but you need (some of) its content in a different format.
Here’s what you can do.
Visual cues:
What to do Results Caveats Package to install
A few caveats, aka: »management of expectations«
Humans can make sense of that, but a simple copy & paste of the text will often lead to disappointing results.
While there is no perfect solution for this, some tools can make a pretty good guess on how the text flows in a PDF. Be ready for some post-processing, though.
Boot into Linux
Boot into a fairly recent Linux distribution. For this episode, we’ve used a plain vanilla, mainstream Kubuntu 20.04 LTS, and cross-checked our results on Manjaro 21.0.7 XFCE.
Open a terminal and you’re ready to go.
Tools used, and their packages
convert
,import
,montage
imagemagick
gs
ghostscript
inkscape
inkscape
pandoc
pandoc
pdf2svg
pdf2svg
pdfcrop
texlive-extra-utils
ortexlive-core
pdftk
pdftk
pdftoppm
,pdftotext
,pdfimages
poppler-utils
tesseract
tesseract-ocr
ortesseract
(+tesseract-ocr-fra
,tesseract-ocr-spa
, … ortesseract-data-fra
,tesseract-data-spa
, … )xdotool
xdotool
Sample files used
One-liners. Or two.
I need the text only
As a sample file to start with, we’ll be using udhr.pdf
:
Complete PDF, as plain textpdftotext udhr.pdf
udhr.txt
Tip:
- Add
-nopgbrk
to your command if you don’t want the original page breaks in the resulting text file.
… but I need only parts of the PDF
Only pages2
(first) to5
(last), as plain textpdftotext -f 2 -l 5 udhr.pdf udhr-p2-5.txt
udhr-p2-5.txt
Shorten the PDF to a few streaks of pages, before the conversionTo keep only p.
2-3
,5
, and7-8
:pdftk udhr.pdf cat 2-3 5 7-8 output udhr-shortened.pdf
udhr-shortened.pdf
Crop away undesirable footers, headers, left, or right margins, before the conversionWith respect to your own PDF files, please experiment with margin sizes, until the cropped PDF pages contain no more undesirable text. Keep a blank margin of
10
pt on all sides that don’t carry undesired content.Our sample file
GG.pdf
contains header and footer text that we want to get rid of, before the conversion:In our case, we’ll cut both header and footer by
-20
pt, but leave left and right margins of10
pt, each.pdfcrop --margins '10 -20 10 -20' --hires --clip GG.pdf
GG-crop.pdf
… but I need real paragraphs, and soft line breaks
Introduce soft line breaks and paragraphs in the resultLet
pandoc
take a try on that.Please note that while this improves things a lot, you’ll most likely have to manually insert a few hard line breaks.
We’ll convert
UN-Introduction.pdf
, a sample file that shows a mixed layout of one and two columns:pdftotext UN-Introduction.pdf
UN-Introduction.txt
pandoc -t plain --wrap=none -o UN-Introduction-softbreaks.txt UN-Introduction.txt
UN-Introduction-softbreaks.txt
… but I need a wordprocessor file
Convert the text of the PDF into.odt
. Sort of.Please note that you’ll most likely have to manually insert hard line breaks. Formatting headings is also manual work. Images can be pulled out of the PDF as well (see below), and re-inserted.
Current versions of pandoc may produce buggy.odt
files that cannot be opened, so we’re converting to.docx
here, instead. For a test, changedocx
toodt
, in the command above.We’ll convert
UN-Introduction.pdf
, again:pdftotext UN-Introduction.pdf
UN-Introduction.txt
pandoc -t docx --wrap=none -o UN-Introduction.docx UN-Introduction.txt
UN-Introduction.docx
I need high-quality page screenshots
As a sample file we’ll use convention-rights-child-text-child-friendly-version.pdf
:
convert
tool used here, the first physical page number within a PDF file 0
– unlike with other tools that consider it to be page 1
.Screenshot of single pageWe’ll take p.
0
here, the title page, and will create a PNG, resized to max.2048x2048
pixels,600
DPI,antialias
edconvert -density 600 -antialias -background white -alpha remove convention-rights-child-text-child-friendly-version.pdf[0] -resize 2048x2048 screenshot-%04d.png
./images/screenshot-0000.png
Screenshots of a range of pagesWe’ll take p.
2-3
here, and will create PNGs, max.2048x2048
pixels each,600
DPI,antialias
ed; stored in subdirectory./images
, file name prefixed withscreenshot-[pagenumber]
convert -density 600 -antialias -background white -alpha remove convention-rights-child-text-child-friendly-version.pdf[2-3] -resize 2048x2048 ./images/screenshot-%04d.png
./images/screenshot-0002.png
,./images/screenshot-0003.png
Tip:
- You can shorten the PDF to a few selected pages first, if a single page range is not what you want. See above, under … but I need only parts of the PDF
I need all the pictures / diagrams / figures
We’ll use 20120000033.pdf
as a sample file, because it contains many »retro« graphics and figures:
Extract all picturesThey’ll be stored in subdirectory
./images
, file names prefixed withpage-[pagenumber]-[imagenumber]
pdfimages -all -p 20120000033.pdf ./images/page
./images/page-001-000
, … (original file types as used when the PDF was created – may be a mix of.png
,.jpg
, …)
Extract pictures only from pages10
(first) to50
(last)They’ll be stored in subdirectory
./images
, file names prefixed withpage-[pagenumber]-[imagenumber]
pdfimages -all -p -f 10 -l 50 20120000033.pdf ./images/page
./images/page-022-000
, … … (original file types, may b mixed.png
,.jpg
, …)
… but I need the diagram without that border
Shave away a few pixels from every borderIn our example,
5
.convert ./images/page-180-057.png -shave 5x5 +repage ./shaved.png
./shaved.png
… but I need a different background for this diagram – I don’t like pastel green
Replace specific background color by anotherIn our example, we’ll replace pastel green
#d6f6da
by white#ffffff
. Allow for a fuzziness of5%
when matching the green tone to be replaced.convert ./images/page-180-057.png -fill "#ffffff" -opaque "#d6f6da" ./white-bg.png
./white-bg.png
Replace specific background color by transparencyIn our example, we’ll replace pastel green
#d6f6da
with transparency. Allow for a fuzziness of5%
when matching the green tone to be removed.convert -transparent "#d6f6da" -fuzz 5% ./images/page-180-057.png ./transparent-bg.png
./transparent-bg.png
, …
… but I want to remove the »outside« of this diagram
Isolate a diagram (make outside part transparent)To achieve this, we’ll simply »flood« the outside with transparency.
The diagram needs a closed outline for this to work, so the transparency does not spill into the inside. Also, the outside part must be contiguous, so the transparency can »flow« everywhere.
The background color to be flooded is simply chosen from the pixel at coordinates
0,0
. Allow for a fuzziness of5%
when matching that color.convert ./images/page-243-083.png -matte -fill none -fuzz 5% -draw 'alpha 0,0 floodfill' ./isolated.png
./isolated.png
, …
… but I need whole diagrams – some are split into several images, and are even incomplete!
When using pdfimages
as explained above, all bitmaps are extracted from your PDF. What you perceive as a picture / diagram / figure when viewing that PDF, however, may actually be a composite of such bitmaps and a variety of other graphical objects, all layered like on a presentation slide. In short: there is no single picture / diagram / figure.
You can see this effect, e.g., when looking at the figure from page 88 of 20120000033.pdf
, after exporting it according to the example above: the extraction results in two bitmaps for the diagram, each with a black background.
To solve this, you could of course take screenshots of all pages, and then carefully cut out the shapes of what you perceive as the pictures / diagrams / figures. If you want to take that road, see above under I need high-quality page screenshots.
A better alternative might be to convert the PDF into a series of SVG vector graphics first: quite often, this preserves the grouping of the visual elements of a picture / diagram / figure, as specified in the PDF file. Which, in turn, allows for opening the converted pages, using a vector graphics program, and quickly exporting the groups of objects that constitute the pictures / diagrams / figures you’re interested in.
Export diagrams as bitmaps, via SVG1 Export all PDF pages to vector images
mkdir ./svg
pdf2svg 20120000033.pdf ./svg/page-%04d.svg all
./svg/page-0001.svg
, … ,./svg/page-0256.svg
2 Open the desired page in Inkscape. Try to select the group of objects that constitutes the picture / diagram / figure (in case there are nested groups, you may have to ungroup some of the outer groups first). Export the selection to a PNG bitmap.
Tip:
- You can shorten the PDF to a few selected pages first, if a single page range is not what you want. See above, under … but I need only parts of the PDF
I need a smaller version, it’s way too huge
Many PDF files don’t contain much content that could be »compressed«, since much of it is already in vectorized format (e.g., embedded fonts, and many of the diagrams).
Usually, the highest reduction in size can be achieved for bitmap content like scans. Any substantial compression will show in jaggy bitmaps, whereas the text looks as hi-res as before, because the vectorized fonts haven’t been changed in the process.
… but I need good quality, and a minimum resolution (DPI)
We’ll use 20120000033.pdf
as a sample file, because it contains many »retro« graphics and figures, and exceeds 200 pages in size.
Compress to a specific resolutionTo compress to a resolution of
300
DPI, preserving bitmap quality as good as possible by using the best downsampling algorithms:gs -r300 -sDEVICE=pdfwrite -dPDFSETTINGS=/screen -dCompatibilityLevel=1.5 -dColorImageResolution=300 -dColorConversionStrategy=/LeaveColorUnchanged -dEmbedAllFonts=true -dSubsetFonts=true -dGrayImageResolution=300 -dMonoImageResolution=300 -dColorImageDownsampleType=/Bicubic -dGrayImageDownsampleType=/Bicubic -dMonoImageDownsampleType=/Subsample -dNOPAUSE -dQUIET -dBATCH -dPrinted=false -sOutputFile=NASA-Handbook-300.pdf 20120000033.pdf
./NASA-Handbook-300.pdf
Replace every occurrence of
300
by your desired resolution. Other common values are150
and72
. When using our sample file and experimenting with different parameters, you may wish to check tables on p. 165 or p. 243 in the resulting files, to assess compression artefacts.300 DPI is giving good results on printers, provided the input file was great. »Upscaling« poor quality files won’t work.Some of the parameters used above are already implied by-dPDFSETTINGS=/screen
. We have added them to allow for isolated parameter modifications, for testing.
… but I need it as small as possible, at all costs.
Compress to 72 DPI, accept lower qualityTo compress to a resolution of
72
DPI, compromising on bitmap quality by crude downsampling:gs -r72 -sDEVICE=pdfwrite -dPDFSETTINGS=/screen -dCompatibilityLevel=1.5 -dColorImageResolution=72 -dColorConversionStrategy=/LeaveColorUnchanged -dEmbedAllFonts=true -dSubsetFonts=true -dAutoRotatePages=/None -dGrayImageResolution=72 -dMonoImageResolution=72 -dColorImageDownsampleType=/Average -dGrayImageDownsampleType=/Average -dMonoImageDownsampleType=/Subsample -dNOPAUSE -dQUIET -dBATCH -dPrinted=false -sOutputFile=NASA-Handbook-minimal.pdf 20120000033.pdf
./NASA-Handbook-minimal.pdf
Tricky Tasks Tackled
I need to copy & paste a paragraph, or two…
There may be various reasons for not being able to just copy and paste a paragraph from a PDF. Let’s see what we can do about them.
… but I have a no-text, bitmaps-only PDF
We’ll turn it into a PDF with overlaid text, by using freely available tools for OCR (optical character recognition).
Create a PDF with text from a bitmap-only PDFWe’ll follow these steps:
- Create JPEG bitmaps from (all, or a range of) the pages of the PDF
- Create a list of the page bitmaps
- Let an OCR software analyze that list of bitmaps and create a PDF, with overlaid text that is searchable / copyable
We’ll use
StrandMagazine_133.pdf
as a non-trivial sample file, because it’s a scan of a 1902 magazine copy, set into two columns – and it contains several images:1 Create JPEG images from PDF pages: pages
4
to16
;95
% quality JPGs,600
DPI; stored in subdirectory./pagebitmaps
, file name prefixed withpage-[pagenumber]
mkdir ./pagebitmaps
pdftoppm StrandMagazine_133.pdf -f 4 -l 16 ./pagebitmaps/page -jpeg -jpegopt quality=95 -r 600
./pagebitmaps/page-004.jpg
, …./pagebitmaps/page-016.jpg
2 Create list of JPEG images: find all
.jpg
that were generated, print paths, sort (so page numbers will be in ascending order)find ./pagebitmaps/ -iname '*.jpg' -type f -printf "./pagebitmaps/%f\n" | sort > pagelist.txt
pagelist.txt
3 Create PDF with overlaid text from image file list: assume
eng
lish language; try to figure the layout by yourself (1
); assume input files have600
DPItesseract -l eng --psm 1 --dpi 600 pagelist.txt Baskervilles-searchable pdf
Baskervilles-searchable pdf
… but I’ve tried everything and nothing worked
Ok, there is one more thing we can try, under these conditions:
- You can open the »stubborn« file in a viewer application, on your desktop.
- You’re not using the Wayland display server protocol.
- The viewer supports key shortcuts to make it »go fullscreen«, and to move »one page down« in the content, respectively.
We’ll try this:
- Open the file in the viewer that can display it, and move to the first page that we want
- Let a script screenshot all desired pages
- Manually crop one of the screenshots, to indicate what region of the screenshots contains the »page«, in our opinion
- Let another script crop all screenshots, and use OCR to create a PDF with text that you can finally copy & paste
In our example, we’ll use StrandMagazine_133.pdf
. It’s not really a »stubborn« file, though:
The scripts used:
These script use the convert
, import
, tesseract
, xdotool
tools.
Turning stubborn files into PDFs with text that you can copy & paste1 Open
StrandMagazine_133.pdf
in a viewer of your choice (we’ll use Evince here). Test whether the keys to go fullscreen and to move one page down work as expected (with Evince, these areF11
andPage_Down
, respectively).
Then, turn off fullscreen again and select page 4, the start »The Hound of the Baskervilles«.2 Open a terminal window and
cd
to an empty directory that you’ll use as a working directory. We’ll let the scriptviewer2grabs.sh
remote-control the viewer application, i.e.: make it go fullscreen by sendingF11
, and take a series of13
screenshots (i.e., p. 4–16); after each screenshot, the script will send aPage_Down
to the viewer. Finally, the script will tell the viewer to end fullscreen mode, by sendingF11
again.Don’t interfere with the script, just watch it do its work. After you have started the script in the terminal, you have 10 seconds to bring the viewer window to the front, before the script starts taking screenshots. The starting page (4
) must already be selected.viewer2grabs.sh -n 13 -p Page_Down -f F11
./screens/grabbed-0001.png
, … ,./screens/grabbed-0005.png
, and./screens/please_crop_me.png
3 Manually crop a sample screenshot, as an example. Later, the second script will learn from your example where the »page« region of the screenshots is.
Open
./screens/please_crop_me.png
and crop it, so only the »page« region is visible. Save it again, overwriting the original file.4 Go back to your terminal window, back to the working directory. We’ll let the script
grabs2pdf.sh
crop all the screenshots of the viewer window, so only the »page« region survives. The script then uses OCR to retrieve the text from the cropped images (assuming it’seng
lish). Finally, it generates a PDF file combining the cropped images with the text.grabs2pdf.sh -l eng -o ./baskervilles
./screens/grabbed-0001.png
, … ,./screens/grabbed-0005.png
, and./screens/please_crop_me.png
I need a visual overview of all pages
- We’ll take tiny screenshots of all pages…
- …and then arrange them in a grid (with variants a – c)
Tiny page screenshot tiles, arranged in a grid1 Take tiny screenshot tiles of all PDF pages: 80% quality JPGs, max. 100×100 pixels, 72 DPI; stored in subdirectory
./images
, file name prefixed withscreenshot-[pagenumber]
For theconvert
tool used here, the first physical page number within a PDF file0
– unlike with other tools that consider it to be page1
.convert -density 72 -alpha remove -background white -colorspace sRGB -quality 80 20120000033.pdf -resize 100x100 ./images/screenshot-%04d.jpg
./images/screenshot-0000.jpg
, …2a Montage screenshots into the desired grid, fixed number of columns (
20
here); set background togray
, separate screenshots by1
pixel to make background shine throughmontage ./images/screenshot-*.jpg -background gray -geometry +1+1 -tile 20x ./grid.jpg
grid.jpg
2b Montage screenshots into the desired grid, fixed number of rows (
7
here); set background togray
, separate screenshots by1
pixel to make background shine throughmontage ./images/screenshot-*.jpg -background gray -geometry +1+1 -tile x7 ./grid.jpg
grid.jpg
2c Montage screenshots into the desired grids, each with a fixed number of columns (
10
here) and of rows (5
here); set background togray
, separate screenshots by1
pixel to make background shine throughmontage ./images/screenshot-*.jpg -background gray -geometry +1+1 -tile 10x5 ./grid-%04d.jpg
grid-0000.jpg
, …
Image Credits:
photography of standing man starring on assorted photos during daytime photo (modified) | Photo by Magdalena Smolnicka on Unsplash
Licensing:
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.