unsearchable2searchablepdfa.sh

A script that will create a PDF/A file with searchable & copyable text, from a source PDF file that (mostly, or totally) hasn’t.

Requires:

  • convert imagemagick
  • gs ghostscript
  • optipng optipng
  • tesseract tesseract-ocr or tesseract (+tesseract-ocr-fra, tesseract-ocr-spa, 
 or tesseract-data-fra, tesseract-data-spa, 
 )

The script creates page bitmaps from the original PDF file, applies OCR to them, and recombines the bitmaps and text into a searchable PDF/A file.

How to call it:

unsearchable2searchablepdfa.sh -l language -d targetDPI pdfFile

E.g.:

unsearchable2searchablepdfa.sh -l eng -d 300 scannedpages.pdf

You’ll most likely want to apply this to PDF files consisting of scanned pages bitmaps only.

To download this script, first hover with your mouse over the listing below and press »Open code in new window«. Copying and pasting the colored and formatted listing below, as-is, won’t work.
#!/bin/bash

target_language=eng  # Remember to install required tesseract language package: eng;spa;fra;deu;...
dpi=300

while getopts ":d:l:" opt; do
  case $opt in
    d) dpi="$OPTARG"
    ;;
    l) target_language="$OPTARG"
    ;;
    \?) echo "Invalid option -$OPTARG" >&2
    echo "Usage:" >&2
    echo "$(basename $0) [-d dpi] [-l language] pdfFile"
    echo
    echo "Examples:"
    echo "$(basename $0)"
    echo "$(basename $0) -d $dpi -l $target_language scannedpages.pdf"
    exit 1
    ;;
  esac
done
shift $((OPTIND-1))

echo "Executing: $(basename $0) -d $dpi -l $target_language $1"

TMP_DIR=$(mktemp -d /tmp/tmpdir.XXXXXXXXXXXX)
SOURCE_FILE=$(readlink -f "$1")

echo
echo Converting $SOURCE_FILE to $dpi DPI bitmaps in $TMP_DIR ...
convert -density $dpi -background white -alpha remove "$SOURCE_FILE" "$TMP_DIR"/page-%04d.png


echo
echo Losslessly compressing bitmaps...
optipng -o2 "$TMP_DIR"/*.png 2> >(grep Processing >&2)

TARGET_DIR=$(dirname "$SOURCE_FILE")
BASE_NAME=$(basename -- "$SOURCE_FILE")
TARGET_NAME="${BASE_NAME%.*}"
TARGET_FILE="$TARGET_DIR/$TARGET_NAME-searchable.pdf"

echo
echo OCR: Converting bitmaps in $TMP_DIR to searchable PDF ...

find "$TMP_DIR"/ -iname '*.png' -type f -printf "$TMP_DIR/%f\n" | sort > "$TMP_DIR/files.txt"
tesseract -l $target_language --dpi $dpi --psm 1 "$TMP_DIR/files.txt" "$TMP_DIR/raw" pdf 2> >(grep Page >&2)

echo
echo Making searchable PDF validate against PDF/A ...
gs -o "$TARGET_FILE" -dPDFA=2 -dPDFACompatibilityPolicy=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor "$TMP_DIR/raw.pdf" 2> >(grep Page >&2)

rm -rf "$TMP_DIR"

echo
echo Finished, see $TARGET_FILE

Image Credits:
Papirus icon for Terminal (modified) | GNU General Public License, version 3

Licensing:
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.