A script that will create a PDF/A file with searchable & copyable text, from a source PDF file that (mostly, or totally) hasnât.
Requires:
convert
imagemagick
gs
ghostscript
optipng
optipng
tesseract
tesseract-ocr
ortesseract
(+tesseract-ocr-fra
,tesseract-ocr-spa
, ⊠ortesseract-data-fra
,tesseract-data-spa
, ⊠)
The script creates page bitmaps from the original PDF file, applies OCR to them, and recombines the bitmaps and text into a searchable PDF/A file.
How to call it:
unsearchable2searchablepdfa.sh -l language -d targetDPI pdfFile
E.g.:
unsearchable2searchablepdfa.sh -l eng -d 300 scannedpages.pdf
Youâll most likely want to apply this to PDF files consisting of scanned pages bitmaps only.
#!/bin/bash target_language=eng # Remember to install required tesseract language package: eng;spa;fra;deu;... dpi=300 while getopts ":d:l:" opt; do case $opt in d) dpi="$OPTARG" ;; l) target_language="$OPTARG" ;; \?) echo "Invalid option -$OPTARG" >&2 echo "Usage:" >&2 echo "$(basename $0) [-d dpi] [-l language] pdfFile" echo echo "Examples:" echo "$(basename $0)" echo "$(basename $0) -d $dpi -l $target_language scannedpages.pdf" exit 1 ;; esac done shift $((OPTIND-1)) echo "Executing: $(basename $0) -d $dpi -l $target_language $1" TMP_DIR=$(mktemp -d /tmp/tmpdir.XXXXXXXXXXXX) SOURCE_FILE=$(readlink -f "$1") echo echo Converting $SOURCE_FILE to $dpi DPI bitmaps in $TMP_DIR ... convert -density $dpi -background white -alpha remove "$SOURCE_FILE" "$TMP_DIR"/page-%04d.png echo echo Losslessly compressing bitmaps... optipng -o2 "$TMP_DIR"/*.png 2> >(grep Processing >&2) TARGET_DIR=$(dirname "$SOURCE_FILE") BASE_NAME=$(basename -- "$SOURCE_FILE") TARGET_NAME="${BASE_NAME%.*}" TARGET_FILE="$TARGET_DIR/$TARGET_NAME-searchable.pdf" echo echo OCR: Converting bitmaps in $TMP_DIR to searchable PDF ... find "$TMP_DIR"/ -iname '*.png' -type f -printf "$TMP_DIR/%f\n" | sort > "$TMP_DIR/files.txt" tesseract -l $target_language --dpi $dpi --psm 1 "$TMP_DIR/files.txt" "$TMP_DIR/raw" pdf 2> >(grep Page >&2) echo echo Making searchable PDF validate against PDF/A ... gs -o "$TARGET_FILE" -dPDFA=2 -dPDFACompatibilityPolicy=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sColorConversionStrategy=UseDeviceIndependentColor "$TMP_DIR/raw.pdf" 2> >(grep Page >&2) rm -rf "$TMP_DIR" echo echo Finished, see $TARGET_FILE
Image Credits:
Papirus icon for Terminal (modified) | GNU General Public License, version 3
Licensing:
This content is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
For your attributions to us please use the word »tuxwise«, and the link https://tuxwise.net.