So we can benefit from at least some of your work, if you could please extract the bookmarks (table of contents) you made on your scan, I could load that into my scan. If you don't know how to do that, I recommend downloading
jpdftweak and using it like this:
Code:
java -jar -Xmx1536M ~/jpdftweak/jpdftweak.jar vol1renumber.pdf -savebookmarks vol1renumber.csv /dev/null
Also, FYI, it is possible to OCR a PDF with tesseract without remaking the PDF from images, though my approach still has to split the PDF into pages which can cause links in the PDF to be broken, but at least it doesn't affect the file size. It still extracts the images, but then only uses them for OCR and then deletes them afterwards, merging the OCR layer on top of each existing PDF page. This is a shell script I have written to do that, which is also dependent on
pdftoppm and pdfseparate and
pdftk (
pdftk is a bit tricky to install on modern versions of Ubuntu but it's still possible). You need to know the resolution of the original images in the original PDF as an input; I haven't found a 100% reliable way to determine this automatically.
Code:
FULLPATH=/home/eric/input.pdf
RESOLUTION=300
echo "============================================================================="
filename="${fullpath##*/}" # Strip longest match of */ from start
echo "========== Starting new file $filename"
dir="${fullpath:0:${#fullpath} - ${#filename}}" # Substring from 0 thru pos of filename
base="${filename%.[^.]*}" # Strip shortest match of . plus at least one non-dot char from end
ext="${filename:${#base} + 1}" # Substring from len of base thru end
if [[ -z "$base" && -n "$ext" ]]; then # If we have an extension and no base, it's really the base
base=".$ext"
ext=""
fi
mkdir ocrtemp
echo "===== Extracting images for $filename"
pdftoppm -r $RESOLUTION -png "$fullpath" ocrtemp/
echo "===== Separating PDF pages of $filename"
pdfseparate "$fullpath" ocrtemp/-%03d-image.pdf
echo "===== Generating OCR for $filename"
for f in ocrtemp/*.png;
do
# rename files to have 3 digits with leading zeroes
FILENUM=`echo $f | sed 's/[^0-9]//g' | sed 's/^0*//'`
FILENUM3=`echo $FILENUM | sed -e :a -e 's/^.\{1,2\}$/0&/;ta'`
NEWFILE=ocrtemp/-$FILENUM3.png
mv "$f" $NEWFILE
tesseract "$NEWFILE" ${NEWFILE%.png}-text -c textonly_pdf=1 pdf
done
echo "===== Merging text with images for $filename"
for f in ocrtemp/-*-image.pdf;
do
#pdftoppm starts with 1 but needs leading zeroes added to match
PAGE=`echo $f | sed 's/[^0-9]//g' | sed 's/^0*//'`
# add leading zeroes
PAGE2=`echo $PAGE | sed -e :a -e 's/^.\{1,2\}$/0&/;ta'`
pdftk "ocrtemp/-$PAGE2-text.pdf" background "$f" output "ocrtemp/merged-$PAGE2.pdf"
# clean all old metadata
pdftk "ocrtemp/merged-$PAGE2.pdf" dump_data | sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | pdftk "ocrtemp/merged-$PAGE2.pdf" update_info - output "ocrtemp/clean-$PAGE2.pdf"
done
echo "===== Combining pages for $filename"
pdftk ocrtemp/clean*.pdf cat output "$dir$base-new.pdf"
rm -rf ocrtemp
Also, in case anyone reading this was wondering, you can extract the original images from a pdf using pdfimages, which is also in the Poppler library linked above. Syntax is like this:
Code:
pdfimages -all input.pdf outputpath/outputfileprefix
If you don't want it to make CCITT or JPEG2000 files, falling back to only PNG and JPEG files which are more usable (but with some increase in file size), use the following:
Code:
pdfimages -png -j input.pdf outputpath/outputfileprefix
I've found that most PDFs use JPEG (when lossy compressed) or PNG (when losslessly compressed) so CCITT and JPEG2000 formats are rare, so the above command will give you the original unchanged images most of the time, but CCITT is most common on old black and white scans (it's a lossy compression algorithm used by fax machines), and some PDF generation programs output true color images in JPEG2000 format.
Once you have PNG files, I have found the best way to minimize their size for including in a PDF is to run them through
optipng first. I've found an optimization level of 7 is pretty hard to beat, but it's a bit slow (try 6 to run a fair bit faster with slightly larger, perhaps 1% more on average, output):
I've found a lot of PDF software doesn't optimize PNG images this much so this can also improve file sizes.