Threaded Mode | Linear Mode

Kaliuzhkin · 12-31-2019, 11:02 PM

(12-31-2019 07:14 PM)EugeneNine Wrote: There are several errors, some documented here:
https://www.hpcalc.org/hp48/docs/misc/sxmanualerr.txt

I've looked at a few of the early listed errors but for many of them, what's in my copy differs from "NOW READS:". Why is this? For example, the errata has:

Page : # 95
Location: Example of Input to the Command SIZE
NOW READS: 1: 92 x 64
SHOULD READ: 1: Graphic 92 x 64.

My copy has "GRAPHIC 6x12.

Another example: The errata has
Page : # 106
Location: Second and third lines from the top
NOW READS: variable you create along with reserved variables created by the
be empty.)
SHOULD READ: variable you create along with any reserved variables created
by the HP 48 (Your VAR menu may not be empty).

My copy is the same as what's in SHOULD READ.

My copy is the same as the one on the Museum USB, listed as Edition 4, July 1990. Is the Errata based on an earlier edition?

(12-31-2019 07:14 PM)EugeneNine Wrote: There are several errors, some documented here:
https://www.hpcalc.org/hp48/docs/misc/sxmanualerr.txt

One of my long term projects was to OCR the scans and apply all the fixes.

Can't you edit the PDF files on the Museum USB?

As to the error I found, could it be added to the errata file?

EugeneNine · 12-31-2019, 11:27 PM

It doesn't say if the error is in the two volume or combined dingle manual. I wonder if HP fixed some for the second version. I don't know who made the errata, if it was from HP or compiled by users like us or which manual set the page numbers correspond to, I didn't dig that deep yet.

The PDF's on the museum, when I scanned them all I had access to was Adobe products and the OCR kept crashing. That's what actually drove me from the WIn9x to WinNT, under 9x Adobe would crash and I'd have to reboot. Switching to NT at least the OS wouldn't crash, so I'd just have to restart Adobe. But it wouldn't OCR very well. That was in the 90's, every so often I'll search out and try some OCR package but still can't get a very good OCR. I'm thinking I might have to re-scan to get a better quality or raw file format but I don't have access to the fancy auto scanner I had back then.

Kaliuzhkin · 01-01-2020, 12:04 AM

(12-31-2019 11:27 PM)EugeneNine Wrote: It doesn't say if the error is in the two volume or combined dingle manual. I wonder if HP fixed some for the second version. I don't know who made the errata, if it was from HP or compiled by users like us or which manual set the page numbers correspond to, I didn't dig that deep yet.

The errors are in the two-volume manual. Proof: the page numbers are decimal numbers from 95 to 814. My copy of the two volume manual has pages numbered from 3 to 852. The one-volume manual has pages numbered by the chapter. There is no page 95 in this manual. Rather, the identified text is on page 4-17. Furthermore, The identified text closely matches what's in my book.

The errata was not from HP. One of the items is tagged:
Received: by ucbvax.Berkeley.EDU (5.63/1.41)
id AA18033; Sat, 16 Jun 90 21:46:25 -0700
Received: from USENET by ucbvax.Berkeley.EDU with netnews
for handhelds@csl.sri.com (handhelds@csl.sri.com)
(contact usenet@ucbvax.Berkeley.EDU if you have questions)
Date: 17 Jun 90 02:21:17 GMT
From: jpser@cup.portal.com (John Paul Serafin)
Organization: The Portal System (TM)
Subject: More HP48 manual corrections
Message-Id: <30863@cup.portal.com>
Sender: handhelds-request@csl.sri.com
To: handhelds@csl.sri.com

(12-31-2019 11:27 PM)EugeneNine Wrote: The PDF's on the museum, when I scanned them all I had access to was Adobe products and the OCR kept crashing. That's what actually drove me from the WIn9x to WinNT, under 9x Adobe would crash and I'd have to reboot. Switching to NT at least the OS wouldn't crash, so I'd just have to restart Adobe. But it wouldn't OCR very well. That was in the 90's, every so often I'll search out and try some OCR package but still can't get a very good OCR. I'm thinking I might have to re-scan to get a better quality or raw file format but I don't have access to the fancy auto scanner I had back then.

What I was suggesting was using a PDF editor. AFAIK, they are readily available, but not free.

EugeneNine · 01-01-2020, 12:25 AM

OIC, edit the images in the PDF?

I want to OCR them to editable text and then be able to make corrections.

EugeneNine · 01-03-2020, 02:24 PM

Well, I have it another run. https://github.com/tesseract-ocr/ seems to have done a decent job at OCR'ing. My thought was to put it all into a (Open/Libre) document and put that on githib where we can then apply any of the corrections and they will get tracked as to who/how/why.

ijabbott · 01-03-2020, 04:33 PM

(01-03-2020 02:24 PM)EugeneNine Wrote: Well, I have it another run. https://github.com/tesseract-ocr/ seems to have done a decent job at OCR'ing. My thought was to put it all into a (Open/Libre) document and put that on githib where we can then apply any of the corrections and they will get tracked as to who/how/why.

Can GitHub show differences between commits to ODT files? Or will it just say something like "binary files differ"?

EugeneNine · 01-03-2020, 04:41 PM

hmm, not sure. I'll have to test.

Or maybe use LaTex or something

SammysHP · 01-03-2020, 06:33 PM

I OCR all manuals/documents that I scan with ocrmypdf (uses tesseract). It keeps the original image in the background, but adds invisible text so that the text search works in the document. Most text is recognized without errors, but symbols, graphics and special fonts are a problem. I highly doubt that a full conversion will work.

EugeneNine · 01-03-2020, 07:57 PM

(01-03-2020 06:33 PM)SammysHP Wrote: I OCR all manuals/documents that I scan with ocrmypdf (uses tesseract).

"Uses Tesseract OCR engine to recognize more than 100 languages"

So the same back end, I used the gimagereader front end.

edryer · 01-09-2020, 09:47 PM

I have the three 28S manuals as well as the new 48G manual.
Always felt the 28S were written in that old HP style where much care was taken... the 48G manual on the other hand the complete opposite in quality both materially (thick vs thin paper, spiral vs non-spiral) and content.

Six years likely separated this transition with the 48S/SX being in the middle. The move from spiral to cheap in the 48S manual sets may well have been costs.

Will never forget the SX had a model plaque whereas the S was printed (as were all later models).

SX feels weighty G feels lightweight, granted this could just be new technology ))

28S keyboard the last decent HP keyboard as well.

But around 1991/92 things changed.

Eric Rechlin · 01-09-2020, 10:42 PM

It's pretty easy to build a searchable PDF from scanned images with open-source tools. From the Linux command line, assuming your images are PNGs and named in order of the pages:

Code:

for f in *.png;

do

   tesseract "$f" ${f%.png} pdf

done

pdfunite *.pdf output.pdf

If you don't have pdfunite or tesseract, you can install them with the following commands (assuming your distro uses apt):

Code:

sudo apt install poppler-utils

sudo apt install tesseract-ocr

I did all this with Windows 10's built-in Linux services, using Ubuntu 16.04.

EugeneNine · (This post was last modified: 01-10-2020 01:16 PM by EugeneNine.)

I had a .pdf soruce (The HP digital sender I used creates pdf's in its firmware), the gui front end just extracts each page, OCR's then puts back together for me. That part was easy but still had a number of errors due to things like a not clean scan and lots of math symbols. So I'm slowly going though and cleaning it up.
I would just upload here but its exceeds the file size limit.

EugeneNine · 05-03-2020, 11:01 PM

FWIW to anyone still interested.
tesseract won't OCR a .pdf so I have to split it out into .png's and I can OCR but then reassembling those .png's brings the .pdf up to 300M. It seems the HP digital sender mush have done something smart to compress it down.

Any compression techniques I try seem to reduce the quality too much.

I was however able to take the original .pdf and insert all the missing pages, crop it, add a table of contents from the OCR'ed table of contents and renumber the cover page so the page numbers in the contents align with the .pdf page numbers.

The file size doubled and is now around 30M but thats still plenty usable.
I think I'll clean up Volume 2 a bit and re-submit to the museum.

Eric Rechlin · 05-03-2020, 11:15 PM

It might have used lossy compression (like JPEG) instead of lossless. Maybe keep the original images for those pages not changed, and use corrected images for the pages you did change, and the file size should be more reasonable. If the original image was a JPEG, saving as a PNG will be quite large unless you do some filtering of the scan first.

EugeneNine · 05-03-2020, 11:24 PM

problem is I don't have original images, the digital sender made pdf's.

SammysHP · 05-04-2020, 04:27 AM

Toolchains like ocrmypdf can preserve most of the original content (by disabling PDF/A compatibility).

Eric Rechlin · 05-04-2020, 01:52 PM

(05-03-2020 11:24 PM)EugeneNine Wrote: problem is I don't have original images, the digital sender made pdf's.

That's fine. It's still possible to extract the original images from a PDF.

EugeneNine · 05-04-2020, 02:11 PM

(05-04-2020 01:52 PM)Eric Rechlin Wrote:
(05-03-2020 11:24 PM)EugeneNine Wrote: problem is I don't have original images, the digital sender made pdf's.

That's fine. It's still possible to extract the original images from a PDF.

Thats what I'm doing, but reassembling back into a PDF its blowing up to 10x in file size.

EugeneNine · 11-25-2020, 03:13 AM

FWIW the other site beat me to it
https://literature.hpcalc.org/

much better scan than my old ones.

Eric Rechlin · 11-25-2020, 04:45 PM

So we can benefit from at least some of your work, if you could please extract the bookmarks (table of contents) you made on your scan, I could load that into my scan. If you don't know how to do that, I recommend downloading jpdftweak and using it like this:

Code:

java -jar -Xmx1536M ~/jpdftweak/jpdftweak.jar vol1renumber.pdf -savebookmarks vol1renumber.csv /dev/null

Also, FYI, it is possible to OCR a PDF with tesseract without remaking the PDF from images, though my approach still has to split the PDF into pages which can cause links in the PDF to be broken, but at least it doesn't affect the file size. It still extracts the images, but then only uses them for OCR and then deletes them afterwards, merging the OCR layer on top of each existing PDF page. This is a shell script I have written to do that, which is also dependent on pdftoppm and pdfseparate and pdftk (pdftk is a bit tricky to install on modern versions of Ubuntu but it's still possible). You need to know the resolution of the original images in the original PDF as an input; I haven't found a 100% reliable way to determine this automatically.

Code:

FULLPATH=/home/eric/input.pdf

RESOLUTION=300

echo "============================================================================="

filename="${fullpath##*/}"                      # Strip longest match of */ from start

echo "========== Starting new file $filename"

dir="${fullpath:0:${#fullpath} - ${#filename}}" # Substring from 0 thru pos of filename

base="${filename%.[^.]*}"                       # Strip shortest match of . plus at least one non-dot char from end

ext="${filename:${#base} + 1}"                  # Substring from len of base thru end

if [[ -z "$base" && -n "$ext" ]]; then          # If we have an extension and no base, it's really the base

  base=".$ext"

  ext=""

fi

mkdir ocrtemp

echo "===== Extracting images for $filename"

pdftoppm -r $RESOLUTION -png "$fullpath" ocrtemp/

echo "===== Separating PDF pages of $filename"

pdfseparate "$fullpath" ocrtemp/-%03d-image.pdf

echo "===== Generating OCR for $filename"

for f in ocrtemp/*.png;

do

  # rename files to have 3 digits with leading zeroes

  FILENUM=`echo $f | sed 's/[^0-9]//g' | sed 's/^0*//'`

  FILENUM3=`echo $FILENUM | sed -e :a -e 's/^.\{1,2\}$/0&/;ta'`

  NEWFILE=ocrtemp/-$FILENUM3.png

  mv "$f" $NEWFILE

  tesseract "$NEWFILE" ${NEWFILE%.png}-text -c textonly_pdf=1 pdf

done

echo "===== Merging text with images for $filename"

for f in ocrtemp/-*-image.pdf;

do

  #pdftoppm starts with 1 but needs leading zeroes added to match

  PAGE=`echo $f | sed 's/[^0-9]//g' | sed 's/^0*//'`

  # add leading zeroes

  PAGE2=`echo $PAGE | sed -e :a -e 's/^.\{1,2\}$/0&/;ta'`

  pdftk "ocrtemp/-$PAGE2-text.pdf" background "$f" output "ocrtemp/merged-$PAGE2.pdf"

  # clean all old metadata

  pdftk "ocrtemp/merged-$PAGE2.pdf" dump_data | sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | pdftk "ocrtemp/merged-$PAGE2.pdf" update_info - output "ocrtemp/clean-$PAGE2.pdf"

done

echo "===== Combining pages for $filename"

pdftk ocrtemp/clean*.pdf cat output "$dir$base-new.pdf"

rm -rf ocrtemp

Also, in case anyone reading this was wondering, you can extract the original images from a pdf using pdfimages, which is also in the Poppler library linked above. Syntax is like this:

Code:

pdfimages -all input.pdf outputpath/outputfileprefix

If you don't want it to make CCITT or JPEG2000 files, falling back to only PNG and JPEG files which are more usable (but with some increase in file size), use the following:

Code:

pdfimages -png -j input.pdf outputpath/outputfileprefix

I've found that most PDFs use JPEG (when lossy compressed) or PNG (when losslessly compressed) so CCITT and JPEG2000 formats are rare, so the above command will give you the original unchanged images most of the time, but CCITT is most common on old black and white scans (it's a lossy compression algorithm used by fax machines), and some PDF generation programs output true color images in JPEG2000 format.

Once you have PNG files, I have found the best way to minimize their size for including in a PDF is to run them through optipng first. I've found an optimization level of 7 is pretty hard to beat, but it's a bit slow (try 6 to run a fair bit faster with slightly larger, perhaps 1% more on average, output):

Code:

optipng -o7 *.png

I've found a lot of PDF software doesn't optimize PNG images this much so this can also improve file sizes.