I have a PDF file containing the SCANNED pages from a 1980 manual… So there is no “text” per se in the file, just “pictures”.
Does anyone know of any way to use some kind of OCR program to create a text file that I can then clean up?
This file is 66meg
You could try a trial version of PDFPen. I’ve used it to OCR a PDF and then selected and pasted into BBEdit or any other word processor or editor for cleanup.
Thanks… I used that then PDF2GO website to extract the text (PDFPen required a paid sub to extract the text)… The OCR isn’t great… so not sure how much work to clean it up will be
I’ve never found OCR to ever really be good. After using PDFPen to OCR a document, I’ve been able to just select text (dragging the mouse over it or using control-a), copy and paste. My documents are only a few pages in length, however, and cleanup, especially line endings, is quite a chore.
this one is 206 pages The original scanned PDF was 60+meg, the OCR one dropped to 27meg and so far the WORD doc version is 260K. a reduction of 230x smaller!
of course the original had hundreds of actual images (screen shots etc)… that the OCR didn’t translate…
Curious if NitroPDF could give you the results you are looking for.
Perhaps… but I’m already 100 pages into the cleanup using WORD