, , ,

I used to use Adobe Acrobat Professional but I got tired of the cost and licencing hassles. Now I use PhantomPDF Business Edition which costs less and works well.

I recently scanned a antiquarian document as PDF. This resulted in PDF wrapper around some images. I used PhantomPDF for OCR and it was OK but there were lots of suspects. I don’t think that Adobe Acrobat would have done much better.

Then I simply printed the PDF document to OneNote and right-clicked on the image to extract the text. It did an awesome job and it is free.

Thoughts on OCR and AI

There were still a few errors which shows that Microsoft have not implemented any AI in the solution. My document contained words in Victorian and Indian English like “dawk” and “waggon”. AI could have determined the nature of the document and improved on the recognition by comparison with other corpus. This seems like an easy win and a quick search does show some people pursuing this approach.