I earlier recommended using the built-in OCR (Optical Character Recognition) engine of Google Web Search to convert scanned PDFs into text. You had to upload the scanned documents to a website and then wait for Google bots to index them.
Now assuming that you know how to extract text from scanned PDF images via Google OCR, the next important question is how good (and reliable) is Google’s text recognition technology vis-a-vis other commercial OCR software like Abbyy FineReader or Adobe Acrobat Professional.
For comparison sake, I chose this scanned PDF* as it contains a mix of tables, images and text of different sizes. The resolution of the scanned paper document is fairly poor as you can easily make it out from the document snapshot:
*The PDF document was initially available on the Hindu website from where Google crawlers picked up the document and converted it into a HTML version.
This is the digitized version of the scanned PDF created using Google OCR.
Google’s software (or rather web search engine) could successfully recognize most of the text and tables in the scanned image though, as expected, it did skip the images in the PDF document. There were a couple of junk characters included in the extracted version but I think that’s more due to the poor scan resolution.
OCR in Adobe Acrobat
Acrobat could recognize pages in the PDF document that had images and exported these pages as such to Microsoft Word. In some cases, it even recognized the text captions beneath the images and exported them as searchable text but overall, the results were too disappointing. The formatting was not preserved on most pages and there were just too many junk characters added to the extracted version.
Abbyy FineReader OCR
After Acrobat, I used Abbyy FineReader to digitize the scanned PDF and here’s the result. Abbyy, being a commercial OCR software, delivered the best performance – it retained the layout on almost every page, removed unnecessary line breaks and added minimal number of junk characters to just a few pages.
There’s however one area where Google OCR software definitely scored above Abbyy FineReader – recognizing image captions. One of the pages in the scanned PDF had around six images with text captions – FineReader recognized the whole page as one image while Google OCR could extract all these individual captions as text. And when compared with Adobe Acrobat, Google OCR was definitely a better choice.
Google’s online OCR is both free and requires no installation. If you have access to a public web server and can afford to wait for a couple of days for Google to convert your scanned PDF files, there’s really no need to hunt for free OCR alternatives anymore.
Also see: Software Tools for a Paperless Office