I earlier recommended using the built-in OCR (Optical Character Recognition) engine of Google Web Search to convert scanned PDFs into text. You had to upload the scanned documents to a website and then wait for Google bots to index them.
Now assuming that you know how to extract text from scanned PDF images via Google OCR, the next important question is how good (and reliable) is Google’s text recognition technology vis-a-vis other commercial OCR software like Abbyy FineReader or Adobe Acrobat Professional.
For comparison sake, I chose this scanned PDF* as it contains a mix of tables, images and text of different sizes. The resolution of the scanned paper document is fairly poor as you can easily make it out from the document snapshot:

*The PDF document was initially available on the Hindu website from where Google crawlers picked up the document and converted it into a HTML version.
Google OCR
This is the digitized version of the scanned PDF created using Google OCR.
Google’s software (or rather web search engine) could successfully recognize most of the text and tables in the scanned image though, as expected, it did skip the images in the PDF document. There were a couple of junk characters included in the extracted version but I think that’s more due to the poor scan resolution.
OCR in Adobe Acrobat
I then tried using the OCR feature of Adobe Acrobat to extract text from the scanned PDF and here’s the resulting Word document.
Acrobat could recognize pages in the PDF document that had images and exported these pages as such to Microsoft Word. In some cases, it even recognized the text captions beneath the images and exported them as searchable text but overall, the results were too disappointing. The formatting was not preserved on most pages and there were just too many junk characters added to the extracted version.
Abbyy FineReader OCR
After Acrobat, I used Abbyy FineReader to digitize the scanned PDF and here’s the result. Abbyy, being a commercial OCR software, delivered the best performance – it retained the layout on almost every page, removed unnecessary line breaks and added minimal number of junk characters to just a few pages.
There’s however one area where Google OCR software definitely scored above Abbyy FineReader – recognizing image captions. One of the pages in the scanned PDF had around six images with text captions – FineReader recognized the whole page as one image while Google OCR could extract all these individual captions as text. And when compared with Adobe Acrobat, Google OCR was definitely a better choice.
Google’s online OCR is both free and requires no installation. If you have access to a public web server and can afford to wait for a couple of days for Google to convert your scanned PDF files, there’s really no need to hunt for free OCR alternatives anymore.
Also see: Software Tools for a Paperless Office
Find this article at: http://www.labnol.org/software/google-ocr-software-review/7208/
Tags: abbyy, adobe acrobat, convert, feature, ocr, pdf, Software

Reader Comments
I dont understand why people still use Acrobat, I think its all the money adobe puts in marketing it, I consider Foxit reader the best of all, and the text is more easy to read copy or snap compared to acrobat
Written by seer on 02.09.09
I was using gmail with the ‘view in html’ option for PDFs after your last post, but seems this option is no longer available.
Written by Azzam on 02.09.09
@seer
I completely agree with you. 1 more advantage with Foxit is that it consumes much lesser memory. I stopped using Acrobat ever since I came across Foxit Reader.
Written by suryadeep on 02.09.09
What is Foxit or Acrobat Reader got to do with OCR?
Written by Amit on 02.10.09
I don’t think that OCR with Google Search is 100% accurate.
In case of PDF all are characters, but in image case it’s difficult for a bot to recognise.
Written by Mohit on 02.10.09
As far as free OCR goes, Google’s offering is certainly interesting. I have to wonder what they’re using for OCR though on their servers. Their Tesseract engine is pretty poor compared to commercial offerings, unless they’re better at getting it to deliver than I am (entirely probable).
I think the reason they haven’t sewn up the OCR market is for the reasons you mention – it’s not instant, it’s not fantastic quality, and if you store important documents online (in a free account) there’s no guarantee that you won’t lose the account, and your documents with it.
Written by Tim at Home Document Manager on 03.19.09