Perform OCR with Google Docs – Turn Images Into Editable Documents

Google Docs can now perform OCR on digital images. You can upload an image containing typewritten or printed text (like a fax document or a scanned newspaper clipping) to your Google Docs account and it will turn that image into editable text.

In the following example, Google Docs successfully extracted all the text from a scanned book page and converted it into an editable document.

google docs ocr

The OCR feature in Google Docs is not part of the standard UI yet but you can use this sample form to upload scanned images to your Google Account and the server will automatically try to extract text from these images provided the image resolution is good and that the text inside images is written using Latin character sets.

The OCR feature can also extract text from noisy images as well (like this WSJ clipping) though the recognized text is not very accurate and the document formatting is lost (see conversion results).

If you are a developer, you can add the ocr=true parameter to your upload request and Google Docs will automatically scan that image for text patterns. You can also upload images to Google Docs without the OCR parameter but in that case, the image will be converted into a new Word document sans OCR.

Like Google Docs, Google Search too includes OCR features but the difference is that while Google Docs can extract text from images, the OCR in Google Search works only with scanned PDF files.

Find this article at: http://www.labnol.org/internet/perform-ocr-with-google-docs/10059/

Tags: , , , , , Internet

Reader Comments

Thanks for the information; Seems Google Doc’s is getting more versatile…

Man, you just rock… How do you get such good ideas… love you and this blog….

Coolsome.

Anybody know which ocr engine they use?

It would seems your initial statement while being technically correct is deceptive, you say “Google Docs successfully extracted all the text from a scanned book page” while the images clearly show that the OCR had significant trouble with the underlined numbers. You might want to note that in your post.

@David Legg – they use Tesseract (code.google.com/p/tesseract-ocr/) – the open source OCR engine. I had some problems with low res images. OCR Terminal (www.ocrterminal.com) does a pretty good job with online OCR – think Labnol has reviewed them before.

Wow ! Thank you again my friend

It’s not perfect. But does a good job.

Although, it converts “this specification defines” to “this specification deƱnes”.

Great article btw.

Microsoft Word could so use this feature.

Thanks for this wonderful tip. Any chance it would ever work with (non-searchable, of course) pdfs?


Comment

Google Custom Search