Convert Scanned PDF Documents to Text with Google OCR

There are two types of PDF documents – those created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc.

google-ocr

Google could always index PDF documents created by conversion but now they also recognize text from PDFs that are generated by scanning paper documents using OCR software.

This is a scanned document and this is the html text view of that same document converted by Google.

Since scanned PDFs are nothing but images, don’t be surprised if Google adds a "search by text" function to their Image Search engine similar to OneNote or EverNote. That will surely be huge.

Convert Scanned PDFs to Text

Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.

Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.

Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.

Find this article at: http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/

web: http://www.labnol.org/ email: amit@labnol.org


Reader Comments

Amit,

Will It work if I upload my scanned pdfs in google pages.

Regards,
Sharad

I don’t recommend using Google Pages for hosting pdf files since Google is replacing them with Google Sites link

You may however use services like geocities or tripod.com that allow public documents.

Till now I know I can edit scanned Documents through Office XP and OCR softwares and this one is good one .No OCRsoftware required makes much easier.

Will this suggestion not cause all your private scanned files to become public and in Google’s cache? Please warn users about that.

You can upload the Scanned PDFs to Gmail and sent it you only. Then Open your Inbox and the mail sent from you, you have an option to View as HTML. That will solve the Hosting problem.

Incorrect article title… should read…

“Add Your Private PDF Documents to Google’s Database to be Used However Google Sees Fit”

mpradeep,

I tried your solution but Gmail will not display scanned items in HTML.

This would have been a great solution.

-SE

better we could upload onto scribd and download rtf or doc or text or even mp3

Its funny people are worried about privacy. I think they have no idea that private documents would not be published to a website and be downloadable by all users. If they are private, they would probably be in a password protected area or protected pdf file. How naive :D :D :D

@bootcat:

Trusting Google, or any other company, with any of your information is what is really “naive”.

why go through all this trouble….
use “ocr opus”, its a open source ocr mantained by google ..
google is probably using this only for image to text convertion…

I’m confused. Why would you use this solution when Adobe Acrobat has OCR capabilities built into it already?

If you have a question or suggestion that is not related to the above discussion, please post it in this forum. All comments are moderated.

Add a Comment

required, use real name
required, will not be published
optional, your blog address

« Back to main

Google Custom Search