There are two types of PDF documents – those created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc.
Google could always index PDF documents created by conversion but now they also recognize text from PDFs that are generated by scanning paper documents using OCR software.
This is a scanned document and this is the html text view of that same document converted by Google.
Since scanned PDFs are nothing but images, don’t be surprised if Google adds a "search by text" function to their Image Search engine similar to OneNote or EverNote. That will surely be huge.
Convert Scanned PDFs to Text
Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.
Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.
Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.
Find this article at: http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/
web: http://www.labnol.org/ email: amit@labnol.org


Reader Comments
Amit,
Will It work if I upload my scanned pdfs in google pages.
Regards,
Sharad
Written by Sharad on 10.31.08
I don’t recommend using Google Pages for hosting pdf files since Google is replacing them with Google Sites link
You may however use services like geocities or tripod.com that allow public documents.
Written by Amit on 10.31.08
Till now I know I can edit scanned Documents through Office XP and OCR softwares and this one is good one .No OCRsoftware required makes much easier.
Written by venkat on 10.31.08
Will this suggestion not cause all your private scanned files to become public and in Google’s cache? Please warn users about that.
Written by Manish on 10.31.08
You can upload the Scanned PDFs to Gmail and sent it you only. Then Open your Inbox and the mail sent from you, you have an option to View as HTML. That will solve the Hosting problem.
Written by mpradeep on 10.31.08
Incorrect article title… should read…
“Add Your Private PDF Documents to Google’s Database to be Used However Google Sees Fit”
Written by 350Zed on 10.31.08
mpradeep,
I tried your solution but Gmail will not display scanned items in HTML.
This would have been a great solution.
-SE
Written by SE on 10.31.08
better we could upload onto scribd and download rtf or doc or text or even mp3
Written by bootcat on 10.31.08
Its funny people are worried about privacy. I think they have no idea that private documents would not be published to a website and be downloadable by all users. If they are private, they would probably be in a password protected area or protected pdf file. How naive :D :D :D
Written by Phalgun on 10.31.08
@bootcat:
Trusting Google, or any other company, with any of your information is what is really “naive”.
Written by 350Zed on 11.01.08
why go through all this trouble….
use “ocr opus”, its a open source ocr mantained by google ..
google is probably using this only for image to text convertion…
Written by chirag on 11.02.08
I’m confused. Why would you use this solution when Adobe Acrobat has OCR capabilities built into it already?
Written by mbear on 11.03.08