Convert Scanned PDF Documents to Text with Google OCR

There are two types of PDF documents – those created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc.

google-ocr

Google could always index PDF documents created by conversion but now they also recognize text from PDFs that are generated by scanning paper documents using OCR software.

This is a scanned document and this is the html text view of that same document converted by Google.

Since scanned PDFs are nothing but images, don’t be surprised if Google adds a "search by text" function to their Image Search engine similar to OneNote or EverNote. That will surely be huge.

Convert Scanned PDFs to Text

Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.

Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.

Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.

Find this article at: http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/

Tags: , , , , , , Software

Reader Comments

Amit,

Will It work if I upload my scanned pdfs in google pages.

Regards,
Sharad

I don’t recommend using Google Pages for hosting pdf files since Google is replacing them with Google Sites link

You may however use services like geocities or tripod.com that allow public documents.

Till now I know I can edit scanned Documents through Office XP and OCR softwares and this one is good one .No OCRsoftware required makes much easier.

Will this suggestion not cause all your private scanned files to become public and in Google’s cache? Please warn users about that.

You can upload the Scanned PDFs to Gmail and sent it you only. Then Open your Inbox and the mail sent from you, you have an option to View as HTML. That will solve the Hosting problem.

Incorrect article title… should read…

“Add Your Private PDF Documents to Google’s Database to be Used However Google Sees Fit”

mpradeep,

I tried your solution but Gmail will not display scanned items in HTML.

This would have been a great solution.

-SE

better we could upload onto scribd and download rtf or doc or text or even mp3

Its funny people are worried about privacy. I think they have no idea that private documents would not be published to a website and be downloadable by all users. If they are private, they would probably be in a password protected area or protected pdf file. How naive :D :D :D

@bootcat:

Trusting Google, or any other company, with any of your information is what is really “naive”.

why go through all this trouble….
use “ocr opus”, its a open source ocr mantained by google ..
google is probably using this only for image to text convertion…

I’m confused. Why would you use this solution when Adobe Acrobat has OCR capabilities built into it already?

“I’m confused. Why would you use this solution when Adobe Acrobat has OCR capabilities built into it already?”

Perhaps because anyone can scan into PDF but nobody wants to buy from this monopoly called Adobe a whole Acrobat suite for hundreds of dollars when Google is offering its tricks for free…?

“@bootcat:
Trusting Google, or any other company, with any of your information is what is really “naive”.”

It’s not the point – there’s a *huge* difference between trusting Google with your docs or not and publishing your docs open to the entire world.

“I’m confused. Why would you use this solution when Adobe Acrobat has OCR capabilities built into it already?”

Because Acrobat’s OCR doesn’t work very well?

Agree with “flip” Acrobat Reader has already inbuilt OCR capability.. What’s new there?

Google Docs is another option. In Google home page click on the “more” tab at the top of the screen, then click on “Documents”.

Probably won’t work very well for conspiracy theorists, the seriously paranoid, or those with dark secrets…

Google Custom Search