Ubuntu – Optical Character Recognition software recommendations


I have seen some ebooks/papers that were apparently scanned from their paper versions but the text in the ebooks/papers can amazingly be copied out. I suppose the directly-scanned versions must have been processed by some Optical Character Recognition software.

So I would like to know what are the recommended Optical Character Recognition softwares? Especially those that are either for Ubuntu or free? If those for Windows are far more superior, please let me know as well.

I am particularly interested in those OCRs that can accept a scanned pdf file as input and still produce as output another pdf file that looks the same as the input one but with its text copyable.

Best Answer

Tesseract OCR

The original engine was developed back in the late 80's by HP and IBM but it has proven to be one of the best Ocular Recognition Softwares I've used. It's recently undergone many updates to the engine and has become one of the most comprehensive OCR tools on the market. Outscoring against most all other OCR tools (with something in the higher 90 percentile of text matches) it can easily transform standard document type-face to text.

The following is an example:

tesseract ScannedDocument.png out

Will produce a file called out.txt