Practical OCR solution for converting a large book to a digital format


I was over by my grandparent's place this past weekend. My grandmother pulled out this giant (~1400 page) book of her family history going back to 1630 or so. Giant nerd that I am, I thought it would be slick to have all the information stored in a database and available from the web. I can handle all the web programming and regular expressions and what not, but what I don't know is the best way to get the text from book to computer.

I know some kind of OCR will be necessary, from the little research I've done, it seems like my options are:

  1. take a picture of every page with a camera then process the pictures with OCR software
  2. use a scanner to scan each page, then process with OCR software
  3. use some kind of hand held device, like this.

Does anyone have any ideas about the best way to tackle this problem? I don't want to destroy the book, because as far as I know, it can't be replaced. This is probably the only time I'm ever going to scan a large book, so I don't think I want to spend more than $250 on any kind of device. I don't mind some manual effort here (I realize this will most likely take months), but I'd like to find the most efficient method possible.

Note about the book: It's only about 20 years old, so it's in pretty good shape. It's monochrome and the pages haven't begun to yellow. Since it is so large though, I worry about possible shadows when the text gets down close to the binding.

Best Answer

I came across this on Lifehacker quite some time back, and it has been one of my top DIY projects ever since.

enter image description here

Replace the iPhone with any camera or imaging, and you get a stack of nice high-res jpegs ready for you to OCR with any software, even (urks!) MS Office... ;)

Cheap. Effective. DIY. You can't beat an idea like this.

EDIT: Comments raised up some points about shadows, page curlings, etc. Quite easily resolved for anyone who have literally photo-copied library texts.

Add a multiple light sources to illuminate the book, and eliminate the shadows.

slant the book at 90 degrees to the pages don't curl towards the bindings in the middle. It also preserves the binding.

I'll see if I can give an example and set one up myself.

EDIT 2 : uploaded sample of how you should hold the book, and also notice the light source from the left.

enter image description here

