Linux – Batch-OCR many PDFs

adobe-acrobatlinuxocrpdfwindows

This has been discussed a year ago here:

Batch OCR for many PDF files (not already OCRed)?

Is there any way to batch OCR PDFs that haven't been already OCRed? This is, I think, the current state of things dealing with two issues:

Batch OCR PDFs

Windows

  • Acrobat – This is the most straightfoward ocr engine that will batch OCR. The only problem seems to be 1) it wont skip files that have already been OCRed 2) try throwing a bunch of PDFs at it (some old) and watch it crash. It is a little buggy. It will warn you at each error it runs into (though you can tell the software to not notify. But again, it dies horribly on certain types of PDFs so your mileage may vary.

  • ABBYY FineReader (Batch/Scansnap), Omnipage – These have got to be some of the worst programmed pieces of software known to man. If you can find out how to fully automate (no prompting) batch OCR of PDFs saving with the same name then please post here. It seems the only solutions I could find failed somewhere–renaming, not fully automated, etc. etc. At best, there is a way to do it, but the documentation and programming is so horrible that you'll never find out.

  • ABBYY FineReader Engine, ABBYY Recognition Server – These really are more enterprise solutions, you probably would be better off just getting acrobat to run over a folder and try and weed out pdfs that give you errors/crash the program than going through the hassle of trying to install evaluation software (assuming you are a simple end-user). Doesn't seem cost competitive for the small user.

  • ** Autobahn DX workstation ** the cost of this product is so prohibitive, you probably could buy 6 copies of acrobat. Not really an end-user solution. If you're an enterprise setup, this may be worth it for you.

Linux

  • WatchOCR – no longer developed, and basically impossible to run on modern Ubuntu distros
  • pdfsandwich – no longer developed, basically impossible to run on modern Ubuntu distros
  • ** ABBY LINUX OCR ** – this should be scriptable, and seems to have some good results:

http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

However, like a lot of these other ABBYY products they charge by the page, again, you might be better off trying to get Acrobat Batch OCR to work.

  • **Ocrad, GOCR, OCRopus, tesseract, ** – these may work but there are a few problems:

    1. OCR results are not as great as, say, acrobat for some of these (see above link).
    2. None of the programs take in a PDF file and output a PDF file. You have to create a script and break apart the PDF first and run the programs over each and then reassemble the file as a pdf
    3. Once you do, you may find, like I did, that (tesseract) creates an OCR layer that is shifted over. So if you search for the word 'the', you'll get a highlight of the part of the word next to it.
  • Batch DjVu → Convert to PDF – haven't looked into it, but seems like a horrible round-a-bout solution.

Online

  • PDFcubed.com – come on, not really a batch solution.
  • ABBYY Cloud OCR – not sure if this is really a batch solution, either way, you have to pay by the page and this could get quite pricey.

Identifying non-OCRed PDFs

This is a slightly easier problem, that can be solved easily in Linux and much less so in Windows. I was able to code a perl script using pdffont to identify whether fonts are embedded to determine which files are not-OCRed.


Current "solutions"

  1. Use a script to identify non-OCRed pdfs (so you don't rerun over thousands of OCRed PDFs) and copy these to a temporary directory (retaining the correct directory tree) and then use Acrobat on Windows to run over these hoping that the smaller batches won't crash.

  2. use the same script but get one of the linux ocr tools to properly work, risking ocr quality.

I think I'm going to try #1, I'm just worried too much about the results of the Linux OCR tools (I don't suppose anyone has done a comparison) and breaking the files apart and stitching them together again seems to be unnecessary coding if Adobe can actually batch OCR a directory without choking.

If you want a completely free solution, you'll have to use a script to identify the non-OCRed pdfs (or just rerun over OCRed ones), and then use one of the linux tools to try and OCR them. Teseract seems to have the best results, but again, some of these tools are not supported well in modern versions of Ubuntu, though if you can set it up and fix the problem I had where the image layer not matching the text-matching layer (with tesseract) then you would have a pretty workable solution and once again Linux > Windows.


Do you have a working solution to fully automate, batch OCR PDFs, skipping already OCRed files keeping the same name, with high quality? If so, I would really appreciate the input.


Perl script to move non-OCRed files to a temp directory. Can't guarantee this works and probably need to be rewritten, but if someone makes it work (assuming it doesn't work) or work better, let me know and I'll post a better version here.


#!/usr/bin/perl

# move non-ocred files to a directory
# change variables below, you need a base dir (like /home/joe/), and a sourcedirectory and output
# direcotry (e.g books and tempdir)
# move all your pdfs to the sourcedirectory

use warnings;
use strict;

# need to install these modules with CPAN or your distros installer (e.g. apt-get)
use CAM::PDF;
use File::Find;
use File::Basename;
use File::Copy;

#use PDF::OCR2;
#$PDF::OCR2::CHECK_PDF   = 1;
#$PDF::OCR2::REPAIR_XREF = 1;

my $basedir = '/your/base/directory';
my $sourcedirectory  = $basedir.'/books/';
my @exts       = qw(.pdf);
my $count      = 0;
my $outputroot = $basedir.'/tempdir/';
open( WRITE, >>$basedir.'/errors.txt' );

#check file
#my $pdf = PDF::OCR2->new($basedir.'/tempfile.pdf');
#print $pdf->page(10)->text;



find(
    {
        wanted => \&process_file,

        #       no_chdir => 1
    },
    $sourcedirectory
);
close(WRITE);

sub process_file {
    #must be a file
    if ( -f $_ ) {
        my $file = $_;
        #must be a pdf
        my ( $dir, $name, $ext ) = fileparse( $_, @exts );
        if ( $ext eq '.pdf' ) {
            #check if pdf is ocred
            my $command = "pdffonts \'$file\'";
            my $output  = `$command`;
            if ( !( $output =~ /yes/ || $output =~ /no/ ) ) {
                #print "$file - Not OCRed\n";
                my $currentdir = $File::Find::dir;
                if ( $currentdir =~ /$sourcedirectory(.+)/ ) {
                    #if directory doesn't exist, create
                    unless(-d $outputroot.$1){
                    system("mkdir -p $outputroot$1");
                    }
                    #copy over file
                    my $fromfile = "$currentdir/$file";
                    my $tofile = "$outputroot$1/$file";
                    print "copy from: $fromfile\n";
                    print "copy to: $tofile\n";
                    copy($fromfile, $tofile) or die "Copy failed: $!";
#                       `touch $outputroot$1/\'$file\'`;
                }
            }

        }

    }
}

Best Answer

I too have looked for a way to batch-OCR many PDFs in an automated manner, without much luck. In the end I have come up with a workable solution similar to yours, using Acrobat with a script as follows:

  1. Copy all relevant PDFs to a specific directory.

  2. Remove PDFs already containing text (assuming they are already OCRd or already text - not ideal I know, but good enough for now).

  3. Use AutoHotKey to automatically run Acrobat, select the specific directory, and OCR all documents, appending "-ocr" to their filename.

  4. Move the OCRd PDFs back to their original location, using the presence of a "-ocr.pdf" file to determine whether it was successful.

It is a bit Heath Robinson, but actually works pretty well.