Use Ghostscript, but tell it to not reprocess images

ghostscriptpdfpdftk

I have a PDF that has already compressed and somewhat artifact-y images, and I'm using Ghostscript to prepend a title page to that PDF.

However, I cannot find any way to tell GS to just use the existing images as-is without reprocessing them, and now I'm feeling as if it's something to do with how GS works, i.e. you can't recompile/link a PDF without reprocessing its images.. Is that true?

I can raise the DPI setting in GS, but it'll go from 5MB to 60MB while still looking worse.

Is there any better alternative to GS that'll do what I need (preferably that will compile on OS X)?

Best Answer

If you just want to concatenate two PDF files without any reprocessing of its content, pdftk is for you. (On Mac OS X this should be available via MacPorts or Fink, for Linux, there are native packages for all major distributions; for Windows, look here.) Try this:

 pdftk title.pdf content.pdf cat output book.pdf

This will prepend the title.pdf to the content.pdf and write the result into book.pdf.

pdftk is a "dumb", but very fast way to concatenate two (or more) PDF files. "Dumb" in so far, as pdftk does not in any way interpret the PDF data stream, it just makes sure that the internal object numbers are re-reshuffled as needed and appear in the PDF xref structure (which basically is a sort of PDF ToC for objects).

Ghostscript:

If you want to use Ghostscript, the basic command to concatenate the same two files would be:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
   title.pdf \
   content.pdf

However, as you experienced, this simple command line may mess up your image quality. The reason is that Ghostscript is not 'dumb' when it processes PDFs: it completely interpretes them when reading in, and creates a completely new file when writing out the result. For creating the result, it will automatically be using default settings for a lot of details in the overall processing. These defaults will apply for all cases where its invocations had not instructed Ghostscript otherwise.

So Ghostscript's method to create the new book.pdf is much more "intelligent" (but also much slower) than pdftk's method. (This is also the reason why Ghostscript in many cases is able to --within limits-- "repair" b0rken PDF files, or to embed fonts into the output PDFs which are not embedded in input PDFs, or to remove duplicate images, replacing them by mere references, etc. -- and overall created smaller, better optimized files from bloated input PDFs...)

The solution is to not let Ghostscript use its defaults: by adding more custom parameters to the command line.

What does it mean "Ghostscript 'interprets' its PDF input"?

All of the file and its contents (objects, streams, fonts, images,...) are read in, checked and held in its own internal representation, before spitting out the resulting PDF with its PDF objects again. However, when 'spitting out', Ghostscript will apply all of its internal default settings for the hundreds of parameters [*] which there are available.

Unfortunately, this causes your "reprocessing" of images according to these default settings -- which can only be avoided or overridden by adding your own (desired) commandline parameters.

Your image problems could be caused by Ghostscript's need (due to licensing issues) to re-encode JPEG2000 images to JPEG encoding. If you want to avoid this, add the following to your commandline:

-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
-dColorImageFilter=/FlateEncode \
-dGrayImageFilter=/FlateEncode \

Pay attention that the above /FlateEncode means that any JPEG stream contained in your input PDF file will be converted to raw (PPM) stream. This will increase massively your generated PDF file.

Other image-related commandline options to consider for including are:

-dColorConversionStrategy=/LeaveColorUnchanged \
-dDownsampleMonoImages=false \
-dDownsampleGrayImages=false \
-dDownsampleColorImages=false \

So the complete Ghostscript commandline that could make you happy should read:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
  -dColorConversionStrategy=/LeaveColorUnchanged \
  -dDownsampleMonoImages=false \
  -dDownsampleGrayImages=false \
  -dDownsampleColorImages=false \
  -dAutoFilterColorImages=false \
  -dAutoFilterGrayImages=false \
  -dColorImageFilter=/FlateEncode \
  -dGrayImageFilter=/FlateEncode \
   title.pdf \
   content.pdf

You could also tell Ghostscript NOT to compress images at all in the output PDF, by using this commandline:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
  -dColorConversionStrategy=/LeaveColorUnchanged \
  -dEncodeColorImages=false \
  -dEncodeGrayImages=false \
  -dEncodeMonoImages=false \
   title.pdf \
   content.pdf

.


[*]:
If you are interested to learn about a complete list of default settings which Ghostscript's pdfwrite device is using, run the following command. It returns you the full list:

 gs \
   -sDEVICE=pdfwrite \
   -o /dev/null \
   -c "currentpagedevice { exch ==only ( ) print == } forall"

For explanations about what exactly all these parameters do mean, you'll have to read up in the Adobe documentation about "Distiller Parameters". Ghostscript tries very hard to mimic all these...

Related Question