Cannot copy non-latin characters from PDF document

character encodingpdfunicode

I have a pdf file which contains some non-latin european characters. If I copy some text with the highlight tool, and paste it into another program (word, notepad) – the 'special' characters do not transfer correctly (I get other odd characters in their place).

I have tried copying the text from both Acrobat Reader and Foxit.

Is there anything I can do here to copy this?


Best Answer

Normal PDF documents containing Unicode text do not store the text as characters - but as references to the glyphs (letter shapes) in the fonts used. When embedding fonts in a PDF document Unicode fonts are also often converted by Acrobat to several smaller fonts - so, even if you use only one font, these references may be to glyphs in several smaller fonts not to the glyphs in the original font.

When cutting and pasting Unicode text from Acrobat to another application, Acrobat needs enough information to reconstruct the Unicode characters from the letter shapes. If the font used has glyphs named accoring to the Adobe Glyph Naming Convention then Acrobat can parse these names (which are also stored in the PDF document) and reconstruct the Unicode text. Unfortunatly, there are many Unicode fonts, including the standard Windows fonts, which do not follow this convention - so this may not be possible.

Tagged PDF files also ensure reliable translation of text into Unicode -so you should be able to cut and paste Unicode text from a Tagged PDF file.

So, if you want to prevent this problem in future, when crating a PDF from a document containing non-Latin Unicode text always generate the PDF file as a Tagged PDF and try to use only fonts which have been created with glyphs named accoring to the the Adobe Glyph Naming Convention. Doing this will ensure that your Unicode PDF documents are searchable and that texr can be reliably cut and paste text from them.