Cutting & Pasting Vietnamese characters from a PDF

character encodingnotepadpdf

I'm trying to copy/paste a bunch of Vietnamese text from a PDF document to Notepad++ (or anything, nothing works). The pasted text is different than the source text. What would be the best way to go about fixing this?

For example:

Source Text: (See screenshot for source text)
enter image description here

Pasted Text: Papaya Salad ~ GÕi ñu ñû Tôm

Thanks so much.

Edit: It appears that if the source is a Word document it copies & pastes as expected. PDF is the issue here.

Best Answer

  • It is because the encoding used in the PDF is arbitrary.

    Acrobat File properties

    From Some PDF in Vietnamese I found in the intertubes

    "Encoding:Custom" probably means a (random seeming) encoding made up for it's own convenience by the program that produced this PDF.

    "Embedded Subset" means The program didn't need a huge number of characters from this font so it just picked the few it needed and arranged them in seemingly random order (maybe the order the program encountered them in the text) and the newly invented encoding is based on this ordering.

    Its not really "characters". Basically the PDF no longer has any universally meaningful information about "which character" it has. It just has an indexed bunch of shapes and a list of positions and sizes where it displays those indexed shapes.


    Wikipedia says

    CID-keyed fonts may be made without reference to a character collection by using an "identity" encoding, such as Identity-H (for horizontal writing) or Identity-V (for vertical). Such fonts may each have a unique character set, and in such cases the CID number of a glyph is not informative; generally the Unicode encoding is used instead, potentially with supplemental information.

    So you might try to see if it makes sense in say UTF-16 BE encoding.