Windows – Why do text editors think this file is UTF-8

text-editorsunicodewindows 7

I have two text files, which I'm giving download links to rather than a pastebin to preserve their contents precisely:

Both of these text files consist only of spaces, carriage returns, newlines, and the letter X, and they should be ASCII encoded. The only difference between those two files is the second file has leading and trailing blank lines removed, and some leading and trailing spaces on each line removed.

The first file is not causing any problems. For some reason, my text editors are detecting the second file as UTF-8:

  • Notepad, when opened by double-clicking the text file, displays corrupt text:

enter image description here

  • Notepad, when using File → Open, works fine as long as I explicitly choose "ANSI":

enter image description here

  • Notepad++, while displaying the file fine, believes it is encoded as "UTF-8 (No BOM)":

enter image description here

  • In Notepad++, even if I select "convert to ANSI" and save the file, the saved file is byte-for-byte identical to the original, and both editors still detect it as UTF-8!

  • Both editors have no issues with the first file and correctly recognize it as ASCII (or ANSI).

I looked at the second text file in a hex editor. Indeed, it does not start with a BOM. The first few bytes of the file are 20 20 20 20 20 20 20 20, as they should be, since it starts with spaces:

enter image description here

My question is: Why, then, do both Notepad and Notepad++ detect the second file as UTF-8? Given that the file has no BOM header, why is this happening, and what is unique about the second file compared to the first file that is causing this? I can't figure out what's going on.

Best Answer

Both of those files are valid ASCII and UTF-8 as they include only codepoints < 0x7F (to put it differently, no single byte has value greater than 127).

My guess is that Notepad++ and Notepad have different heuristics [if multiple encodings are valid]:

N++ simply prefers UTF-8,

Notepad (Win utility) seems to look at file length - if it is even (as your second file which is 72 320 bytes) than treat it as UTF-16 (native Windows encoding which is mainly 2 bytes [not always, but it was probably carried over from earlier UCS-2 which was always two byte]) and if it is odd (as your first file - 78 045 bytes) treat it as ASCII (single byte).

You may test it by adding single space (or any other valid ascii char) at the end of your first file to make the length even - if you open it in notepad it will assume it's Unicode and display 'garbage'

btw: both files are recognized as utf-8 in Notepad++ on my PC