This seems like a very narrow case with very low risk. You would have to have UTF-16 source code of unknown providence that you decide to load up and convert to UTF-8, and that source code would have to have some hidden exploit it. How likely is this scenario, I would say close to zero.
You can't fix all of the bugs, nor should you try. You have balance bug fixing with feature development.
There's multiple issues reported, only the first one appears to be UTF-16 related and all of the reported issues trigger the issue with simply opening a malicious text file. The referenced conversion presumably happens eagerly so the editor can operate in utf8 in memory.
I think that's more severe than you suggest; it means that someone could craft malware and all it would take is get someone to view the file in notepad++ to run an exploit.
According to TFA, just opening the file is sufficient to trigger the buffer overflow: “Open the file in Notepad++ to hit out of bounds access with ASAN.”
Say, a *.txt file attached to an email, the opening of which in a text editor is usually considered benign.
Ideally… yes. Consider that Windows APIs use UTF-16, not UTF-8, for wide characters. Microsoft's extensions to ISO 9660 used (big endian!) UCS-2 (a.k.a. UTF-16) for long filenames, NTFS uses proper UTF-16 I believe.
Surrogate pairs are interpreted by client software, so it’s UTF-16 in that sense. The file system just doesn’t ensure that there won’t be unpaired surrogates, or other noncharacters. This is similar to strings in .NET, Java, and JavaScript.
Nor should you. Even a well-formed sequence of utf-16 codepoints can be utter nonsense; there's approximately no level of abstraction between "sequence of fixed-width code units" and "run it through a full-blown a font rendering stack" where it makes sense to assume your input is "well-formed".
You are right: NTFS exclusively runs on UCS-2, now "UTF-16 without validation". Back in the 1980s "Unicode" referred to a way to represent the ISO standard "Universal Character Set", i.e. UCS-2, instead of being synonymous to the entire character set like today. The wisdom of i18n back then was to use "Unicode" text encoding because it covered the known universe of writing. Qt made a similar choice.
You can't fix all of the bugs, nor should you try. You have balance bug fixing with feature development.