One thing not mentioned here is the history of the "Byte Order Mark" (BOM) in unicode.
(Not an expert here but my understanding having lived it.)
You see, given UCS-2, there are 2 ways to encode any codepoint -- either big endian or little endian. The idea was then to create a codepoint (U+FEFF) that you could put at the start of a text stream that would signify what order the file was encoded in.
Wikipedia page: https://en.m.wikipedia.org/wiki/Byte_order_mark
This then got overloaded. When loading a legacy text format often times there is the difficulty of figuring out the code page to use. When applied to HTML, there are a bunch of ways to do it and they don't always work. There are things like charset meta tags (but you have to parse enough HTML to find it and then re-start the decode/parse). But often times even that was wrong. Browsers used to (and still do?) have an "autodetect" mode where it would try divine the codepage based on content. This is all in the name of "be liberal in what you expect".
Enter UTF-8. How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content? How does this apply to regular old text files? Well, the answer is to use the BOM. Encode it in UTF-8 and put it at the start of the text file.
But often times people want to treat simple UTF-8 as ASCII and you end up with a high value codepoint in what would otherwise be an ASCII document. And everyone curses it.
Having the BOM littered everywhere doesn't seem to be as much of a problem not as it used to be. I think a lot of programs stopped putting it in and a lot of other programs talk UTF-8 and deal with it silently. Still something to be aware of though.
My most spectacular fail was a program that read UTF-8 or Latin-1 and wrote UTF-16, preserving but not displaying null characters. I believe this was default behavior HyperStudio. Every round-trip would double the size of the file by inserting null bytes every other character. Soon there were giant stretches of null characters between each display character, but the displayed text never appeared to change even though the disk requirements doubled with each launch. That's how I learned about UTF-16!
Speaking of Win32/COM... is there a "tcpdump for COM"? I've got a legacy app that uses COM for IPC and I've been instrumenting each call for lack of one.
I would guess though, that there is probably some pretty helpful code for this in the apitrace program that could probably be lifted out and reused, since DirectX APIs tend to involve a lot of COM. I haven't tried, though.
BTW, OLE/COM is something that didn't click at all for me when I first started encountering it in the '90s. I'm kind of bummed that it seems to have been left behind because it's still a useful technology.
If the document contains no codepoints above 0x7F, it is both US-ASCII and UTF-8 at the same time. If the document decodes as valid UTF-8, it’s more likely to be UTF-8 than whatever national encoding it might be (latin1? windows-1252? equivalents for other countries?)
With the BOM, you get more serious problems. Everyone on the way needs to know about it, and that it needs to be ignored when reading, but probably not when writing. I remember the olden days of modifying .php files from CMSes with notepad.exe, which happily, silently added a BOM to the file, and now suddenly your website displays three weird characters at the very top of the page.
They did consult not just experienced East Asian programmers, but Chinese, Korean, Japanese and Vietnamese professors of Linguistics and the relevant local authorities, and had them form the Ideographic Rapporteur Group, which advised them that all the East Asian languages can be encoded in 20940 code points.
Unicode is not centrally run like so many people seem to expect. Rather, the Unicode Consortium itself acts just as the standards body that mates together local standards into a single complete whole. The Consortium makes no decisions about how Chinese is represented, other than allocating them code pages; all the relevant decisions about any language is done by local experts. The problem was that the early 90's were an era of boundless optimism and futurism, which in East Asia resulted in the Han Unification project, with the idea of cutting down the amount of symbols in use, and unifying the representation of all the languages that used a Han-derived alphabet. That... did not go so well.
To my eyes, the letter "A" in English (Latin), Greek, and Russian (Cyrillic) looks identical. Why does it need three separate code points? Maybe it gets tricky when the relationship between uppercase and lowercase diverges. In fact, that does appear to be one of the arguments in this technical note:
>Even more significantly, from the point of view of the problem of character encoding for digital textual representation in information technology, the preexisting identification of Latin, Greek, and Cyrillic as distinct scripts was carried over into character encoding, from the very earliest instances of such encodings. Once ASCII and EBCDIC were expanded to start incorporating Greek or Cyrillic letters, all significant instances of such encodings included a basic Latin (ASCII or otherwise) set and a full set of letters for Greek or a full set of letters for Cyrillic. Precedent for the purposes of character encoding was clearly established by those early 8-bit charsets.
That's true, although it reminded me of a cool charset: the Russian KOI8-R encoding. This encoding was created so that if you stripped the high bit (presumably on a system unable to handle russian properly) you ended up with semi-legible latinized Russian. Quoting wikipedia:
>For instance, "Русский Текст" in KOI8-R becomes rUSSKIJ tEKST ("Russian Text") if the 8th bit is stripped; attempting to interpret the ASCII string rUSSKIJ tEKST as KOI7 yields "РУССКИЙ ТЕКСТ". KOI8 was based on Russian Morse code, which was created from Latin Morse code based on sound similarities, and which has the same connection to the Latin Morse codes for A-Z as KOI8 has with ASCII.
This legacy support is why there are precomposed characters.
It’s not quite the same for Han unification (people using those alphabets were already used to complex gyration was ao e). In that case the linguists, historians and other scientists were in agreement but they neglected politics. In Latin unification nobody cares that J sounds different in every language, but because Hanzi embody semantics, Japan may have a loancharacter because it once sounded the same as a word in use at the time OR because it had the same meaning as a word in use at the time so the sense of “unified” can be contentious, even though they are simply code points.
Sorting complex charsets may require some sort of collation, of course, but in simple cases the status quo is highly usable without one.
1. Let's start with a simple character (indeed the character for character) like 字. The top stroke on this character is generally written straight in Japanese (and I believe Korean), but it has a very distinct diagonal angle (top left to bottom right) in Chinese, both Traditional and Simplified. Should it be counted as the same character? I believe most linguists would say yes. If your answer for this is no, then Han Unification is a lost cause - you'd basically have to create completely separate character sets for Simplified Chinese, Traditional Chinese, Japanese and Korean and Vietnamese, and add a few extra characters to the Traditional Chinese set that are different over in Hong Kong.
2. Let's go further on with the grass radical (艹) which has 4 strokes in Traditional Chinese vs. 3 in other writing systems. Should all the characters with this radical be separate?
3. What about characters like 骨? The inner right angle stroke in the top of this character is on the left side in Simplified Chinese, and on the right side otherwise. Similar minor differences can be found for characters like 刃 and 化. These are not really the fault of simplification, but rather come from the fact that different countries chose a different Han character variant to standardize on.
A good example from another non-western language that went through a partial unification ignoring stylistic differences in writing would be Urdu. It's sharing the same unicode range as Arabic, even though the typical styling of "the same characters" in both languages could be quite different. Here's a signboard with both Urdu and Arabic from Wikipedia - the difference is quite apparent:
But the unification could never be perfect.
Simplified characters often combine more than 1 traditional characters. e.g. 发 could mean both hair (髮) and discharge (發). Japanese simplified the later character to 発. Unicode assigns a different codepoint to each of them, and it would obviously be quite bad for Traditional Chinese if it didn't. It also assigns a different code point where consistent stylistic simplification of radicals are involved, so all characters with the radical 見 ('see') have different codepoints from those with the simplified radical 见.
Even ignoring political tensions, Han unification was naive from the start, especially considering they knew they wouldn't be able to unify everything. They could have left a way to designate regional styling within the encoding.
The reasoning is not totally crazy; the Table of General Standard Chinese Characters even today only has about 8000 characters.
You can't leave out people who need to encode rare languages on computers underserved, so for all the scholars recording Egyptian and Luwian hieroglyphs, Sumerian and Akkadian cuneiform and Tangut script. You'd probably have to eventually support as-of-yet unencoded scripts such as Classical Yi, Jurchen, Aztec hieroglyphs and so on.
You'd have to add alphabets and syllabaries for small minority languages such as Lisu, Coptic, Cherokee and Vai. And you'd have to support the rather widely used writing systems of India, each with its own 128 characters.
Combine this with the over 28,000 CJK characters Unicode did end up allocating, the 11,184 discrete codepoints they Unicode did end up adding for Koran Hangul , you've already used 40,000 characters. Add all the long-tail scripts and you're past 65,535 well before we even reach Emoji or extremely rare CJKV characters from Mojikyo.
"In other words, given that the limitation to 65,536 character codes genuinely does satisfy all the world's modern communication needs with a safety factor of about four, then one can decide up-front that preserving a pure 16-bit architecture has a higher design priority than publicly encoding every extinct or obscure character form. Then the sufficiency of 16 bits for the writing technology of the future becomes a matter of our active intention, rather than passive victimization by writing systems of the past."
The engineers behind Unicode made the classic software engineering mistake of looking at their needs today and taking a factor of safety that's way too small. It's the same thing that happened with the IPv4 address range. A simple back-of-the-envelope calculation would have told us it's not going to be enough, but it was easy to forget back in the 1980s that one day children in Africa would also have personal computers connected to the grand Internet, that would need an IP address.
The fact that by version 1.0.1 (the first version of Unicode to include the CJK characters), they already used almost half of their space when they were planning to use less than a quarter, should have already prepared them to the future. No, the reasoning behind 16 bits was always crazy. You can't design future-proof standards with a low safety factor and condescending estimation of usage (Yi Syllabary? Who would ever use that?).
But we should probably be happy about this. A 32-bit encoding would have certainly been rejected by just about everyone as too wasteful, and software engineers back in the 80's and 90's were still unable to give up merging their understanding of codepoints with glyhs, so a multi-byte encoding (which is what we ended up having eventually with UTF-8 and UTF-16) wasn't a popular idea either.
Fortunately enough, Unicode is still much better than the mess it preceded, and the fix for the 16-bit limitation, as painful as it was, has proven far easier than IPv6.
 a.k.a The "Korean Mess":
They are not 16 bits wide, but mostly 14 bits wide and separated from the single byte portion with most significant bits of bytes set. As far as I can tell there are "only" about 33,000 code points  in the Unihan database that are known to be officially encoded in some sort of encodings throughout the entire East Asia. This does not include unofficial but de facto encodings like ETEN extensions to Big5 but you get the idea. This and the Han unification made it pretty reasonable to expect that the modern characters can be encoded in 16 bits.
 I've counted distinct code points in the "other mappings" table from the Unihan database.
That's not what Unicode does though. Unicode encodes many duplicate characters (because they are duplicated in a national encoding standard and the round-trip compatibility requirement). Unicode also includes every possible variant that is of cultural or academic interest, including dead languages and so-called "ghost characters" that are actually printing errors in old paper dictionaries or census data, but now allocated a code point. And this is before we say anything about emoji...
So from today's perspective 16 bit encoding space is laughably small. But it is true that basically any message meant to be consumed by the general population, including CJKV languages but notably not emoji, can be fully represented with characters found on the BMP, which can be faithfully encoded in UCS-2.
But the insistence on avoiding C0 and C1 controls would have led to some odd codepoint numbers, and definite incompatibility with ASCII and Unicode; for instance, 'A', instead of 0x41, would have been 0x20202041! Whereas Unicode of course made sure to copy ASCII (indeed, to copy ISO-8859-1) for its first block.
Of course, this didn't come to pass because, well, people were already supporting 16-bit Unicode and didn't want to switch again. And I'm kind of glad it didn't -- while it would have meant not having to deal with the compromise that is UTF-16, imagine living in a world where 'A' is not 0x41, but rather 0x20202041! (Or imagine opening up a text file and finding three spaces between each character...)
On the other hand, often "obviously broken" is better than "subtly broken". There's a ton of undetected encoding problems in the wild right now.
So the idea of ever having a fixed-length encoding for Unicode is basically impossible now. Best to just use UTF-8 for everything and logic to group it in to code points, grapheme clusters, or whatever other granularity is needed.
I've heard more than one person tell me they don't need to worry about text shaping since they are using UTF-8. (That statement doesn't make any sense) There is a lot of confusion with Unicode text rendering stack.
We do it all the time. Splitting a string at U+002C or U+003B or U+000A, looking for U+007B or U+007D, etc. In my experience, it's really common.
That means that what Unicode today considers "one character" can be infinite in size.
Typing in many Latin languages generates diacritics using ‘dead keys’, which are prefix operators. (Originally, on a physical typewriter, these were keys that simply struck the paper without triggering the mechanism to advance the carriage.) If Unicode had taken this hint, life would be easier.
(ISO/IEC 6937 had prefix diacritics, but it was too late.)
Process whole strings, the meaning of a partial string may not be what you hope, even without fancy writing systems and Unicode encoding.
"Give the money to Steph" - OK, will do
"...enson" - Crap, OK, somebody chase Steph and get our money back, meanwhile here's James Stephenson's money
"... once he gives you the key" - Aargh. Somebody chase down James and get the key off him.
Either way there's already a recommendation in this area. I'm not sure if it strictly fits this particular scenario but the "UAX15-D3. Stream-Safe Text Format" suggests that you should support sequences at least 31 code points long.
Well never mind that then; how do you like U+41,U+300 aka U+C0 latin capital letter a with grave accent. Not just in the BMP but in the part of the BMP that's only 2 bytes of UTF-8.
And there's a whole bunch of similar characters that don't have NFC encodings.
1 7 0xxxxxxx
2 11 110xxxxx 10xxxxxx
3 16 1110xxxx 10xxxxxx 10xxxxxx
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 26 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 31 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
7 36 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
8 42 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
9 53 11111111 110xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10 58 11111111 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
That loses one very useful property of UTF-8: you always know, by looking at a single byte, whether it's the first byte of a code point or not. It also loses the property that no valid UTF-8 encoding is a subset of another valid UTF-8 encoding. It's better to stop at 36 bits (which keeps another useful property: you'll never find a 0xFF byte within an UTF-8 string, and you'll only find a 0x00 byte when it's encoding the U+0000 code point).
Naah.. there's UTF-9 and UTF-18 for that..
Only ~137K have been assigned so far, or in other words a little over 17 bits. Given that each additional bit doubles the range, even in the case that characters continue to be assigned at a constant rate, reaching that high is so far into the future that it's not worth thinking about. Unicode is around 30 years old; at the same rate, another 30 years would be needed to get to 18 bits, then another 60, 120, ...
On the other hand, it is extremely wasteful to burden all UTF-8 processing code with handling values that will literally never be encountered outside of error cases.
If you've ever noticed that entering an emoji or other non-ASCII  character seems to dramatically increase the size of your SMS, that's because the message is switching from 1-byte characters to 2-byte characters.
 It's actually GSM-7, not ASCII, but the principle is the same: https://www.twilio.com/docs/glossary/what-is-gsm-7-character...
Every flip phone I owned did not play well with non-ASCII text. Receiving a message with an emoji would render the entire message, or if I was lucky just the portion following the emoji, unrenderable. Most phones I had failed to just rectangles, which I assume is their method of mojibake.
This was not a causal factor in my switch to a smartphone. It was annoying to have to explain character encodings to friends who didn't understand why they couldn't send me emoji-containing messages.
You might also find these links helpful for getting the intended use of HN:
Practically speaking, I had fun learning this while writing an Angular app to read SMS backups . It's frustrating and cool at the same time.
UTF-16 can represent all Unicode code points, through surrogate pairs when needed.
Systems that originally were designed for UCS-2 have often been upgraded to UTF-16. E.g. Windows, JS. But it is then not UCS-2 anymore. In some case the documentation and/or function names, etc., might talk incorrectly about being UCS-2 (while they really are UTF-16 now) for historical reasons, adding to the confusion.
>UCS-2 differs from UTF-16 by being a constant length encoding and only capable of encoding characters of BMP.
There does seem to be some disagreement on this topic though. I wonder if it could be described as WTF-16.
So for your narrow encoding you need one that maps 1:1 (in a fixed-length way) to UTF-16. That rules out multi-byte UTF-8.
You could of course just restrict it to UTF-8 chars that fit in a single byte – but that's just ASCII! You waste a bit per character.
So you might has we'll use a full 8-bit encoding, and Latin-1 is convenient as it covers many more characters likely to be encountered in text that can use single-byte UTF-8 in the first place. It is also convenient in that Latin-1 and Unicode codepoints identical so conversion to and from UTF-16 can be very efficient (in fact there are SIMD instructions which do it exactly).
In any case, even in the scenario you could somehow use UTF-8 efficiently, it is likely to be worse in this particular scenario:
- They both use 1 byte for code points 0-127
- Latin-1 uses 1 byte for 128-255, while UTF-8 uses 2
- Above 255, UTF-8 uses at least two but the Java hybrid string will be using UTF-16 now, for a common case of 2 bytes.
So the only place UTF-8 wins is for the unusual case of characters outside the BMP which are 3 bytes in UTF-8 but 4 bytes in UTF-16. Those aren't common at all, and in many scenarios where you'd have a lot of them UTF-16 would still win because it represents many characters in the BMP in 2 bytes that take 3 in UTF-8.
 One might think there is an exception illustrated by the few methods that mention "code point", like `codePointAt`. Yes, these deal in code points, but their indexes are still indexes into a UTF-16 string. So they are kind of a hybrid API. They let you count N UTF-16 code units (which will represent <= N code points) into a string, and then deal with the code point at that location. That leads to weird cases like asking for the code point starting at the second half of a surrogate pair, which gives you the low surrogate alone.
In fact this means though that it isn't just the mistake of UCS-2 that results in people wanting "transparent Latin1 under the hood to save space".
Even if we had gone right to UTF-8, there would still roughly the same cost savings there.
I can't think of any way to deal with unicode that wouldn't have similar cost savings from "transparently doing non-unicode when possible".
Well, I suppose unless you made whatever Java (et al) are doing to mark which strings or portions of strings are actually being stored as Latin1, and made that a Unicode encoding. That would of course have it's own... challenges. If you think "transparently storing as Latin1 when possible" is an ugly hack, you'd probably not be happier with it being made a unicode standard encoding.
It turns out being able to represent all written human communication is really hard. Unicode actually does a pretty amazing job of it, balancing lots of trade-offs. (And the fact that it has as wide-spread adoption as it does is in part testament to this; just because you make a standard doesn't mean anyone has to or is going ot use it. Some of the trade-offs unicode balanced were making the adoption curve as easy as possible for existing software. Making ascii valid UTF-8 was, I think, proven by history to be the right decision, although it involved trade-offs... such as Latin1 not fitting in one byte :) ). Overall, Unicode is a really succesful standard, technically and in terms of adoption.
But UCS-2 was a mistake, that if it had been avoided would avoid some headache. But, i think, not the headache where Java is motivated to "store some things transparently as Latin1 where possible."
Under such a design, storing the string data as anything other than UTF-8 is unnecessary. In fact, it’s impossible: if you store string data as Latin-1, you can’t index into it with a UTF-8 byte index.
That does mean you can’t take advantage of the space savings of Latin-1 over UTF-8, but those are extremely minimal, since both encodings are 1 byte for all ASCII characters.
I am not sure if the space savings really are extremely minimal or not for Latin1 vs UTF8. You are correct both encodings are one byte for all ASCII characters, but they are choosing to store as Latin1, NOT as ASCII, to get 1 byte for all Latin1 characters beyond ascii. If they didn't care about space savings for anything but ascii -- they could already have chosen UTF8 instead of Latin1. Which makes me think they care about the space savings for non-ascii Latin1.
It seems to me an orthogonal question to whether you provide functions like `charAt`, and what big-O performance-complexity.
But indexing by code units only really made sense when Unicode was limited to 16-bit codepoints. Back then, “code unit” and “codepoint” were the same thing, so every code unit had some semantic meaning of its own, rather than just being an artifact of the encoding. Thus, programmers could keep treating strings as “arrays of characters” like they were used to, except that “character” now meant a 16-bit integer instead of an 8-bit one. In truth, Unicode code points were never a perfect representation of what a human might think of as a “character”, due to the existence of combining characters like accent marks, but you could still arbitrarily chop them up and combine them and get something roughly sensible. Overall, they seemed like a reasonable compromise that preserved the notion of “array of characters” to as great an extent as possible.
But these days, a UTF-16 “code unit” can be half of a surrogate pair: an artifact of the encoding with no semantic meaning of its own, the exact thing we were trying to avoid. It now has the downsides of a variable-width encoding like UTF-8, without the space efficiency or the backwards compatibility. So if Java’s String were being designed from scratch – or if, hypothetically, Unicode had never been intended to be limited to 16-bit codepoints – there’s no way it would provide indexing by 16-bit code units.
It might instead opt to provide O(1) access to 32-bit Unicode codepoints. This is what Python 3 does, and so it has to represent strings as UCS-4 in the worst case, but it can store them as UCS-2 or Latin-1 if the string has no codepoints over 65,535 or 255, respectively. In that case, yes, you still end up with Latin-1 as an optimization. But Python has always prioritized ease of use over performance, and it had backwards compatibility constraints as well. For new designs, most people just see the overhead of UCS-4 as a bridge too far. Compared to that, it seems more attractive to bite the bullet and accept the downsides of variable-width encodings. And at that point you may as well pick UTF-8.
I guess perhaps it wouldn't ever have said it provided a charAt in O(1), okay. It may have been a mistake to think this mattered, it was just offered because it was easy. In which case I suspect there are actually few programs which would be negatively affected by changing charAt to the most efficient you can do in UTF8.
UTF-32 of course has the O(1) property, but nobody wants that memory usage.
* Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks. As described below, we currently inflate Latin1 strings to TwoByte Gecko strings and that was also a potential performance risk, but inflating Latin1 is much faster than inflating UTF8.
* Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.
* Converting SpiderMonkey’s own string algorithms to work on UTF8 would require a lot more work. This includes changing the irregexp regular expression engine we imported from V8 a few months ago (it already had code to handle Latin1 strings).
This is great for ASCII when you know there's no such thing as combining characters/etc, but I would like to remind everyone reading this that there's no such thing as "linear indexing" of user-perceived characters in Unicode.
User-perceived characters need processing in order to index due to Grapheme clusters potentially using many code points together. (https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundarie...)
For instance, (on my machine, a dark-skinned male teacher) is a combination of these characters:
- U+1F468 Man
- U+1F3FE Medium-Dark Skin Tone
- U+200D Zero Width Joiner
- U+1F3EB School
And knowing what the byte index of the start/end of that character in a string cannot be done by just multiplying an offset by some multiple of number of bytes per character.
There is if you dynamically assign virtual code points to grapheme clusters.
> adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case
But I got confused and addressed the linear indexing part.
But yeah, ASCII is constant time, but you can’t assume even UTF-32 can be constant time due to variable length grapheme clusters.
>Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks.
But I guess they mean converting it all to support UTF-8 under the described circumstances - not just converting it all to UTF-8.
Edit: I.e. it would be awkward to say you're implementing "UTF-8, but only for these codepoints". That would be equivalent to implementing Latin-1.
Edit: I'm thinking of standard ASCII rather than Latin-1. Above the first 128 code points, UTF-8 switches to two bytes while Latin-1 remains with one byte, so it is much simpler.
Also see: http://utf8everywhere.org/
But it wasn't clear at all whether to use UTF-8, -16 or -32. Windows and Java used UTF-16. So D became agnostic about that, supporting all three equally.
Over time it became clear that you're right, the best approach was to use UTF-8 everywhere, and last-minute convert to/from other formats.
"With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling "narrow" functions, including fopen and SetWindowTextA, with UTF-8 strings."
Another popular work-around is to convert the name to the 8.3 filename equivalent, this is necessary if the fopen is inside a library function that takes a string filename and cannot be altered.
8.3 filenames for the win!
"It is not possible given both the current architecture (which must work in both user and kernel mode) and also the inherent assumption in several subsystems (like USER) that the ACPs maximum number of bytes per character is 2."
And the #define would have been MB_LEN_MAX.
I like the way they handle it in Python 3.6. They use their own internal encoding, then bypass the regular console byte API when they need to output a string, allowing full Unicode.
That said, UTF-8 + gzip on the wire isn’t a bad solution.
Приве́т नमस्ते שָׁלוֹם
On Chrome - there's a tricky usability solution to allow part selection of the Hebrew.
My rough guesses at answers are 0.1%, 90%, and no. Besides, UTF-32 plus gzip isn't a bad solution, and completely ends all discussion about transfer formats... at the cost of requiring a compressor.
So it often makes more sense to simple use UTF-16 for your own strings and assume UTF-16 for strings returned by the WinAPI (but handle invalid sequences appropriately).
The way this can be stretched to make sense is by thinking for example how much these have grown over 20 years.
Media (music, video, images) grows staggeringly because of 200p > 4k, 128bit mp3 to flac or whatever and there's more of it all that's gone digital.
Code size continues to grow quite significantly even.
But text? Even the massive wikipedia's size, only a fraction is text content.
Then as memory capacity also grows it's really a back and forth battle between computing resources and all other data, except text.
Hard to think of many places 4 byte chars would matter nowadays, maybe the tiniest devices for a few apps?
Furthermore 4-byte chars also have the endianess problem of UTF-16.
I think the rule of thumb of "use UTF-8 unless you really have a good reason not to" still holds true today.
One thing that seems different is that you also hear concern from developers about coding complexity and the desire to always have things to be as easy to reason about as possible. Yes these are things a library should just handle for you but it sometimes useful to know what’s happening fundamentally or can sometimes allow for less surprises.
Nevertheless, if one of the cases you mentioned became significant I’m not sure that not many would say coding complexity should take precedence over your considerations.
It's unclear if UTF-32 beats UTF-8 there however. UTF-8's encoding is a bit trickier to deal with of course, but it's really not that bad. On top of that you have a few advantages such as not having to worry about endianess as mentioned previously but also not having to worry too much about alignment or corrupted input (you can re-synchronize UTF-8 at every codepoint since the leading byte has a unique bit pattern). And of course ASCII/C-string compatibility means that it's unlikely to be mangled by non-unicode-aware applications.
In any case the complexity probably won't be in the encoding, be it UTF-8, 16, 32 or anything else. UTF-32 might be constand-width codepoint but in general you don't really care about codepoints since they can be combined in various ways to create characters or things like country flags or complex characters with diacritics.
I always wondered how this affected simplicity and efficiency of different text editor implementation strategies. I'm mainly thinking about the variable byte lengths and how it affects anything related to character indices. I'm guessing if it seriously hinders a preferred strategy one would just convert to a plain codepoint of fixed word size internally at the cost of greater memory usage... which actually is not that bad if you only want to support the BMP without any fancy encoding which would only takes up 2 bytes internally.
In fact, I done a fair amount of ARM coding so I understand what you are trying to say.
Nevertheless, your comment seems to point out smartphones may be less powerful, but sheds no light on what the delta might be for boot performance or how you can be confident it's say, more than a few hundred milliseconds.
Maybe it is, I haven't isolated and investigated that single difference enough to form a reasonable guess. Have you?
However, there is some appeal in defining the format to a fixed low fidelity, too, since it shapes the authoring process around conveying the content through the limitations - "the medium is the message." But they're more likely to be the source format than thr transmission format, in the same way that old video games designed for 8kB of working memory now end up with terabytes of video footage streamed to online audiences.
I find it strange to start in 1989, and not in 1980, when Xerox had a precursor to Unicode, or 1988, when employees from Xerox and Apple coined the name “Unicode” (https://en.wikipedia.org/wiki/Unicode#History)
It certainly is a bit unfair to not mention Xerox or Apple at all.
”that the JVM is gaining the ability to transparently represent strings in memory as Latin-1, a legacy codepage, if possible.”
“Compact strings” (https://openjdk.java.net/jeps/254) shipped with JDK 9 in September 2017 (https://docs.oracle.com/javase/9/whatsnew/toc.htm#JSNEW-GUID...)
I mention in another post (Correction: In a draft of a post.) I'm still stuck on JDK 8 for various reasons out of my control, and am generally unaware of changes made in 9 or later. I'll amend the post to reflect that the feature shipped.
Windows 10 recently added the ability to use UTF-8 for the “ANSI” legacy codepage, which is awesome! I accomplished similar prior using API hooking but a Microsoft solution can simply do a much better job across the board. Now the only reason to set the encoding to anything else is to avoid mojibake when using non-English software. (Dear Japanese software developers: please consider using wide win32 APIs :( )
Linux goes for byte stream encoding (e.g utf-8), because it isn't (/wasn't) doing any string processing. So, encoding efficiency beats algorithmic complexity of string operations. If you need an algorithmic more efficient encoding, that's up to your application, and ucs-4 was the way to do it.
”The character codes may be 1-byte or 2-byte codes, but there is nothing else in the text stream except for those codes. Using font information that your program stores separately, Script Manager routines can help you determine whether a character is 1 or 2 bytes, and other managers allow you to work with either character size.”
More cumbersome to use than modern Unicode API’s? Definitely, but it was superior to what Windows provided for years.
Look, ma, I'm on Hacker News!
Absolute madness. I think they will fix it eventually though. The madness is too great to ignore.
love the word.
Does that mean there's processor branch-prediction fails and stalling every few bytes in processing UTF8 strings with many multibyte characters in them?
I keep hearing people refer to this operation (mostly in HN threads), I have yet to see an appropriate use of it. So far the only uses I've seen were (for Unicode) misguided attempts at character iteration, which doesn't need this at all.
Of course these files are so old they predate proper encoding and are implicitly Windows-1252 or similar.
So being able to iterate over characters is important, but constant time access to a specific character seems very rare.
I think the only case where I've wanted a specific character was when I wrote games that loaded maps from text files and those were ASCII regardless.
Though if one had to decode the line from UTF-8, you're iterating anyway so...
I imagine the main reason people would use indexing is to scan through strings in lockstep, e.g. to implement comparison, but iterators can cover that case as well, especially if you can copy them.
The branch misses are not that high or costly, as most characters don't need the subsequent branches for 2-4 bytes, but cache misses cost ~50x more than a branch miss. And there are enough parallel pipelines now to avoid most utf-8 branch misses.
The new massive parallel SIMD ways to process UTF-8 are so fast because the width is so high, 256 bytes per loop, which is enough for most strings. But avoiding branching has some constant overhead to prepare the loops, for small strings it rarely pays off.
Selection (extracting the i-th element) is actually quite fast, but there's some overhead in grabbing a whole chunk of characters and extracting all their payload bits at once.
Mostly no. Or at least, not if the implementation was written by somebody who knows what they're doing. A lot of multi-byte encodings do have the problem you're worried about but UTF-8 does not, and the definition is even careful to make this true if you're processing malformed UTF-8 which may be invalid.
The trick is that the bit pattern alone tells you whether this is a trailing byte. Nothing you could be looking for could _begin_ with a trailing byte, they only make sense in trailing position. Whether you are looking for a colon, or an omega, or the dog emoji, any trailing byte won't match.
It's not very fast, but it makes flawless Unicode support easy. :P
Wrong: that's a feature of C, which all these systems are based on.
wchar_t is 32 bits on Linux, and, though also just a number, it commonly represents a Unicode code point.
On Windows NT filename is specified as sequence of 16 bit wchar_ts and is supposed to be valid NFC UTF-16. OS X also expects that filenames are valid sequences of unicode codepoints and in this case NFD.
I would emphasise that the "supposed to be" is a convention only (similar to the Linux UTF-8 convention). The only filesystem enforced rules are that file/directory names can't include \/:*?"<>| or wchar_t(s) less than 0x20 (aka ascii control characters).
A Win32 path can also have additional rules (such as a file cannot be called "COM") but these aren't always enforced, depending on how the API is called.