0600..0605 ; Prepend # Cf  ARABIC NUMBER SIGN..ARABIC NUMBER MARK ABOVE
06DD ; Prepend # Cf ARABIC END OF AYAH
070F ; Prepend # Cf SYRIAC ABBREVIATION MARK
08E2 ; Prepend # Cf ARABIC DISPUTED END OF AYAH
0D4E ; Prepend # Lo MALAYALAM LETTER DOT REPH
110BD ; Prepend # Cf KAITHI NUMBER SIGN
111C2..111C3 ; Prepend # Lo  SHARADA SIGN JIHVAMULIYA..SHARADA SIGN UPADHMANIYA
I suspect the actual answer to your question is "historical accident". But, the dead key argument doesn't quite hold -- I can equally ask why keyboard dead keys aren't typed after the character.
> I can equally ask why keyboard dead keys aren't typed after the character.
So the answer probably lies in what the Unicode codepoint lexing algorithm must do to disregard "dependent" combining characters, if it hits a corrupted base character.
I love that Unicode is very nicely and strictly defining it, but god damn is it a complicated beast.
On top of that, the Unicode rules do a very poor job with things like URLs and email addresses. The Android text stack has its own layer (WordBreaker in the minikin lib) on top of that which recognizes those and implements its own rules.
But TR14 is a good start, for sure.
The smallest allocatable block of Unicode is a 16 code point block of characters. In the BMP, there are just 8 such blocks remaining--and 1 of them is scheduled for use in Unicode 10 and 3 of them are tentatively reserved for a current proposal. Note however that there are ~10k unassigned code points within the BMP.
After the BMP, the next most-full block is the SIP (plane 2), which is basically entirely reserved for "random rare characters in Chinese/Japanese" (although only about 85% or so of it is considered assigned as of Unicode 10). Plane 3, the TIP, is more or less reserved for SIP overflow and historical Chinese scripts, although only about 25% of it has tentative reservations.
Around half of the SMP is already tentatively reserved for scripts, although I'm not sure if the Script Encoding Initiative's list of scripts to encode (http://linguistics.berkeley.edu/sei/scripts-not-encoded.html) are all on the Unicode roadmap pages. There's about 200 scripts left to encode, although some of them may be consolidated in Unicode's script terminology (for example, Unifon is proposed for Latin Extensions D).
I think the set of remaining historical scripts to encode is considered more or less complete, although several scripts definitely need a lot more research to actually encode (Mayan hieroglyphics is probably the hardest script left, since it requires rather complex layout constraints).
A Practical Guide to Character Sets and Encodings
or: What’s all this about ASCII, Unicode and UTF-8?
This is very ambiguous definition and it can be very confusing. I sure there are many people who have read Joel Spolsky's Unicode intro and left confused.
Using ASCII as an example is confusing because ASCII character maps into several different Unicode concepts:
2. code point
3. encoded character
5. grapheme cluster
6. abstract character
7. user perceived character
Mapping from user perceived character to abstract characters is not total,injective, or surjective. Some abstract characters need more than one code point to express them. You can't split sequence of Unicode code points arbitrarily in code point boundaries, you must use grapheme clusters instead.
Actually 'code unit', which may have any number of bits, depending on the encoding. Otherwise spot on.
you'll notice I didn't cover collation -- why? because explaining that would dilute process of understand UTF-x and UNICODE
When people start with introduction like this, they end up thinking they have learned more than they actually have.
I point this out because I was one of those mislead by several previous articles explaining Unicode encoding the same wrong way as you did. When I ran into trouble and asked help, everybody around me was misguided the same way. I didn't know I had to dig into manuals because everybody explained that this is how it is. Then I had to teach everyone else that we had all learned it wrong.
Many people deal only with ascii or other easy western alphabets and they can work years with Unicode before they hit into trouble.
Which is usually a symptom of a deeper problem: using the same ADT to represent "known-valid decoded text" and "a slice of bytes that would maybe decode to text in some unspecified encoding", such that the methods manipulating that ADT are completely incoherent.
Honestly, I'm surprised we don't see more programming languages like Objective-C, that have very clear distinctions between their "Data" type and "String" type, where encoded text is an NSData (a buffer of bytes) while decoded, valid text is an NSString, and all the methods on NSStrings operate on the grapheme clusters that decoded text is composed of, rather than on the bytes or codepoints or "characters" that are only relevant to encoded text.
Source: the same Wikipedia article you linked.
There is no space for expansion without reassigning private use areas or changing the encoding mechanism of surrogates--which is currently completely specified (each surrogate pair will produce a valid code point).
That's a property of the ISO-8859 encodings, it turns out to be pretty terrible as you have no way of distinguishing utter garbage/nonsense from actual text.
> I use B85 or B64 when needed but was wondering if there is a Unicode encoding that will do the job.
Base64 and 85 are pretty much the opposite of Unicode encodings. Encodings turn text (abstract unicode codepoints) into actual bytes you can store, Base64 and Base85 turn bytes into ASCII or ASCII-compatible text.
I know and that's the purpose I use it for: to smuggle in binary data as text, but no reason why an encoding scheme should not be dual purpose. Thanks for the tip about ISO-8859
You take the first six bits of your binary data and convert to some ascii char mapping. And then the the next 6 bits and so on.
You can't do that with unicode, and it wouldn't make sense for any other encoding standard.
They are completely different things.
Its a really silly, stupid situation I need this for, exchanging data with Python and I can only use Unicode strings.
Sure, kinda: https://github.com/pfrazee/base-emoji (it's a base256 using emoji), but then you still need to encode that text, which is going to require 4 bytes per symbol, so I'm not sure you're going to get any actual gain over B64/B85 in the end. There's also the option of using a subset of the U+0100~U+07FF range (though it contains a diacticial block which may not be ideal) as it encodes to 2 bytes in both UTF-8 and UTF-16 (though there are diacritics in these blocks, and some of the codepoints are reserved but not allocated so…).
So with most systems being UTF8, this means that the encoding is "UTF8, except with bytes that would be invalid UTF8 mapping to this range of surrogates". So you've got an encoding with no holes, which is compatible with all valid UTF8 (since those surrogates are invalid UTF8).
It's not common, so it doesn't match that criterion of yours,but I think it's a great solution and hope it catches on. We need a solution for "text, except with graceful fallback when someone decided to put arbitrary bytes in there"
Since Python happened to be a common feature common between our two scenarios, wanted to caution other HN'ers not to put blame on Python. In my case Python was indeed doing the right thing.
It's not a very good property to have. UTF-8's invalid decodings means that it's very easy to detect if you're not using UTF-8: iconv's charset detector will rule certainty of UTF-8 if it sees just 3 valid UTF-8 multibyte sequences and no invalid sequences, and that's pretty much the only reliable detector.
"Below are the guidelines that were used in defining the UCS
6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream."
If they used "10" as a marker for "this is the start of a two-byte sequence", it could not have been used for "this is a byte in a multi-byte sequence, but not the first one"
This article is much better at telling you relevant things (with nice data-driven visualizations!) about how Unicode is used now.
Who's seen this before: You're getting up to speed on your new company's legacy code base. Q: "Hey, senior developer, This API takes a string as an argument and we need to display it, fine. What encoding are we using?" A: ¯\_(ツ)_/¯
I wrote a console-based mail client, and I struggled with UTF for months before it was all done, and even then there were niggles caused by libraries - for example passing a C++ string to a Lua script, and then using the `string:len()` lua function would return the number of bytes, rather than the length of the rendered string. Something I had to work around:
Do unicode strings really have a rendered length?
> So the “length” of ... characters depends ... on the display font
I know that eev.ee points out that wcwidth() is inconsistent (emoji screwed it up, then Unicode 9 gave a consistent width to emoji, but most libraries aren't on Unicode 9 so the situation is currently more inconsistent than before). But the situation is a lot better than just saying "I don't know, blame fonts".
Putting a link to that Spolski article in the comments on HN, reddit, etc whenever some other article on Unicode is posted has become an in-joke. You'd find it impossible to "retire" it, even honorably. Though andrewl didn't follow the apparent protocol of referring to that article with some adjective like "excellent" or "brilliant".
>In Unicode, a letter maps to something called a code point which is still just a theoretical concept.
This is incorrect. Letter or character (user perceived character) can be multiple code points.
Code points don't have intrinsic meaning across languages. If you want to count number of letters in a text or edit text, you must work at grapheme cluster level. Absolute minimum about Unicode must at least mention grapheme clusters and user perceived characters.
For those that don't know Joel Spolsky used to be a PM on Excel back in the early 90s.