It was such a perfect abbreviation, but now I probably shouldn't use it, as it would be confused with Simon Sapin's WTF-8, which people would actually use on purpose.
Sorry for hijacking it!
That is an amazing example.
It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.
Which makes me happy that my module solves it.
>>> from ftfy.fixes import fix_encoding_and_explain
>>> fix_encoding_and_explain("ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C")
('\xa0the future of publishing at W3C',
[('encode', 'sloppy-windows-1252', 0),
('transcode', 'restore_byte_a0', 2),
('decode', 'utf-8-variants', 0),
('encode', 'sloppy-windows-1252', 0),
('decode', 'utf-8', 0),
('encode', 'latin-1', 0),
('decode', 'utf-8', 0),
('encode', 'sloppy-windows-1252', 0),
('decode', 'utf-8', 0),
('encode', 'latin-1', 0),
('decode', 'utf-8', 0)])
The key words "WHAT", "DAMNIT", "GOOD GRIEF", "FOR HEAVEN'S SAKE",
"RIDICULOUS", "BLOODY HELL", and "DIE IN A GREAT BIG CHEMICAL FIRE"
in this memo are to be interpreted as described in [RFC2119].
Converting between UTF-8 and UTF-16 is wasteful, though often necessary.
> wide characters are a hugely flawed idea [parent post]
I know. Back in the early nineties they thought otherwise and were proud that they used it in hindsight. But nowadays UTF-8 is usually the better choice (except for maybe some asian and exotic later added languages that may require more space with UTF-8) - I am not saying UTF-16 would be a better choice then, there are certain other encodings for special cases.
UTF-32/UCS-4 is quite simple, though obviously it imposes a 4x penalty on bytes used. I don't know anything that uses it in practice, though surely something does.
Again: wide characters are a hugely flawed idea.
Namely it won't save you from the following problems:
* Precomposed vs multi-codepoint diacritics (Do you write á with
one 32 bit char or with two? If it's Unicode the answer is both)
* Variation selectors (see also Han unification)
* Bidi, RTL and LTR embedding chars
I almost like that utf-16 and more so utf-8 break the "1 character, 1 glyph" rule, because it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avoid such problems. Unicode just isn't simple any way you slice it, so you might as well shove the complexity in everybody's face and have them confront it early.
Perl6 calls this NFG .
^ link currently broken, the plain-text version is at https://raw.githubusercontent.com/perl6/specs/master/S15-uni...
> The mapping between negative numbers and graphemes in this form is not guaranteed constant, even between strings in the same process.
Sure, more recently, Go and Rust have decided to go with UTF-8, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-2, UCS-4 as appropriate) model if you have to do actual processing instead of just passing opaque strings around.
Also note that you have to go through a normalization step anyway if you don't want to be tripped up by having multiple ways to represent a single grapheme.
The overhead is entirely wasted on code that does no character level operations.
For code that does do some character level operations, avoiding quadratic behavior may pay off handsomely.
i agree its a flawed idea though. 4 billion characters seems like enough for now, but i'd guess UTF-32 will need extending to 64 too... and actually how about decoupling the size from the data entirely? it works well enough in the general case of /every type of data we know about/ that i'm pretty sure this specialised use case is not very special.
That's not remotely comparable to the situation in Windows, where file names are stored on disk in a 16 bit not-quite-wide-character encoding, etc... And it's leaked into firmware. GPT partition names and UEFI variables are 16 bit despite never once being used to store anything but ASCII, etc... All that software is, broadly, incompatible and buggy (and of questionable security) when faced with new code points.
But we don't seem to be running out -- Planes 3-13 are completely unassigned so far, covering 30000-DFFFF. That's nearly 65% of the Unicode range completely untouched, and planes 1, 2, and 14 still have big gaps too.
The issue isn't the quantity of unassigned codepoints, it's how many private use ones are available, only 137,000 of them. Publicly available private use schemes such as ConScript are fast filling up this space, mainly by encoding block characters in the same way Unicode encodes Korean Hangul, i.e. by using a formula over a small set of base components to generate all the block characters.
My own surrogate scheme, UTF-88, implemented in Go at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to 2 billion as originally specified by using the top 75% of the private use codepoints as 2nd tier surrogates. This scheme can easily be fitted on top of UTF-16 instead. I've taken the liberty in this scheme of making 16 planes (0x10 to 0x1F) available as private use; the rest are unassigned.
I created this scheme to help in using a formulaic method to generate a commonly used subset of the CJK characters, perhaps in the codepoints which would be 6 bytes under UTF-8. It would be more difficult than the Hangul scheme because CJK characters are built recursively. If successful, I'd look at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-8 and UTF-32 officially extended to 2 billion characters.
With typing the interest here would be more clear, of course, since it would be more apparent that nil inhabits every type.
The HTML5 spec formally defines consistent handling for many errors. That's OK, there's a spec. Stop there. Don't try to outguess new kinds of errors.
As to draconian error handling, that’s what XHTML is about and why it failed. Just define a somewhat sensible behavior for every input, no matter how ugly.
> I have been told multiple times now that my point of view is wrong and I don't understand beginners, or that the “text model” has been changed and my request makes no sense.
"The text model has changed" is a perfectly legitimate reason to turn down ideas consistent with the previous text model and inconsistent with the current model. Keeping a coherent, consistent model of your text is a pretty important part of curating a language. One of Python's greatest strengths is that they don't just pile on random features, and keeping old crufty features from previous versions would amount to the same thing. To dismiss this reasoning is extremely shortsighted.
Maybe this has been your experience, but it hasn't been mine. Using Python 3 was the single best decision I've made in developing a multilingual website (we support English/German/Spanish). There's not a ton of local IO, but I've upgraded all my personal projects to Python 3.
Your complaint, and the complaint of the OP, seems to be basically, "It's different and I have to change my code, therefore it's bad."
Now we have a Python 3 that's incompatible to Python 2 but provides almost no significant benefit, solves none of the large well known problems and introduces quite a few new problems.
Python 2 handling of paths is not good because there is no good abstraction over different operating systems, treating them as byte strings is a sane lowest common denominator though.
Python 3 pretends that paths can be represented as unicode strings on all OSes, that's not true. That is held up with a very leaky abstraction and means that Python code that treats paths as unicode strings and not as paths-that-happen-to-be-unicode-but-really-arent is broken. Most people aren't aware of that at all and it's definitely surprising.
On top of that implicit coercions have been replaced with implicit broken guessing of encodings for example when opening files.
On the guessing encodings when opening files, that's not really a problem. The caller should specify the encoding manually ideally. If you don't know the encoding of the file, how can you decode it? You could still open it as raw bytes if required.
Slicing or indexing into unicode strings is a problem because it's not clear what unicode strings are strings of. You can look at unicode strings from different perspectives and see a sequence of codepoints or a sequence of characters, both can be reasonable depending on what you want to do. Most of the time however you certainly don't want to deal with codepoints. Python however only gives you a codepoint-level perspective.
Guessing encodings when opening files is a problem precisely because - as you mentioned - the caller should specify the encoding, not just sometimes but always. Guessing an encoding based on the locale or the content of the file should be the exception and something the caller does explicitly.
Fortunately it's not something I deal with often but thanks for the info, will stop me getting caught out later.
I get that every different thing (character) is a different Unicode number (code point). To store / transmit these you need some standard (encoding) for writing them down as a sequence of bytes (code units, well depending on the encoding each code unit is made up of different numbers of bytes).
How is any of that in conflict with my original points? Or is some of my above understanding incorrect.
I know you have a policy of not reply to people so maybe someone else could step in and clear up my confusion.
As the user of unicode I don't really care about that. If I slice characters I expect a slice of characters. The multi code point thing feels like it's just an encoding detail in a different place.
I guess you need some operations to get to those details if you need. Man, what was the drive behind adding that extra complexity to life?!
Thanks for explaining. That was the piece I was missing.
And I mean, I can't really think of any cross-locale requirements fulfilled by unicode.upper (maybe case-insensitive matching, but then you also want to do lots of other filtering).
According to the Unicode Technical Report #26 that defines CESU-8, CESU-8 is a Compatibility Encoding Scheme for UTF-16 ("CESU"). In fact, the way the encoding is defined, the source data must be represented in UTF-16 prior to converting to CESU-8. Since UTF-16 cannot represent unpaired surrogates, I think it's safe to say that CESU-8 cannot represent them either.
>UTF-16 is designed to represent any Unicode text, but it can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pairs would instead represent a supplementary code point. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not contain any surrogate code point. (This was presumably deemed simpler that only restricting pairs.)
This is all gibberish to me. Can someone explain this in laymans terms?
Characters outside the Basic Multilingual Plane (BMP) are encoded as a pair of 16-bit code units. The numeric value of these code units denote codepoints that lie themselves within the BMP. While these values can be represented in UTF-8 and UTF-32, they cannot be represented in UTF-16. Because we want our encoding schemes to be equivalent, the Unicode code space contains a hole where these so-called surrogates lie.
Because not everyone gets Unicode right, real-world data may contain unpaired surrogates, and WTF-8 is an extension of UTF-8 that handles such data gracefully.
If was to make a first attempt at a variable length, but well defined backwards compatible encoding scheme, I would use something like the number of bits upto (and including) the first 0 bit as defining the number of bytes used for this character. So,
> 0xxxxxxx, 1 byte
> 10xxxxxx, 2 bytes
> 110xxxxx, 3 bytes.
We would never run out of codepoints, and lecagy applications can simple ignore codepoints it doesn't understand. We would only waste 1 bit per byte, which seems reasonable given just how many problems encoding usually represent. Why wouldn't this work, apart from already existing applications that does not know how to do this.
As to running out of code points, we’re limited by UTF-16 (up to U+10FFFF). Both UTF-32 and UTF-8 unchanged could go up to 32 bits.
But inserting a codepoint with your approach would require all downstream bits to be shifted within and across bytes, something that would be a much bigger computational burden. It's unlikely that anyone would consider saddling themselves with that for a mere 25% space savings over the dead-simple and memcpy-able UTF-32.
In addition, there's a 95% chance you're not dealing with enough text for UTF-32 to hurt. If you're in the other 5%, then a packing scheme that's 1/3 more efficient is still going to hurt. There's no good use case.
Coding for variable-width takes more effort, but it gives you a better result. You can divide strings appropriate to the use. Sometimes that's code points, but more often it's probably characters or bytes.
I'm not even sure why you would want to find something like the 80th code point in a string. It's rare enough to not be a top priority.
WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only, paired surrogates from valid UTF16 are decoded and re-encoded to a proper UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.
Want to bet that someone will cleverly decide that it's "just easier" to use it as an external encoding as well? This kind of cat always gets out of the bag eventually.
Note the WTF-8 entry has only been there fore a few minutes, I just added it. It might be removed for non-notability.
And because of this global confusion, everyone important ends up implementing something that somehow does something moronic - so then everyone else has yet another problem they didn't know existed and they all fall into a self-harming spiral of depravity.
[grapheme clusters] <-> [characters]
[glyphs] [codepoints] <-> [code units] <-> [bytes]
Then, it's possible to make mistakes when converting between representations, eg getting endianness wrong.
Some issues are more subtle: In principle, the decision what should be considered a single character may depend on the language, nevermind the debate about Han unification - but as far as I'm concerned, that's a WONTFIX.
Below is all the background I had to learn about to understand the motivation/details.
UCS-2 was designed as a 16-bit fixed-width encoding. When it became clear that 64k code points wasn’t enough for Unicode, UTF-16 was invented to deal with the fact that UCS-2 was assumed to be fixed-width, but no longer could be.
The solution they settled on is weird, but has some useful properties. Basically they took a couple code point ranges that hadn’t been assigned yet and allocated them to a “Unicode within Unicode” coding scheme. This scheme encodes (1 big code point) -> (2 small code points). The small code points will fit in UTF-16 “code units” (this is our name for each two-byte unit in UTF-16). And for some more terminology, “big code points” are called “supplementary code points”, and “small code points” are called “BMP code points.”
The weird thing about this scheme is that we bothered to make the “2 small code points” (known as a “surrogate” pair) into real Unicode code points. A more normal thing would be to say that UTF-16 code units are totally separate from Unicode code points, and that UTF-16 code units have no meaning outside of UTF-16. An number like 0xd801 could have a code unit meaning as part of a UTF-16 surrogate pair, and also be a totally unrelated Unicode code point.
But the one nice property of the way they did this is that they didn’t break existing software. Existing software assumed that every UCS-2 character was also a code point. These systems could be updated to UTF-16 while preserving this assumption.
Unfortunately it made everything else more complicated. Because now:
- UTF-16 can be ill-formed if it has any surrogate code units that don’t pair properly.
- we have to figure out what to do when these surrogate code points — code points whose only purpose is to help UTF-16 break out of its 64k limit — occur outside of UTF-16.
This becomes particularly complicated when converting UTF-16 -> UTF-8. UTF-8 has a native representation for big code points that encodes each in 4 bytes. But since surrogate code points are real code points, you could imagine an alternative UTF-8 encoding for big code points: make a UTF-16 surrogate pair, then UTF-8 encode the two code points of the surrogate pair (hey, they are real code points!) into UTF-8. But UTF-8 disallows this and only allows the canonical, 4-byte encoding.
If you feel this is unjust and UTF-8 should be allowed to encode surrogate code points if it feels like it, then you might like Generalized UTF-8, which is exactly like UTF-8 except this is allowed. It’s easier to convert from UTF-16, because you don’t need any specialized logic to recognize and handle surrogate pairs. You still need this logic to go in the other direction though (GUTF-8 -> UTF-16), since GUTF-8 can have big code points that you’d need to encode into surrogate pairs for UTF-16.
If you like Generalized UTF-8, except that you always want to use surrogate pairs for big code points, and you want to totally disallow the UTF-8-native 4-byte sequence for them, you might like CESU-8, which does this. This makes both directions of CESU-8 <-> UTF-16 easy, because neither conversion requires special handling of surrogate pairs.
A nice property of GUTF-8 is that it can round-trip any UTF-16 sequence, even if it’s ill-formed (has unpaired surrogate code points). It’s pretty easy to get ill-formed UTF-16, because many UTF-16-based APIs don’t enforce wellformedness.
But both GUTF-8 and CESU-8 have the drawback that they are not UTF-8 compatible. UTF-8-based software isn’t generally expected to decode surrogate pairs — surrogates are supposed to be a UTF-16-only peculiarity. Most UTF-8-based software expects that once it performs UTF-8 decoding, the resulting code points are real code points (“Unicode scalar values”, which make up “Unicode text”), not surrogate code points.
So basically what WTF-8 says is: encode all code points as their real code point, never as a surrogate pair (like UTF-8, unlike GUTF-8 and CESU-8). However, if the input UTF-16 was ill-formed and contained an unpaired surrogate code point, then you may encode that code point directly with UTF-8 (like GUTF-8, not allowed in UTF-8).
So WTF-8 is identical to UTF-8 for all valid UTF-16 input, but it can also round-trip invalid UTF-16. That is the ultimate goal.
> If, on the other hand, the input contains a surrogate code point pair, the conversion will be incorrect and the resulting sequence will not represent the original code points.
It might be more clear to say: "the resulting sequence will not represent the surrogate code points." It might be by some fluke that the user actually intends the UTF-16 to interpret the surrogate sequence that was in the input. And this isn't really lossy, since (AFAIK) the surrogate code points exist for the sole purpose of representing surrogate pairs.
The more interesting case here, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-16 will actually end up being ill-formed.
I thought it was a distinct encoding and all related problems were largely imaginary provided you /just/ handle things right...
Sadly systems which had previously opted for fixed-width UCS2 and exposed that detail as part of a binary layer and wouldn't break compatibility couldn't keep their internal storage to 16 bit code units and move the external API to 32.
What they did instead was keep their API exposing 16 bits code units and declare it was UTF16, except most of them didn't bother validating anything so they're really exposing UCS2-with-surrogates (not even surrogate pairs since they don't validate the data). And that's how you find lone surrogates traveling through the stars without their mate and shit's all fucked up.
> UTF-16 was redefined to be ill-formed if it contains unpaired surrogate 16-bit code units.
This is incorrect. UTF-16 did not exist until Unicode 2.0, which was the version of the standard that introduced surrogate code points. UCS-2 was the 16-bit encoding that predated it, and UTF-16 was designed as a replacement for UCS-2 in order to handle supplementary characters properly.
> UTF-8 was similarly redefined to be ill-formed if it contains surrogate byte sequences.
Not really true either. UTF-8 became part of the Unicode standard with Unicode 2.0, and so incorporated surrogate code point handling. UTF-8 was originally created in 1992, long before Unicode 2.0, and at the time was based on UCS. I'm not really sure it's relevant to talk about UTF-8 prior to its inclusion in the Unicode standard, but even then, encoding the code point range D800-DFFF was not allowed, for the same reason it was actually not allowed in UCS-2, which is that this code point range was unallocated (it was in fact part of the Special Zone, which I am unable to find an actual definition for in the scanned dead-tree Unicode 1.0 book, but I haven't read it cover-to-cover). The distinction is that it was not considered "ill-formed" to encode those code points, and so it was perfectly legal to receive UCS-2 that encoded those values, process it, and re-transmit it (as it's legal to process and retransmit text streams that represent characters unknown to the process; the assumption is the process that originally encoded them understood the characters). So technically yes, UTF-8 changed from its original definition based on UCS to one that explicitly considered encoding D800-DFFF as ill-formed, but UTF-8 as it has existed in the Unicode Standard has always considered it ill-formed.
> Unicode text was restricted to not contain any surrogate code point. (This was presumably deemed simpler that only restricting pairs.)
This is a bit of an odd parenthetical. Regardless of encoding, it's never legal to emit a text stream that contains surrogate code points, as these points have been explicitly reserved for the use of UTF-16. The UTF-8 and UTF-32 encodings explicitly consider attempts to encode these code points as ill-formed, but there's no reason to ever allow it in the first place as it's a violation of the Unicode conformance rules to do so. Because there is no process that can possibly have encoded those code points in the first place while conforming to the Unicode standard, there is no reason for any process to attempt to interpret those code points when consuming a Unicode encoding. Allowing them would just be a potential security hazard (which is the same rationale for treating non-shortest-form UTF-8 encodings as ill-formed). It has nothing to do with simplicity.