
The string type is broken - DmitryNovikov
http://mortoray.com/2013/11/27/the-string-type-is-broken/
======
tomp
The problem with text (that Unicode solves only partially) is that text
representation, being a representation of human thought, in inherently
ambiguous and imprecise.

Some examples:

(1) A == A but A != Α. The last letter is not uppercase "a", but uppercase
"α". Most of the time, the difference is important, but sometimes humans want
to ignore it (imagine you can't find an entry in a database since it contains
Α that looks just like A). Google gives different autocomplete suggestions for
A and Α. Is this outcome expected? is it desired?

(2) The Turkish alphabet is mostly the same as the Latin alphabet, except for
the letter "i", which exists in two variants: dotless ı and dotted i (as in
Latin). For the sake of consistency, this distinction is kept in the upper
case as well: dotless I (as in Latin) and dotted İ. We can see that not even
the uppercase <==> lowercase transformation is defined for text independently
of language.

These are just two examples of problems with text processing that arise even
before all the problems with Unicode (combining characters, ligatures, double-
width characters, ...) and without considering all the conventions and
exceptions that exist in richer (mostly Asian) alphabets.

~~~
zokier
I think (2) is an issue with Unicode specifically. They should have specified
Turkish alphabet to use ı and a diacritic to make the dotted one. That would
have made (in this case) capitalization locale-independent.

~~~
jheriko
isn't this solving the wrong side of the problem? how about not having to
think about such things at all and just accepting that uppercase/lowercase
conversion is never going to be language agnostic.

thats futureproof and powerful, rather than extra thinking and work...

~~~
zokier
Most likely case-changes need to be locale-aware, that is true. But still I
think minimizing number of locale-specifics is a reasonable goal and in that
light I dislike the common usage of turkish i as a example because it is such
a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather
than fundamental issue.

~~~
jheriko
You are right, everything should be as easy as possible. This is a good
philosophy for design in general...

------
pilif
A nitpick from the article

 _> This spells trouble for languages using UTF-16 encodings (Java, C#,
JavaScript)._

if they were using UTF-16, this wouldn't be a problem as UTF-16 can be used to
perfectly well encode code points outside of the BMP (at the cost of losing
ability for O(1) access to specific code points of course. If you need to know
what the n-th code point is, you have to scan the string until the n-th
position).

They are, however, using UCS-2 which can't. If you use a library that knows
about UCS-2 to work on strings encoded in UTF-16, then you will get broken
characters, your counts will be off and case transformations might fail.

Most languages that claim Unicode support still only have UCS-2 libraries
(Python 3 is a notable exception)

~~~
jeltz
> Most languages that claim Unicode support still only have UCS-2 libraries
> (Python 3 is a notable exception)

Most non-JVM languages[1] actually use UTF-8 as the internal encoding so they
should not suffer from this. Python 3 does not use UTF-16 either, it selects
an encoding based on the contents of the string.

[http://www.python.org/dev/peps/pep-0393/](http://www.python.org/dev/peps/pep-0393/)

1\. I think .NET too uses UCS-2 or UTF-16, but I am not a Windows developer.

~~~
berdario
Python <3.3 uses UCS2 or UCS4, depending on the build

Ruby >1.8 lets you choose the encoding

.NET UCS2/UTF-16 (I know the difference, imho if the stdlib has a .size,
.length or .count that works on code units instead of code points it's
broken... thus I'll mention only UCS2 from now on)

Java UCS2

Clojure UCS2

Scala UCS2

QT UCS2

Haskell String UCS4

Haskell Data.Text UTF-16 (yes, not a naive UCS-2)

Rust UCS4 (last time I checked)

Javascript UCS2

Dart UCS2

PHP Unicode-oblivious

Vala UCS4

Go UTF-8 (but it lets you call len() on strings, and it doesn't return the
length of the string, but its size in bytes)

I can't really think of another language that uses UTF-8 internally, are you
sure?

~~~
pitterpatter
> Rust UCS4 (last time I checked)

Rust chars are 32bit Unicode codepoints. But strings themselves are utf-8.
That is the string type, ~str, is basically just ~[u8], a vector of bytes and
not ~[char].

`.len()` [O(1)] gives you byte length while `.char_len()` [O(n)] gives you the
number of codepoints.

So strings in rust are just vectors of bytes with the invariant that it's
valid utf-8.

~~~
berdario
Thanks, I didn't know that

------
pilif
Python 3 gets so much of this right. It's one of the things I really loved
about python 3 as it allows for correct string handling in most cases (see
below).

Note that this is only really true with Python 3.3 and later as in earlier
versions stuff would start breaking for characters outside of the BMP (which
is where JS is still stuck at, btw) unless you had a wide build which was
using a lot of memory for strings (4 bytes per character)

In general, internally using unicode and converting to and from bytes when
doing i/o is the right way to go.

But: Due to
[http://en.wikipedia.org/wiki/Han_unification](http://en.wikipedia.org/wiki/Han_unification)
being locked into Unicode with a language might not be feasible for everybody
- especially in Asian regions, Unicode isn't yet as widely spread and you
still need to deal with regional encodings, mainly because even with the huge
character set of Unicode, we still can't reliably write in every language.

Ruby 1.9 and later helps here by having many, many string types (as many as it
knows encodings), which can't be assigned to each other without conversion.

This allows you to still have an internal character set for your application
and doing encoding/decoding at i/o time, but you're not stuck with unicode if
that's not feasible for your use-case.

People hate this though because it seems to interfere with their otherwise
perfectly fine workflow ("why can't I assign this "string" I got from a user
to this string variable here??"), but it's actually preventing data corruption
(once strings of multiple encodings are mixed up, it's often impossible to un-
mix them, if they have the same characer width).

I don't know how good the library support for the various Unicode encodings is
in Ruby though. According to the article, there still is trouble with
correctly doing case transformations and reversing them.

Which brings me to another point: Some of the stuff you do with strings isn't
just dependent on string encoding, but also locale.

Uppercasing rules for example depend on locale, so you need to keep that into
account too. And, of course, deal with cases when you don't know the locale
the string was in (encoding is hard enough and most of the cases undetectable
- but locales - next to impossible).

I laugh at people who constantly tell me that this isn't hard and that "it's
just strings".

~~~
lelf
> _Python 3 gets so much of this right_

What does it gets right????? It's all broken as nearly everything else!

It's sad 99% comments there are “oh see, I can run _some_ examples from page
just fine. So everything's all right, I've got full Unicode!”

The reality is there's 1-2 languages that are trying to make it correct from
the beginning (perl6, I'm looking at you). It's 2013 and if language can
compose bytes to code points everyone declares a win, sticks "full unicode
support" label to it and continues to _str[5:9]_.

”But I've got UnicodeUtils!” — it won't help. People just don't want or cannot
write it correctly. Word is not [a-z]. Not [[:alpha:]] either. And not [insert
regex here]. You cannot reverse by reversing codepoint list. And you cannot
reverse by reversing grapheme list. And string length is hard to compute and
then it doesn't any sense. And indexing string doesn't make any sense and it's
far away from O(1)

~~~
wbond
Can you provide some examples of Python 3 getting strings wrong?

Between strings being native unicode code points (you have to encode to bytes
to get UTF-8) and unicodedata for normalization and decomposition
([http://docs.python.org/3.3/library/unicodedata.html](http://docs.python.org/3.3/library/unicodedata.html))
I've found Python 3 pretty robust. Python 3.3 also uses appropriate Unicode
data for regular expressions, as mentioned on
[http://docs.python.org/3.3/howto/regex.html](http://docs.python.org/3.3/howto/regex.html).

If you want to compare strings you should really normalize them first, which
is where unicodedata comes in. In my programming situations it would be wrong
to conflate different decomposition of the same unicode string. Why is this?
Because other software you interact with uses encodings and the UTF-8 encoding
of two different decompositions if different. I've run into this with UTF-8
filenames on OS X when working with Subversion.

~~~
lelf
Did you read the comment you're replying to at all? You can start at “It's sad
99% comments”.

PS:

    
    
      Python 3.3.2 (default, Nov 27 2013, 20:04:48)
      [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> 'öo̧'[1:]
      '̈o̧'
    
    

And sorry, those new regexes don't even support \X (grapheme matching)

Edit: python version

~~~
wbond
Yes, I did, and you did not provide a single example. You just said "“oh see,
I can run some examples from page just fine. So everything's all right, I've
got full Unicode!".

Taking the time to actually prove your point it useful. However, your recent
example seems to be running fine on Python 3.3. You did not include any
version info in your example output.

    
    
        Python 3.3.0 (default, Mar 11 2013, 00:32:12) 
        [GCC 4.7.2] on linux
        Type "help", "copyright", "credits" or "license" for more information.
        >>> "öo̧"[1:]
        'o̧'
        >>>
    

I haven't run across any situations where Python 3.3 is doing wrong, which is
why I am asking for some examples.

~~~
lelf
3.3.2. No, it is not. Use 'o\u0308o\u0327'

~~~
wbond
Oh, I see the issue here. You are expecting the string class to function via
graphemes rather than characters. It should be possible to implement grapheme
support since character support is there, but I imagine the reverse is not
true.

A little googling turned this up. [https://mail.python.org/pipermail/python-
ideas/2013-July/021...](https://mail.python.org/pipermail/python-
ideas/2013-July/021916.html)

~~~
tekacs
TL;DR of parent comment here for those skimming:

    
    
      x = 'o\u0308o\u0327'
      len(x) == 4
      x == "öo̧"
      x[:2] == "ö"
      x[2:] == "o̧"
      x[0] == x[2] == "o"

------
this_user
The string type isn't broken. If anything these "X is broken" posts are
broken. Taking one special case, finding problems with that case and deducing
that the whole concept must therefore be discarded is just silly. Strings work
fine for the vast majority of use cases. No technology is free of flaws and
engineering decisions are almost always based on weighting the pros and cons
and choosing a solution that on balance works best. Strings are a useful
feature and Unicode is a notoriously hard problem. Proposing to go back to
arrays of characters makes things worse for most people in most cases and
therefore is not a practical solution.

~~~
mercurial
In my experience, the world is full of software which "work fine for the
majority of use cases" until the point where you take the wrong code path and
things go south.

~~~
jodrellblank
Much like human brains, and business processes.

------
judofyr
In many languages it's difficult fixing the string type without breaking
existing code. In Ruby: String#upcase only handles ASCII (by spec), #length
counts codepoints, #reverse reverses codepoints.

You can use UnicodeUtils if you need "full" Unicode support:

    
    
        >> UnicodeUtils.upcase("baﬄe")
        => "BAFFLE"
        >> graphemes = UnicodeUtils.each_grapheme("noe\u0308l").to_a
        >> graphemes.reverse.join
        => "lëon"
        >> graphemes.size
        => 4
        >> graphemes[0, 3]
        => "noë"

~~~
lelf
> _String#upcase only handles ASCII (by spec)_

Bad for Ruby

> _You can use UnicodeUtils if you need "full" Unicode support:_

Oh, sure

    
    
      Betty:~ lelf$ ruby -r unicode_utils/u -e 'puts UnicodeUtils.each_grapheme("A‮͜CB‬D").to_a.reverse.join'
      D‬BC͜‮A
    

So, "full" (it's not) Unicode support won't help you if you have little idea
about what you're doing (like indexing
stringه҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿s)

~~~
jhull
what are these characters printing here?

~~~
preek
It's an awesome little gadget - looks like one character, but is a really big
messy bunch of bytes:

"\xD9\x87\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\
x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD
2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\
xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xB
F\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\
xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xC
C\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\
xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x8
8\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\
x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD
2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\
xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xB
F\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\
xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xC C\xBF\n"

~~~
jhull
funny that an ad comes up when you search that character on google.
[http://imgur.com/Y5vkbHF](http://imgur.com/Y5vkbHF)

------
jbert
Perl seems to pass nearly all the tests (including uppercasing baffle):

    
    
      $ perl -E 'use utf8; binmode STDOUT, ":utf8"; say uc("baﬄe");'

BAFFLE

The only failure I can see is that it treats "no<combining diaresis>el" as 5
characters (so reports length as 5 and reversing places the accent on the
wrong character). That's documented here:
[http://perldoc.perl.org/perluniintro.html#Handling-
Unicode](http://perldoc.perl.org/perluniintro.html#Handling-Unicode) "Note
that Perl considers grapheme clusters to be separate characters"

All else seems to work though (including precomposed/decomoposed string
equiality etc). The docco also says that perl's regex engine with Do The Right
Thing with matching the entire grapheme cluster as a single char.

~~~
llimllib
Python 3 gets that one, but python 2.7 doesn't:

    
    
        $ python3 -c 'print("baﬄe".upper())'
        BAFFLE
        $ python -c 'print "baﬄe".upper()'
        BAﬄE
        $ python -c 'print u"baﬄe".upper()'
        BAÏ¬E

~~~
deathanatos
It's interesting that you get BAFFLE for the first one. I get the same result
in both 3 and 2.

Note first that the reason you get "BAÏ¬E" is a bit of garbarge-in garbage-
out. Strangely, the interpreter isn't rejecting that with the typical
"SyntaxError: Non-ASCII character <char> in file" error; instead, it appears
to be assuming ISO-8859-1, and then performing .upper(). You can fix that:

    
    
        python2 -c '# coding: utf-8
        print u"baﬄe".upper()'
    

(Note, of course, that the #coding needs to match your terminals encoding,
which is likely UTF-8, but it isn't guaranteed.)

That, for me, prints "BAﬄE" in both Python 2 and 3 (adjusting for 3 by adding
parens around print, and removing the u prefix on the literal.) I'm on Python
3.2, so perhaps 3.3 does better. (I'm behind on updates, but last I did
update, Gentoo stable was still on 3.2.)

~~~
llimllib
Interesting.

I am on python 3.3, but I don't know if it's the updated interpreter that
fixes the bug.

------
edent
˙ƃuᴉuɐǝɯ ⅋ 'spɹoʍ 'sɥdʎlƃ 'sɹǝʇɔɐɹɐɥɔ uǝǝʍʇǝq 'ɹǝʌǝʍoɥ 'ǝɔuǝɹǝɟɟᴉp ɐ sᴉ ǝɹǝɥ┴
˙ʇxǝʇ ɥʇᴉʍ punoɹɐ ƃuᴉsɹɐ oʇ sǝɯoɔ ʇᴉ uǝɥʍ sǝᴉʇᴉlᴉqᴉssod ƃuᴉʇsǝɹǝʇuᴉ ǝɯos
sɹǝɟɟo ǝpoɔᴉu∩

~~~
antocv
Awesome way to exercise the brain.

~~~
afandian
Interesting. I had no problem reading that (except for 'arsing' which I
thought I'd misread). The ability of the brain to pattern-match upside down is
amazing.

~~~
aroman
Fascinatingly, I read your comment first, then tried to read the upside down
post — the only word I had trouble with was arsing.

Sight reading is really fascinating.

------
lmm
I think the mistake here is seeing a string as an extension of an array or
vector. What I would prefer is a string type that didn't support all the
operations of vectors. The length of a string is not inherently a meaningful
question (and for the cases where it is, what you want is something like a
vector of grapheme clusters - which is a useful type to have, but not so
useful that every string in your program should incur the overhead of creating
such a thing); likewise reversing and splitting are operations that simply
shouldn't be allowed for your "fast path, undecoded string" type.

~~~
rlpb
I'm with you here; but in that case, I'd like an ascii_string type, which most
languages don't provide specifically. This type _would_ support string
reversal, substring slices, and so on, but be limited to 7-bit ASCII only. I
think there are many use cases that are purely internal, and don't need i8n.
It's handy to be able to do things, including operations on strings, for
internal things. Filename handling where you control the filename, the "keys"
in languages which use strings for a dictionary type, and so forth.

~~~
jeltz
I think this might just confuse new programmers and the filename thing is
especially dangerous since at some point you might want to support i18n there.
I think it would be better to have two types of string: 1) unicode strings and
2) arrays of 8 byte data with some string like functions (essentially C
strings). The second case is essentially binary data strings.

------
baddox
A big problem here is a lack of clear definitions for various concepts like
"character," "reversed string," "upper case," etc. The author briefly
recognizes this, but brushes it off with statements like "I generally expect
that..." and "I assume most people would not be happy with the current
result."

I think these hand-wavings aren't helpful. Short of extensive surveying, which
is bound to be controversial no matter what the result, talking about "general
expectations" is a purely subjective notion, and not a good way to evaluate
the actions of cold, soulless silicon that is just following orders.

Like the author, I also consider myself a mostly reasonable person, yet is
might come up with very different expectations. If I saw that "ffl" ligature,
how would I know it's a ligature and not some single unrelated character in
another language? You might respond "but it's clearly part of the word
'baffle' and should be capitalized thusly." But would you suggest that string
libraries ship with word lists and perform contextual analysis to determine
how to perform string operations? Surely that's a fool's errand, not to
mention that it would inevitably produce _unexpected_ results.

~~~
itsybitsycoder
"If I saw that "ffl" ligature, how would I know it's a ligature and not some
single unrelated character in another language?"

Because the name of that character is "Latin Small Ligature ffl". Knowing to
capitalize ﬄ as FFL doesn't require a word list any more than knowing to
capitalize "ffl" does.

------
masklinn
I'm not sure I agree with the title, although I do agree with just about all
of the content:

* a string type is probably a good idea to bundle the subtleties of unicode, a plain array or list (whether it's of bytes or of codepoints) won't cut it: standard array operations are incorrect/invalid on unicode streams

* the _vast_ majority of string types are broken anyway, as even in the best case they're codepoint arrays (possibly with a smart implementation). The bad cases are just code unit arrays, which break before you even reach fine points of unicode manipulation

And then, you've got the issue that a lot of unicode manipulation is locale-
dependent, which most languages either ignore completely or fuck up (or half
and half, for extra fun)

------
b-johansson
If you are actually manipulating strings rather than just storing and pushing
them around I would suggest looking at ICU. Handling Unicode is difficult and
it's easy to confuse encodings, code points and glyphs or make assumptions
based on your own culture and language.

ICU has support for a lot of the basic operations you would want to perform on
strings as well as conversion to whatever format is suitable for your platform
and environment.

------
billpg
Do people really need to reverse strings in the real world?

I don't think I've ever written code to do that outside of homework
assignments and interviews.

~~~
abcd_f
May be not reversing, but trimming a Unicode string to certain character count
is a close relative and it is a very common operation.

~~~
jeltz
What do you use it for? Unless you have a monospaced font the number of
characters do not mean much. So unless you are implementing command line tools
or text editors it should not be that common.

~~~
jerf
I have a database field limited to 100 "characters" [1]. The user sent me a
form submission with 150. I need to do something to resolve that. This is
_incredibly_ common. Truncation to a defined size is routine.

[1]: I'm leaving "characters" undefined here, because no matter what Unicode-
aware definition you apply here, you've got trouble.

~~~
al2o3cr
"I have a database field limited to 100 "characters"."

Well there's your problem right there...

"The user sent me a form submission with 150. I need to do something to
resolve that."

Any software that defines "do something" here as "silently discard 1/3rd of
the user input" is software I'm going to throw in the trash. If you must have
fixed-length fields, surely telling the user "much characters, wow overflow"
is better than just chopping the input.

~~~
jerf
Since this seems to be confusing people, I'm providing a small hypothetical
example here.

"Any software that defines "do something" here as "silently discard 1/3rd of
the user input" is software I'm going to throw in the trash."

You are reading far more in than I put in. I merely said _somehow_ you need to
resolve this; you put a particular resolution in my mouth, then attacked.

I did choose the web for one reason, which is that you can't avoid this case;
you can try to limit the UI to generating 100 characters only (and I still
haven't defined "characters"...), but it's 15 seconds for a user to pull open
Firebug and smash 150 characters into your form submission anyhow. _Somehow_ ,
you better resolve this, and as quickly as you mounted the high horse when
faced with the prospect of mere truncation, throwing the _entire_ request out
for that will cause somebody else to mount an equally high horse....

------
rayiner
This is why the U.S. dominates the software world. Back when everyone was
figuring out how to express their languages, we had the option to punt on
complexity and just use ASCII.

~~~
adamtj
ASCII doesn't make the U.S. special. ASCII is special because it's from the
U.S.

Lots of people speak languages that trivially fit in 8 bits with no real
"figuring out" to do. Before Unicode, we all had our different codepages or
encodings. Including the U.S.

The U.S. is pretty central to computing. Because of that, and because ASCII
only uses 7 bits, some other 8-bit cultures use it as a subset for their
native 8-bit encodings. Even in the U.S, we use extensions to ASCII so we can
represent text in languages that are close cousins to English. I doubt you
actually use ASCII much. You've probably been using either ISO 8859-1 (aka
Latin-1), which is a superset of ASCII, or Windows-1252, which is a superset
of Latin-1.

[http://msdn.microsoft.com/en-
us/library/cc194884.aspx](http://msdn.microsoft.com/en-
us/library/cc194884.aspx)

This mess of incompatible codepages and culture specific encodings is one of
the main problems that Unicode was invented to solve. It also happens to help
languages which need more than 8 bits.

~~~
rayiner
Many languages fit into 8 bits, but English is particularly simple in its
alphabet. Even many of the European languages that can fit in 8 bits have
things like accented characters that complicates things somewhat.

Of course this isn't to say English is simple overall. Just that it's
complexities lie elsewhere, and it's simplicities lie in an area that made it
particularly simple for early computer systems to process.

~~~
meepmorp
> Even many of the European languages that can fit in 8 bits have things like
> accented characters that complicates things somewhat.

I don't see your point here, with respect to English orthography making
computer implementation easier. How exactly does not needing representations
for accented characters make anything easier?

~~~
ianbicking
If it was just some additional characters like ñ (which is considered a letter
of its own, not an accented n) then it wouldn't be a big deal – but e and é
are the same letter with different accents, which adds some subtlety that
English simply doesn't have. Given a small enough number of accented
characters you can punt on that, call them each a character, but English is
objectively simpler since the only real distinction it has between letters is
caps or not-caps. (I was just watching the Mother Of All Demos, though, and
everything was in caps but they put an overline over capital letters. So even
normal English lettering was too complicated for a while.)

------
Pitarou
Hat tip to Guido van Rossum for passing (nearly) all the tests in Python 3.

Is the "ffl-ligature to uppercase" test really relevant? Isn't that fixed by
appropriate use of string normalisation?

~~~
unwind
I also doubt the validity of the upper-casing, it feels like in an
internationalization/localization context, converting a string to all upper
case is not a valid thing to be doing.

Not all languages (or even characters) _have_ a well-defined upper-case
versions of their glyphs.

Even if they all did, I would expect the interpretation by a (human) reader to
vary culturally.

~~~
XorNot
The usual goal is to apply a consistent transform though, to smooth out
interpretation differences - i.e. when looking for command input I either
lowercase or uppercase things to smooth over the fact that "yes" "YES" "Yes"
are all completely valid ways of saying the same thing with those characters.

If there's only one way of expressing the thing - i.e. a single chinese
character - then it would be valid to do nothing. It's just in english "y" and
"Y" might change context, but as far as computer input is generally concerned
they are the same thing.

~~~
TorKlingberg
For that use case it is better to compare case insensitively with "yes"
instead of converting the input to lower case first.

~~~
zokier
How do you do case-insensitive comparison without normalizing the case of the
operands?

~~~
gnaritas
Most string compare routines in the library offer a case insensitive compare
option already, you don't have to normalize it.

------
panzi
While JavaScript (in browsers) has no way to normalize precomposed/decomposed
strings, it has standard methods to correctly compare them:
[https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Refe...](https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare)
[https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Refe...](https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Reference/Global_Objects/Collator)

E.g.:

    
    
        var decomp="noël";
        var precomp="noël";
        console.log(decomp.split(""));
        console.log(precomp.split(""));
        console.log(decomp.localeCompare(precomp));
    

Prints:

    
    
        ["n", "o", "e", "̈", "l"]
        ["n", "o", "ë", "l"]
        0
    

Browser support for this varies, The Intl.Collator interface is currently only
supported Chrome (maybe also in Opera? Idk).

Note: In Chrome when comparing (e.g. sorting) a lot of strings
String.prototype.localeCompare is much slower than using a pre-composed
Intl.Collator instance (because internally localeCompare creates a new
collator for each call). Using Intl.Collator rediced startup time of my
[http://greattuneplayer.jit.su/](http://greattuneplayer.jit.su/) immensely.
node.js currently has no support for Intl.*. It probably will be a compile
time option for 0.12.

------
brihat
This article is mostly written from a European language perspective. For
Indian scripts, storing combining characters as a separate code points is the
right thing to do.

For example, कि (ki) is composed of क and ि When I'm writing this in an
editor, say, I typed ku (कु) instead of ki (कि) and I press backspace, I
indeed want to see क rather than deleting the whole "कि".

~~~
frenchy
Only some times I figure, because if you want to make the first letter green,
you'd want that to apply to the whole कि.

------
gcr
For the record, Racket gets the "baﬄe" example right:

    
    
        racket@> (string-upcase "baﬄe")
        "BAFFLE"
    

It also passes all of the author's other tests (except for the ones involving
combining diacritics, but racket includes built-in functions for normalizing
such strings so you can work with them)

------
revelation
I seem to have rather little use for the cases the author presents here. If
I'm working with strings, they are either of the debug or internal variant,
where even basic ASCII would suffice, or I get them from somewhere and don't
touch them at all, just pass them around.

But what I absolutely need in a language is to have a very very clear
seperation between _strings_ and _byte arrays_ , or raw data, and ideally a
way to transform between the two. C# gets this right with its byte and string
types, the framework uses them correctly, and there is the wonderful Encoding
namespace to interchange the two. Python 2.7 is the absolte worst, it's
apparently impossible to get anything done with raw data and _not_ run into
some obscure 'ASCII codec can't handle octet 128' whatever exception (reminds
you why we have strict typing: magic is fucking annoying).

------
duncan_bayne
I'd have hoped Common Lisp would fare well here, but SBCL (1.1.11 on 64-bit
Linux Mint 15) is pretty broken. My results:

string: noël, reversed: l̈eon, first 3 chars: noe, length: 5

string: 😸😾, reversed: 😾😸, first 1 char: 😸, length: 2

string: baﬄe, upcase: BAﬄE

string: noël, equals precomposed: NIL

 _Edited_ : GNU CLISP 2.49 produces identical results.

~~~
aerique
I was somewhat disappointed as well. I wrote some tests here:
[http://paste.lisp.org/display/140280](http://paste.lisp.org/display/140280)

Perhaps playing around with different internal representations as pointed out
by sedachv
([https://news.ycombinator.com/item?id=6811407](https://news.ycombinator.com/item?id=6811407))
would work but the initial, naive string usage doesn't work.

While I expected the default usage to work correctly in Common Lisp.

------
implicit
What we really should be doing is doing away with broken nomenclature.

What does the "length" of a string even mean? A database will tell you it has
to do with storage. A nontechnical person will say it's the number of symbols.
A visual designer might say that it has to do with onscreen width when
rasterized in a particular way. None of these people are obviously right or
wrong.

It's very useful to be able to count the number of glyphs in a string, or the
number of unicode codepoints, or bytes, or pixels when rasterized in a
particular way, but "length" isn't clear enough to unambiguously refer to any
of them. Any meaning you try to ascribe to the "length" operation is going to
be wrong to someone.

------
NkVczPkybiXICG
All of these examples work in Haskell's canonical text library, 'text'! It's
the only language I know of that works.

~~~
FreeFull
The reversal of the decomposed noël doesn't produce the right result.
Converting baﬄe to uppercase does do the right thing though, and the rest
works as expected.

------
AndyKelley
I don't think the solution to this problem is to make our string classes more
complicated. I think it's to make our languages and character sets less
complicated. I can't believe that multiple codepoints being used to generate a
single glyph made it into the Unicode spec. That breaks a bunch of extremely
useful abstractions. I think it is reasonable to expect human languages to be
made up of distinct glypths that do not interfere with each other. Any
language that does not is too complicated to be worth supporting. Let it die.

------
delinka
Now let's take the lower case of "BAFFLE" \- should we get "baffle" or should
the string class/function/wtfe attempt to recognize that a ligature can
replace "ffl" and return to us "baﬄe"? More generally, should the string
library ever attempt to replace letter with ligatures? Should this be yet
another option?

And as I type this, another issue manifests: the spelling correction can't
even recognize baﬄe as a properly spelled word; it highlights the 'ba' and
ignores the rest.

~~~
ygra
Uppercasing and lowercasing is inherently lossy. E.g. the German ß becomes SS
when uppercased, yet there is no way to know whether SS should be lowercased
to ss or ß again. That's a reason why those things should be used, if at all,
only as display transformations. Same goes for ligatures, but even those
actually shouldn't be applied automatically, depending on the language. E.g.
in German ligatures cannot span syllables and few layout engines can detect
that.

~~~
zokier
I feel like I should learn German only so that I would be able to comment on
the ß issue every time a Unicode thread pops up. From my uninformed point of
view it is not really clear if ß should really be handled as a separate
character/grapheme, or just as a ligature in rendering phase and stored as
'ss'. Or even if current-day orthography should be held at such a sacrosanct
position that it shouldn't be changed to save significant amount of collective
effort.

~~~
TillE
> or just as a ligature in rendering phase and stored as 'ss'.

Probably.

> to save significant amount of collective effort

I've seen this kind of suggestion a number of times on HN, and I find it
highly amusing. When confronted with a difficult challenge in representing the
world on a computer, apparently the answer is to instead change the world.

OK, but then how are you going to handle hundreds of years of legacy texts?

~~~
rolux
In German, 'ß' is definitely not just a ligature of 'ss'.

Consider 'Masse' (mass) vs. 'Maße' (dimensions).

Uppercasing these words will necessarily produce ambiguity.

It would be equally tempting -- and wrong -- to treat the German characters
'ä', 'ö' and 'ü' as ligatures of 'ae', 'oe' and 'ue'. They're pronounced the
same, and the latter forms commonly occur as substitutions in informal
writing, but they also occur in proper names, where it would be incorrect to
substitute them with the former. However, if you want to sort German strings,
'ä', 'ö' and 'ü' sort as 'ae', 'oe' and 'ue'.

------
al2o3cr
If anybody hasn't seen it, Glitchr's twitter is a fantastic example of how
bizarro things can get with "140 characters".

[https://twitter.com/glitchr_](https://twitter.com/glitchr_)

Note: may freak out browsers with a flaky Unicode implementation. For
instance, scrolling that stream on the iOS Twitter client can get very laggy.

------
JulianMorrison
For Go: the for-range loop iterates 5 times, reversed (manually, using the
resulting runes) is l̈eon, utf8.RuneCount is 5. The blog has just recently
been talking about text normalization[1] via a library, but it isn't built
into the core.

[1]
[http://blog.golang.org/normalization](http://blog.golang.org/normalization)

------
brihat
The author intentionally chooses decomposed form. Indeed all of them work with
Python 3. Here:

    
    
        Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
        [GCC 4.8.1] on linux
        Type "help", "copyright", "credits" or "license" for more information.
        >>> noel="noël"
        >>> noel[::-1]       # reverse
        'lëon'
        >>> noel[0:3]        # first three characters 
        'noë'
        >>> len(noel)        # length
        4
    

The point is, defining what is a character based on how it is displayed is
flawed. Just precompose the string ifg you want and carry on. Like I said in
my other comment, making automatic conversion of decomposed -> precomposed
wrecks havoc with Indian languages.

~~~
jfim
Works as expected too in Scala, although it might be because the terminal does
normalization.

scala> val noel = "Noël" noel: String = Noël

scala> noel.reverse res0: String = lëoN

scala> noel.take(3) res1: String = Noë

scala> noel.length res2: Int = 4

scala> import java.text.Normalizer

val nfdNoel = Normalizer.normalize(noel, Normalizer.Form.NFD) import
java.text.Normalizer

scala> nfdNoel: String = Noël

scala> nfdNoel.length res3: Int = 5

scala> nfdNoel.reverse res4: String = l̈eoN

scala> nfdNoel.take(3) res5: String = Noe

The problem with an array of characters, as he mentions, is that it doesn't
work properly in many use cases. If your array of characters stores 16 bit
codepoints, it breaks with the 32 bit codepoints (Java got bit hard by that,
where a char used to be a character prior to the introduction of surrogate
pairs in Unicode); if it stores 32 bit codepoints, then it's pretty wasteful
in most cases, which is exactly why you'd want a string type that handles
storage of series of characters in an optimal fashion.

------
klrr
I hope Haskell Prime solves this. In Haskell String is literally a list of
characters. This causes some overhead and leads to bad performance. Of course
we got Text and for binary data you can use ByteString, but it's a bit of pain
compared to having a real string type by default.

------
jeorgun
I think the specific case of ligatures isn't a failure in strings per se, but
a failure in Unicode in that it includes them in the first place. What
"ﬁ".upper() (or whatever) should do is kind of ambiguous. The following
doesn't really seem appropriate:

    
    
      "ﬁ".upper().lower() #=> "fi"
    

But obviously nor does

    
    
       "ﬁ".upper() #=> "ﬁ"
    

In Turkish (which distinguishes between dotted and dotless 'i'), this issue
exists already:

    
    
       "ı".upper().lower() #=> "i"
    

This case _couldn 't_ (so far as I know) be fixed by any string library
without breaking Unicode compatibility, so it seems slightly disingenuous to
call it an issue with strings.

------
wazoox
Tom Christiansen (of Perl fame) made a much, much thorough analysis of Unicode
problems in his OSCON 2011 presentation:
[http://www.oscon.com/oscon2012/public/schedule/detail/24252](http://www.oscon.com/oscon2012/public/schedule/detail/24252)

Here are the slides:
[http://training.perl.com/OSCON2011/gbu/gbu.pdf](http://training.perl.com/OSCON2011/gbu/gbu.pdf)

The site seems down ATM, but Internet Archive has it:
[https://web.archive.org/web/20121224081332/http://98.245.80....](https://web.archive.org/web/20121224081332/http://98.245.80.27/tcpc/OSCON2011/gbu.html)

------
mathias
The article briefly mentions JavaScript, which uses something similar to
UTF-16/UCS-2: [http://mathiasbynens.be/notes/javascript-
encoding>](http://mathiasbynens.be/notes/javascript-encoding>)

Here’s a slightly more in-depth blog post on the many issues this causes, and
how to avoid them in JavaScript: [http://mathiasbynens.be/notes/javascript-
unicode](http://mathiasbynens.be/notes/javascript-unicode) Some of these
problems are briefly mentioned in the above post, too.

------
ddebernardy
This is misinformation. OP's strings are just wrong...

    
    
        >> "\u0308"
        => "̈"
        >> "\u00eb"
        => "ë"
        >> "noe\u0308l"
        => "noël"
        >> "no\u00ebl"
        => "noël"
    

His noël examples work just fine if you don't copy/paste the string he posts,
and instead type them in like I just did.

If anything, languages are reporting correct reverses and length, since he's
really manipulating 5 characters rather than four.

~~~
enko
Congratulations, you've discovered unicode composition!

    
    
      2.0.0p247 :045 > Unicode::compose("e\u0308").unpack('U').first.to_s(16)
       => "eb" 
      2.0.0p247 :046 > Unicode::compose("\u00eb").unpack('U').first.to_s(16)
       => "eb" 
      2.0.0p247 :047 > Unicode::decompose("e\u0308").unpack('U').first.to_s(16)
       => "65" 
      2.0.0p247 :048 > Unicode::decompose("\u00eb").unpack('U').first.to_s(16)
       => "65"
    

I presume the ones you pasted in were changed by the browser. His examples are
not wrong at all, indeed how can a string be "wrong"?

~~~
ddebernardy
His example is "wrong" in the sense that you cannot reasonably complain that
"noe¨l" gets reversed to "l¨eon" and put the "¨" part on top of the "l" when
it does — which seems entirely correct. Or for that matter, that the string's
length is 5 when there are indeed 5 characters.

As for being changed by the browser, the latter (or rather the OS) copied what
there was, and the OS pasted it verbatim insofar as I can tell.

------
vfclists
Honestly I think 'you' computer programmers love useless challenges too much.
Why can't you adopt lessons from Q?

If it isn't easy to get some languages working with Unicode properly then fix
the languages and leave Unicode alone. Remove all the language characteristics
that makes working with Unicode difficult. If Unicode will not go to the
language then the language must go to Unicode, or opt out of the computer era,
or die!!

KISS!!

------
mbq
There is a one more issue -- the easier it is to manipulate strings in some
language the greater chance that they will be used as an internal data
structure for things that certainly aren't texts. And this almost always
causes substantial performance loss and awful bugs that are either untraceable
due to a dependence on subtle configuration details or form security holes. Or
both.

------
ademarre
I think a lot of programmers don't properly understand character encoding
simply because their programming languages don't give them the proper
treatment. We need more APIs that force developers to acknowledge character
encodings, probably in the type system.

------
jheriko
this hits on one of my biggest problems with native android and ios
development. the wcs/wchar functions are largely broken or unusable... it
caused me a real headache from not knowing upfront.

the idea of the string type is just fine though (or a character array) broken
implementations don't invalidate it, they just invalidate the myth of '3rd
party libraries must be good because hundreds of programmers worked on them
for years' \- which is exactly a myth. it doesn't just apply to strings but
everything. (not brokeness, just that you shouldn't expect them to work beyond
what you can measure, and certainly shouldn't expect that they are flawless or
even good implementations)

------
rverghes
Out of curiosity, why only have one string type? We don't do the same for
numbers. Many languages don't have "number", they have int, float, long, etc.

Instead of just String, maybe we should have ASCIIString, UTF8String, and
UTF16String.

------
monkeyninja
I don't understand the reason of using C++ char array to store unicode
text....

------
drdaeman
That's because in a truly sane languages there should be a distinction between
data type and its implementation.

Then it would be not "string" type, that's broken, but an implementation of
"string" type.

~~~
agravier
I agree, it seems like a much saner thing to do. Now that you make me think of
that, I do not know many instances of this. I just could think of
[https://github.com/clojure-numerics/core.matrix](https://github.com/clojure-
numerics/core.matrix) upon which I stumbled recently. Do you have other
example of efforts to separate a type from its implementations?

~~~
riffraff
doesn't every statically typed imperative language do this, and recommend it?

~~~
drdaeman
Not to my knowledge.

C++ and Java are statically typed and they, as far as I know, don't have
distinction between string interface and implementation, just a standard
string type. You can't make your own string implementation and make others
(given that - would it exist - they use standard string interface)
transparently accept them instead of language's standard string
implementation.

Even Haskell (with standard Prelude) doesn't have a readily available and
widely accepted typeclass for strings. As String is just an alias to [Char],
if library writer used that, they won't accept, say, Data.Text (I know, it's a
bit distinct thing, but...)

~~~
riffraff
I was referring to "efforts to separate a type from its implementations", not
String specifically, and thinking of containers & co.

Although, for example, even java has CharSequence which only gives you access
to codepoints in a char sequence, you can inherit from that and create your
own.

------
daGrevis
Logically equivalent doesn't mean equivalent for computers. While you can't
define why reverse of “noël“ is “lëon“ by set of rules that computer can
follow, computer just can't know.

~~~
zokier
Umm. For that case you definitely can define a valid reversing algorithm. The
key is using grapheme clusters as the indivisible base unit. Sure, there are
probably some weird languages that will not reverse properly with such
algorithm, but it would still be a significant improvement over the current
situation.

------
falsedan
Mostly this article says that most languages choose NFC for their default
normalization form, and don't attempt to detect & convert strings to that form
automatically.

------
jszumski
Objective-C handles the "baﬄe" case just fine.

------
michaelfeathers
This is now my favorite example of a leaky abstraction.

------
Aardwolf
So what do you propose? Not have a string type, and let everyone handle all
these cases manually instead? That will not end well...

------
shioyama
Happy to see that ruby (2.0 at least) passes all the tests except the "baffle"
one.

Edit: sadly, it doesn't.

~~~
judofyr

        "noe\u0308l".size # => 5
    

ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin13.0.0]

~~~
monkeyninja
"noe\u{0308}".size => 4

You code is wrong

~~~
judofyr
You forgot the "l":

    
    
        "noe\u{0308}" # => "noë"

------
jwmerrill
It looks like Factor passes all of these tests. Hat tip to Daniel Ehrenberg.

------
cafard
Upvotes for the OP who posts "Disrupt 'Is Broken'".

------
Dewie
What do you guys think about String in Haskell, where it is a list of char?
Should it have some other default implementation, or should it have been more,
um, decoupled from its implementation (don't know the correct terminology)?

------
AsymetricCom
Now, look over here! When I substitute this context with that, ka-pow! now
it's an array of characters!

Big deal. I don't understand what the point of this article is when it shows
the shortcomings of half a dozen different string implementations in random
languages. Yes, if you don't understand the language, then your assumptions
about how it works may be wrong. Big surprise, that doesn't mean every string
implementation needs to conform to your expectations...

~~~
kbenson
Unicode is a standard. It says how to act in these circumstances. Calling out
incorrect unicode implementations is useful. You shouldn't have to worry about
inconsistent behavior between different languages that purport to support
unicode strings. That's the point of a standard.

~~~
AsymetricCom
So who is the authority about correct unicode implementations, exactly? And
how to different languages with different use cases and power conform to such
a standard? Why doesn't this authority extend over language implementations?
Because they know what they are doing, and understand the domain, unlike the
author of this article.

Look, I'm all for open standards, but saying that standards are required to be
adhered to at the programming language level is just ignorance of the real
world. The point of a standard isn't to dictate how data is architectured
internally, it's to facilitate interoperability of systems at their endpoints.
If you want interoperability of programmers, than make your own conforming
language and get programmers to adopt it the right way, by competing in the
market of ideas.

There is no idea here other than the writer's unjustified expectation that he
should just know how every language handles Unicode because??? Because Unicode
is a standard? No.. that doesn't make sense at all. Mixing contexts to make
the point here means there is no ground for his argument to stand on.

~~~
kbenson
_So who is the authority about correct unicode implementations, exactly?_

The Unicode Consortium[1] publishes standards. If a language advertises
unicode support, I expect it to follow that standard.

 _Look, I 'm all for open standards, but saying that standards are required to
be adhered to at the programming language level is just ignorance of the real
world_

I'm not saying a language has to do anything, but if it's advertising support
for a well defined feature, and does not deliver correctly on that, I will
call them out on it, and support anyone else who does as well. Should we all
just throw our hands up and say "Well, it's done now, no point in making a big
deal of it?" I would rather apply pressure to get things fixed, or at least
make it well known enough that future language designers give it the care and
attention it's due.

 _There is no idea here other than the writer 's unjustified expectation that
he should just know how every language handles Unicode because??? Because
Unicode is a standard? No.._

Are you under the impression that what the author is attempting is not well
defined? The unicode standard has conformance clauses about how to interpret
unicode strings[2]. That means that if a language advertises it has/supports
unicode strings, and fails the tests we've just seen, it's not conformant with
the unicode standard. That would make this useful because it's pointing out
bugs. If a language does _not_ advertise unicode support, but supports _some_
unicode features, then this is useful because it's making sure people are
aware of the limits of their language. All too often people refer to the
native string implementation in their language as supporting unicode, when
clearly there are problems.

1: [http://www.unicode.org/](http://www.unicode.org/)

2:
[http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf](http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
(see section 3.2)

------
DannoHung
Is it possible that Unicode is actually a bunch of horseshit? If literally
nobody gets the spec right, then maybe the spec is wrong.

~~~
gsg
Unfortunately, general purpose text is not a clean simple thing that you can
model nicely. Unicode is a mess because the problem it tries to solve is
messy.

Even if you could somehow come up with something obviously better, getting any
new standard adopted widely enough to be useful would be a formidable, if not
insurmountable, challenge. It's less pain to keep using Unicode and try to
deal with the worst of the damage.

------
davidhalter
I would argue that it's a unicode problem. `U+0308` shouldn't exist in the
first place as a unicode character. That's why we have `U+00EB` ('LATIN SMALL
LETTER E WITH DIAERESIS'), etc.

~~~
masklinn
Not all combination of base and combining characters exist in a precomposed
form, since a base character can have an infinite number of combining
characters tacked onto it.

If anything should not exist, it's U+00EB, which is a convenience,
compatibility and (space) optimisation codepoint.

