Hacker News new | comments | show | ask | jobs | submit login
The string type is broken (mortoray.com)
302 points by DmitryNovikov 1398 days ago | hide | past | web | 222 comments | favorite

The problem with text (that Unicode solves only partially) is that text representation, being a representation of human thought, in inherently ambiguous and imprecise.

Some examples:

(1) A == A but A != Α. The last letter is not uppercase "a", but uppercase "α". Most of the time, the difference is important, but sometimes humans want to ignore it (imagine you can't find an entry in a database since it contains Α that looks just like A). Google gives different autocomplete suggestions for A and Α. Is this outcome expected? is it desired?

(2) The Turkish alphabet is mostly the same as the Latin alphabet, except for the letter "i", which exists in two variants: dotless ı and dotted i (as in Latin). For the sake of consistency, this distinction is kept in the upper case as well: dotless I (as in Latin) and dotted İ. We can see that not even the uppercase <==> lowercase transformation is defined for text independently of language.

These are just two examples of problems with text processing that arise even before all the problems with Unicode (combining characters, ligatures, double-width characters, ...) and without considering all the conventions and exceptions that exist in richer (mostly Asian) alphabets.

I think your first assertion can be strengthened even further. It isn't like this is unique to letters that look the same. That is, sometimes WORD != WORD. Consider a few common words. Time? As in Time of day? As in how long you have? An interesting combination of the two? Day? As in a marker on the calendar? Just the time when the sun is out? Then we get into names. Imagine the joy of having to find someone named "Brad" that isn't famous. From a city named Atlanta, but not the one in GA. (If you really want some fun, consider the joy that is abbreviations. Dr?)

Except these are all well outside the ambit of what programmers usually think of as text processing, so they won't try to solve them using the same tools.

More to the point, they sound hard, so people won't be so quick to claim they've solved them.

On the other hand, case-insensitive string matching sounds easy, even if it's actually somewhat difficult due to the language dependencies mentioned above, so people will claim to have a general solution that fails the first time it's faced with i up-casing to İ instead of I, or the fact the German 'ß' up-cases to 'SS' as opposed to any single character. (Unicode does contain 'ẞ', a single-character capital 'ß', which occurs in the real world but is vanishingly rare. As far as modern German speakers are concerned, the capital form of 'ß' is 'SS'.)





Right, I do not disagree. I just feel better treating them the same. That is, both are actually easy and reliable so long as you realize you have to make some gross simplifications. And most of the time your life will be much easier if you start with the gross simplifications and try to expand beyond them only when necessary. (This is also why I'm loathe to try programming in unicode...)

I think (2) is an issue with Unicode specifically. They should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent.

While that's a problem with Unicode, it's a really big problem with Unicode. As the name alludes to, Unicode preserved as much as possible of existing regional encodings, which is why (among other reasons) there's a pre-composed version of basically every accented Latin letter.

isn't this solving the wrong side of the problem? how about not having to think about such things at all and just accepting that uppercase/lowercase conversion is never going to be language agnostic.

thats futureproof and powerful, rather than extra thinking and work...

Most likely case-changes need to be locale-aware, that is true. But still I think minimizing number of locale-specifics is a reasonable goal and in that light I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.

You are right, everything should be as easy as possible. This is a good philosophy for design in general...

Homoglyphs vary sometimes with text styles, though. So Α doesn't always have to look like A. Or, more to the point, while T and т might look alike, T and т often do not (the latter of which often looks like m). So even as humans we need to keep track of the script at times.

The funny thing is that, according to "the rules" (the Real Academia de la Lengua Española), in Spanish we should be always using \u0130, but of course no one does...

A nitpick from the article

>This spells trouble for languages using UTF-16 encodings (Java, C#, JavaScript).

if they were using UTF-16, this wouldn't be a problem as UTF-16 can be used to perfectly well encode code points outside of the BMP (at the cost of losing ability for O(1) access to specific code points of course. If you need to know what the n-th code point is, you have to scan the string until the n-th position).

They are, however, using UCS-2 which can't. If you use a library that knows about UCS-2 to work on strings encoded in UTF-16, then you will get broken characters, your counts will be off and case transformations might fail.

Most languages that claim Unicode support still only have UCS-2 libraries (Python 3 is a notable exception)

Exactly, UTF-16 perfectly defines surrogate pairs for code points that do not fit into the 16 bit plane. A perfect implementation of UTF-16 should have no problem, unfortunately, most are broken when it comes to surrogate pairs.

Many languages pre-date the introduction of UTF-16 and implemented 16 bit string encoding as UCS-2, and still do.

Then there are oddities like VBA using UTF-16 internally, but converting all strings going through the Win32 API as 8-bit (relying on the current code page for character translation!)...

.NET uses 16-bit characters, but you can use the System.Globalization.StringInfo class to iterate through a string one Unicode character at a time, index into strings by Unicode character, etc. The API's a bit awkward, but it works.

One unicode character at a time or one unicode codepoint at a time? (see character composition)

Codepoint. Java 5 also added new string APIs for this.

IIRC, Cocoa is one of the very few frameworks/languages/whatever which provides APIs for manipulating and iterating on grapheme clusters out of the box. And provides a page explaining some of the unicode concepts and how they map to NSString: https://developer.apple.com/library/mac/documentation/Cocoa/...

This is incorrect - see the definition of a text element in the remarks of http://msdn.microsoft.com/en-us/library/system.globalization....

Thanks for the info.

The .NET StringInfo class provides methods to work at the grapheme level, not code points.

Good point - I meant one codepoint at a time.

But I was wrong and it's actually by grapheme, as danbruc correctly notes.

> Most languages that claim Unicode support still only have UCS-2 libraries (Python 3 is a notable exception)

Most non-JVM languages[1] actually use UTF-8 as the internal encoding so they should not suffer from this. Python 3 does not use UTF-16 either, it selects an encoding based on the contents of the string.


1. I think .NET too uses UCS-2 or UTF-16, but I am not a Windows developer.

.NET uses UCS-2 because the Windows API uses UCS-2 (so when you use Visual Studio out of the box, you will get UCS-2). ECMAScript (JS) uses UCS-2 because that's all there was when the spec was written.

Other scripting languages I know for certain are

- PHP doesn't care and treats strings as arrays of bytes. All the str functions operate on these byte arrays and thus happily destroy your strings if they are encoded as anything but the old 8-bit encodings. If you need to support utf-8, you have to use different library functions (mb_*) and a special syntax in their regex support (/u modifier).

- Python < 3 treats strings as byte arrays or UCS-2 depending on whether you use the byte type or the Unicode type. As such, it has all the same issues as all other UCS-2 libraries

- Ruby < 1.9 treats strings a byte arrays. There is some limited UTF-8 support, but it's in additional libraries. The internal API is treating strings as byte arrays. Ruby >= 1.9 lets you chose your internal encoding. Most people use utf-8, but you don't have to.

- Perl I don't know enough about, but I hear it as an UTF-8 mode that is actually well-supported by the language itself and gets almost everything right.

These are the more common scripting languages.

Of the compiled languages, I know for certain about Go (utf-8; good library support), C (OS dependent, but the standard string API treats strings as byte arrays), C++ (dito) and Delphi (UCS-2 since 2010, byte arrays before that)

I would say that there are so many exceptions to the UTF-8 rule that I wouldn't say "most" languages are using UTF-8.

> - Python < 3 treats strings as byte arrays or UCS-2 depending on whether you use the byte type or the Unicode type. As such, it has all the same issues as all other UCS-2 libraries

It's Python < 3.3 (the Flexible String Represrntation was introduced in 3.3), there's a byte array type (str in P2, bytes in P3) and a string type (unicode/str), which may be UCS2 ("narrow" builds, the default) or UCS4 ("wide" builds, set by many linux distros)

Python <3.3 uses UCS2 or UCS4, depending on the build

Ruby >1.8 lets you choose the encoding

.NET UCS2/UTF-16 (I know the difference, imho if the stdlib has a .size, .length or .count that works on code units instead of code points it's broken... thus I'll mention only UCS2 from now on)

Java UCS2

Clojure UCS2

Scala UCS2


Haskell String UCS4

Haskell Data.Text UTF-16 (yes, not a naive UCS-2)

Rust UCS4 (last time I checked)

Javascript UCS2

Dart UCS2

PHP Unicode-oblivious

Vala UCS4

Go UTF-8 (but it lets you call len() on strings, and it doesn't return the length of the string, but its size in bytes)

I can't really think of another language that uses UTF-8 internally, are you sure?

> Rust UCS4 (last time I checked)

Rust chars are 32bit Unicode codepoints. But strings themselves are utf-8. That is the string type, ~str, is basically just ~[u8], a vector of bytes and not ~[char].

`.len()` [O(1)] gives you byte length while `.char_len()` [O(n)] gives you the number of codepoints.

So strings in rust are just vectors of bytes with the invariant that it's valid utf-8.

Thanks, I didn't know that

Common Lisp comes with two character types, base-char and character, the former being allowed to be a subset of the latter. Clozure Common Lisp uses UTF-32 for all characters and strings internally. SBCL uses base-char and simple-base-string types for ASCII and character and (simple-array character) types for UTF-32 internally. IMO having this option for two types of characters that are compatible but may have different internal representations is a really good part of the Common Lisp standard.

Perl, Rust, Go and Vala. I take back the "most" part though. It seems like there are many popular solutions.

Erlang uses Unicode code points and also binaries.

Python 3 gets so much of this right. It's one of the things I really loved about python 3 as it allows for correct string handling in most cases (see below).

Note that this is only really true with Python 3.3 and later as in earlier versions stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw) unless you had a wide build which was using a lot of memory for strings (4 bytes per character)

In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.

But: Due to http://en.wikipedia.org/wiki/Han_unification being locked into Unicode with a language might not be feasible for everybody - especially in Asian regions, Unicode isn't yet as widely spread and you still need to deal with regional encodings, mainly because even with the huge character set of Unicode, we still can't reliably write in every language.

Ruby 1.9 and later helps here by having many, many string types (as many as it knows encodings), which can't be assigned to each other without conversion.

This allows you to still have an internal character set for your application and doing encoding/decoding at i/o time, but you're not stuck with unicode if that's not feasible for your use-case.

People hate this though because it seems to interfere with their otherwise perfectly fine workflow ("why can't I assign this "string" I got from a user to this string variable here??"), but it's actually preventing data corruption (once strings of multiple encodings are mixed up, it's often impossible to un-mix them, if they have the same characer width).

I don't know how good the library support for the various Unicode encodings is in Ruby though. According to the article, there still is trouble with correctly doing case transformations and reversing them.

Which brings me to another point: Some of the stuff you do with strings isn't just dependent on string encoding, but also locale.

Uppercasing rules for example depend on locale, so you need to keep that into account too. And, of course, deal with cases when you don't know the locale the string was in (encoding is hard enough and most of the cases undetectable - but locales - next to impossible).

I laugh at people who constantly tell me that this isn't hard and that "it's just strings".

> Python 3 gets so much of this right

What does it gets right????? It's all broken as nearly everything else!

It's sad 99% comments there are “oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!”

The reality is there's 1-2 languages that are trying to make it correct from the beginning (perl6, I'm looking at you). It's 2013 and if language can compose bytes to code points everyone declares a win, sticks "full unicode support" label to it and continues to str[5:9].

”But I've got UnicodeUtils!” — it won't help. People just don't want or cannot write it correctly. Word is not [a-z]. Not [[:alpha:]] either. And not [insert regex here]. You cannot reverse by reversing codepoint list. And you cannot reverse by reversing grapheme list. And string length is hard to compute and then it doesn't any sense. And indexing string doesn't make any sense and it's far away from O(1)

Can you provide some examples of Python 3 getting strings wrong?

Between strings being native unicode code points (you have to encode to bytes to get UTF-8) and unicodedata for normalization and decomposition (http://docs.python.org/3.3/library/unicodedata.html) I've found Python 3 pretty robust. Python 3.3 also uses appropriate Unicode data for regular expressions, as mentioned on http://docs.python.org/3.3/howto/regex.html.

If you want to compare strings you should really normalize them first, which is where unicodedata comes in. In my programming situations it would be wrong to conflate different decomposition of the same unicode string. Why is this? Because other software you interact with uses encodings and the UTF-8 encoding of two different decompositions if different. I've run into this with UTF-8 filenames on OS X when working with Subversion.

Did you read the comment you're replying to at all? You can start at “It's sad 99% comments”.


  Python 3.3.2 (default, Nov 27 2013, 20:04:48)
  [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> 'öo̧'[1:]

And sorry, those new regexes don't even support \X (grapheme matching)

Edit: python version

Yes, I did, and you did not provide a single example. You just said "“oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!".

Taking the time to actually prove your point it useful. However, your recent example seems to be running fine on Python 3.3. You did not include any version info in your example output.

    Python 3.3.0 (default, Mar 11 2013, 00:32:12) 
    [GCC 4.7.2] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> "öo̧"[1:]
I haven't run across any situations where Python 3.3 is doing wrong, which is why I am asking for some examples.

3.3.2. No, it is not. Use 'o\u0308o\u0327'

Oh, I see the issue here. You are expecting the string class to function via graphemes rather than characters. It should be possible to implement grapheme support since character support is there, but I imagine the reverse is not true.

A little googling turned this up. https://mail.python.org/pipermail/python-ideas/2013-July/021...

TL;DR of parent comment here for those skimming:

  x = 'o\u0308o\u0327'
  len(x) == 4
  x == "öo̧"
  x[:2] == "ö"
  x[2:] == "o̧"
  x[0] == x[2] == "o"

You got me curious about grapheme matching in Python with regex. It looks like it is not in the stdlib yet with 3.3. However, it you install https://pypi.python.org/pypi/regex and then replace:

    import re

    import regex as re
Then if you want to get into using graphemes slicing, you could use something like:

    import regex as re
    decomposed_str = 'o\u0308o\u0327'
    graphemes = re.findall('(\\X)', decomposed_str)
    sub_graphemes = grapheme[1:]
    decomposed_substr = ''.join(sub_graphemes)

But what is that sequence (I know the unicode sequence is listed below -- but is it some wierd edge-case)? — because if I manually compose/type those (and a few other characters) everything seems to work fine:

    [edit: Python 3.2.3]
    [edit: [GCC 4.7.2] on linux2]

    >>> 'öo̧'[1:] #copy-paste
    >>> 'öo̧'[::-1] # "reverse" also breaks
    #But for Japanese:
    >>> '日本語'[1:]
    >>> '日本語'[:-1]
    >>> '日本語'[-1:]
    >>> '日本語'[::-1]
    # And Norwegian
    >>> 'æåø'[::-1]
    # And a few "French" characters (in this case
    # manually typed as alt+~+e, etc
    >>> 'ẽêèe'[::-1]
    # And crucially for your example, typed as
    # alt+"+o
    >>> 'öo'[::-1]
So is your initial example some kind of unicode-without-bom(b) or something?

[edit2: I gather, that working with "pre-composed" characters work, and working with "de-composed" ones break. Which, while expected, is a little sad, I agree.]

> Python 3 gets so much of this right. It's one of the things I really loved about python 3 as it allows for correct string handling in most cases (see below).

One of the biggest things that I feel Python gets right with the string type is that strings are immutable. It makes a lot of things easier.

It really makes sense to have a good string type for small strings, stored in unicode. Immutability makes everything simpler.

The string type is not a good fit for handling large amounts of text. There are trade offs for efficiency that have to be made to create a handy string type. It really makes sense to have a separate "bytes" type or some kind of StringBuffer for doing big text operations.

Isn't the string type immutable in many (most?) other languages as well? In Objective-C the default is an immutable string (though optionally

one can create mutable strings as well). Lua also uses immutable strings. In Java and C# I think the situation is the same, since if you want

to use high performance string manipulation, you'll generally resort some form of StringBuilder helper class.

Correct, C# and .Net have an immutable string class and a mutable StringBuilder helper class.

I believe strings are mutable in Ruby.

They are...

    s = "hello"
    s << "   world"
    s # hello world

Is that allocating a new buffer, leaving the "hello" string to be collected by the GC?

No, it's operating in place:

    def append_world(str)
        str << " world"

    a = "hello"
    a                       #=> "hello world"

No, it expands the existing buffer. (leaving " world" to be collected). Note that the following is different and more like what you're thinking.

    a = a + " world"

>In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.

I'm not sure what "internally using unicode" means. Pyhon's internal representation of strings has changed a lot. It hasn't even been stable in Python 3. Now they are apparently using an internal representation that varies depending on the "widest" character stored.

The only solution that isn't driving me insane is to use UTF-8 everywhere. The Python 3 unicode situation is actually the main reason why I'm not using Python much these days.

In Python 3, you don't care about what they use internally. You don't need to.

If you want to work with strings, you work with strings. If you want to work with bytes, you work with bytes. If you want to convert bytes into strings (maybe because it's user input that you want to work with), then you tell Python what encoding these bytes are in and you have it create a string for you. You don't care what Python uses internally, because their string API is correct and correctly works on characters.

That noël example of the original article consists of 4 characters in Python 3 which is exactly what you want.

I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

UTF-8 also isn't widely used by current operating systems (Mac OS and Windows use UCS-2). It's also not what's used by way too many legacy systems still around.

So as long as the data you work with likely isn't in UTF-8, the encoding and decoding steps will be needed if you want to be correct. Otherwise, you risk mixing strings in different encodings together which is an irrecoverable error (aside of using heuristics based on the language of the content).

>In Python 3, you don't care about what they use internally. You don't need to.

I do need to know and I always care. My requirements may be different than those of most others because I write text analysis code and I need to optimize the hell out of every single step. I shiver at the thought that any representation could be chosen for me automatically.

Of course, nothing is stopping me from simply using the bytes type instead of str, but clearly the Python community has decided to go down a road I feel is entirely wrong so I'm not coming along.

>I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

I'm bound to live in a variable length character world unless I decide to use 4 byte code points everywhere, which is prohibitive in terms of memory usage. Memory usage is absolutely critical. Iterating over a few characters now and then to count them is almost always negligible.

The need to index into a string to find the nth character only comes up when I know what I'm looking for. Things like parsing syntax or protocol headers come to mind, and they are always ASCII. I don't remember a situation where I needed to know the nth character of some arbitrary piece of text and repeat that operation in a tight loop.

If one day I find myself in such a situation I will have to convert to an array of code points anyway.

So in your one, specific, performance-limited situation, Python 3's implementation of unicode doesn't work for you. Mostly because you are trying to optimize based on implementation details.

I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?

>I don't see how this equates to a general purpose language failing at strings

And I don't see where I said it did.

I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.

I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.

I strongly disagree

and I'd like to know what do you think the "right thing" would be

I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...

there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)

would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?

if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?

Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure

>I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"

Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.

>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings

If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

>would you prefer to have O(n) indexing and slicing of strings

I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

> If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

Actually, you can in Python... and obviously most developers ignore such issues [citation needed]

My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)

> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?

byte[] slice(int start, int end);

I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)

> Actually, you can in Python...

No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.

Unicode makes random access useless at anything other than destroying text.

> but a good stdlib would obviously help immensely in this regard

Which is extremely rare, and which Python does not have.

>Actually, you can in Python

You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently. http://www.python.org/dev/peps/pep-0393/#new-api

Sounds like you want to use Go. Feels like Python, but technically correct implementations of concepts.

Mac OS and Windows use UCS-2

Which parts of Mac OS? You'd have a lot of problems with Emoji support if that were true. To the best of my knowledge, it's UTF-16 everywhere.

Or do you actually mean Mac OS as in Mac OS 9, and not OS X?

Agreed with most of it except:

"because their string API is correct"

Apparently they have a bug in their UTF-7 parser that can lead to invalid unicode strings. Don't know if it's already fixed.

It was a bug in the decoding: it raised an unexpected exception, nothing that couldn't be worked around with a check (afaik it didn't crash the interpreter)

and it has been fixed since more than 1 month, just 2 days after it was reported


Let's avoid spreading fud, shall we? :)

That would be an implementation flaw, not an API issue.


> stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw)

ECMAScript 6 fixes that, mostly. See http://mathiasbynens.be/notes/javascript-unicode for details.

The string type isn't broken. If anything these "X is broken" posts are broken. Taking one special case, finding problems with that case and deducing that the whole concept must therefore be discarded is just silly. Strings work fine for the vast majority of use cases. No technology is free of flaws and engineering decisions are almost always based on weighting the pros and cons and choosing a solution that on balance works best. Strings are a useful feature and Unicode is a notoriously hard problem. Proposing to go back to arrays of characters makes things worse for most people in most cases and therefore is not a practical solution.

Vast majority of use cases in the English-speaking world.

In other countries like China, Japan, India, ... those edge cases are common enough to represent a significant portion of use cases and make X truly broken.

The article is maybe a bit provocative, but you know what, that's exactly what is needed to raise awareness of mainly US-centric developers who would completely ignore the technical issues until they face a clone in China whose only innovative feature is not breaking on Chinese text.

The point is not to reduce the number of options (everyone going back to arrays of characters) but to put the spotlight on some problems where going a level lower could help a lot.

> *Strings work fine for the vast majority of use cases

In the CKJ space (a third of the population ?) strings are "broken" in the vast majority of use cases (really, things like what format you should accept for a telephone number). It get exponentially dirty as you try more complex manipulations, and I think these are interesting problems. It helps discussing them from time to time.

Conflating the responsibilities of "character list" with "byte array" is always going to go badly.

That's OK, sounds like he is writing a new language, so screwing up on the strings implementation is par for the course. Languages and databases don't typically get correct string handling for many years later after they are born, if ever. Supporting all the unicode and other character set insanity takes years of work. Asking someone writing a language to get strings right is like asking a five year old to obtain a drivers license.

In my experience, the world is full of software which "work fine for the majority of use cases" until the point where you take the wrong code path and things go south.

Much like human brains, and business processes.

Exactly. Engineering =/= Maths.

In many languages it's difficult fixing the string type without breaking existing code. In Ruby: String#upcase only handles ASCII (by spec), #length counts codepoints, #reverse reverses codepoints.

You can use UnicodeUtils if you need "full" Unicode support:

    >> UnicodeUtils.upcase("baffle")
    => "BAFFLE"
    >> graphemes = UnicodeUtils.each_grapheme("noe\u0308l").to_a
    >> graphemes.reverse.join
    => "lëon"
    >> graphemes.size
    => 4
    >> graphemes[0, 3]
    => "noë"

> String#upcase only handles ASCII (by spec)

Bad for Ruby

> You can use UnicodeUtils if you need "full" Unicode support:

Oh, sure

  Betty:~ lelf$ ruby -r unicode_utils/u -e 'puts UnicodeUtils.each_grapheme("A‮͜CB‬D").to_a.reverse.join'
So, "full" (it's not) Unicode support won't help you if you have little idea about what you're doing (like indexing stringه҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿s)

Strange. Chrome indents those characters into a >-shaped "flock formation", but Firefox renders them as a vertical column.

what are these characters printing here?

It's an awesome little gadget - looks like one character, but is a really big messy bunch of bytes:

"\xD9\x87\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\ x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD 2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\ xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xB F\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\ xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xC C\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\ xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x8 8\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\ x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD 2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\ xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xB F\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\ xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xC C\xBF\n"

funny that an ad comes up when you search that character on google. http://imgur.com/Y5vkbHF

Ugh your post is totally breaking the page layout!

Perl seems to pass nearly all the tests (including uppercasing baffle):

  $ perl -E 'use utf8; binmode STDOUT, ":utf8"; say uc("baffle");'

The only failure I can see is that it treats "no<combining diaresis>el" as 5 characters (so reports length as 5 and reversing places the accent on the wrong character). That's documented here: http://perldoc.perl.org/perluniintro.html#Handling-Unicode "Note that Perl considers grapheme clusters to be separate characters"

All else seems to work though (including precomposed/decomoposed string equiality etc). The docco also says that perl's regex engine with Do The Right Thing with matching the entire grapheme cluster as a single char.

Perl is actually very good with Unicode. Note that a character is "The smallest component of written language that has semantic value" according to the Unicode glossary - I'd say Perl respects that meaning. As noted in the docs, graphemes can be handled with \X in regular expressions (although admittedly that's not pretty):

    my $length = 0; $length++ while $dec =~ /\X/g;
Note that a grapheme is defined as "A minimally distinctive unit of writing in the context of a particular writing system" - i.e. context is required to determine what a grapheme actually is. A few others have pointed that out... Given the definitions from Unicode, Perl does a pretty good job (esp. when using Unicode::Normalize to normalize input).


Python 3 gets that one, but python 2.7 doesn't:

    $ python3 -c 'print("baffle".upper())'
    $ python -c 'print "baffle".upper()'
    $ python -c 'print u"baffle".upper()'

It's interesting that you get BAFFLE for the first one. I get the same result in both 3 and 2.

Note first that the reason you get "BAϬ„E" is a bit of garbarge-in garbage-out. Strangely, the interpreter isn't rejecting that with the typical "SyntaxError: Non-ASCII character <char> in file" error; instead, it appears to be assuming ISO-8859-1, and then performing .upper(). You can fix that:

    python2 -c '# coding: utf-8
    print u"baffle".upper()'
(Note, of course, that the #coding needs to match your terminals encoding, which is likely UTF-8, but it isn't guaranteed.)

That, for me, prints "BAfflE" in both Python 2 and 3 (adjusting for 3 by adding parens around print, and removing the u prefix on the literal.) I'm on Python 3.2, so perhaps 3.3 does better. (I'm behind on updates, but last I did update, Gentoo stable was still on 3.2.)


I am on python 3.3, but I don't know if it's the updated interpreter that fixes the bug.

Also Cocoa's NSString:

  [@"baffle" uppercaseString]; // @"BAFFLE"

˙ƃuᴉuɐǝɯ ⅋ 'spɹoʍ 'sɥdʎlƃ 'sɹǝʇɔɐɹɐɥɔ uǝǝʍʇǝq 'ɹǝʌǝʍoɥ 'ǝɔuǝɹǝɟɟᴉp ɐ sᴉ ǝɹǝɥ┴ ˙ʇxǝʇ ɥʇᴉʍ punoɹɐ ƃuᴉsɹɐ oʇ sǝɯoɔ ʇᴉ uǝɥʍ sǝᴉʇᴉlᴉqᴉssod ƃuᴉʇsǝɹǝʇuᴉ ǝɯos sɹǝɟɟo ǝpoɔᴉu∩

Which looks cleaner, "丄" or "┴" ? I.e. "ǝɹǝɥ┴" or ǝɹǝɥ丄" ?

Awesome way to exercise the brain.

Interesting. I had no problem reading that (except for 'arsing' which I thought I'd misread). The ability of the brain to pattern-match upside down is amazing.

Fascinatingly, I read your comment first, then tried to read the upside down post — the only word I had trouble with was arsing.

Sight reading is really fascinating.

Varies by person. I'm nearly 100% incapable of reading upside down text.

I think the mistake here is seeing a string as an extension of an array or vector. What I would prefer is a string type that didn't support all the operations of vectors. The length of a string is not inherently a meaningful question (and for the cases where it is, what you want is something like a vector of grapheme clusters - which is a useful type to have, but not so useful that every string in your program should incur the overhead of creating such a thing); likewise reversing and splitting are operations that simply shouldn't be allowed for your "fast path, undecoded string" type.

I'm with you here; but in that case, I'd like an ascii_string type, which most languages don't provide specifically. This type _would_ support string reversal, substring slices, and so on, but be limited to 7-bit ASCII only. I think there are many use cases that are purely internal, and don't need i8n. It's handy to be able to do things, including operations on strings, for internal things. Filename handling where you control the filename, the "keys" in languages which use strings for a dictionary type, and so forth.

I think this might just confuse new programmers and the filename thing is especially dangerous since at some point you might want to support i18n there. I think it would be better to have two types of string: 1) unicode strings and 2) arrays of 8 byte data with some string like functions (essentially C strings). The second case is essentially binary data strings.

A big problem here is a lack of clear definitions for various concepts like "character," "reversed string," "upper case," etc. The author briefly recognizes this, but brushes it off with statements like "I generally expect that..." and "I assume most people would not be happy with the current result."

I think these hand-wavings aren't helpful. Short of extensive surveying, which is bound to be controversial no matter what the result, talking about "general expectations" is a purely subjective notion, and not a good way to evaluate the actions of cold, soulless silicon that is just following orders.

Like the author, I also consider myself a mostly reasonable person, yet is might come up with very different expectations. If I saw that "ffl" ligature, how would I know it's a ligature and not some single unrelated character in another language? You might respond "but it's clearly part of the word 'baffle' and should be capitalized thusly." But would you suggest that string libraries ship with word lists and perform contextual analysis to determine how to perform string operations? Surely that's a fool's errand, not to mention that it would inevitably produce unexpected results.

"If I saw that "ffl" ligature, how would I know it's a ligature and not some single unrelated character in another language?"

Because the name of that character is "Latin Small Ligature ffl". Knowing to capitalize ffl as FFL doesn't require a word list any more than knowing to capitalize "ffl" does.

I'm not sure I agree with the title, although I do agree with just about all of the content:

* a string type is probably a good idea to bundle the subtleties of unicode, a plain array or list (whether it's of bytes or of codepoints) won't cut it: standard array operations are incorrect/invalid on unicode streams

* the vast majority of string types are broken anyway, as even in the best case they're codepoint arrays (possibly with a smart implementation). The bad cases are just code unit arrays, which break before you even reach fine points of unicode manipulation

And then, you've got the issue that a lot of unicode manipulation is locale-dependent, which most languages either ignore completely or fuck up (or half and half, for extra fun)

If you are actually manipulating strings rather than just storing and pushing them around I would suggest looking at ICU. Handling Unicode is difficult and it's easy to confuse encodings, code points and glyphs or make assumptions based on your own culture and language.

ICU has support for a lot of the basic operations you would want to perform on strings as well as conversion to whatever format is suitable for your platform and environment.

Do people really need to reverse strings in the real world?

I don't think I've ever written code to do that outside of homework assignments and interviews.

Substrings exhibit similar problems and those are used quite often. It's just that in this case the effect of seeing it fail is a little more dramatic (i.e., l̈ – which doesn't even seem to render properly here).

"l̈" renders just fine for me, maybe your font does not include it.

Verdana doesn't seem to properly support U+0308, apparently. It's wrong (with that font) in Chrome, IE 10, Firefox and Word 2010. Other operating systems might substitute a different font that works better, perhaps.

Yes, I am running Debian without having installed the Microsoft core fonts so Verdana is substituted for DejaVu Sans.

May be not reversing, but trimming a Unicode string to certain character count is a close relative and it is a very common operation.

Right, but what's the count you want there? It's either a byte count or a grapheme cluster count. The .count() on most current languages' string types doesn't correspond to either of those, so isn't really useful.

What do you use it for? Unless you have a monospaced font the number of characters do not mean much. So unless you are implementing command line tools or text editors it should not be that common.

Truncating with ellipsis in the GUI in a desktop app. I can measure rendered length on a desktop, so I can truncate down to the desired number of pixels, round down to the nearest char, and then tack on "...". I would hate to see a semantically-important accent mark lost this way.

I have a database field limited to 100 "characters" [1]. The user sent me a form submission with 150. I need to do something to resolve that. This is incredibly common. Truncation to a defined size is routine.

[1]: I'm leaving "characters" undefined here, because no matter what Unicode-aware definition you apply here, you've got trouble.

This is a good real-world example and the response is an armchair programmer informing you that you are doing it wrong. The internet is rife with know-it-alls. "Just do X." Well, I cannot because I am contractually obligated to write the software as specified and not cowboy up and do whatever I like.

Maybe someone decided 100 characters was a reasonable cutoff and that field is not important enough to reject (read: increase bounce rate) on if someone manages to send too much.

Maybe the 100 characters is a short string generated from an unrestricted long string and cached on a separate server.

"I have a database field limited to 100 "characters"."

Well there's your problem right there...

"The user sent me a form submission with 150. I need to do something to resolve that."

Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash. If you must have fixed-length fields, surely telling the user "much characters, wow overflow" is better than just chopping the input.

Since this seems to be confusing people, I'm providing a small hypothetical example here.

"Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash."

You are reading far more in than I put in. I merely said somehow you need to resolve this; you put a particular resolution in my mouth, then attacked.

I did choose the web for one reason, which is that you can't avoid this case; you can try to limit the UI to generating 100 characters only (and I still haven't defined "characters"...), but it's 15 seconds for a user to pull open Firebug and smash 150 characters into your form submission anyhow. Somehow, you better resolve this, and as quickly as you mounted the high horse when faced with the prospect of mere truncation, throwing the entire request out for that will cause somebody else to mount an equally high horse....

What if it's a batch ETL process where there is no "user" to tell that it went wrong?

The point that when you're worrying about string length, it's often an indicator of a separate problem is a good one. But some things really do need the ability to measure/truncate strings and not every situation allows just throwing the software in the trash as an option.

Have you checked how your database counts? Does it count code points or does it try to count graphemes? I assume the former, but I guess you would still have to cut the input at a grapheme border when truncating the input.

Ellipsisising text when it does not fit into a label, for example. And if you just remove code points from the end (instead of graphemes) until the string (including ellipsis) fits then you might just drop a diacritic.

You have a search query and you want to remove stopwords and normalize the query.

I've been waiting for someone to ask me to reverse a string in an interview, so I can tell them why the code I just wrote for them (using the XOR trick, which is what they're usually expecting), is wrong.

When I've asked people to reverse a char* in the past, it's just been to see if they understand the basics of pointers. The XOR "trick" hasn't been impressive since high school. :)

Had such a case a few months back. Strings of single-byte characters are Endian-agnostic but multi-byte character encoding is affected by Endianness. To cope with it I read the sequence as single byte, then reversed, then changed the encoding to proper encoding and reversed again. The data came from a binary dump where I only needed a section that contained a few strings.

I admit it's dirty but it was throwaway code for an isolated case.

Edit: eh, guys, as I stated the string came from a binary dump. I didn't get to choose the encoding, it came from ROM in an embedded system with a different Endianness. I had to figure out a way to make it human readable.

Use UTF8, no endian issues. Thats yet another reason why UTF16 and UTF32 are broken.

language will not store unicode string internally with UTF8. Yes, we use it as input and output, but in memory, utf8 is terrible for random access characters. endian is only an issue (normally) for input and output, not really an issue for internal storage. especially when using UTF16 and UTF32 you know exactly the size of items.

UTF-16 is just as bad as UTF-8 regarding variable-width code points. The only thing you always have (unless using compression schemes like SCSU) is random access to code units. Only UTF-32 also allows random access to code points. However, that's still of questionable value because when dealing with text you often want to handle graphemes, not code points, code units or bytes.

You cannot do random access at all in Unicode, not even UTF-32 (and absolutely not UTF-16), due to combining characters.

UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).

For this you do not need to reverse the string in a unicode aware way. You need to operate on the raw bytes.

The same mechanism that is used for reversing a string can be very useful though.

Think in the lines of python's:

>>> 'abcd'[:-1]


This is why the U.S. dominates the software world. Back when everyone was figuring out how to express their languages, we had the option to punt on complexity and just use ASCII.

ASCII doesn't make the U.S. special. ASCII is special because it's from the U.S.

Lots of people speak languages that trivially fit in 8 bits with no real "figuring out" to do. Before Unicode, we all had our different codepages or encodings. Including the U.S.

The U.S. is pretty central to computing. Because of that, and because ASCII only uses 7 bits, some other 8-bit cultures use it as a subset for their native 8-bit encodings. Even in the U.S, we use extensions to ASCII so we can represent text in languages that are close cousins to English. I doubt you actually use ASCII much. You've probably been using either ISO 8859-1 (aka Latin-1), which is a superset of ASCII, or Windows-1252, which is a superset of Latin-1.


This mess of incompatible codepages and culture specific encodings is one of the main problems that Unicode was invented to solve. It also happens to help languages which need more than 8 bits.

Many languages fit into 8 bits, but English is particularly simple in its alphabet. Even many of the European languages that can fit in 8 bits have things like accented characters that complicates things somewhat.

Of course this isn't to say English is simple overall. Just that it's complexities lie elsewhere, and it's simplicities lie in an area that made it particularly simple for early computer systems to process.

> Even many of the European languages that can fit in 8 bits have things like accented characters that complicates things somewhat.

I don't see your point here, with respect to English orthography making computer implementation easier. How exactly does not needing representations for accented characters make anything easier?

If it was just some additional characters like ñ (which is considered a letter of its own, not an accented n) then it wouldn't be a big deal – but e and é are the same letter with different accents, which adds some subtlety that English simply doesn't have. Given a small enough number of accented characters you can punt on that, call them each a character, but English is objectively simpler since the only real distinction it has between letters is caps or not-caps. (I was just watching the Mother Of All Demos, though, and everything was in caps but they put an overline over capital letters. So even normal English lettering was too complicated for a while.)

It has fewer characters (don't need one for each accent, possibly exceeding 8 bits otherwise) and/or no variable width characters. Also capitalization rules are trivial.

Not that I'm claiming English is unique here, just convenient, and many languages can't claim that.

Meanwhile, many of those languages with accented characters have no use for letters like z.

It's not really worth mentioning the alphabet when talking about unique features of English.

It seems rather that it is the other way around - the US dominated (and still does) the computer industry, and so ASCII, the English-centered character set, became the standard. ASCII is good enough (you might lose some accents on certain characters in certain words and such, but nothing much) for English but has no consideration for any other characters that might be used in other languages.

If Turkey was the dominant country in IT, I don't see why they wouldn't do the same thing only for their own alphabet; include all the characters of their alphabet (latin alphabet plus a few more), plus some more common characters used in math etc.

The OP probably needed to clarify. English having a simple alphabet gave the US a leg up on personal computing compared to the KJC countries. It's only one contributing factor, though.

Hat tip to Guido van Rossum for passing (nearly) all the tests in Python 3.

Is the "ffl-ligature to uppercase" test really relevant? Isn't that fixed by appropriate use of string normalisation?

The ffl ligature passes

  $ python3
  Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
  [GCC 4.8.1] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> "baffle".upper()
Strange that the article claims that no languages passes it. It seems from another post that perl passes it too

Embarrassing. I should have checked for myself.

And it is strange. Maybe the author needs to check his locale settings?

I also doubt the validity of the upper-casing, it feels like in an internationalization/localization context, converting a string to all upper case is not a valid thing to be doing.

Not all languages (or even characters) have a well-defined upper-case versions of their glyphs.

Even if they all did, I would expect the interpretation by a (human) reader to vary culturally.

The Unicode standard includes uppercase rules. If you're already representing strings using Unicode codepoints, why not follow the whole Unicode standard?

EDIT: And yes, different languages can have different uppercasing rules. There's still a standard: https://github.com/lang/unicode_utils/blob/master/data/CaseF...

Thanks, I was not aware of that.

I guess "uppercase this string" goes from being a tiny loop to a big ... thing based on a lot of hardcoded knowledge, which in turn might indicate that it's not a very simple operation any more.

The usual goal is to apply a consistent transform though, to smooth out interpretation differences - i.e. when looking for command input I either lowercase or uppercase things to smooth over the fact that "yes" "YES" "Yes" are all completely valid ways of saying the same thing with those characters.

If there's only one way of expressing the thing - i.e. a single chinese character - then it would be valid to do nothing. It's just in english "y" and "Y" might change context, but as far as computer input is generally concerned they are the same thing.

To compare strings case-insensitively, you want case-folding instead of lowercase or uppercase. Unicode defines case-folding for comparing strings. There are enough complexities with case, like characters that don't have other case or multiple mappings, that it can't be correctly used for comparison.

For that use case it is better to compare case insensitively with "yes" instead of converting the input to lower case first.

If you can compare case-insensitively then you (or the library you call into) must be aware of case and you face the exact same problems. It's a pretty good thought exercise to attempt writing your own Unicode-aware case insensitive string compare. A lot of people call into libraries for this stuff without realizing how complex the problem gets.

How do you do case-insensitive comparison without normalizing the case of the operands?

Most string compare routines in the library offer a case insensitive compare option already, you don't have to normalize it.

Here are the Unicode rules, which do consider localization: ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

It would be interesting to know how often these rules are actually used though...

Many designers use

To make acronyms like HTML and CSS look better on the page. To support i18n, HTML allows setting the language on a per-document or even per-element basis. That way the upper- or lower-casing can be done following the rules of the language.

[edit:] ah, ok, so on python 3 it depends how it's constructed.

[originally i had a post here saying i couldn't get the noel to work on 3.3.2]

Seems to work fine on 3.3.3 (Linux)

  Python 3.3.3 (default, Nov 23 2013, 09:49:26)
  [GCC 4.8.2] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> a = "noël"
  >>> len(a)
  >>> a[::-1]

This is the test case: https://eval.in/73766


Ah, I see, the decomposed case indeed doesn't work as well.

The test case is decomposed.

While JavaScript (in browsers) has no way to normalize precomposed/decomposed strings, it has standard methods to correctly compare them: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


    var decomp="noël";
    var precomp="noël";

    ["n", "o", "e", "̈", "l"]
    ["n", "o", "ë", "l"]
Browser support for this varies, The Intl.Collator interface is currently only supported Chrome (maybe also in Opera? Idk).

Note: In Chrome when comparing (e.g. sorting) a lot of strings String.prototype.localeCompare is much slower than using a pre-composed Intl.Collator instance (because internally localeCompare creates a new collator for each call). Using Intl.Collator rediced startup time of my http://greattuneplayer.jit.su/ immensely. node.js currently has no support for Intl.*. It probably will be a compile time option for 0.12.

This article is mostly written from a European language perspective. For Indian scripts, storing combining characters as a separate code points is the right thing to do.

For example, कि (ki) is composed of क and ि When I'm writing this in an editor, say, I typed ku (कु) instead of ki (कि) and I press backspace, I indeed want to see क rather than deleting the whole "कि".

Only some times I figure, because if you want to make the first letter green, you'd want that to apply to the whole कि.

For the record, Racket gets the "baffle" example right:

    racket@> (string-upcase "baffle")
It also passes all of the author's other tests (except for the ones involving combining diacritics, but racket includes built-in functions for normalizing such strings so you can work with them)

I seem to have rather little use for the cases the author presents here. If I'm working with strings, they are either of the debug or internal variant, where even basic ASCII would suffice, or I get them from somewhere and don't touch them at all, just pass them around.

But what I absolutely need in a language is to have a very very clear seperation between strings and byte arrays, or raw data, and ideally a way to transform between the two. C# gets this right with its byte and string types, the framework uses them correctly, and there is the wonderful Encoding namespace to interchange the two. Python 2.7 is the absolte worst, it's apparently impossible to get anything done with raw data and not run into some obscure 'ASCII codec can't handle octet 128' whatever exception (reminds you why we have strict typing: magic is fucking annoying).

I'd have hoped Common Lisp would fare well here, but SBCL (1.1.11 on 64-bit Linux Mint 15) is pretty broken. My results:

string: noël, reversed: l̈eon, first 3 chars: noe, length: 5

string: 😸😾, reversed: 😾😸, first 1 char: 😸, length: 2

string: baffle, upcase: BAfflE

string: noël, equals precomposed: NIL

Edited: GNU CLISP 2.49 produces identical results.

I was somewhat disappointed as well. I wrote some tests here: http://paste.lisp.org/display/140280

Perhaps playing around with different internal representations as pointed out by sedachv (https://news.ycombinator.com/item?id=6811407) would work but the initial, naive string usage doesn't work.

While I expected the default usage to work correctly in Common Lisp.

What we really should be doing is doing away with broken nomenclature.

What does the "length" of a string even mean? A database will tell you it has to do with storage. A nontechnical person will say it's the number of symbols. A visual designer might say that it has to do with onscreen width when rasterized in a particular way. None of these people are obviously right or wrong.

It's very useful to be able to count the number of glyphs in a string, or the number of unicode codepoints, or bytes, or pixels when rasterized in a particular way, but "length" isn't clear enough to unambiguously refer to any of them. Any meaning you try to ascribe to the "length" operation is going to be wrong to someone.

All of these examples work in Haskell's canonical text library, 'text'! It's the only language I know of that works.

The reversal of the decomposed noël doesn't produce the right result. Converting baffle to uppercase does do the right thing though, and the rest works as expected.

I don't think the solution to this problem is to make our string classes more complicated. I think it's to make our languages and character sets less complicated. I can't believe that multiple codepoints being used to generate a single glyph made it into the Unicode spec. That breaks a bunch of extremely useful abstractions. I think it is reasonable to expect human languages to be made up of distinct glypths that do not interfere with each other. Any language that does not is too complicated to be worth supporting. Let it die.

Now let's take the lower case of "BAFFLE" - should we get "baffle" or should the string class/function/wtfe attempt to recognize that a ligature can replace "ffl" and return to us "baffle"? More generally, should the string library ever attempt to replace letter with ligatures? Should this be yet another option?

And as I type this, another issue manifests: the spelling correction can't even recognize baffle as a properly spelled word; it highlights the 'ba' and ignores the rest.

Uppercasing and lowercasing is inherently lossy. E.g. the German ß becomes SS when uppercased, yet there is no way to know whether SS should be lowercased to ss or ß again. That's a reason why those things should be used, if at all, only as display transformations. Same goes for ligatures, but even those actually shouldn't be applied automatically, depending on the language. E.g. in German ligatures cannot span syllables and few layout engines can detect that.

I feel like I should learn German only so that I would be able to comment on the ß issue every time a Unicode thread pops up. From my uninformed point of view it is not really clear if ß should really be handled as a separate character/grapheme, or just as a ligature in rendering phase and stored as 'ss'. Or even if current-day orthography should be held at such a sacrosanct position that it shouldn't be changed to save significant amount of collective effort.

> or just as a ligature in rendering phase and stored as 'ss'.


> to save significant amount of collective effort

I've seen this kind of suggestion a number of times on HN, and I find it highly amusing. When confronted with a difficult challenge in representing the world on a computer, apparently the answer is to instead change the world.

OK, but then how are you going to handle hundreds of years of legacy texts?

In German, 'ß' is definitely not just a ligature of 'ss'.

Consider 'Masse' (mass) vs. 'Maße' (dimensions).

Uppercasing these words will necessarily produce ambiguity.

It would be equally tempting -- and wrong -- to treat the German characters 'ä', 'ö' and 'ü' as ligatures of 'ae', 'oe' and 'ue'. They're pronounced the same, and the latter forms commonly occur as substitutions in informal writing, but they also occur in proper names, where it would be incorrect to substitute them with the former. However, if you want to sort German strings, 'ä', 'ö' and 'ü' sort as 'ae', 'oe' and 'ue'.

The point is, while it may have started out as a ligature (of either ſs or ſz, no one really knows for sure), it has long become a letter in its own right. You cannot treat it like a display-only ligature without throwing away information, e.g. the difference between Maße (measurements) and Masse (mass). People in Switzerland made a conscious decision not to use ß anymore, but that's not the case in other countries where the language is used.

As "ß" vs. "ss" changes pronunciation of preceding vowels, I can't see how it could be anything other than its own letter.

* "Fuß" ("foot") roughly rhymes with "loose."

* "Fluss" ("river") roughly rhymes with… um, nothing I can think of. It has the vowel sound of "look" and "book," at least as pronounced in the American Northeast.

Since the orthographic reform of 1996, this has become a big deal.

If anybody hasn't seen it, Glitchr's twitter is a fantastic example of how bizarro things can get with "140 characters".


Note: may freak out browsers with a flaky Unicode implementation. For instance, scrolling that stream on the iOS Twitter client can get very laggy.

For Go: the for-range loop iterates 5 times, reversed (manually, using the resulting runes) is l̈eon, utf8.RuneCount is 5. The blog has just recently been talking about text normalization[1] via a library, but it isn't built into the core.

[1] http://blog.golang.org/normalization

The author intentionally chooses decomposed form. Indeed all of them work with Python 3. Here:

    Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
    [GCC 4.8.1] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> noel="noël"
    >>> noel[::-1]       # reverse
    >>> noel[0:3]        # first three characters 
    >>> len(noel)        # length
The point is, defining what is a character based on how it is displayed is flawed. Just precompose the string ifg you want and carry on. Like I said in my other comment, making automatic conversion of decomposed -> precomposed wrecks havoc with Indian languages.

Works as expected too in Scala, although it might be because the terminal does normalization.

scala> val noel = "Noël" noel: String = Noël

scala> noel.reverse res0: String = lëoN

scala> noel.take(3) res1: String = Noë

scala> noel.length res2: Int = 4

scala> import java.text.Normalizer

val nfdNoel = Normalizer.normalize(noel, Normalizer.Form.NFD) import java.text.Normalizer

scala> nfdNoel: String = Noël

scala> nfdNoel.length res3: Int = 5

scala> nfdNoel.reverse res4: String = l̈eoN

scala> nfdNoel.take(3) res5: String = Noe

The problem with an array of characters, as he mentions, is that it doesn't work properly in many use cases. If your array of characters stores 16 bit codepoints, it breaks with the 32 bit codepoints (Java got bit hard by that, where a char used to be a character prior to the introduction of surrogate pairs in Unicode); if it stores 32 bit codepoints, then it's pretty wasteful in most cases, which is exactly why you'd want a string type that handles storage of series of characters in an optimal fashion.

I hope Haskell Prime solves this. In Haskell String is literally a list of characters. This causes some overhead and leads to bad performance. Of course we got Text and for binary data you can use ByteString, but it's a bit of pain compared to having a real string type by default.

I think the specific case of ligatures isn't a failure in strings per se, but a failure in Unicode in that it includes them in the first place. What "fi".upper() (or whatever) should do is kind of ambiguous. The following doesn't really seem appropriate:

  "fi".upper().lower() #=> "fi"
But obviously nor does

   "fi".upper() #=> "fi"
In Turkish (which distinguishes between dotted and dotless 'i'), this issue exists already:

   "ı".upper().lower() #=> "i"
This case couldn't (so far as I know) be fixed by any string library without breaking Unicode compatibility, so it seems slightly disingenuous to call it an issue with strings.

Tom Christiansen (of Perl fame) made a much, much thorough analysis of Unicode problems in his OSCON 2011 presentation: http://www.oscon.com/oscon2012/public/schedule/detail/24252

Here are the slides: http://training.perl.com/OSCON2011/gbu/gbu.pdf

The site seems down ATM, but Internet Archive has it: https://web.archive.org/web/20121224081332/http://98.245.80....

The article briefly mentions JavaScript, which uses something similar to UTF-16/UCS-2: http://mathiasbynens.be/notes/javascript-encoding>

Here’s a slightly more in-depth blog post on the many issues this causes, and how to avoid them in JavaScript: http://mathiasbynens.be/notes/javascript-unicode Some of these problems are briefly mentioned in the above post, too.

This is misinformation. OP's strings are just wrong...

    >> "\u0308"
    => "̈"
    >> "\u00eb"
    => "ë"
    >> "noe\u0308l"
    => "noël"
    >> "no\u00ebl"
    => "noël"
His noël examples work just fine if you don't copy/paste the string he posts, and instead type them in like I just did.

If anything, languages are reporting correct reverses and length, since he's really manipulating 5 characters rather than four.

Congratulations, you've discovered unicode composition!

  2.0.0p247 :045 > Unicode::compose("e\u0308").unpack('U').first.to_s(16)
   => "eb" 
  2.0.0p247 :046 > Unicode::compose("\u00eb").unpack('U').first.to_s(16)
   => "eb" 
  2.0.0p247 :047 > Unicode::decompose("e\u0308").unpack('U').first.to_s(16)
   => "65" 
  2.0.0p247 :048 > Unicode::decompose("\u00eb").unpack('U').first.to_s(16)
   => "65"
I presume the ones you pasted in were changed by the browser. His examples are not wrong at all, indeed how can a string be "wrong"?

His example is "wrong" in the sense that you cannot reasonably complain that "noe¨l" gets reversed to "l¨eon" and put the "¨" part on top of the "l" when it does — which seems entirely correct. Or for that matter, that the string's length is 5 when there are indeed 5 characters.

As for being changed by the browser, the latter (or rather the OS) copied what there was, and the OS pasted it verbatim insofar as I can tell.

Honestly I think 'you' computer programmers love useless challenges too much. Why can't you adopt lessons from Q?

If it isn't easy to get some languages working with Unicode properly then fix the languages and leave Unicode alone. Remove all the language characteristics that makes working with Unicode difficult. If Unicode will not go to the language then the language must go to Unicode, or opt out of the computer era, or die!!


There is a one more issue -- the easier it is to manipulate strings in some language the greater chance that they will be used as an internal data structure for things that certainly aren't texts. And this almost always causes substantial performance loss and awful bugs that are either untraceable due to a dependence on subtle configuration details or form security holes. Or both.

Is it possible that Unicode is actually a bunch of horseshit? If literally nobody gets the spec right, then maybe the spec is wrong.

Unfortunately, general purpose text is not a clean simple thing that you can model nicely. Unicode is a mess because the problem it tries to solve is messy.

Even if you could somehow come up with something obviously better, getting any new standard adopted widely enough to be useful would be a formidable, if not insurmountable, challenge. It's less pain to keep using Unicode and try to deal with the worst of the damage.


Try the correct test input: noe\u0308l

I think a lot of programmers don't properly understand character encoding simply because their programming languages don't give them the proper treatment. We need more APIs that force developers to acknowledge character encodings, probably in the type system.

this hits on one of my biggest problems with native android and ios development. the wcs/wchar functions are largely broken or unusable... it caused me a real headache from not knowing upfront.

the idea of the string type is just fine though (or a character array) broken implementations don't invalidate it, they just invalidate the myth of '3rd party libraries must be good because hundreds of programmers worked on them for years' - which is exactly a myth. it doesn't just apply to strings but everything. (not brokeness, just that you shouldn't expect them to work beyond what you can measure, and certainly shouldn't expect that they are flawless or even good implementations)

Out of curiosity, why only have one string type? We don't do the same for numbers. Many languages don't have "number", they have int, float, long, etc.

Instead of just String, maybe we should have ASCIIString, UTF8String, and UTF16String.

I don't understand the reason of using C++ char array to store unicode text....

That's because in a truly sane languages there should be a distinction between data type and its implementation.

Then it would be not "string" type, that's broken, but an implementation of "string" type.

I agree, it seems like a much saner thing to do. Now that you make me think of that, I do not know many instances of this. I just could think of https://github.com/clojure-numerics/core.matrix upon which I stumbled recently. Do you have other example of efforts to separate a type from its implementations?

Most collection libraries (e.g. the Java one) work like this - you have List as an interface and can use LinkedList or ArrayList or so on. I particularly like scala's approach to factory-like methods combined with this; Seq is an interface, as is List, with implementations like LinkedList. But you can do any of LinkedList(1, 2, 3), List(1, 2, 3), or Seq(1, 2, 3) - and get back a LinkedList, a List (which will be an implementation-selected implementation, possibly LinkedList), or a Seq (which again will be an implementation-selected implementation, possibly LinkedList).

doesn't every statically typed imperative language do this, and recommend it?

Not to my knowledge.

C++ and Java are statically typed and they, as far as I know, don't have distinction between string interface and implementation, just a standard string type. You can't make your own string implementation and make others (given that - would it exist - they use standard string interface) transparently accept them instead of language's standard string implementation.

Even Haskell (with standard Prelude) doesn't have a readily available and widely accepted typeclass for strings. As String is just an alias to [Char], if library writer used that, they won't accept, say, Data.Text (I know, it's a bit distinct thing, but...)

I was referring to "efforts to separate a type from its implementations", not String specifically, and thinking of containers & co.

Although, for example, even java has CharSequence which only gives you access to codepoints in a char sequence, you can inherit from that and create your own.

You are right. It's interesting that I didn't think of it, probably because switching implementations in compiled languages is often less trivial, and I don't remember doing it. Actually, are there many alternative implementations of, say, the C++ STL?

Logically equivalent doesn't mean equivalent for computers. While you can't define why reverse of “noël“ is “lëon“ by set of rules that computer can follow, computer just can't know.

Umm. For that case you definitely can define a valid reversing algorithm. The key is using grapheme clusters as the indivisible base unit. Sure, there are probably some weird languages that will not reverse properly with such algorithm, but it would still be a significant improvement over the current situation.

That's why the article says, it's better to have a bare-bytes data structure, than a broken string type.

Mostly this article says that most languages choose NFC for their default normalization form, and don't attempt to detect & convert strings to that form automatically.

Objective-C handles the "baffle" case just fine.

This is now my favorite example of a leaky abstraction.

So what do you propose? Not have a string type, and let everyone handle all these cases manually instead? That will not end well...

Happy to see that ruby (2.0 at least) passes all the tests except the "baffle" one.

Edit: sadly, it doesn't.

If you're using Rails, the ActiveSupport::Multibyte::Chars library (#mb_chars on any string) passes all these tests except the upcase of baffle:

  irb(main):001:0> RUBY_VERSION
  => "1.9.3"
  irb(main):002:0> Rails.version
  => "3.2.14"
  irb(main):003:0> # example 1
  irb(main):004:0* example1 = "noe\u0308l".mb_chars
  => noël
  irb(main):005:0> example1.reverse
  => lëon
  irb(main):006:0> example1.compose.slice(0,3)
  => noë
  irb(main):007:0> example1.g_length #grapheme_length
  => 4
  irb(main):008:0> example1.compose.length
  => 4
  irb(main):009:0> # example 2
  irb(main):010:0* example2 = "😸😾".mb_chars
  => 😸😾
  irb(main):011:0> example2.length
  => 2
  irb(main):012:0> example2.slice(1,1)
  => 😾
  irb(main):013:0> example2.reverse
  => 😾😸
  irb(main):014:0> # example 3
  irb(main):015:0* example3 = "baffle".mb_chars
  => baffle
  irb(main):016:0> example3.upcase
  => BAfflE
  irb(main):017:0> # example 4
  irb(main):018:0* example4 = "noël".mb_chars
  => noël
  irb(main):019:0> example4 == example1
  => false
  irb(main):020:0> example4 == example1.compose
  => true

    "noe\u0308l".size # => 5
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin13.0.0]

Ah, oops. I read the post too quickly, now I see the problem.

"noe\u{0308}".size => 4

You code is wrong

You forgot the "l":

    "noe\u{0308}" # => "noë"

It looks like Factor passes all of these tests. Hat tip to Daniel Ehrenberg.

Upvotes for the OP who posts "Disrupt 'Is Broken'".

What do you guys think about String in Haskell, where it is a list of char? Should it have some other default implementation, or should it have been more, um, decoupled from its implementation (don't know the correct terminology)?

Now, look over here! When I substitute this context with that, ka-pow! now it's an array of characters!

Big deal. I don't understand what the point of this article is when it shows the shortcomings of half a dozen different string implementations in random languages. Yes, if you don't understand the language, then your assumptions about how it works may be wrong. Big surprise, that doesn't mean every string implementation needs to conform to your expectations...

Unicode is a standard. It says how to act in these circumstances. Calling out incorrect unicode implementations is useful. You shouldn't have to worry about inconsistent behavior between different languages that purport to support unicode strings. That's the point of a standard.

So who is the authority about correct unicode implementations, exactly? And how to different languages with different use cases and power conform to such a standard? Why doesn't this authority extend over language implementations? Because they know what they are doing, and understand the domain, unlike the author of this article.

Look, I'm all for open standards, but saying that standards are required to be adhered to at the programming language level is just ignorance of the real world. The point of a standard isn't to dictate how data is architectured internally, it's to facilitate interoperability of systems at their endpoints. If you want interoperability of programmers, than make your own conforming language and get programmers to adopt it the right way, by competing in the market of ideas.

There is no idea here other than the writer's unjustified expectation that he should just know how every language handles Unicode because??? Because Unicode is a standard? No.. that doesn't make sense at all. Mixing contexts to make the point here means there is no ground for his argument to stand on.

So who is the authority about correct unicode implementations, exactly?

The Unicode Consortium[1] publishes standards. If a language advertises unicode support, I expect it to follow that standard.

Look, I'm all for open standards, but saying that standards are required to be adhered to at the programming language level is just ignorance of the real world

I'm not saying a language has to do anything, but if it's advertising support for a well defined feature, and does not deliver correctly on that, I will call them out on it, and support anyone else who does as well. Should we all just throw our hands up and say "Well, it's done now, no point in making a big deal of it?" I would rather apply pressure to get things fixed, or at least make it well known enough that future language designers give it the care and attention it's due.

There is no idea here other than the writer's unjustified expectation that he should just know how every language handles Unicode because??? Because Unicode is a standard? No..

Are you under the impression that what the author is attempting is not well defined? The unicode standard has conformance clauses about how to interpret unicode strings[2]. That means that if a language advertises it has/supports unicode strings, and fails the tests we've just seen, it's not conformant with the unicode standard. That would make this useful because it's pointing out bugs. If a language does not advertise unicode support, but supports some unicode features, then this is useful because it's making sure people are aware of the limits of their language. All too often people refer to the native string implementation in their language as supporting unicode, when clearly there are problems.

1: http://www.unicode.org/

2: http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf (see section 3.2)

I would argue that it's a unicode problem. `U+0308` shouldn't exist in the first place as a unicode character. That's why we have `U+00EB` ('LATIN SMALL LETTER E WITH DIAERESIS'), etc.

Not all combination of base and combining characters exist in a precomposed form, since a base character can have an infinite number of combining characters tacked onto it.

If anything should not exist, it's U+00EB, which is a convenience, compatibility and (space) optimisation codepoint.

Uhm, nope. Definitely not. All the precomposed letters only exist because of compatibility with legacy character sets. There are also some languages that routinely use more than one stacked diacritic on letters and encoding every possible precomposed variant would be at least a little bit silly.

I would argue the opposite. Combining characters are a general (and thus preferable) solution to diacritics, so precombined codepoints should not have been included in Unicode.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact