Hacker News new | past | comments | ask | show | jobs | submit login
UTF-8 Everywhere (utf8everywhere.org)
275 points by angersock on Jan 16, 2014 | hide | past | favorite | 149 comments

This resonate so much for me, in VLC.

VLC has a very large number of users on Windows (80% of our users), yet almost none of the dev use Windows to code. Therefore, we use UTF-8 char* everywhere, notably in the core. We use UTF-16 conversions only in the necessary Windows modules, that use Windows APIs. Being sure we were UTF-8 everywhere took a lot of time, tbh...

But the worse are formats like ASF (WMV) or MMS that use UTF-16/UCS-2 (without correctly specifying) them and that we need to support on all other platforms, like OSX or Linux...

I am also convinced that wide char should be avoided and are one of the bad thing of java. It is very nice to see that more and more people agree and that important projects like VLC already apply this principle. Will 2014 become the year we kill wide chars ?

Wow, I've learned even more about VLC (and programming in general) and I'm even more impressed. Thanks for this little insight!

This may be tangential, but I think that computer languages should have a different type (and literal notation) for human text (strings that may be read by human, may be translated, won't affect program semantics) and for computer string (strings that are strictly defined, not to be translated, and may affect program semantics).

Then we could put all the human language problems into human text type, and leave the simpler computer string type with easier semantics.

In Python, although there are no tools for that, I typically use the following convention: single quotes for computer text and double quotes for human text. I guess you could use byte arrays for computer text as well, but it would be more painful.

Haskell has the different semantics though they share the literal type. In Haskell there's Text for Unicode strings (UTF-16 encoded) and ByteString for, well, byte-strings (fixed vectors of Word8s). The semantics make it hard to confuse them, though if you turn on the OverloadedStrings extension then

is ambiguous. If you don't like that you can just not turn on that extension and then there is no literal syntax for either type and you must convert literal strings (really just lists of characters, [Char]) to either type manually

    Data.Text.pack "foo" :: Text
    Data.ByteString.pack "foo" :: ByteString
Finally, you must use manual encoding functions to interconvert them

    Data.Text.Encoding.encodeUtf8 :: Text -> ByteString

Python has separate bytes and unicode types as well, I think he meant there should be a "black box" type that the program shouldn't try to interpret.

I don't think you can really play the same game in Python. The idea of separation of types that occurs in Haskell will really require static typing. It means that entire processing pipelines in your code are completely specialized to one type of string or the other (or generalized to both) with no uncertainty.

There is no reason "computer text" couldn't also span the whole unicode character set. I don't think your comment is tangential. But I think the notion that some characters are special or more basic than other characters is a trap and leads to illogical thought. For example, many internet protocols are defined in that way. Ascii only for all commands and responses and then for i18n there are some crazy encoding schemes used (quoted printable, utf7, base64, etc) to encode unicode as ascii.

All that goes away if your protocol is standardized on utf8. Then text is text and bytes is bytes.

Lots of operations people want to do on 'strings' don't make sense on unicode text. There is no 'length'. There is the size in bytes of various encodings and there is the number of code points and there is the number of grapheme clusters. The latter is often what people really want, but to get that you need to know about the fonts being used to render the text.

Similarly people want to iterate over a string character by character or take substrings by range but with unicode text that becomes iteration over code points and ranges of code points (unless you go all the way and use a text rendering system to give you grapheme clusters). Code points can be decomposed diacritic marks etc so you can't just blindly insert or change code points at a certain index or take arbitrary substrings without risking breaking the string (you can end up with accents on characters that you didn't intend, or stranded at the end of a string and probably plenty of other types of breakage that I can't even think of). Functionality exists to deal with all this but it's pretty burdensome (e.g. NSString has -rangeOfComposedCharacterSequencesForRange:).

That all adds up to a pretty hefty performance penalty as well as potential layering violations (needing to consider fonts and rendering when parsing some protocol if you really are going to treat strings as a sequence of grapheme clusters).

It certainly is possible to split text into grapheme clusters without involving any font rendering, see: http://www.unicode.org/reports/tr29/ Most text manipulation isn't performance critical, and when it is, you can always implement a fast-path for mostly ascii text.

I actually would prefer the computer string type to be array of bytes. Many people mentioned this type distinction already exists in many languages (Python, Haskell..).

Though I think it would be useful to think about it as a sort of subtype of human string, with default encoding in UTF-8. So substitution or concatenation of human and computer string would yield a human string (which is where these languages usually fail short, because you need explicit conversion, it doesn't work like e.g. integers and floats).

In traditional Windows development, the way that was handled was by having human-strings in a resource file (which you could replace with a file in a different language), while the computer-strings were hardcoded constants.

Traditional? The same is recommended for .net/WinRT development.

I'm not familiar enough with newer Windows development to comment on it, so that's why I qualified my statement.

In Go, you can name your variables in Chinese† if you want.

In Go, strings are immutable UTF-8 byte arrays,†† and the language provides facilities for iterating over them either byte by byte or rune by rune (a rune is an int32, wide enough to hold any unicode character).



Does Go provide functions for iterating grapheme by grapheme?

That's being worked on†. It hasn't made it into the standard package library as far as I know.

E.g., there are two ways to write Cañyon City. You can write the ñ as U+00F1 or as an ascii lower-case n followed by a combining tilde (U+0303). The first case results in a single rune, and the second in two runes. Example††. You need additional logic in order to normalize to a canonical representation and realize that the two strings are actually the same.

Also, if you are displaying the string, you need to account for the fact that, although the two strings have different byte and rune lengths, they take up exactly the same number of pixels on your display medium.



>E.g., there are two ways to write Cañyon City. You can write the ñ as U+00F1 or as an ascii lower-case n followed by a combining tilde (U+0303). The first case results in a single rune, and the second in two runes. Example††. You need additional logic in order to normalize to a canonical representation and realize that the two strings are actually the same.

Who thought that having two ways to go about this was a good idea in the first place?

That's the purpose of the Rune type, I believe. When you iterate over text, it gives you one character at a time, not a fixed number of bytes.

That's iteration by code point, not by grapheme. For example, the grapheme 'ä' may be represented as two code points: an 'a' and a combining diaeresis.



A Go string is an immutable opaque sequence of bytes. It is conventionally UTF-8, but it does not have to be. The only place in the language that cares is if you range over a string, but even there non-UTF-8 strings are supported.

Can you have spaces or other reserved characters in variable names? Because in Lisp you can. In addition to being able to use characters from any encoding of course.

I thought that was the purpose of having binary strings and unicode strings in Python. That's how I've been using them anyhow. Unicode for anything whose text property is relevant (u"this is text") and binary as a simple container of bytes ("bf9@cf0*1!09v$j#x0j").

Yes, I think that's a good start, although the literal notation for the computer strings b".." is a bit unwieldy. They also don't coerce when used together, see my comment above.

Does Ruby's Symbol type acceptably accomplish this?

Symbols are yet another use case; "I want something as efficient as a small number" (probably 4 bytes nowadays) "but I want to type a nice human name for it, and I don't want to even have to think about the translation". It's neither "text" nor a "binary", it's actually a well-disguised int (or similar type).

I love symbols in Ruby. I wish every language had something similar, rather than having me define constants, importing the constants, etc. Just let me create symbols I can pass around easily.

Debatable. I haven't been doing Ruby for a few years, but when I did, it was mostly used as dictionary keys. So, sure, you didn't need to define your constant somewhere, but on the other hand, you end up a typo away from an invalid lookup. Getting a NameError is a much easier way to locate the origin of the problem.

Yes, that is a valid point, and definitely a downside.

But note that ruby's symbols are interned, so if you create lots of them programatically they never get GC'd.

For some use cases, symbol types are useful (note that they originally come from Lisp). But not all. For example, in many internet protocols, you need computer strings which are not symbols.

Now, all the advice in the Windows section - don't do this, don't do that, only and always do third - is lovely, but if you happen to care about app's performance, you will have to carry wstrings around.

Take a simple example of an app that generates a bunch of logs that need to be displayed to the user. If you are to follow article's recommendations, you'd have these logs generated and stored in UTF8. Then, only when they are about to be displayed on the screen you'd convert them to UTF16. Now, say, you have a custom control that renders log entries. Furthermore, let's imagine a user who sits there and hits PgUp, PgDown, PgUp, PgDown repeatedly.

On every keypress the app will run a bunch of strings through MultiByteToWideChar() to do the conversion (and whatever else fluff that comes with any boost/stl wrappers), feed the result to DrawText() and then discard wstrings, triggering a bunch of heap operation along the way. And you'd better hope latter doesn't cause heap wobble across a defrag threshold.

Is your code as sublime as it gets? Check. Does it look like it's written by over-enlightened purists? You bet. Just look at this "advice" from the page -

  ::SetWindowTextW(widen("string litteral").c_str())
This marvel passes a constant string to widen() to get another constant string to pass to an API call. Just because the code is more kosher without that goddamn awful L prefix. Extra CPU cycles? Bah. A couple of KB added to the .exe due to inlining? Who cares. But would you just look at how zen the code is.

tl;dr - keeping as much text as possible in UTF8 in a Windows app is a good idea, but just make sure not to take it to the extremes.

"if you happen to care about app's performance, you will have to carry wstrings around"

If those strings are for the user to read, he's reading a million times slower than you handle the most ornate reencoding. Sounds like a premature optimization.

Not only that, but the time required to convert from UTF-8 to UTF-16 is negligible in relation to the time required to lay out the glyphs and draw them on screen. Premature optimisation indeed.

It's not a premature optimization. It's a manifestation of a different set of coding ethics which is just ... err ... less wasteful and generally more thoughtful.

Yup. I'd wish this ethics was more popular. I can understand that we "waste" countless cycles in order to support abstraction layers that help us code faster and with less bugs. But I think that our programs could still be an order of magnitude faster (and/or burn less coal) if people thought a little bit more and coded a little bit slower. The disregard people have for writing fast code is terrifying.

Or maybe it's just me who is weird. I grew up on gamedev, so I feel bad when writing something obviously slow, that could be sped up if one spent 15 minutes more of thinking/coding on it.

Yeah, I'll have to disagree with both of you. The "coding ethics" that wants to optimze for speed everywhere is the wasteful and thoughtless one.

Computers are fast, you don't have to coddle them. Never do any kind of optimization that reduces readability without concrete proof that it will actually make a difference.

15 minutes spent optimizing code that takes up 0.1% of a program's time are 15 wasted minutes that probably made your program worse.

Additionally: "Even good programmers are very good at constructing performance arguments that end up being wrong, so the best programmers prefer profilers and test cases to speculation."(Martin Fowler)

> Computers are fast, you don't have to coddle them

This mentality is exactly why Windows feels sluggish in comparison to Linux on the same hardware. Being careless with the code and unceremoniously relying on spare (and frequently assumed) hardware capacity is certainly a way to do things. I'm sure it makes a lot of business sense, but is it a good engineering? It's not.

Neither is optimization for its own sake, it's just a different (and worse) form of carelessness and bad engineering.

Making code efficient is not a virtue in its own right. If you want performance, set measurable goals and optimize the parts of the code that actually help you achieve those goals. Compulsively optimizing everything will just waste a lot of time, lead to unmaintainable code and quite often not actually yield good performance, because bottlenecks can (and often do) hide in places where bytes-and-cycles OCD overlooks them.

I think we are talking about different optimizations here. I'm referring to "think and use qsort over bubblesort" kind of thing while you seem to be referring to a hand-tuned inline assembly optimizations.

My point is that the "hardware can handle it" mantra is a tell-tale site of a developer who is more concerned with his own comforts than anything else. It's someone who's content with not pushing himself and that's just wrong.

(edit) While I'm here, do you know how to get an uptime on Linux?

  cat /proc/uptime
Do you know how to get uptime on Windows? WMI. That's just absolutely f#cking insane that I need to initialize COM, instantiate an object, grant it required privileges, set up a proxy impersonation only to allow me send an RPC request to a system service (that may or may not be running, in which case it will take 3-5 seconds to start) that would on my behalf talk to something else in Windows guts and then reply with a COM variant containing an answer. So that's several megs of memory, 3-4 non-trivial external dependencies and a second of run-time to get the uptime.

Can you guess why I bring this up?

Because that's exactly a kind of mess that spawns from "oh, it's not a big overhead" assumption. Little by little crap accumulates, solidifies and you end up with this massive pile of shitty negligent code that is impossible to improve or refactor. All because of that one little assumption.

You make WMI sound long-winded, but do you think 'cat /process/uptime' is free? There's a lot involved in opening a file.

On the process side, it's a few system calls, and operating system always have this code at hand, does not need to load anything (that's what slow).

I agree that optimization for its own sake is not a good thing (though tempting one for some, including me), but there's a difference between prematurely optimizing and just careless cowboy-coding. Sometimes two minutes of thinking and few different characters are enough to speed code up an order of magnitude (e.g. by choosing the proper type or data structure).

Also, being aware of different ways code can be slow (from things dependent on programming language of choice to low-level stuff like page faults and cache misses) can make you produce faster code by default, because the optimized code is the intuitive one for you.

Still, I think there's a gap between "fast enough and doesn't suck" and "customers angry enough to warrant optimization". It's especially visible in the smartphone market, where the cheaper ones can't sometimes even handle their operating system, not to mention the bloated apps. For me it's one of the problems with businesses. There's no good way to incentivize them to stop producing barely-good-enough-crap and deliver something with decent quality.

For display purposes, UTF-8 vs. UTF-16 is going to be such a miniscule difference that it's not worth the potential portability bugs to try to optimize for speed. You're talking about at most 30000 characters of text on screen at once. If that's entirely stored in UTF-8, and entirely rendered in UTF-16, and the conversion takes an insane 100 cycles per character on average, you're still using less than 0.1% of a single core of a modern desktop CPU.

If you got into the 1%+ range, I could see justifying some attention to speed, but otherwise...

Less wasteful of computer time, but more wasteful of developer time. And, given that the comment is advocating a more complex strategy for using strings with different encodings rather than the simple one given in the story, probably more error-prone too.

The advocating strategy is simpler: UTF8 strings are a lot easier to handle than UCS2. What is complex is that windows API is inconsistent and more oriented toward UCS2.

That was an unfortunate example. The widen() in this case is absolutely unnecessary. The author even recommends using the L prefix for UTF-16 string literals inside of Windows API calls (but not on other platforms, where wchar_t isn't UTF-16):

> Do not use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.

Except for that, you do make a good point. It's probably better to store some strings in memory according to the platform requirements, if the program can be shown to exhibit delays caused by string conversions.

You do realise that drawing the characters to the screen is an order of magnitude (at least) slower than grabbing some heap memory, doing a string conversion and freeing the memory, right? You don't even need to allocate memory - you can have a constant small thread-local buffer of, say, 1kb that you reuse for these conversions.

I was horrified to discover that Microsoft SQL Server's text import/export tools don't even support UTF-8. Like, at all. You can either use their bastardized wrongendian pseudo-UTF-16, or just pick a code-page and go pure 8-bit.

I'm not sure what modules or tools you are talking about, but if you use Sql Server Integration Services (formerly SQL Servrer Data Transformation Services), you basically have a data processing pipeline which supports everything and all transformations on the planet.

And obviously it supports arbitrary text-encodings, although sometimes you will need to be explicit about it.

If you used the simplified wizards, all the options may not have been there, but you should have been given the option to export/save the job as a package, and then you can open, modify, test and debug that before running the job for real.

Seriously. SQL Server has some immensely kick-ass and über-capable tooling compared to pretty much every other database out there.

To even suggest it doesn't support UTF8 is ludicrous.

He's probably talking about "bcp" which indeed doesn't support utf-8.

So why would someone even use bcp instead of SSIS? SSIS might be nice for performing repeated imports of data that has a fixed format, but for quick and dirty exports/imports it's really frustrating to use. It's not even smart enough to scan an entire data file and suggest appropriate field lengths and formats. Every single time I try to import a .csv file it craps out and doesn't even show where the error occured - that's after clicking through a bunch of steps in a GUI. At least with BCP you can easily rerun the import/export from the command line.

SQL and SQL Management studio are generally great but I would not include SSIS when lauding them.

If SQL server supports UTF8, Microsoft manages to hide that fact well. http://technet.microsoft.com/en-us/library/ms176089.aspx:

char [ ( n ) ] Fixed-length, non-Unicode string data.


Character data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.

So, (var)char is "non-Unicode", and n(var)char is UCS-2 only.

That is in agreement with http://blogs.msdn.com/b/qingsongyao/archive/2009/04/10/sql-s..., which claims the glass is half full ("In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation.")

On the other hand, we have http://msdn.microsoft.com/en-us/library/ms143726.aspx that seems to state that SQL Server 2012 has proper unicode collations. UTF8 still is nowhere to be found, though.

To be fair, the format in which data is stored in the DB and the format used for importing data are two entirely different things.

If you want to treat data as a stream of bytes hardcore UTF8 & PHPesque style (this function is "binary safe" woo) with no regard to the actual text involved, feel free to store it a bytes. SQL Server supports that.

If you want to store it as unicode text feel free to use the ntext and nvarchar types. I'm pretty sure that's what you intend to do anyway, even though you insist on calling it UTF8.

I'm not the original complainer about UTF8 support, but "If you want to store it as unicode text feel free to use the ntext and nvarchar types." comes at a price: for the o so common almost-ASCII text collections, it blows up your disk usage and I/O bandwidth for actual data by a factor of almost 2. For shortish fields, the difference probably isn't that, but if you store, say, web pages or blog posts, it can add up.

The various command-line and text-friendly bulk/bcp tools are the pain-point here.

SSIS is insanely powerful and performant, but it's also insanely cumbersome and script-unfriendly. Microsoft has finally started embracing the power of plain text and scriptable tools in their web-stack and .NET, but SSIS represents a holdover from their heavyweight GUI-and-wizard days.

Fair enough. I don't use those tools too often, so I tend to forget they're around as well.

That said, if you are doing repeatable jobs (and not just one-off imports) you can still create a SSIS package, and then run the package from your script using the package-runner and appropriate config-data.

I don't like the way UTF-8 was clipped to only 1 million codepoints in 2003 to match the UTF-16 limit. The original 2.1 billion codepoint capacity of the original 1993 UTF-8 proposal would've been far better. Go Lang uses \Uffffffff as syntax to represent runes, giving the same upper limit as the original UTF-8 proposal, so I wonder if it supports, or one day will support, the extended 5- and 6-byte sequences.

In fact, UTF-16 doesn't really have the 1 million character limit: by using the two private-use planes (F and 10) as 2nd-tier surrogates, we can encode all 4-byte sequences of UCS-32, and all those in the original UTF-8 proposal.

I suspect the reason is more political than technical. unicode.org (http://www.unicode.org/faq/utf_bom.html#utf16-6) says "Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other UTFs) can represent larger intergers, these policies mean that all encoding forms will always represent the same set of characters. Over a million possible codes is far more than enough for the goal of Unicode of encoding characters, not glyphs. Unicode is not designed to encode arbitrary data."

What use would it have to have so much extra codepoint space?

2 planes (130,000) of private-use codepoints aren't enough, and because the top 2 planes of Unicode are designated private use, UTF-16 gives developers the option of extending them to 2.1 billion if they need it. I've wanted extra private-use space for generating Unihan characters by formula in the same way the 10,000 Korean Hangul ones are generated from 24 Jamo. I'm sure many other developers come across other scenarios where 130,000 isn't enough for private use.

I'm simply saying that UTF-8 shouldn't be crippled in the Unicode/ISO spec to 21 bits, but be extended to 31 bits as originally designed because the technical reason given (i.e. because UTF-16 is only 21 bits) isn't actually true. The extra space should be assigned as more private use characters. (Except of course the last two codepoints in each extra plane would be nonchars as at present, and probably also the entire last 2 planes if the 2nd-tier "high surrogates" finish at the end of a plane.)

Part of the reason this is a problem is because someone probably said "Who could need more than 16 bits' worth of codepoints?", so I'd err on the side of extra codepoint space.

We constantly have to deal with Win32 as a build platform and we write our apps natively for that platform using wchar. I think the main difficulty is that most developers hate adding another library to their stack, and to make matters worse, displaying this text in Windows GUI would require conversion to wchar. That's why I think they are up for a lot of resistance, at least in the Windows world. If the Windows APIs were friendlier to UTF-8, there might be hope. But as it stands right now, using UTF-8 requires the CA2W/CW2A macros, which is just a lot of dancing to keep your strings in UTF-8 which ultimately must be rendered in wchar/UTF-16.

Maybe there might be a shot in getting developers to switch if Windows GUIs/native API would render Unicode text presented in UTF-8. But right now, it's back to encoding/decoding.

"This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text."

Except that Javascript is UTF-16, so no luck with 4 byte chars there.

> Javascript is UTF-16

No it isn't. Javascript is no different from any other text. It can be encoded in any encoding. Where did you get the idea that JS is UTF-16?

EDIT: I misunderstood the intent of the comment I was responding to. JS uses (unbeknownst to me) UTF-16 as its internal representation of strings.

JavaScript source can be encoded in any way that the browser can handle, yes.

Within the JS language, strings are represented as sort-of-UCS-2-sort-of-UTF-16 [0]. This is one of the few problems with JS that I think merits a backwards-compatibility-breaking change.

[0] http://mathiasbynens.be/notes/javascript-encoding

GP means string literals. To quote from the spec: "4.3.16 String value: primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer... Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text."

The "usually" there turns out to be important.

Javascript "strings" are, as the spec says, just arrays of 16 bit integers internally. Since Unicode introduced characters outside the Basic Multilingual Plane (BMP) i.e. those with codepoints greater than 0xFFFF it has no longer been possible to store all characters as a single 16 bit integer. But it turns out that you can store non-BMP character using a pair of 16 bit integers. In a UTF-16 implementation it would be impossible to store one half of a surrogate pair without the other, indexing characters would no longer be O(1) and the length of a string would not necessarily be equal to the number of 16 bit integers, since it would have to account for the possibility of a four byte sequence representing a single character. In javascript none of these things are true.

This turns out to be quite a significant difference. For example it is impossible in general to represent a javascript "string" using a conforming UTF-8 implementation, since that will choke on lone surrogates. If you are building an application that is supposed to interact with javascript — for example a web browser — this prevents you from using UTF-8 internally for the encoding, at least for those parts that are accessible from javascript.

The idea is from ECMA 262, sections 2(conformance), 4.3.16 (String value), 6 (Source text), 8.4 (String type)... That's basically THE reason, why all js engines are UTF-16 internally.

We really ought to suggest that be changed in ES7.

Have a look at this stack overflow question[1]. Javascript/ECMAScript strings are supposed to be UTF-16. That said UTF-16 encodes 4 byte codepoints just as easily as UTF-8.


Concrete example of where JS has trouble:

String.fromCharCode(0x010004).charCodeAt(0); => 4

And software developers, don't forget to implement the 4 byte characters too please. Utter nightmare dealing with MySQL. I believe 4 byte characters still even break github comments.

MySQL was a Norwegian company at the beginning, but you won't believe. It's one of the worst products when it comes to I18N and especially to Unicode. They still kind of store unicode text as a chain of encoded utf8 characters, if I'm not wrong. Their stupid command line utility still defaults to LATIN1 for input and output despite my locale clearly saying .UTF-8. People in their IRC channel still refuse to admit anything of this as a problem.

The horrible unicode support in MySQL fits well with its general careless attitude toward data integrity.

Minor correction: MySQL was Swedish and InnoDB Finnish.

About 2 weeks I tried to file a bug with our backend guys about 4-byte characters wreaking havoc on our API.

My example broke the bug tracker's (bugzilla) comment system as well. I chuckled.

Yeah. We've noticed, to our own amusement, that Jira (we're on an older version) can't handle non-ASCII. Makes entering tickets involving other languages fun.

I assume Bugzilla broke due to being backed by MySQL. Bugzilla itself is written in Perl so should have no problem.

At my last job I checked in a test case for our astral character handling. And broke the build server.

I ran into that issue with MySQL's utf-8 handling. It was 𝒜wesome: http://geoff.greer.fm/2012/08/12/character-encoding-bugs-are...

I can only imagine what kind of frustration drove someone to make this site.

The same frustration made me write a custom string library for the Playground SDK, years ago.

std::string is missing a lot of functionality one tends to need when dealing with strings (such as iterating over UTF-8 characters or fast conversion to UTF-16, but also things like search-and-replace). And it makes me sad that I can't use that string library any more (legally) because of the license PlayFirst insisted on using (no redistribution).

As far as I'm concerned, though, there IS no good string library available for use anywhere. I've looked at all of the ones I could find, and they're all broken in some fundamental way. I guess solving the "string problem" isn't sexy enough for someone to release a library that actually hits all the pain points.

You might like ogonek[1].It’s currently still not implementing regular expressions and thus strongly limited in its capabilities (and C++11 regex are so badly designed that they cannot be extended meaningfully to handle this case), but it has hands down the best API for working with text in C++. It makes using a wrong encoding a compile-time error and offers effortless ways of dealing with actual Unicode entities (code points, grapheme clusters) rather than bytes.

[1] https://github.com/rmartinho/ogonek

You've sparked my curiosity: what's wrong with C++11's regex?

I've used them over the summer but nothing felt broken beyond the general C++ verbosity. Granted most of my prior regex work was in Perl.

Thanks for the link! I'll take a look.

I actually extremely rarely need full reg-ex support. Almost never, really. What WOULD be awesome is limited pattern support, at the level of Lua patterns[1], especially if they were UTF-8 character aware.

[1] http://www.lua.org/manual/5.1/manual.html#5.4.1 -- Lua patterns are NOT regular expressions, even though they look similar; there's no "expression" possible, just character class repeats.

Well, looks like that is what it is. Shoot me an email with your pain points, and maybe something will come of it. :)

I'm more than imagining it right now. >:(

UTF-8 is usually good enough in disk.

I would like to have at least two options in memory: utf-8 and vector of displayed characters (there's many combinations in use in existing modern languages with no single-character representations in UTF-<anything>).

Do you need a vector of displayed characters?

Usually all you care about is the rendered size, which your rendering engine should be able to tell you. No need to be able to pick out those characters in most situations.

Yes. If I want to work with language and do some stringology, that's what I want. I might want to swap some characters, find length of words etc. To have vector of characters (as what humans consider characters) is valuable.

Yes. A really good string type, that actually modelled a sequence of characters, not bytes, codepoints, interspersed glyphs and modifiers, or what have you, would be very useful at times.

The acid test for this sort of 'humane string' type would be whether you could splice together any two substrings from any two input strings and get something that could be validly displayed. UTF-8 bytes fail because you can get fractional codepoints. Codepoints in >=20-bit integers fail because you can get modifier characters which don't attach to anything.

A similar test would be whether you can reverse any input string by reversing the sequence of units. For example, reversing "amm͊z" should yield "zm͊ma", which it doesn't in unicode, because "m͊" is made with a combining mark, and doesn't have a composed form.

For extra fun, i suspect that reversing the string "œ" should yield "eo".

It should also be simple to do things like search for particular characters in a modifier-insensitive way. For example, i should be able to count that "sš" contains two copies of the letter 's' without having to do any deciphering. I suspect i should also be able to count that the string "ß" contains two copies of the letter 's', but i'm not nearly as sure about that.

I think i essentially want a string that looks like:

List<Pair<Character, List<Modifier>>>

But i'm not sure. And i'm even less sure about how i'd encode it efficiently.

> To have vector of characters (as what humans consider characters) is valuable.

That might be an awful can of worms. Are Arabic vowels characters? "ij" letter in Dutch? Would you separate Korean text into letters or treat each block of letters as a character?

I can answer the question on Korean. Treat each block of letters as a character. Never ever separate for human uses.

Can you really automate that problem? You could provide a "split at glyphs" function, but I doubt that would actually be useful without tons of caveats. Even English doesn't do split at glyphs well given the existence of ligatures. `flat -> fl a t`.

Not to mention you would need to make any such function language aware since different languages could theoretically have different mapping rules for the same sequence of characters.

Most of the post talks about how Windows made a poor design decision in choosing 16bit characters.

No debate there.

However, advocating "just make windows use UTF8" ignores the monumental engineering challenge and legacy back-compat issues.

In Windows most APIs have FunctionA and FunctionW versions, with FunctionA meaning legacy ASCII/ANSI and FunctionW meaning Unicode. You couldn't really fix this without adding a 3rd version that was truly UTF-8 without breaking lots of apps in subtle ways.

Likely it would also only be available to Windows 9 compatible apps if such a feature shipped.

No dev wanting to make money is going to ship software that only targets Windows 9, so the entire ask is tough to sell.

Still no debate on the theoretical merits of UTF-8 though.

Nothing worth doing is easy.

Anyways, the FunctionA/FunctionW is usually hidden behind a macro anyways (for better or worse). This could simply be yet another compiler option.

We "solved" (worked around? hacked?) this by creating a set of FunctionU macros and in some cases stubs that wrap all of the Windows entry points we use with incoming and outgoing converters. It's ugly under the hood and a bit slower than it needs to be, but the payoff of the app being able to consistently "think" in UTF-8 has been worth it.

Of course, we had to ditch resource-based string storage anyway for other cross-platform reasons, and were never particularly invested in the "Windows way" of doing things, so it wasn't a big shock to our developers when we made this change.

Would be lovely if MS Office could export CSV to UTF-8, but nope.

Yes, I had issues with that in Excel 2010 too.

The only well known Microsoft application that can handle UTF-8 is notepad.exe (Win 7).

Export in UTF-16, open in Notepad and re-save as UTF-8.

I admire and appreciate your concern for something that is missunderstood and ignored. However this webpage took way to long to say what is so great about utf 8.

Honestly, I think it is platform politics. *nix systems seem to prefer UTF-8, while UTF-16 is the default on Windows. Space and memory are cheap, so either encoding seems fine.

The bottom line is that UTF-8 is awkward to use on Windows, while UTF-16/wchar_t is awkward to use on Linux, simply because the core APIs make them so (there is no _wfopen function in glibc).

The other problem with UTF-16 is that it's much easier to pretend that 1 element = 1 character than with UTF-8.

It's not really politics. Microsoft made the choice for fixed-sized chars back when it was thought that 16 bits was enough for everyone. MS was at the forefront of internationalizing things, and probably still are. (Multilanguage support in Windows and Office is quite top class.)

Unfortunately, we need more than 16 bits of codepoints, so 16-bit chars is a waste and a bad decision with that insight. It seems unlikely that a fresh platform with no legacy requirements would choose a 16-bit encoding. Think of all the XML in Java and .NET - all of it nearly always ASCII, using up double the RAM for zero benefit. It sucks.

Was UTF-8 even around when Microsoft decided on 16-bit widechar?

Other platforms seem to have lucked out by not worrying as much as standardizing on a single charset and UTF8 came in and solved the problems.

"Was UTF-8 even around when Microsoft decided on 16-bit widechar?"

No, Thompson's placemat is from September 1992 and NT 3.1 from July 1993, but development on NT started in November 1989 (http://en.wikipedia.org/wiki/Windows_NT#Development)

This summary is excellent, and concise:


I think it's unfortunate that it doesn't have more concrete examples. I think having more of those would really help strengthen their case, clarify their points, and make their arguments tangible and understandable to a much wider audience.

One instance where I really wish for examples: they mention characters, code points, code units, grapheme clusters, user-perceived characters, fonts, encoding schemes, multi-byte patterns, BE vs LE, BOM, .... while I kind of get some of these, I certainly don't understand all of them in detail, and so there's no way that I'll grasp the subtleties of their complicated interactions. Examples, even of simple things such as what actually gets saved to disk when I write out a string using UTF-8 encoding vs. UTF-16 -- especially when using higher codepoints, would be hugely beneficial for me.

Windows is a horrible environment for UTF8 unless MS provides a special locale for it.

At present state, you can choose to use utf8 internally in your app, but when you need to cooperate with other programs (over sockets or files), it's going to be confusing. Some will be sending you ANSI bytes and you take it as UTF8.

"Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist, text is not only for human readers."

I'm a little confused by this statement. Can someone clarify?

I think the author wants to say that a) computers should use appropriate binary formats for communication between them (e.g. from a web server to a browser), b) they don’t and use plain text in many cases (e.g. HTML) and hence c) text is not only read by humans but also by computers.

I am not entirely sure whether that makes any sense, though.

Interesting. I came to the same conclusion myself a few years ago when converting a Windows app to Unicode. I store all strings as UTF-8, which enabled me to continue using strncpy, char[] etc. I convert to wchar_t only when I need to pass the string to Win32. I can even change from narrow to widechar dynamically. I use a global switch which tells me whether I am running in Unicode or not, and call the 'A' or 'W' version of the Win32 function, after converting to wchar_t if necessary.

In Javascript, it's UTF16. Also Java.

Can't speak for other of the top.

What is currently the best way of dealing with UTF-8 strings in a cross-platform manner? It sounds like widechars and std::string just won't cut it.

Yeah, you really need some specialized interface that encapsulates all the things you may need from your string processing, but can spit out a representation of that string in various encodings.

In line with the spirit of this article, that interface should use UTF-8 storage internally as well, but this should be transparent to the programmer anyway. Dealing with encoded strings directly is a recipe for heartache unless you're actually writing such a library.

A higher level language, honestly. Perl, Ruby, Python3 and Haskell have excellent crossplatform utf8 support, and I'd be amazed if OCaml didn't. But if you want to write C++ code that works on windows, you're in for some pain.

> there is a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth

But what about other planets? Is there a Unicode Astral Plane which may encode poorly in the future?

There is are 13 Unicode astral planes (though only two of them have characters assigned so far), and it they do indeed encode poorly in some environments: the planes other than plane 0, the basic multilingual plane, are informally known as "the astral planes".

There's 16 astral planes (U+1xxxx to U+10xxxx), of which 3 have characters assigned...

* plane 1 is the supplementary multilingual plane

* plane 2 is the supplementary ideographic plane

* plane E is the supplement­ary special-purpose plane

* planes F and 10 are private-use planes

Perhaps you wrote from memory.

Thanks for providing the correct details. I was indeed writing from memory.

Thank god for emoji.

They're very useful if you want a test case that requires multiple bytes in UTF-8 and multiple words in UTF-16.

Seriously, I wasn't being sarcastic! Emoji have been the single largest driving factor in proper unicode adoption I've seen—they're the first non-BMP characters to see wide-spread use.

IT is so Anglophile that programs can become slower if you deviate from ASCII...

But of course being so incredibly anglocentric is not an issue, at least that seems to be the consensus of the participants when I read discussions on the Web where all the people who are discussing it write English with such a proficiency that I can't tell who are and aren't native speakers of the language.

I'm Chinese American, and I don't agree with your statement that string libraries are Anglophile. How would you encapsulate the 10,000 commonly used Chinese characters? It's just the reality of having a lot of characters in a language. Not much else you can do to speed up processing. How would you design string storage to be faster for a language like Chinese?

English happens to be the lingua franca of Engineering. It's not about brown nosing English-speaking countries, but about getting the widest range of audience.

Let me ask you, with 10k commonly used characters doesn't that lead to shorter texts? Kind of like how higher base numbers can encode larger numbers with fewer digits, in that case the longer encoding of UTF-8 could be made up for by using fewer characters. Or am I wrong about this assumption?

As an example, suppose that there are one character that denotes the word 'house', if that single character is encoded using five bytes it takes the same amount of space as the english encoding.

That seems more than plausible to me. While the character 象 is two bytes longer than the character "f", it is five bytes shorter than "elephant".

IIRC the average word length in English is around 5 characters.

I won't pretend to have a solution. I guess you would like to have some compression scheme, since I'm guessing it would save space over having 2-4 bytes (however many there are in the Chinese language) for every character. You won't gain much compared to having an English-centric encoding scheme, I guess, in the case of a language with a large amount of characters.

But it's still funny to me how even the computer who speaks in 1's and 0's favours English-centered notation.

> English happens to be the lingua franca of Engineering. It's not about brown nosing English-speaking countries, but about getting the widest range of audience.

Pragmatism ũber alles, chants the American. I guess I'm not impressed by the support of non-English languages in IT.

I have for that matter met engineering students who don't seem to speak a lick of English, maybe even people studying CS/CE.

> I won't pretend to have a solution.

See, here is the thing. At the end of the day, hypotheticals are worthless, concrete solutions are all that matters. It isn't anglocentricism that we picked the solution that is actually fleshed out and works over the vague hypothetical solution. It's "get-shit-done"-ism

> I guess you would like to have some compression scheme, since I'm guessing it would save space over having 2-4 bytes (however many there are in the Chinese language) for every character.

You are forgetting the pigeonhole principle: http://en.wikipedia.org/wiki/Pigeonhole_principle

You can of course compress a text[1] after encoding it, but that really is an unrelated topic. You can't get 10k possible characters into 8 bits, you need to go multi-byte.

> Pragmatism ũber alles, chants the American.

"Un bon mot ne prouve rien." -Voltaire

[1] _most_ text that you will see in practice. No lossless compression algorithm can compress _any_ possible text.

> See, here is the thing. At the end of the day, hypotheticals are worthless, concrete solutions are all that matters. It isn't anglocentricism that we picked the solution that is actually fleshed out and works over the vague hypothetical solution. It's "get-shit-done"-ism

"If you don't know of a solution yourself, shut up." Similar to "if you can't play guitar as well as <a player>, you don't get to have an opinion".

Admittedly in this context I might as well have thought I had something better to offer, given my original post. But as I've said, I don't. It was more of a historical note. I don't see how, given an alternative history, computers wouldn't favour for example the Russian alphabet.

And while we're at it, you might lecture me on how text/ASCII-centered protocols are superior to a binary format. Because I honesetly don't know.

And the fact that IT is Anglo centric goes way beyond Shannon entropy.

> You are forgetting the pigeonhole principle: http://en.wikipedia.org/wiki/Pigeonhole_principle

Compressing as in something like Huffman encoding. Maybe I was misusing the names.

"If you don't know of a solution yourself, shut up." Similar to "if you can't play guitar as well as <a player>, you don't get to have an opinion".

I'm not telling you to shut up. I am telling you to not act offended that a tangible working solution was chosen over a hypothetical solution. In other words, don't act like the universe is unfair because Paul McCartney is famous for songwriting while you are not, even though you totally could have hypothetically written better songs.

> "I don't see how, given an alternative history, computers wouldn't favour for example the Russian alphabet."

In an alternative universe where CP1251 was picked as the basis of the first block in Unicode instead of ASCII, it would have been for the same reasons that ASCII was picked in this universe.

In that universe, you'd just be complaining that Unicode was Russo-centric.

What reason, in this universe, would there have been to go that route?

> Compressing as in something like Huffman encoding. Maybe I was misusing the names.


Huffman encoding is a method used for lossless compression of particular texts. It does not let you put more than 256 characters into a single byte in a character encoding.

The guys that made JIS X 0212 were not missing something when they made JIS X 0208, a two byte encoding, prior to Unicode.

> And the fact that IT is Anglo centric goes way beyond Shannon entropy.

Okay. Complain about instances where it actually exists, and in discussions where it is actually relevant.

> In that universe, you'd just be complaining that Unicode was Russo-centric.

Yes, obviously.

> It does not let you put more than 256 characters into a single byte in a character encoding.

Which I have never claimed. (EDIT: I think we're talking past each other: my point was that things like Huffman encoding encodes the most frequent data with the lowest amount of bits. I don't know how UTF-8 is implemented, but it seems conceptually similar. There is a reason that I didn't want to get anywhere near the nitty-gritty of this.)

> Okay. Complain about instances where it actually exists, and in discussions where it is actually relevant.

Oh yes, I will complain.

A character coding has an equal distribution of each code point. Each code point is represented once.

"For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding."

Huffman encoding something written in Japanese is useful. It is not useful for creating a Japanese character set.

Get it?

If you don't buy it, then try it on pen and paper. Imagine a hypothetical 10-character alphabet, and try to devise an encoding that will let you fit it into a two-bit word, without going multi-word. Use prefix codes or whatever you want.

It's not going to happen. You also aren't going to get 10k characters into an 8-bit/word single-word character set.

> IT is so Anglophile that programs can become slower if you deviate from ASCII...

Well, switching from a simple, 7-bit character set where all you need to do is parse and display byte by byte to a potentially multi-byte processing and a display lookup table that's several orders of magnitude larger... it's pretty easy to see why things can become slower.

Looking at the .NET parts of the manifesto, I just have to roll my eyes:

Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.

While theoretically true, for most practical purposes, this reeks of a USA/American/English bias and lack of real world experience.

You know what? I want to know that the text "ØÆÅ" is three characters long. I dont want to know that it's a 6-byte array once encoded to UTF8. Anywhere in my code telling me this is 6 characters is lying, not to mention a violation of numerous business-requirements.

When I work with text I want to work with text and never the byte-stream it will eventually be encoded to. I want to work on top of an abstraction which lets me treat text as text.

Yes, their are cases where the abstraction will leak. But those cases are very far and few in between. And in all cases where it doesn't, it offers me numerous advantages over the PHPesque, amateurish and incorrect approach of treating everything as a dumb byte-array.

It's not. It's text in my program. It's text rendered on your screen. It's just a byte-array when we send it over the wire, so stop trying to pretend text isn't text.

This manifesto is wildly misguided.

I don't think you understood the statement you are quoting.

The manifesto is not in any way advocating that you use byte length as a substitute for string length. (It does argue that string length is not very commonly necessary, and has unclear semantics because of the multiple different definitions for "character", but these arguments are unrelated to the statement you quoted.)

Here's the point the author was making, which you missed. Take a non-BMP code point like '𝄞', which is U+1D11E. In UTF-16 this is represented by code units D834 and DD1E. If you try to use a "substring" operation to take the first "character" of this with a C#/.NET substring operation, you will get an invalid string, since D834 by itself is not a valid UTF-16 string.

> I want to work on top of an abstraction which lets me treat text as text.

If you think that UTF-16 will let you say string[i] and always get the i'th character, you are mistaken. That is one of the main points of the essay.

> Yes, their are cases where the abstraction will leak. But those cases are very far and few in between.

If you write your apps this way, then you don't really support Unicode, you just support the BMP, without combining characters.

One of the advantages of working in UTF-8 is that if your code is broken, you find out about it right away - as soon as someone enters an accented character. If your UTF-16 code is broken you might not find it in testing.

Be careful when flinging bias accusations around; the author is not actually from the US/UK [0].

In addition, while .NET methods will work correctly for "ØÆÅ", they have incorrect behavior when it comes to any characters that lie outside the BMP, which includes many CJK characters. So, .NET only meets your requirements if you never interact with such languages.

[0] http://www.utf8everywhere.org/#faq.anglophile

I fundamentally agree, but I don't think even you understand the implications of what you're saying. Though you say "characters", You're probably still thinking about Unicode codepoints here, not characters. A real character-oriented API would require an arbitrary number of bytes to store each character (a character being an arbitrary vector of codepoints), and would require a font to be associated to each region of codepoints--because it's the font's decision of how to render a sequence of codepoints, so it's the font that determines how many characters you get. (For an edge-case example: http://symbolset.com/)

...and saying that, it'd still be a good idea.

Yes. I've got a toy language that I've been working on that uses a rope to store characters internally.

Everything works on "logical characters" - arbitrary vectors of codepoints, as you say. There's still a number of edge cases I have yet to work out as to what exactly is considered a character, though. (I just added support for a single code point encoding multiple characters, for example.)

I'm not so sure that making thing reliant on a font would be the best way to solve that, though. I'd intuitively say that there should be less coupling between rendering choices and internal encoding than that.

The twist is that each node in the rope can only store characters of the same physical length in bytes (and same number of logical characters per physical character). This means that in the typical case (most characters require the same number of bytes to encode) it doesn't add too too much overhead. Still not something I would consider as the base String type for a lower-level language, though.

There are a few simple optimizations that I have yet to do (encode smaller characters as what would be ordinarily be invalid longer encodings, if it makes sense (a single one-byte character in the middle of a bunch of two-byte characters, for example), that sort of thing.)

It seems to work fairly well, so far. Or at least it tends to give "common-sensical" results, and avoids a large chunk of worst-case behavior that standard "prettified character array" strings have.

When I learned of the difficulty mapping between code points and characters, I realized that Unicode is a standard nobody will ever (knowingly) implement correctly. Even if everyone has access to a font API, there'll probably be bugs in the fonts for all eternity.

(I half-considered adding this to my comment above, but it didn't quite fit.)

If we were serious about a character-oriented API, we definitely wouldn't want to introduce character rendering rules into places like the kernel. But I don't think we'd necessarily have to.

The best solution, I think, would be to decompose fonts into two pieces:

1. a character map (a mapping from paramaterized codepoint sequences to single entities known to the font),

and 2. a graphemes file (the way to actually draw each character.)

The graphemes file would be what people would continue to think of as "the font." And the graphemes file would specify the character map it uses, in much the same way an XML/SGML document specifies a DTD.

As with DTDs, the text library built into the OS would have a well-known core set of character maps built in, and allow others to be retrieved and cached when referenced by URL. The core set would become treated something like root CAs or timezone data are now: bundled artifacts that get updated pretty frequently by the package manager.

They're not arguing that the length [of the bytestream] is what the user cares most about. I thought they covered this quite well in the "Myths" section on "counting coded characters or codepoints is important."[0]

From that section they explain that for text manipulation [e.g: cursor position & manipulation of text under the cursor] the programmer should be counting grapheme clusters; whereas for storage [memory & disk] concerns the programmer _should_ care about the number of codepoints.

They go on to say that counting the number of characters is up to the rendering engine and is completely unrelated to the number of codepoints.

I don't think the _manifesto itself_ is wildly misguided. In that section they were merely pointing out _how .NET and Java currently report string length._

At least as I read the manifesto: they seem to believe that counting codepoints is useful _but orthogonal_ to counting characters.

[0]: http://www.utf8everywhere.org/#myth.strlen

Quite the contrary: if you are treating text as anything but a dumb byte-array, you are probably doing it wrong. Unicode elements are not characters, but code points. Your example text "ØÆÅ" could be legitimately represented as either three or four code points, depending on whether the Å character is precomposed (U+00C5) or decomposed (U+0041 U+030A).

In what scenario, exactly, do you want to know that the text "ØÆÅ" is three characters long? That's almost never a useful question to ask or a useful answer to have.

Lexicographic ordering and sorting; edit distance; spelling correction (really, anything involving prefixes/tries)...

According to the article, UTF-8 sorts lexicographically the same as UTF-32, implying that there is no need to know the length or character boundaries for ordering/sorting.

The others would be good reasons though.

All of those need pretty deep knowledge of Unicode in general and the language you're working with in particular. Counting "characters" (really, code points) is going to be the least of your worries.

When writing wc?

Well, he doesn't say that one should use byte count instead of code point count as string length. That's a strawman and not in the manifesto. What he is saying though is that one should prefer to count grapheme clusters instead of characters since the definition of the latter is incredibly vague.

I don't understand how what you said contradicts the points in the manifesto. C# strings _try_ to look like they show you the characters and not bytes, but fail in that regard.

How many characters is "é́́"? To our eyes, one. But it's actually four code points (the e and three acute modifiers). This is the case whether you're using UTF-8 or UTF-16.

Regardless of the encoding you choose, you need to understand how it represents characters.

Also FYI, most programs don't display text on the screen. To most programs, text is a sequence of bytes to be shuffled around, and nothing more. Display and text manipulation is the minority case.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact