VLC has a very large number of users on Windows (80% of our users), yet almost none of the dev use Windows to code. Therefore, we use UTF-8 char* everywhere, notably in the core. We use UTF-16 conversions only in the necessary Windows modules, that use Windows APIs. Being sure we were UTF-8 everywhere took a lot of time, tbh...
But the worse are formats like ASF (WMV) or MMS that use UTF-16/UCS-2 (without correctly specifying) them and that we need to support on all other platforms, like OSX or Linux...
Then we could put all the human language problems into human text type, and leave the simpler computer string type with easier semantics.
In Python, although there are no tools for that, I typically use the following convention: single quotes for computer text and double quotes for human text. I guess you could use byte arrays for computer text as well, but it would be more painful.
Data.Text.pack "foo" :: Text
Data.ByteString.pack "foo" :: ByteString
Data.Text.Encoding.encodeUtf8 :: Text -> ByteString
All that goes away if your protocol is standardized on utf8. Then text is text and bytes is bytes.
Similarly people want to iterate over a string character by character or take substrings by range but with unicode text that becomes iteration over code points and ranges of code points (unless you go all the way and use a text rendering system to give you grapheme clusters). Code points can be decomposed diacritic marks etc so you can't just blindly insert or change code points at a certain index or take arbitrary substrings without risking breaking the string (you can end up with accents on characters that you didn't intend, or stranded at the end of a string and probably plenty of other types of breakage that I can't even think of). Functionality exists to deal with all this but it's pretty burdensome (e.g. NSString has -rangeOfComposedCharacterSequencesForRange:).
That all adds up to a pretty hefty performance penalty as well as potential layering violations (needing to consider fonts and rendering when parsing some protocol if you really are going to treat strings as a sequence of grapheme clusters).
Though I think it would be useful to think about it as a sort of subtype of human string, with default encoding in UTF-8. So substitution or concatenation of human and computer string would yield a human string (which is where these languages usually fail short, because you need explicit conversion, it doesn't work like e.g. integers and floats).
In Go, strings are immutable UTF-8 byte arrays,†† and the language provides facilities for iterating over them either byte by byte or rune by rune (a rune is an int32, wide enough to hold any unicode character).
E.g., there are two ways to write Cañyon City. You can write the ñ as U+00F1 or as an ascii lower-case n followed by a combining tilde (U+0303). The first case results in a single rune, and the second in two runes. Example††. You need additional logic in order to normalize to a canonical representation and realize that the two strings are actually the same.
Also, if you are displaying the string, you need to account for the fact that, although the two strings have different byte and rune lengths, they take up exactly the same number of pixels on your display medium.
Who thought that having two ways to go about this was a good idea in the first place?
Take a simple example of an app that generates a bunch of logs that need to be displayed to the user. If you are to follow article's recommendations, you'd have these logs generated and stored in UTF8. Then, only when they are about to be displayed on the screen you'd convert them to UTF16. Now, say, you have a custom control that renders log entries. Furthermore, let's imagine a user who sits there and hits PgUp, PgDown, PgUp, PgDown repeatedly.
On every keypress the app will run a bunch of strings through MultiByteToWideChar() to do the conversion (and whatever else fluff that comes with any boost/stl wrappers), feed the result to DrawText() and then discard wstrings, triggering a bunch of heap operation along the way. And you'd better hope latter doesn't cause heap wobble across a defrag threshold.
Is your code as sublime as it gets? Check. Does it look like it's written by over-enlightened purists? You bet. Just look at this "advice" from the page -
If those strings are for the user to read, he's reading a million times slower than you handle the most ornate reencoding. Sounds like a premature optimization.
Or maybe it's just me who is weird. I grew up on gamedev, so I feel bad when writing something obviously slow, that could be sped up if one spent 15 minutes more of thinking/coding on it.
Computers are fast, you don't have to coddle them. Never do any kind of optimization that reduces readability without concrete proof that it will actually make a difference.
15 minutes spent optimizing code that takes up 0.1% of a program's time are 15 wasted minutes that probably made your program worse.
Additionally: "Even good programmers are very good at constructing performance arguments that end up being wrong, so the best programmers prefer profilers and test cases to speculation."(Martin Fowler)
This mentality is exactly why Windows feels sluggish in comparison to Linux on the same hardware. Being careless with the code and unceremoniously relying on spare (and frequently assumed) hardware capacity is certainly a way to do things. I'm sure it makes a lot of business sense, but is it a good engineering? It's not.
Making code efficient is not a virtue in its own right. If you want performance, set measurable goals and optimize the parts of the code that actually help you achieve those goals. Compulsively optimizing everything will just waste a lot of time, lead to unmaintainable code and quite often not actually yield good performance, because bottlenecks can (and often do) hide in places where bytes-and-cycles OCD overlooks them.
My point is that the "hardware can handle it" mantra is a tell-tale site of a developer who is more concerned with his own comforts than anything else. It's someone who's content with not pushing himself and that's just wrong.
Can you guess why I bring this up?
Because that's exactly a kind of mess that spawns from "oh, it's not a big overhead" assumption. Little by little crap accumulates, solidifies and you end up with this massive pile of shitty negligent code that is impossible to improve or refactor. All because of that one little assumption.
Also, being aware of different ways code can be slow (from things dependent on programming language of choice to low-level stuff like page faults and cache misses) can make you produce faster code by default, because the optimized code is the intuitive one for you.
Still, I think there's a gap between "fast enough and doesn't suck" and "customers angry enough to warrant optimization". It's especially visible in the smartphone market, where the cheaper ones can't sometimes even handle their operating system, not to mention the bloated apps. For me it's one of the problems with businesses. There's no good way to incentivize them to stop producing barely-good-enough-crap and deliver something with decent quality.
If you got into the 1%+ range, I could see justifying some attention to speed, but otherwise...
> Do not use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
Except for that, you do make a good point. It's probably better to store some strings in memory according to the platform requirements, if the program can be shown to exhibit delays caused by string conversions.
And obviously it supports arbitrary text-encodings, although sometimes you will need to be explicit about it.
If you used the simplified wizards, all the options may not have been there, but you should have been given the option to export/save the job as a package, and then you can open, modify, test and debug that before running the job for real.
Seriously. SQL Server has some immensely kick-ass and über-capable tooling compared to pretty much every other database out there.
To even suggest it doesn't support UTF8 is ludicrous.
So why would someone even use bcp instead of SSIS? SSIS might be nice for performing repeated imports of data that has a fixed format, but for quick and dirty exports/imports it's really frustrating to use. It's not even smart enough to scan an entire data file and suggest appropriate field lengths and formats. Every single time I try to import a .csv file it craps out and doesn't even show where the error occured - that's after clicking through a bunch of steps in a GUI. At least with BCP you can easily rerun the import/export from the command line.
SQL and SQL Management studio are generally great but I would not include SSIS when lauding them.
char [ ( n ) ]
Fixed-length, non-Unicode string data.
Character data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.
So, (var)char is "non-Unicode", and n(var)char is UCS-2 only.
That is in agreement with http://blogs.msdn.com/b/qingsongyao/archive/2009/04/10/sql-s..., which claims the glass is half full ("In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation.")
On the other hand, we have http://msdn.microsoft.com/en-us/library/ms143726.aspx that seems to state that SQL Server 2012 has proper unicode collations. UTF8 still is nowhere to be found, though.
If you want to treat data as a stream of bytes hardcore UTF8 & PHPesque style (this function is "binary safe" woo) with no regard to the actual text involved, feel free to store it a bytes. SQL Server supports that.
If you want to store it as unicode text feel free to use the ntext and nvarchar types. I'm pretty sure that's what you intend to do anyway, even though you insist on calling it UTF8.
SSIS is insanely powerful and performant, but it's also insanely cumbersome and script-unfriendly. Microsoft has finally started embracing the power of plain text and scriptable tools in their web-stack and .NET, but SSIS represents a holdover from their heavyweight GUI-and-wizard days.
That said, if you are doing repeatable jobs (and not just one-off imports) you can still create a SSIS package, and then run the package from your script using the package-runner and appropriate config-data.
In fact, UTF-16 doesn't really have the 1 million character limit: by using the two private-use planes (F and 10) as 2nd-tier surrogates, we can encode all 4-byte sequences of UCS-32, and all those in the original UTF-8 proposal.
I suspect the reason is more political than technical. unicode.org (http://www.unicode.org/faq/utf_bom.html#utf16-6) says "Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other UTFs) can represent larger intergers, these policies mean that all encoding forms will always represent the same set of characters. Over a million possible codes is far more than enough for the goal of Unicode of encoding characters, not glyphs. Unicode is not designed to encode arbitrary data."
I'm simply saying that UTF-8 shouldn't be crippled in the Unicode/ISO spec to 21 bits, but be extended to 31 bits as originally designed because the technical reason given (i.e. because UTF-16 is only 21 bits) isn't actually true. The extra space should be assigned as more private use characters. (Except of course the last two codepoints in each extra plane would be nonchars as at present, and probably also the entire last 2 planes if the 2nd-tier "high surrogates" finish at the end of a plane.)
Maybe there might be a shot in getting developers to switch if Windows GUIs/native API would render Unicode text presented in UTF-8. But right now, it's back to encoding/decoding.
EDIT: I misunderstood the intent of the comment I was responding to. JS uses (unbeknownst to me) UTF-16 as its internal representation of strings.
Within the JS language, strings are represented as sort-of-UCS-2-sort-of-UTF-16 . This is one of the few problems with JS that I think merits a backwards-compatibility-breaking change.
Minor correction: MySQL was Swedish and InnoDB Finnish.
My example broke the bug tracker's (bugzilla) comment system as well. I chuckled.
std::string is missing a lot of functionality one tends to need when dealing with strings (such as iterating over UTF-8 characters or fast conversion to UTF-16, but also things like search-and-replace). And it makes me sad that I can't use that string library any more (legally) because of the license PlayFirst insisted on using (no redistribution).
As far as I'm concerned, though, there IS no good string library available for use anywhere. I've looked at all of the ones I could find, and they're all broken in some fundamental way. I guess solving the "string problem" isn't sexy enough for someone to release a library that actually hits all the pain points.
I've used them over the summer but nothing felt broken beyond the general C++ verbosity. Granted most of my prior regex work was in Perl.
I actually extremely rarely need full reg-ex support. Almost never, really. What WOULD be awesome is limited pattern support, at the level of Lua patterns, especially if they were UTF-8 character aware.
 http://www.lua.org/manual/5.1/manual.html#5.4.1 -- Lua patterns are NOT regular expressions, even though they look similar; there's no "expression" possible, just character class repeats.
I would like to have at least two options in memory: utf-8 and vector of displayed characters (there's many combinations in use in existing modern languages with no single-character representations in UTF-<anything>).
Usually all you care about is the rendered size, which your rendering engine should be able to tell you. No need to be able to pick out those characters in most situations.
The acid test for this sort of 'humane string' type would be whether you could splice together any two substrings from any two input strings and get something that could be validly displayed. UTF-8 bytes fail because you can get fractional codepoints. Codepoints in >=20-bit integers fail because you can get modifier characters which don't attach to anything.
A similar test would be whether you can reverse any input string by reversing the sequence of units. For example, reversing "amm͊z" should yield "zm͊ma", which it doesn't in unicode, because "m͊" is made with a combining mark, and doesn't have a composed form.
For extra fun, i suspect that reversing the string "œ" should yield "eo".
It should also be simple to do things like search for particular characters in a modifier-insensitive way. For example, i should be able to count that "sš" contains two copies of the letter 's' without having to do any deciphering. I suspect i should also be able to count that the string "ß" contains two copies of the letter 's', but i'm not nearly as sure about that.
I think i essentially want a string that looks like:
But i'm not sure. And i'm even less sure about how i'd encode it efficiently.
That might be an awful can of worms. Are Arabic vowels characters? "ij" letter in Dutch? Would you separate Korean text into letters or treat each block of letters as a character?
Not to mention you would need to make any such function language aware since different languages could theoretically have different mapping rules for the same sequence of characters.
No debate there.
However, advocating "just make windows use UTF8" ignores the monumental engineering challenge and legacy back-compat issues.
In Windows most APIs have FunctionA and FunctionW versions, with FunctionA meaning legacy ASCII/ANSI and FunctionW meaning Unicode. You couldn't really fix this without adding a 3rd version that was truly UTF-8 without breaking lots of apps in subtle ways.
Likely it would also only be available to Windows 9 compatible apps if such a feature shipped.
No dev wanting to make money is going to ship software that only targets Windows 9, so the entire ask is tough to sell.
Still no debate on the theoretical merits of UTF-8 though.
Anyways, the FunctionA/FunctionW is usually hidden behind a macro anyways (for better or worse). This could simply be yet another compiler option.
Of course, we had to ditch resource-based string storage anyway for other cross-platform reasons, and were never particularly invested in the "Windows way" of doing things, so it wasn't a big shock to our developers when we made this change.
The only well known Microsoft application that can handle UTF-8 is notepad.exe (Win 7).
The bottom line is that UTF-8 is awkward to use on Windows, while UTF-16/wchar_t is awkward to use on Linux, simply because the core APIs make them so (there is no _wfopen function in glibc).
Unfortunately, we need more than 16 bits of codepoints, so 16-bit chars is a waste and a bad decision with that insight. It seems unlikely that a fresh platform with no legacy requirements would choose a 16-bit encoding. Think of all the XML in Java and .NET - all of it nearly always ASCII, using up double the RAM for zero benefit. It sucks.
Was UTF-8 even around when Microsoft decided on 16-bit widechar?
Other platforms seem to have lucked out by not worrying as much as standardizing on a single charset and UTF8 came in and solved the problems.
No, Thompson's placemat is from September 1992 and NT 3.1 from July 1993, but development on NT started in November 1989 (http://en.wikipedia.org/wiki/Windows_NT#Development)
One instance where I really wish for examples: they mention characters, code points, code units, grapheme clusters, user-perceived characters, fonts, encoding schemes, multi-byte patterns, BE vs LE, BOM, .... while I kind of get some of these, I certainly don't understand all of them in detail, and so there's no way that I'll grasp the subtleties of their complicated interactions. Examples, even of simple things such as what actually gets saved to disk when I write out a string using UTF-8 encoding vs. UTF-16 -- especially when using higher codepoints, would be hugely beneficial for me.
At present state, you can choose to use utf8 internally in your app, but when you need to cooperate with other programs (over sockets or files), it's going to be confusing. Some will be sending you ANSI bytes and you take it as UTF8.
I'm a little confused by this statement. Can someone clarify?
I am not entirely sure whether that makes any sense, though.
Can't speak for other of the top.
In line with the spirit of this article, that interface should use UTF-8 storage internally as well, but this should be transparent to the programmer anyway. Dealing with encoded strings directly is a recipe for heartache unless you're actually writing such a library.
But what about other planets? Is there a Unicode Astral Plane which may encode poorly in the future?
* plane 1 is the supplementary multilingual plane
* plane 2 is the supplementary ideographic plane
* plane E is the supplementary special-purpose plane
* planes F and 10 are private-use planes
Perhaps you wrote from memory.
But of course being so incredibly anglocentric is not an issue, at least that seems to be the consensus of the participants when I read discussions on the Web where all the people who are discussing it write English with such a proficiency that I can't tell who are and aren't native speakers of the language.
English happens to be the lingua franca of Engineering. It's not about brown nosing English-speaking countries, but about getting the widest range of audience.
As an example, suppose that there are one character that denotes the word 'house', if that single character is encoded using five bytes it takes the same amount of space as the english encoding.
IIRC the average word length in English is around 5 characters.
But it's still funny to me how even the computer who speaks in 1's and 0's favours English-centered notation.
> English happens to be the lingua franca of Engineering. It's not about brown nosing English-speaking countries, but about getting the widest range of audience.
Pragmatism ũber alles, chants the American. I guess I'm not impressed by the support of non-English languages in IT.
I have for that matter met engineering students who don't seem to speak a lick of English, maybe even people studying CS/CE.
See, here is the thing. At the end of the day, hypotheticals are worthless, concrete solutions are all that matters. It isn't anglocentricism that we picked the solution that is actually fleshed out and works over the vague hypothetical solution. It's "get-shit-done"-ism
> I guess you would like to have some compression scheme, since I'm guessing it would save space over having 2-4 bytes (however many there are in the Chinese language) for every character.
You are forgetting the pigeonhole principle: http://en.wikipedia.org/wiki/Pigeonhole_principle
You can of course compress a text after encoding it, but that really is an unrelated topic. You can't get 10k possible characters into 8 bits, you need to go multi-byte.
> Pragmatism ũber alles, chants the American.
"Un bon mot ne prouve rien." -Voltaire
 _most_ text that you will see in practice. No lossless compression algorithm can compress _any_ possible text.
"If you don't know of a solution yourself, shut up." Similar to "if you can't play guitar as well as <a player>, you don't get to have an opinion".
Admittedly in this context I might as well have thought I had something better to offer, given my original post. But as I've said, I don't. It was more of a historical note. I don't see how, given an alternative history, computers wouldn't favour for example the Russian alphabet.
And while we're at it, you might lecture me on how text/ASCII-centered protocols are superior to a binary format. Because I honesetly don't know.
And the fact that IT is Anglo centric goes way beyond Shannon entropy.
> You are forgetting the pigeonhole principle: http://en.wikipedia.org/wiki/Pigeonhole_principle
Compressing as in something like Huffman encoding. Maybe I was misusing the names.
I'm not telling you to shut up. I am telling you to not act offended that a tangible working solution was chosen over a hypothetical solution. In other words, don't act like the universe is unfair because Paul McCartney is famous for songwriting while you are not, even though you totally could have hypothetically written better songs.
> "I don't see how, given an alternative history, computers wouldn't favour for example the Russian alphabet."
In an alternative universe where CP1251 was picked as the basis of the first block in Unicode instead of ASCII, it would have been for the same reasons that ASCII was picked in this universe.
In that universe, you'd just be complaining that Unicode was Russo-centric.
What reason, in this universe, would there have been to go that route?
> Compressing as in something like Huffman encoding. Maybe I was misusing the names.
Huffman encoding is a method used for lossless compression of particular texts. It does not let you put more than 256 characters into a single byte in a character encoding.
The guys that made JIS X 0212 were not missing something when they made JIS X 0208, a two byte encoding, prior to Unicode.
> And the fact that IT is Anglo centric goes way beyond Shannon entropy.
Okay. Complain about instances where it actually exists, and in discussions where it is actually relevant.
> It does not let you put more than 256 characters into a single byte in a character encoding.
Which I have never claimed. (EDIT: I think we're talking past each other: my point was that things like Huffman encoding encodes the most frequent data with the lowest amount of bits. I don't know how UTF-8 is implemented, but it seems conceptually similar. There is a reason that I didn't want to get anywhere near the nitty-gritty of this.)
> Okay. Complain about instances where it actually exists, and in discussions where it is actually relevant.
Oh yes, I will complain.
"For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding."
Huffman encoding something written in Japanese is useful. It is not useful for creating a Japanese character set.
If you don't buy it, then try it on pen and paper. Imagine a hypothetical 10-character alphabet, and try to devise an encoding that will let you fit it into a two-bit word, without going multi-word. Use prefix codes or whatever you want.
It's not going to happen. You also aren't going to get 10k characters into an 8-bit/word single-word character set.
Well, switching from a simple, 7-bit character set where all you need to do is parse and display byte by byte to a potentially multi-byte processing and a display lookup table that's several orders of magnitude larger... it's pretty easy to see why things can become slower.
Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.
While theoretically true, for most practical purposes, this reeks of a USA/American/English bias and lack of real world experience.
You know what? I want to know that the text "ØÆÅ" is three characters long. I dont want to know that it's a 6-byte array once encoded to UTF8. Anywhere in my code telling me this is 6 characters is lying, not to mention a violation of numerous business-requirements.
When I work with text I want to work with text and never the byte-stream it will eventually be encoded to. I want to work on top of an abstraction which lets me treat text as text.
Yes, their are cases where the abstraction will leak. But those cases are very far and few in between. And in all cases where it doesn't, it offers me numerous advantages over the PHPesque, amateurish and incorrect approach of treating everything as a dumb byte-array.
It's not. It's text in my program. It's text rendered on your screen. It's just a byte-array when we send it over the wire, so stop trying to pretend text isn't text.
This manifesto is wildly misguided.
The manifesto is not in any way advocating that you use byte length as a substitute for string length. (It does argue that string length is not very commonly necessary, and has unclear semantics because of the multiple different definitions for "character", but these arguments are unrelated to the statement you quoted.)
Here's the point the author was making, which you missed. Take a non-BMP code point like '𝄞', which is U+1D11E. In UTF-16 this is represented by code units D834 and DD1E. If you try to use a "substring" operation to take the first "character" of this with a C#/.NET substring operation, you will get an invalid string, since D834 by itself is not a valid UTF-16 string.
> I want to work on top of an abstraction which lets me treat text as text.
If you think that UTF-16 will let you say string[i] and always get the i'th character, you are mistaken. That is one of the main points of the essay.
> Yes, their are cases where the abstraction will leak. But those cases are very far and few in between.
If you write your apps this way, then you don't really support Unicode, you just support the BMP, without combining characters.
In addition, while .NET methods will work correctly for "ØÆÅ", they have incorrect behavior when it comes to any characters that lie outside the BMP, which includes many CJK characters. So, .NET only meets your requirements if you never interact with such languages.
...and saying that, it'd still be a good idea.
Everything works on "logical characters" - arbitrary vectors of codepoints, as you say. There's still a number of edge cases I have yet to work out as to what exactly is considered a character, though. (I just added support for a single code point encoding multiple characters, for example.)
I'm not so sure that making thing reliant on a font would be the best way to solve that, though. I'd intuitively say that there should be less coupling between rendering choices and internal encoding than that.
The twist is that each node in the rope can only store characters of the same physical length in bytes (and same number of logical characters per physical character). This means that in the typical case (most characters require the same number of bytes to encode) it doesn't add too too much overhead. Still not something I would consider as the base String type for a lower-level language, though.
There are a few simple optimizations that I have yet to do (encode smaller characters as what would be ordinarily be invalid longer encodings, if it makes sense (a single one-byte character in the middle of a bunch of two-byte characters, for example), that sort of thing.)
It seems to work fairly well, so far. Or at least it tends to give "common-sensical" results, and avoids a large chunk of worst-case behavior that standard "prettified character array" strings have.
If we were serious about a character-oriented API, we definitely wouldn't want to introduce character rendering rules into places like the kernel. But I don't think we'd necessarily have to.
The best solution, I think, would be to decompose fonts into two pieces:
1. a character map (a mapping from paramaterized codepoint sequences to single entities known to the font),
and 2. a graphemes file (the way to actually draw each character.)
The graphemes file would be what people would continue to think of as "the font." And the graphemes file would specify the character map it uses, in much the same way an XML/SGML document specifies a DTD.
As with DTDs, the text library built into the OS would have a well-known core set of character maps built in, and allow others to be retrieved and cached when referenced by URL. The core set would become treated something like root CAs or timezone data are now: bundled artifacts that get updated pretty frequently by the package manager.
From that section they explain that for text manipulation [e.g: cursor position & manipulation of text under the cursor] the programmer should be counting grapheme clusters; whereas for storage [memory & disk] concerns the programmer _should_ care about the number of codepoints.
They go on to say that counting the number of characters is up to the rendering engine and is completely unrelated to the number of codepoints.
I don't think the _manifesto itself_ is wildly misguided. In that section they were merely pointing out _how .NET and Java currently report string length._
At least as I read the manifesto: they seem to believe that counting codepoints is useful _but orthogonal_ to counting characters.
The others would be good reasons though.
Regardless of the encoding you choose, you need to understand how it represents characters.
Also FYI, most programs don't display text on the screen. To most programs, text is a sequence of bytes to be shuffled around, and nothing more. Display and text manipulation is the minority case.