Hacker News new | comments | show | ask | jobs | submit login
Pragmatic Unicode (nedbatchelder.com)
42 points by olefoo 1785 days ago | hide | past | web | 19 comments | favorite

I am so happy to have found this. People often ask me about how to properly handle Unicode in python at my $dayjob. I now have an "official"-sounding name for the "decode early, unicode everywhere, encode late" mantra I always tried to explain: "Unicode sandwich". I like it, and it is solid advice.

I don't quite understand why there should be a difference between encode and decode? Surely something will always be in some encoding so you are just re-encoding?

Or does decode convert things into some sort of byte value?

I'm not a python programmer, so a little confused.

The internal representation of Unicode objects is an implementation detail (configured at compile-time, it's generally UCS-2).

So, for purposes of user code, no, you don't always have something in an encoding. You have some text (which is somehow stored in a way not exposed to you) or you have some encoded bytes.

In the most recent version of Python, the implementation representation is a run-time detail, not compile-time.

If you use ASCII then it's a one byte encoding. Most non-ASCII Unicode strings will have a two byte encoding, and if more than the Basic Multilingual Plane is needed, then Python will switch to a four byte encoding.

So, if I have this right decode() says "take this string which I believe is encoded in <encoding> and convert it into whatever the internal string representation is"?

Yep. "And then allow me to treat it as text", semantically.

This all makes me very happy I program in Go day to day.

Could you elaborate why the article is not applicable to Go?

  Does Go allows to work with binary data (bytes)? 
  Does it allow to work with text (Unicode)?
  Does it allows to convert one to another using a character encoding?
  Does it allow to accept data/text input from stdin?
  Does it allow to output data/text to stdout?
  Does it allow to read/write data/text from/to file/subprocess?
  Does it allow to specify filenames as Unicode/bytes?
  Does it allow to use regexes on bytes/Unicode?
Do you need to write every single operation twice: once for bytes, once for Unicode?

I think the way go handles unicode is very sane- of course two of the three primary creators of Go, Ken Thompson and Rob Pike, are the same two co-creator's of UTF-8, so that is probably unsurprising.

Take a look at the Go spec: https://golang.org/ref/spec And std libs: https://golang.org/pkg/

tl;dr It's nearly 2013, if you're still stuck at the byte/Unicode level you're doing it wrong: nowadays people are using programming languages which have abstraction for strings and characters.

I can't answer about parent about Go specifically but I may have the beginning of an answer.

Do you realize all your questions are really low-level details that are entirely taken care of by default APIs in quite some languages?

The whole byte/Unicode you keep insisting on is a strawman: at the right level of abstraction you're dealing with neither. You're dealing with "strings" made of "characters".

I don't remember when was the last time I had to manually read bytes / data and "convert" them using whatever character encoding.

You're questions just feel weird. As if the early nineties called and wanted their byte/UTF-8 encoders developers back or something.

I realize this may be a very true concern when working with languages that didn't deal correctly with the string/characters abstraction and which are still stuck in the byte/Unicode(byte or bytes) mindset but you have to realize that several languages and techniques allow to simply dodge these matters altogether.

For example, Java doesn't specify the encoding of Java source file: great, we mandate (and enforce with scripts/hooks) that all .java file contain only ASCII characters. We use properties file (whose encoding is defined) to contain our UTF-8 characters.

Wanna transmit text? Well, use a format which allows to specify the encoding of your text. It's really that simple and, on the other end of the wire, it's going to become automagically again "strings" and "characters".

Another "detail": Go is one the only language to specify that source file have to be encoded in UTF-8. This is a major win too. (if you've never seen ISO-Latin-1 / UTF-8 / MacRoman / etc. mismatch in source file then you've never worked in a diversified enough team / environment ; )

"low-level details that are entirely taken care of by default APIs in quite some languages"

They are not "taken care of"; it's only that the sharp edges are smoothed down a bit.

Here's a hard one. In Go, if you read a directory to get a list of filenames, do you get a list of (Unicode) strings, or a list of byte sequences? Does it depend on the filesystem? What if the filesystem is nominally UTF-8 encoded, or UTF-16 encoded, but the underlying filesystem API does no validation, so that illegal UTF-8/-16 sequences are allowed? Do you have a way to get the filenames as bytes for that case? How do you show those strings in a GUI? Can someone enter a path name for a byte sequence which can't be represented in Unicode?

The Python developers worked hard to get a reasonably good solution, but there are still hard details worry about. Can you point out the Go solution? Or is this one of the corner cases you've not had to deal with?

Suppose you've read three strings, one with "Å" (LATIN CAPITAL LETTER A WITH RING ABOVE), the second with "Å" (ANGSTROM SIGN), and the third using the code point sequence "U+0041 U+030A" (LATIN LETTER A and COMBINING RING ABOVE ("°")). Are these the same strings? Is there a way to normalize the Unicode string to make them the same?

A toolkit can implement the normalizations, but can't know when to do them. In your code, have you personally dealt with Unicode normalizations?

And now I'm curious - can you use all three Å representations as different variables in Go? ... Ahh, here is it "The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points." This means your Go code COULD have problems distinguishing different representation of the same character, rather like the problems you saw with ISO-Latin-1 / UTF-8 / MacRoman . Perhaps you just haven't worked with a diversified enough team / environment in your Go code? ;)

The fact that you think Unicode is a "string" of "characters" suggests that you don't know much about Unicode and mostly have experience using the simple Latin-like subset of Unicode.

If that's the case, then you should have written "This all makes me very happy I work with Latin-like languages day to day" rather than "..I program in Go..."

No, martinced, it doesn't work like that. Encodings are data formats, and with the Internet connecting everything to everything, you have data generated by every programming language ever written and every protocol ever used swirling around you all the time. What a delight it would be if all text data were UTF-8, but it's not. So, you have to take what you get as input, do whatever processing you do with your part in the system, and output what your receiver demands. In the general case, you can't control the whole system, regardless of what tools you choose to use.

I specialize in software that deals with non-Western scripts (writing systems), and my life would be easier (and less profitable) if everyone used a Unicode encoding for everything (I don't care WHICH Unicode encoding), but the legacy stuff will be around for years. In the meantime, it doesn't matter whether you use Go or Java or Python 3, you'll still have to deal with the fact that others out there use Python 2 or C or a MacRoman Web app (yes, I had to deal with one) or SQLServer's broken "UTF-8" or whatever to generate encoding mess you have to deal with. The fact that your language does things well internally is a great feature but won't insulate you from trouble if you have to deal with the outside world, as most of us do.

This depends very much on the sort of work one does, I think.

If there are only strings made of characters, it can be difficult to form certain sequences of bytes. For example, (uint8_t[]){0x80,0x80} isn't valid UTF-8; one could create the string "\x80\x80", but this is actually four bytes in size. Not great for efficiency, no good for interoperability.

To create every possible array of bytes, arrays of bytes must be a data type too. And these arrays of bytes must be able to turn into strings, and vice versa, because the data read from and written to sockets and files and streams and what-have-you is most naturally of array-of-byte type rather than string. Perhaps mandating UTF-8 as the string encoding, leaving alternative encodings up to the programmer, would be reasonable these days, I suppose - if a little rude.

Another advantage to having arrays of bytes in addition to strings of chars is that arrays of bytes support efficient random access.

(Javascript looks like an example of this playing itself out, from the little I've used it so far. Websockets has alternative text/binary frames, which appear to be a later addition, and there's this ArrayBuffer thing which I believe was prompted by WebGL's requirement for efficient arrays of unadorned bytes.)

> Perhaps mandating UTF-8 as the string encoding, leaving alternative encodings up to the programmer, would be reasonable these days, I suppose - if a little rude.

It really works quite well, especially "these days" (maybe not so much in the '90s), and has the—huge—advantage that it's very simple and lightweight, and avoids the very tricky issue of re-encoding (Guaranteed To Make Your Life Hell).

The python3/java/microsoft/whatever solution is to add an enormously heavyweight and complex layer that attempts to interpret and re-encode the contents of strings, with often dubious benefits.

My experience suggest that the latter method is in many cases a classic case of over-engineering, where for a wide swathe of applications, the downsides—complexity and cost—exceed the upsides. But because these languages (or systems, in the case of MS) require this at a fundamental level, all applications pay the costs, even those that don't gain any benefit.

[Given the encoding jungle of the '80s/'90s, it's to some degree understandable that MS/Java went the way they did (though UTF-8 was around back then, and would have at least made a much better canonical representation). That python3 dove into that swamp is a bit more mystifying given the rather clear signs that the future will essentially be Unicode encoded with UTF-8...]

Do you realize all your questions are really low-level details that are entirely taken care of by default APIs in quite some languages?

Somebody's got to write those APIs. d0mine's questions are valid.

We use properties file (whose encoding is defined) to contain our UTF-8 characters.

Yup...but it's defined as ISO-8859-1, not UTF-8 ( http://docs.oracle.com/javase/6/docs/api/java/util/Propertie... ). You need to escape everything else. Also the escapes are only for BMP characters, so supplementary characters need multiple unicode escapes to represent a single 'character'.

There's too many rough edges in interoperability. You do still need to know this stuff if you work outside of Latin-1.

For everyone hoping for a reply or dialog, it appears based on history that martinced is a fly-by commenter who doesn't do follow-ups.

No matter what language you program in, you still need to have an understanding of character encodings if you ever intend to write software that interfaces with the real world.

Racket does this too.

In Racket,

    "this is a string, which is a sequence of characters"
    #"this is a bytestring, a sequence of numbers between 0 and 255"
You can convert back and forth with string->bytes/utf-8 or bytes->string/utf-8, substituting utf-8 for whichever character encoding you want.

You can use functions like string-ref, which returns a unicode character and bytes-ref, which returns an integer. Functions like string-normalize-nfc (or -nfd or -nfkc or etc) will normalize combining characters, and you can capitalize uppercase/lowercase strings according to locale, etc. Read from an input port with read-bytes and read-string, and writing to an output port works similarly.

Racket does does a great job keeping these things as explicit as they should be.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact