
Pragmatic Unicode - olefoo
http://nedbatchelder.com/text/unipain.html
======
davvid
I am so happy to have found this. People often ask me about how to properly
handle Unicode in python at my $dayjob. I now have an "official"-sounding name
for the "decode early, unicode everywhere, encode late" mantra I always tried
to explain: "Unicode sandwich". I like it, and it is solid advice.

------
jiggy2011
I don't quite understand why there should be a difference between encode and
decode? Surely something will always be in _some_ encoding so you are just re-
encoding?

Or does decode convert things into some sort of byte value?

I'm not a python programmer, so a little confused.

~~~
JulianWasTaken
The internal representation of Unicode objects is an implementation detail
(configured at compile-time, it's generally UCS-2).

So, for purposes of user code, no, you don't always have something in an
encoding. You have some text (which is somehow stored in a way not exposed to
you) or you have some encoded bytes.

~~~
jiggy2011
So, if I have this right decode() says "take this string which I believe is
encoded in <encoding> and convert it into whatever the internal string
representation is"?

~~~
JulianWasTaken
Yep. "And then allow me to treat it as text", semantically.

------
voidlogic
This all makes me very happy I program in Go day to day.

~~~
d0mine
Could you elaborate why the article is not applicable to Go?

    
    
      Does Go allows to work with binary data (bytes)? 
      Does it allow to work with text (Unicode)?
      Does it allows to convert one to another using a character encoding?
      Does it allow to accept data/text input from stdin?
      Does it allow to output data/text to stdout?
      Does it allow to read/write data/text from/to file/subprocess?
      Does it allow to specify filenames as Unicode/bytes?
      Does it allow to use regexes on bytes/Unicode?
      ...
    

Do you need to write every single operation twice: once for bytes, once for
Unicode?

~~~
martinced
tl;dr It's nearly 2013, if you're still stuck at the byte/Unicode level you're
doing it wrong: nowadays people are using programming languages which have
abstraction for strings and characters.

I can't answer about parent about Go specifically but I may have the beginning
of an answer.

Do you realize all your questions are _really_ low-level details that are
entirely taken care of by default APIs in quite some languages?

The whole byte/Unicode you keep insisting on is a strawman: at the right level
of abstraction you're dealing with neither. You're dealing with "strings" made
of "characters".

I don't remember when was the last time I had to manually read bytes / data
and "convert" them using whatever character encoding.

You're questions just feel weird. As if the early nineties called and wanted
their byte/UTF-8 encoders developers back or something.

I realize this _may_ be a very true concern when working with languages that
didn't deal correctly with the string/characters abstraction and which are
still stuck in the byte/Unicode(byte or bytes) mindset but you have to realize
that several languages and techniques allow to simply dodge these matters
altogether.

For example, Java doesn't specify the encoding of Java source file: great, we
mandate (and enforce with scripts/hooks) that all .java file contain only
ASCII characters. We use properties file (whose encoding is defined) to
contain our UTF-8 characters.

Wanna transmit text? Well, use a format which allows to specify the encoding
of your text. It's really that simple and, on the other end of the wire, it's
going to become automagically again "strings" and "characters".

Another "detail": Go is one the only language to specify that source file have
to be encoded in UTF-8. This is a major win too. (if you've never seen ISO-
Latin-1 / UTF-8 / MacRoman / etc. mismatch in source file then you've never
worked in a diversified enough team / environment ; )

~~~
to3m
This depends very much on the sort of work one does, I think.

If there are only strings made of characters, it can be difficult to form
certain sequences of bytes. For example, (uint8_t[]){0x80,0x80} isn't valid
UTF-8; one could create the string "\x80\x80", but this is actually four bytes
in size. Not great for efficiency, no good for interoperability.

To create every possible array of bytes, arrays of bytes must be a data type
too. And these arrays of bytes must be able to turn into strings, and vice
versa, because the data read from and written to sockets and files and streams
and what-have-you is most naturally of array-of-byte type rather than string.
Perhaps mandating UTF-8 as the string encoding, leaving alternative encodings
up to the programmer, would be reasonable these days, I suppose - if a little
rude.

Another advantage to having arrays of bytes in addition to strings of chars is
that arrays of bytes support efficient random access.

(Javascript looks like an example of this playing itself out, from the little
I've used it so far. Websockets has alternative text/binary frames, which
appear to be a later addition, and there's this ArrayBuffer thing which I
believe was prompted by WebGL's requirement for efficient arrays of unadorned
bytes.)

~~~
snogglethorpe
> _Perhaps mandating UTF-8 as the string encoding, leaving alternative
> encodings up to the programmer, would be reasonable these days, I suppose -
> if a little rude._

It really works quite well, especially "these days" (maybe not so much in the
'90s), and has the— _huge_ —advantage that it's very simple and lightweight,
and avoids the very tricky issue of re-encoding (Guaranteed To Make Your Life
Hell).

The python3/java/microsoft/whatever solution is to add an enormously
heavyweight and complex layer that attempts to interpret and re-encode the
contents of strings, with often dubious benefits.

My experience suggest that the latter method is in many cases a classic case
of over-engineering, where for a wide swathe of applications, the
downsides—complexity and cost—exceed the upsides. But because these languages
(or systems, in the case of MS) require this at a fundamental level, _all_
applications pay the costs, even those that don't gain any benefit.

[Given the encoding jungle of the '80s/'90s, it's to some degree
understandable that MS/Java went the way they did (though UTF-8 _was_ around
back then, and would have at least made a much better canonical
representation). That python3 dove into that swamp is a bit more mystifying
given the rather clear signs that the future will essentially be Unicode
encoded with UTF-8...]

