
Strings, bytes, runes and characters in Go - kisielk
http://blog.golang.org/strings
======
mseepgood
Some people here seem to think that indexing, measuring and slicing operations
based on runes (code points) instead of bytes (UTF-8 code units) by default
would be to be a good idea. It's not - you get the worst of both worlds:
indexing is not a constant-time operation and a code point is _still_ not a
user-perceived character, because combining character sequences consist of
multiple code points, even normalization doesn't help in general.

Other languages like C# seem to be different on the surface, but in fact they
index and measure by code units as well (2 byte UTF-16 code units), not by
code points.

~~~
derefr
> and a code point is still not a user-perceived character

How about indexing, measuring and slicing operations based on user-perceived
characters, then?

~~~
iv_08
I think the number of displayed characters is even font-dependent.

------
lmm
No distinction between string and byte array? I foresee all the fun of python
3 in go's future. Those of us programming in the real world need to deal with
legacy character sets in strings obtained from elsewhere, and it's no fun at
all to discover that what you thought was a string is actually an array of
SJIS or iso8859-1 bytes.

~~~
mediocregopher
It sounds like you've decided to dislike go without ever having actually used
it for anything in the "real world".

There is a difference between a string and a byte array. A string is a string.
A byte array is a []byte (byte slice). You have to explicitly cast from one to
another. Neither are inherently utf8. A string is represented by a byte array
under-the-hood, and string literals in your code are read as utf8 encoded.
Strings themselves are not necessarily utf8 encoded, and if you need to use a
different encoding there's libraries for that (unless you're using something
really esoteric).

~~~
lmm
So what do you get when you read a file, or when a file is uploaded to your
web server? What happens if you write a function that accepts a string as a
parameter, but haven't noticed that you're implicitly assuming the string is
utf8? (e.g. a function that formats one string using another - if the
encodings are different you'll end up with a string that's invalid for either
encoding, no?)

The distinction between a string with one encoding and a string with another
is subtle but vitally important - exactly the sort of thing a type system
should take care of.

~~~
kisielk
There is a library in one of the Go subrepositories that handles
transformation from other encodings to UTF-8:
[https://code.google.com/p/go/source/browse?repo=text#hg%2Fen...](https://code.google.com/p/go/source/browse?repo=text#hg%2Fencoding)

If you're expecting to get data in other other encodings you could put
together some detection and transformation at the point of ingress and convert
to UTF-8 encoded text for the rest of your application.

------
pygy_
Julia uses the same strategy, but indexing an utf8 string returns the rune
rather than the byte. If you try to get a byte in the middle of a rune
representation, it raises an error.

The `next(string, index)` function used for the iteration protocol works like
the `utf8.DecodeRuneInString()` shown in the example, but it returns the next
valid index rather than the character width.

~~~
dbaupp
Rust has a similar approach (in that it raises an error when you attempt to do
something not on a rune boundary), although `string[index]` still returns a
byte rather than a character but strong static typing means that it isn't a
huge problem.

------
SigmundA
Coming from C# it seems odd that a string would index on byte and not
char(rune) and that it is essentially a read only byte array. If you wanted a
byte array why wouldn't you use a byte array, why have strings and byte
arrays?

In C# you can encode/decode strings to byte arrays based on your desired
encoding, but a string is composed of characters, it's in memory
representation is abstracted.

Is this a performance or zero copy thing? Not having to encode/decode to get
to the bytes?

~~~
bazzargh
As someone's already pointed out, C# strings are composed of UTF-16 codepoints
not characters - this means that if you have a character outside the basic
multilingual plane it'll be represented as two codepoints using a surrogate,
and the character count in the C# string will be wrong (the same is true of
Java and JS for example)

That's a hard problem, and avoiding it in every situation would require
scanning the strings for surrogates beforehand, when you might never need to
know that information. Go makes it explicit that knowing the exact character
position and string length in characters comes at a cost.

There's a good discussion of this on Tim Bray's blog:
[http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF](http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF)

~~~
twotwotwo
Just for fun, here's Go handling a char outside the BMP (😃, U+1F603):

[http://play.golang.org/p/qg7POYAAOL](http://play.golang.org/p/qg7POYAAOL)

~~~
samatman
[https://github.com/mnemnion/emojure/](https://github.com/mnemnion/emojure/)

You can even export them without Capital letters ;-)

~~~
twotwotwo
I did run into at least one eminently reasonable use of how Go source is
defined to be in UTF-8. Comments in the crypto libs just use math symbols
where they're handy, like this in crypto/rsa[1]:

    
    
      // Check that de ≡ 1 mod p-1, for each prime.
      // This implies that e is coprime to each p-1 as e has a multiplicative
      // inverse. Therefore e is coprime to lcm(p-1,q-1,r-1,...) =
      // exponent(ℤ/nℤ). It also implies that a^de ≡ a mod p as a^(p-1) ≡ 1
      // mod p. Thus a^de ≡ a mod n for all a coprime to n, as required.
    

Sadly, the spec requires identifiers to be just Unicode letters and digits, so
we will never experience the power and glory of emoji function names in Go.

[1]
[http://golang.org/src/pkg/crypto/rsa/rsa.go](http://golang.org/src/pkg/crypto/rsa/rsa.go)

------
frou_dh
That blog post is an example of good technical writing.

~~~
stevvooe
After following the development of Go for awhile, I've come to idolize Rob
Pike's terse, accurate communication style.

~~~
4ad
Check out his books too: The Unix Programming Environment and The Practice of
Programming. Both co-authored by Brian Kernighan.

