Hacker News new | comments | show | ask | jobs | submit login
The Minimum Every Software Developer Absolutely Must Know About Unicode (2003) (joelonsoftware.com)
57 points by federicoponzi 217 days ago | hide | past | web | 11 comments | favorite



This needs the "(2003)" label suffix added. While awesome, and I love Joel, this is a VERY old article indeed.


Yes. The big problems since then are because there are now popular characters that won't fit in two bytes. Back in Joel's day, if you only handled 16 bit Unicode characters, you just gave up Cretan Linear B and similar alphabets you weren't likely to encounter. Today, most of the emoji, of which there are now 2628 [1], are up there beyond 16 bits, in what used to be called the "astral planes".

Unfortunately, Java and Windows both adopted 16-bit Unicode as their internal representation. So did the Python of the time. This led to a hokey scheme where two 16-bit Unicode characters are used to represent one longer character. Current language status is that Go and Rust are all 100% Unicode, Java still has the 16-bit hack, Python 3 is 100% Unicode but Python 2.7 has 16-bit or 32-bit Unicode builds, and C and C++ still require a lot of explicit work by the application programmer. PHP has some Unicode support, but doesn't really understand string encodings.

On the web, HTML is now almost entirely UTF-8. At least that's been straightened out. Mostly. There's still too much Latin-1 (ISO8859-1) around. Browsers still mostly support the old encodings, although Microsoft Edge apparently doesn't do so very well. Note that HTTP is not Unicode-aware; URLs, domain names, and HTTP data often have strings with other encodings.

The major databases understand Unicode, but it may be necessary to specify an encoding. MySQL went through a 24-bit Unicode period ("utf-8" is 24-bit) but now supports 32-bit Unicode. This requires specifying "utf8mb4" as the encoding. For new work, use that for everything. Postgres uses "utf-8" for 32-bit Unicode. MongoDB has no idea what character set it's using; it only knows raw bytes, and the application has to handle encoding and decoding.

The current big question is whether programs should use UTF-8 encoding internally, and whether the encoding should be visible to the programmer. Rust uses UTF-8 internally, it's totally visible to the programmer, and you have to use special libraries to iterate over a string properly. Go tries to hide it a little; if you iterate over a Go string using "range", you get 32-bit "runes". Python hides the internal representation completely; if you iterate over a Python string, you get strings which contain one code point, and strings are subscriptable by code point.

[1] http://unicode.org/emoji/charts-beta/full-emoji-list.html


> if you iterate over a Go string using "range", you get 32-bit "runes"

It's often easier to use that for-loop with a break to look at the first character of such a string, as in...

  package main
  import "fmt"
  func main(){
    str:= "我是一个人"
    for _, r:= range str {
      fmt.Printf("%c\n", r)
      break
    }
  }
Doing a str[0] prints the first utf8 byte of the string ("æ") instead of the whole character. The name of the official function is difficult to remember, and you have to put an import at the top...

  import "unicode/utf8"
  ...
  r, _:= utf8.DecodeRuneInString(str)


> Postgres uses "utf-8" for 32-bit Unicode

What's the reason for the quotes around utf-8? To my knowledge it's bog standard utf-8, with the exception that NULL is disallowed in many types.


Very old, but still very applicable, IMO.


See also: http://kunststube.net/encoding/ - which I personally prefer / is more thorough in my opinion.



This reminds me of D. Goldberg's very old "What Every Computer Scientist Should Know About Floating-Point Arithmetic".

[1] http://www.lsi.upc.edu/~robert/teaching/master/material/p5-g...


There was a talk at DroidCon given by Jesse Wilson that discusses this more in depth (and with a little more clarity).

https://youtu.be/T_p22jMZSrk


For another good article for programmers on Unicode, see this brilliant and excellent article by Joel Spolsky at https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


thats the same link as the submission you are commenting on...?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: