Hacker News new | past | comments | ask | show | jobs | submit login

Java is pretty good at character processing and has been since the inception of the language. Adopting Unicode from the start helped enormously, along with clearly separating String from byte[] in the type system. Finally the fact you have static typing makes it a lot easier to avoid 'what the heck do I have here' problems with byte vs. str that still pop up even in Python3.

That said Python3 is vastly better than Python2. Basic operations like reading/writing to files and serialization to JSON for the most part just work without having to worry about encodings or manipulate anything other than str objects. I'm sure there are lots of cases where that's not true but for my work at least string handling is no longer a major issue in writing correct programs. The defaults largely seem to work.

Java's string handling is also broken by default in a few ways, due to it historically using UCS-2 internally and hence still allowing surrogate pairs to get split up, giving broken unicode strings.

I have not personally encountered this problem but it's definitely there. The other problem historically is that Java didn't explicitly require clients to specify encodings explicitly when moving between strings and bytes. That's been cleaned up quite a bit in recent releases of the JDK.

All things considered Java character handling was an enormous improvement over the languages that preceded it and still better than implementations in many other languages. (I wish the same could be said of date handling.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact