Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is the first time I've ever seen someone write with an understanding of combining characters, glyphs, codepoints vs encoding of said codepoints - and yet arrive at this conclusion.

What's the largest codebase you tried a unicode-ification project on? It's a nightmare unless you keep de/encoding as close to the i/o operations as possible.

I can't understand how you've ever found it just as easy to do "string x contain substring y" on bytes vs uc strings. Any case-insensitive test will fail miserably unless you only ever see ASCII input. Then there's sorting and tokenization. Oh god, the sorting bugs...

Even measuring the length of string is a miserable fail. And blind substitution of utf8 bytes horribly mangles the output causing mysterious segfaults or silent corruption.

On a large codebase, programmers can't keep track of what encoding is being used in which parts of the code. Eg. Let's allow the users to specify input file encoding! But our OS does filenames in UTF16-LE. And the Web API is UTF-8... nasty stuff. It's far saner to use character strings everywhere except immediately after/before I/O operations.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: