Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

why are they so hard?



There's two aspects of complexity with strings:

1) 40 years of crazy encodings and languages.

2) human languages are wildly diverse and basically any assumption you wish to apply is broken.

For 1, any system that wants to deal with the outside world needs to deal with: operating system encodings (arbitrary bytes on unix, malformed UCS2 on windows), C representation (null-terminated strings), systems that only work with ASCII, systems that only work with utf8, systems that work with arbitrary encodings/languages (HTML). This is arguably unnecessary complexity that exists because of short-sighted decisions in the past.

2 is the necessary complexity; the fact that languages are really complicated.

There are thousands of symbols in writing. Do you try to encode these symbols in a monolithic manner, or in a compositional way? For historical reasons, you can often do both! ë can be a single character, or e with an accent modifier. How do you handle string searching in such a model? Do you match `noel` with `noël`? What's the length of noël? 4 characters? 5 characters? bytes? graphemes? codepoints? Can you correctly reverse noël (do it wrong and you can get leön)?

Different letters which have similar/identical representations but different semantics/origins! Is Ε "capital e" or "capital ε"? How do you upper-case or lower-case these letters? Do you expect to_upper(to_lower(char)) to roundtrip (it won't)? Do you expect capitalization to be doable in-place (it's not)? Do you expect capitalization to be region-specific (it is)?

Are any of these operations even coherent in a language like Japanese? Why are you trying to do them?

God help you if you want to display this text. Are you ready to handle right-to-left text? Are you assuming that your font is monospace (hey there terminal and text editors)? C̢̫a̘̺̯n ̘̜̦̹y̷̫̼̘̩o̶͉u̗̩̻̞ ̻ẹ͡v̴̤͎̹e̶̫̠̤̭̺̤̞n̛̞̹̣̩̲͉̮ ̜͖̪͔̖d̤e̘̯ͅa̺l̟̀ ͚̗̣w̭i̸͇̠̥̣̜̥t̸h̸̻̮̼̙̹ ̗̺̱̣̰̱̙z̟a̺͜l̠̦̖̟̰͍g҉̜͖͓̫ơ̩̹̰͕?̹̳̼̯̘̺̟


I think my favorite part of this article is how my Zalgo example doesn't render even remotely correctly in any browser I've tried it in.


Because we spent a lot of time optimizing every assumption for the ascii-only case, and letting go of that simplification is distressing, I guess.


Yes. And many languages had ASCII-only stuff for a long time, so "real" APIs seem much, much more complex. Of course, that complexity was always there, but us English speakers could mostly just ignore it...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: