Hacker News new | past | comments | ask | show | jobs | submit login

> There's no reason to have both ä and a followed by the umlaut as separate characters.

ä is a part of ISO/IEC 8859-1 among others. Unicode has a requirement that characters in existing character sets should be precomposed because the transition from the legacy system (where everything is seemingly one "character") to Unicode (where some characters may be composed of multiple code points) would be harder. This concern is not hypothetical; Unicode as today couldn't be possible if an ASCII-compatible Unicode transformation format (UTF-8) didn't exist.

> I know nothing about Hangul.

It doesn't matter you know nothing about Hangul because you are suggesting to remove some features that Hangul depends on! Prove that Hangul doesn't need those features or accept that your complaint to Unicode based on your misunderstanding of internationalization is unsubstantiated.

> Tell me about the Fraktur example. Two fonts in Unicode for the same letters, and that's only the beginning. Unicode has boldface and italic encodings, too, but only for some letters.

These characters are in (amusingly, self-explanatory) the Mathematical Alphanumeric Symbols block where the same character with different styles can have different semantics. It was originally proposed by the STIX project in 1998 (L2/98-405) in the form of variant tags and later changed to separate characters (L2/99-105) because variant tags cannot be limited to a set of desirable characters. Nowadays they are more frequently used as "formatting" in a plain text, but Unicode wouldn't be responsible for any such (incorrect) uses beyond mathematics.

Oh, and all of this information can be easily found from L2 documents linked in that Wikipedia page. If you were doing the minimal research on the topic before complaining I wouldn't be this sarcastic.

> But people are familiar with that, after all, even diff programs have an option to ignore whitespace differences.

Ideally diff should be aware of the structure of input files so that they can filter out non-semantic changes. Whitespace removal is a cheap approximation that doesn't always work. But that's another topic.

> Have fun writing a 100% correct program that can diff two Unicode files and not show embedded but invisible differences.

Why should I do that? I do care about trailing whitespaces and unused control characters in a plain text file and I want to see changes on them.

I guess you have made this problem up again as a hypothetical example (please let me know the context if not), but if you really need the visual difference of texts, the correct answer is that you should consult the rendering system---not Unicode---to list a series of glyphs and compare against them because it has a final say about the actual visual output. It's like a Unicode string indexing [1]; yes, you can do that under certain circumstances, but why do you that in the first place? You are almost surely solving a wrong problem.

[1] https://news.ycombinator.com/item?id=20050326




> Why should I do that?

If you're writing a spell checker, for example.

> hypothetical example

Spell checkers, string search, sorting strings, etc., are all real programming tasks.

> the same character with different styles can have different semantics

Yes, I know that. Unicode exceeded its charter by adding invisible semantic content like that. For the simple reason that it cannot succeed at that task. The semantic meaning of text is imputed from the context, not the glyph.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: