Hacker News new | past | comments | ask | show | jobs | submit login

I don't recognize that one, but there are in fact characters that render as whitespace but aren't, for example U+2800 (BRAILLE PATTERN BLANK).



This is particularly annoying when trying to sanitize input in a language like Java, where String.trim() only removes space (maybe tab?) and the pre-defined Regex classes in Pattern define some of the whitespace characters, but not all.

Horizontal whitespace characters [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]

Vertical whitespace characters [\n\x0B\f\r\x85\u2028\u2029]

Whitespace characters [ \t\n\x0B\f\r]


Which is particularly egregious, since in actual braille, that is a whitespace character (specifically, a space with no dots in it represents the character... space! 0x20).


There is considered to be semantics in individual glyphs. So IIRC there is a unicode minus sign and a unicode dash and even if they are visibly indistinguishable you're not supposed to mix them. I'm no unicode expert though. I doubt the unicode consortium is comprised of fools and nincompoops. Best assume they know what they're doing.


I'm not claiming that BRAILLE PATTERN BLANK is the same character as SPACE (I don't particularly disagree, but that's not my point); I'm claiming that BRAILLE PATTERN BLANK is a whitespace character, just like SPACE or NEWLINE or U+2003 EM SPACE (pretty much any of the U+200xs, really), regardless of whether it's the same character.


I exactly get your point. I don't know why either, but I'm assuming there's a reason.


Fair enough, but I'm assuming the reason, whatever it is, is at least as stupid as the reason why U+01F1 exists and is not encoded as 44 5A.


It's a backwards compatibility issue with an older character set where the symbol is used in latin transcription of Macedonian.

https://en.wikipedia.org/wiki/Dz_(digraph)#Unicode


Yes, and I'm assuming the reason BRAILLE PATTERN BLANK is not correctly classified is at least as stupid as that, and can therefore be safely ignored for the purposes of having legitimate reasons for things.


Finally you've got to the point, that braille whitespace is not classified as a whitespace (which on checking is true). Why didn't you say that at the start.


> that braille whitespace is not classified [correctly]

I did! That was the first thing I said.

> > characters that render as whitespace but aren't, for example U+2800 (BRAILLE PATTERN BLANK).

> in actual braille, that is a whitespace character


Indeed you did. My apologies.


There are a lot of those in Unicode. Characters that are visually indistinguishable but exist to provide round-trip capability to some older character set.


Why on earth would unicode U+01F1, with it's obvious hex encoding of 01F1, be encoded as hex value 44 5A (a Han character https://www.fileformat.info/info/unicode/char/445a/index.htm)?


He meant U+44 U+5A. The ASCII letters D and Z.


Then in that case because some apparent pairs of letters are in fact a single letter in some languages. Per the DZ wiki page someone else has given, DZ is a distinct single letter in hungarian and slovak at least.

Double letters being a single letter isn't rare, such as ll and dd in welsh. In a strange sense it arguably occurs in english in th and th, which I could argue are double-letter representations of single letters, which were originally eth and thorn

https://en.wikipedia.org/wiki/Eth

https://en.wikipedia.org/wiki/Thorn_%28letter%29


Yes, exactly; "Dz" and "ll" (and "fi" at U+FB01) are single characters to exactly the same extent as "th" is, ie not.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: