I don't recognize that one, but there are in fact characters that render as whit...

LgWoodenBadger · on June 16, 2020

This is particularly annoying when trying to sanitize input in a language like Java, where String.trim() only removes space (maybe tab?) and the pre-defined Regex classes in Pattern define some of the whitespace characters, but not all.

Horizontal whitespace characters [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]

Vertical whitespace characters [\n\x0B\f\r\x85\u2028\u2029]

Whitespace characters [ \t\n\x0B\f\r]

a1369209993 · on June 15, 2020

Which is particularly egregious, since in actual braille, that is a whitespace character (specifically, a space with no dots in it represents the character... space! 0x20).

throwaway_pdp09 · on June 15, 2020

There is considered to be semantics in individual glyphs. So IIRC there is a unicode minus sign and a unicode dash and even if they are visibly indistinguishable you're not supposed to mix them. I'm no unicode expert though. I doubt the unicode consortium is comprised of fools and nincompoops. Best assume they know what they're doing.

a1369209993 · on June 15, 2020

I'm not claiming that BRAILLE PATTERN BLANK is the same character as SPACE (I don't particularly disagree, but that's not my point); I'm claiming that BRAILLE PATTERN BLANK is a whitespace character, just like SPACE or NEWLINE or U+2003 EM SPACE (pretty much any of the U+200xs, really), regardless of whether it's the same character.

throwaway_pdp09 · on June 15, 2020

I exactly get your point. I don't know why either, but I'm assuming there's a reason.

a1369209993 · on June 15, 2020

Fair enough, but I'm assuming the reason, whatever it is, is at least as stupid as the reason why U+01F1 exists and is not encoded as 44 5A.

dhosek · on June 15, 2020

It's a backwards compatibility issue with an older character set where the symbol is used in latin transcription of Macedonian.

https://en.wikipedia.org/wiki/Dz_(digraph)#Unicode

a1369209993 · on June 16, 2020

Yes, and I'm assuming the reason BRAILLE PATTERN BLANK is not correctly classified is at least as stupid as that, and can therefore be safely ignored for the purposes of having legitimate reasons for things.

throwaway_pdp09 · on June 16, 2020

Finally you've got to the point, that braille whitespace is not classified as a whitespace (which on checking is true). Why didn't you say that at the start.

a1369209993 · on June 16, 2020

> that braille whitespace is not classified [correctly]

I did! That was the first thing I said.

> > characters that render as whitespace but aren't, for example U+2800 (BRAILLE PATTERN BLANK).

> in actual braille, that is a whitespace character

throwaway_pdp09 · on June 16, 2020

Indeed you did. My apologies.

mark-r · on June 15, 2020

There are a lot of those in Unicode. Characters that are visually indistinguishable but exist to provide round-trip capability to some older character set.

throwaway_pdp09 · on June 15, 2020

Why on earth would unicode U+01F1, with it's obvious hex encoding of 01F1, be encoded as hex value 44 5A (a Han character https://www.fileformat.info/info/unicode/char/445a/index.htm)?

wnoise · on June 15, 2020

He meant U+44 U+5A. The ASCII letters D and Z.

throwaway_pdp09 · on June 15, 2020

Then in that case because some apparent pairs of letters are in fact a single letter in some languages. Per the DZ wiki page someone else has given, DZ is a distinct single letter in hungarian and slovak at least.

Double letters being a single letter isn't rare, such as ll and dd in welsh. In a strange sense it arguably occurs in english in th and th, which I could argue are double-letter representations of single letters, which were originally eth and thorn

https://en.wikipedia.org/wiki/Eth

https://en.wikipedia.org/wiki/Thorn_%28letter%29

a1369209993 · on June 16, 2020

Yes, exactly; "Dz" and "ll" (and "fi" at U+FB01) are single characters to exactly the same extent as "th" is, ie not.