Hacker News new | past | comments | ask | show | jobs | submit login

Just scripted something to find them all:

    U+01C5: Dž (lower dž, upper DŽ)
    U+01C8: Lj (lower lj, upper LJ)
    U+01CB: Nj (lower nj, upper NJ)
    U+01F2: Dz (lower dz, upper DZ)
    U+1F88: ᾈ (lower ᾀ, upper ἈΙ)
    U+1F89: ᾉ (lower ᾁ, upper ἉΙ)
    U+1F8A: ᾊ (lower ᾂ, upper ἊΙ)
    U+1F8B: ᾋ (lower ᾃ, upper ἋΙ)
    U+1F8C: ᾌ (lower ᾄ, upper ἌΙ)
    U+1F8D: ᾍ (lower ᾅ, upper ἍΙ)
    U+1F8E: ᾎ (lower ᾆ, upper ἎΙ)
    U+1F8F: ᾏ (lower ᾇ, upper ἏΙ)
    U+1F98: ᾘ (lower ᾐ, upper ἨΙ)
    U+1F99: ᾙ (lower ᾑ, upper ἩΙ)
    U+1F9A: ᾚ (lower ᾒ, upper ἪΙ)
    U+1F9B: ᾛ (lower ᾓ, upper ἫΙ)
    U+1F9C: ᾜ (lower ᾔ, upper ἬΙ)
    U+1F9D: ᾝ (lower ᾕ, upper ἭΙ)
    U+1F9E: ᾞ (lower ᾖ, upper ἮΙ)
    U+1F9F: ᾟ (lower ᾗ, upper ἯΙ)
    U+1FA8: ᾨ (lower ᾠ, upper ὨΙ)
    U+1FA9: ᾩ (lower ᾡ, upper ὩΙ)
    U+1FAA: ᾪ (lower ᾢ, upper ὪΙ)
    U+1FAB: ᾫ (lower ᾣ, upper ὫΙ)
    U+1FAC: ᾬ (lower ᾤ, upper ὬΙ)
    U+1FAD: ᾭ (lower ᾥ, upper ὭΙ)
    U+1FAE: ᾮ (lower ᾦ, upper ὮΙ)
    U+1FAF: ᾯ (lower ᾧ, upper ὯΙ)
    U+1FBC: ᾼ (lower ᾳ, upper ΑΙ)
    U+1FCC: ῌ (lower ῃ, upper ΗΙ)
    U+1FFC: ῼ (lower ῳ, upper ΩΙ)





You can find them all with this UnicodeSet query (though the query alone naturally won’t show you the lower and upper forms):

  [[:Changes_When_Lowercased:]&[:Changes_When_Uppercased:]]
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%...

It’s a handy way of finding all kinds of things along these lines. Look at the properties of some characters you care about, and see how you can add, subtract and intersect them.


TIL:

Polytonic orthography (from Ancient Greek πολύς (polýs) 'much, many' and τόνος (tónos) 'accent') is the standard system for Ancient Greek and Medieval Greek and includes:

- acute accent (´)

- circumflex accent (ˆ)

- grave accent (`); these 3 accents indicate different kinds of pitch accent

- rough breathing (῾) indicates the presence of the /h/ sound before a letter

- smooth breathing (᾿) indicates the absence of /h/.

Since in Modern Greek the pitch accent has been replaced by a dynamic accent (stress), and /h/ was lost, most polytonic diacritics have no phonetic significance, and merely reveal the underlying Ancient Greek etymology.

(https://en.wikipedia.org/wiki/Greek_diacritics)


This seems to be missing the iota subscript (aka ypogegrammeni) which is the source of the weirdness of what happens when casing, e.g., ῳ. (This is another diacritical that modern Greek has abandoned since its impact on pronunciation was already being lost in the classical era (when I took Attic Greek in college, pronunciation wasn’t a critical thing, but we treated all the accents as simply a stress accent, ignored iota subscript and pronounced the rough breathing as h.)

In upper case, ῳ can be written as ῼ, Ω with the subscript or ΩΙ with the distinction between the first two often made as a matter of font design (in fact the appearance of ῼ differs depending on whether it’s in the edit box or in text on this site.


One of the features of finl is the ability to have automatic substitutions of character inputs to, e.g., enable the TeX standard for inputing characters like “, ” and —

Playing with this, I was thinking that I could enable use of the Silvio Levy’s old 7-bit ascii input for Greek and realized that you would need different mappings of characters depending on where the character mapping happened relative to case folding. Text is messier than most peopler realize.


Reminds me of Vietnamese and its use of diacritics to mark tones. Vietnamese also uses diacritical markings to differentiate some vowels.

https://en.wikipedia.org/wiki/Vietnamese_phonology#Tone?wpro...


There is speculation that the polytonic accents in Greek (which were a late addition to the alphabet, incidentally), originally were tone markers. ΄ represented a rising tone, ` a falling tone and ῀ a rising then falling tone.

The other day I posted similar tables/scripts for related character properties and there was some good discussion: https://news.ycombinator.com/item?id=42014045

- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...

- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:

Ω ω Ω

İ i̇ İ

K k K

Å å Å

ẞ ß SS

ϴ θ Θ

I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?


Note that semantic meaning for the second case is preserved - whether you use a precomposed symbol for capital I with overdot, or a combining character for the latter, it's supposed to be the same thing.

The others are much worse in this regard, since they actually lose meaningful information.


Seems like lot of these would be taken care by normalization though? Pre-composed characters are bit of a mess.

I do feel it is a error that unit/math symbols get changed, imho they should stay as-is through case conversions.


These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.

Someone pointed out the canonical source, which I'll have to look at more closely:

https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt


The Unicode names of these 31 chars,

  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON

  LATIN CAPITAL LETTER {_} WITH SMALL LETTER {_}
    L,J
    N,J
    D,Z

  GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH PROSGEGRAMMENI

  GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH {PSILI,DASIA} AND {_}
    PROSGEGRAMMENI
    VARIA AND PROSGEGRAMMENI
    OXIA AND PROSGEGRAMMENI
    PERISPOMENI AND PROSGEGRAMMENI

What's the difference with letter Ch [0]? When it's capitalized at the beginning of the word, it also looks like uppercase C and lowercase h.

[0]https://en.wikipedia.org/wiki/Ch_(digraph)


There is no single unicode character representing "Ch".

Here's a list of Unicode digraphs: DZ, Dz, dz, DŽ, Dž, dž, IJ, ij, LJ, Lj, lj, NJ, Nj, nj, ᵺ

https://en.wikipedia.org/wiki/Digraph_(orthography)#In_Unico...


Yeah, but why does Unicode have those and not ch?

According to [1], these particular ones exist because of legacy encodings of Serbo-Croatian,

    Digraphs ⟨dž⟩, ⟨lj⟩ and ⟨nj⟩ in their upper case, title case and lower case forms have dedicated Unicode code points as shown in the table below, However, these are included chiefly for backwards compatibility with legacy encodings which kept a one-to-one correspondence with Cyrillic; modern texts use a sequence of characters. 
[1] https://en.wikipedia.org/wiki/Gaj%27s_Latin_alphabet#Computi...

Ch may be a digraph in many languages, but is it implemented in Unicode as a single character?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: