It’s a handy way of finding all kinds of things along these lines. Look at the properties of some characters you care about, and see how you can add, subtract and intersect them.
Polytonic orthography (from Ancient Greek πολύς (polýs) 'much, many' and τόνος (tónos) 'accent') is the standard system for Ancient Greek and Medieval Greek and includes:
- acute accent (´)
- circumflex accent (ˆ)
- grave accent (`); these 3 accents indicate different kinds of pitch accent
- rough breathing (῾) indicates the presence of the /h/ sound before a letter
- smooth breathing (᾿) indicates the absence of /h/.
Since in Modern Greek the pitch accent has been replaced by a dynamic accent (stress), and /h/ was lost, most polytonic diacritics have no phonetic significance, and merely reveal the underlying Ancient Greek etymology.
This seems to be missing the iota subscript (aka ypogegrammeni) which is the source of the weirdness of what happens when casing, e.g., ῳ. (This is another diacritical that modern Greek has abandoned since its impact on pronunciation was already being lost in the classical era (when I took Attic Greek in college, pronunciation wasn’t a critical thing, but we treated all the accents as simply a stress accent, ignored iota subscript and pronounced the rough breathing as h.)
In upper case, ῳ can be written as ῼ, Ω with the subscript or ΩΙ with the distinction between the first two often made as a matter of font design (in fact the appearance of ῼ differs depending on whether it’s in the edit box or in text on this site.
One of the features of finl is the ability to have automatic substitutions of character inputs to, e.g., enable the TeX standard for inputing characters like “, ” and —
Playing with this, I was thinking that I could enable use of the Silvio Levy’s old 7-bit ascii input for Greek and realized that you would need different mappings of characters depending on where the character mapping happened relative to case folding. Text is messier than most peopler realize.
There is speculation that the polytonic accents in Greek (which were a late addition to the alphabet, incidentally), originally were tone markers. ΄ represented a rising tone, ` a falling tone and ῀ a rising then falling tone.
For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:
Ω ω Ω
İ i̇ İ
K k K
Å å Å
ẞ ß SS
ϴ θ Θ
I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?
Note that semantic meaning for the second case is preserved - whether you use a precomposed symbol for capital I with overdot, or a combining character for the latter, it's supposed to be the same thing.
The others are much worse in this regard, since they actually lose meaningful information.
These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.
Someone pointed out the canonical source, which I'll have to look at more closely:
LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
LATIN CAPITAL LETTER {_} WITH SMALL LETTER {_}
L,J
N,J
D,Z
GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH PROSGEGRAMMENI
GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH {PSILI,DASIA} AND {_}
PROSGEGRAMMENI
VARIA AND PROSGEGRAMMENI
OXIA AND PROSGEGRAMMENI
PERISPOMENI AND PROSGEGRAMMENI
According to [1], these particular ones exist because of legacy encodings of Serbo-Croatian,
Digraphs ⟨dž⟩, ⟨lj⟩ and ⟨nj⟩ in their upper case, title case and lower case forms have dedicated Unicode code points as shown in the table below, However, these are included chiefly for backwards compatibility with legacy encodings which kept a one-to-one correspondence with Cyrillic; modern texts use a sequence of characters.