CSS text-transform is language-dependant: the Dutch case explained

dasil003 · on April 1, 2012

I might be biased because I recently spent a couple days wrestling with it, but I think the Turkish case is more interesting. There are two vowels that use the letter I forms:

İ/i and I/ı

So the lowercase and uppercase are essentially split into two letters and then a new uppercase and lowercase form is created based on growing or shrinking the originals. I'm sure it seemed quite the elegant solution in the 1920s when they were working on the latinized Turkish alphabet.

The implication of this is that unlike the Dutch case, this affects all text-transform actions (uppercase, lowercase, and capitalize).

When we were doing our Turkish localized site I started digging into this, and I was horrified that no browser actually supports proper Turkish capitalization rules. Firefox in particular had a bug open since 2004 (now finally fixed as well). Asking around with Turkish web developers I heard of some crazy hacks (custom fonts!), but I got the feeling that Turkish web designers just avoid text-transform. This was not an option for us as we rely heavily on text-transform in our design (http://tr.mubi.com).

In the end I was able to piece together a surprisingly robust javascript replacement method with some help from Stack Overflow:

http://stackoverflow.com/a/8743095/8376

lillycat · on April 1, 2012

The Turkic bug was fixed a week ago and we'll also be in Firefox 14. As far as I know Webkit still has this bug. I don't know about IE or Opera, if somebody knows...

pennig · on April 1, 2012

The code path for that case must be delightful.

robin_reala · on April 1, 2012

Here's the diff:

https://hg.mozilla.org/mozilla-central/rev/bb53aec4a302

Doesn't look too bad

masklinn · on April 1, 2012

It's really weird that they just add special cases like that. Though I expect it's just because they don't have enough special cases yet (went from one — for the Turkish I — to two).

I'd have expected something like a generic Unicode-aware/y text management layer, and CSS text transforms would just go through that layer.

underwater · on April 1, 2012

Eek. Not only are they hardcoding the logic but they mix their CSS-specific code into the function. I understand that they are handling a limited number of cases now but if I came across that kind of code in my work I'd be very sceptical.

ars · on April 1, 2012

That's called not over engineering something.

Making something more complicated doesn't make it better. Make it more complicated when you need to, not before.

mcpherrinm · on April 2, 2012

The problem is that Unicode doesn't know about language. Unicode is just characters.

Language-aware bits are more gross, but then language often is. It's not nicely structured like most of the other things we encounter when transforming data.

masklinn · on April 3, 2012

> The problem is that Unicode doesn't know about language. Unicode is just characters.

I won't blame you for this, it is a common mistake, but Unicode goes far beyond merely mapping characters to integers. The Standard Annexes, Technical Reports and Technical Specifications cover pretty much all things localization from line breaking [UAX14] to regular expressions [UTS18] through date and time formatting [UTS35] or sorting [UTS10].

And as it turns out, both uppercasing and titlecasing are covered by [UAX44] as part of the SpecialCasing.txt file which provides lower, upper and title-casing (along with optional conditions) for characters with non-trivial mappings (trivial 1:1 mappings are covered in the base UnicodeData.txt file)

[UAX14] http://www.unicode.org/reports/tr14/

[UTS18] http://www.unicode.org/reports/tr18/

[UTS35] http://www.unicode.org/reports/tr35/

[UTS10] http://www.unicode.org/reports/tr10/

[UAX44] http://www.unicode.org/reports/tr44/

darklajid · on April 1, 2012

I wonder how the German ß is handled. Having no clue about the implementation of these transforms, wouldn't that be a similar case?

lillycat · on April 1, 2012

Yes, in Firefox the 'esszet' is transformed in SS when in capital letters. But this done since a long time. The dotted and dot-less Turkic, and the Dutch IJ, are new in Firefox 14 (which is the first browser to support it, AFAIK).

There is some specific cases with accented Greek diphthongs, where the diacritic position changes in upper and lower case, but Mozilla is working on a fix.

Michiel · on April 1, 2012

Of course, editors should be using the ligature (unicode character LATIN SMALL LIGATURE IJ). But having said that: I'm Dutch and a) I have never used that character and b) I have no idea how to write it on a keyboard.

Nvn · on April 1, 2012

That's actually discussed in the bug report[1] and apparently its use is discouraged by Unicode.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=740477#c2

nemoniac · on April 1, 2012

Unicode has a gazillion code points and it's discouraging us from writing our own language? Really?

Someone · on April 1, 2012

http://unicode.org/faq/ligature_digraph.html:

A: The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.

Navarr · on April 1, 2012

Now we only need "text-transform: katakana" and "text-transform: hiragana" for emphasizing Japanese text.

Of course, this can't possibly work with Kanji without some special hack around.

iamgilesbowkett · on April 1, 2012

there is no such word as "dependant." it's "dependent." really sorry to be that guy but it would hugely brighten my day if you could fix the spelling in the title.

dbuxton · on April 1, 2012

Well, there is such a word, certainly in the UK.

A more constructive comment might be, "the spelling variant you have used is normally a noun meaning 'a person who depends on another for their upkeep or care', where here you want the more normal adjectival spelling 'dependent'."

Although fwiw it seems that even as an adjective "dependant" might fly: http://www.wordnik.com/words/dependant