
In the Turkish locale, "INFO".lower() != "info" - duckerude
https://github.com/python/cpython/blob/fff3c28052e6b0750d6218e00acacd2fded4991a/Lib/logging/handlers.py#L802
======
kentonv
Has anyone here _ever_ had a use case for toLower() where they actually wanted
localization to apply?

It seems to me that in practice, it's extremely rare to want to change case of
real, natural-language text. When I have natural-language text, it's just a
blob to me, and I don't want to touch it.

The only time I ever want to lower-case or capitalize something, I'm working
with identifiers meant for computer -- not human -- consumption. Usually,
specifically, I'm dealing with identifiers that have annoyingly been defined
to be case-insensitive even though the only humans that ever see them are
programmers and programmers hate case-insensitivity. HTTP headers are a common
example.

I mostly write C++, and I end up writing code like:

    
    
        for (char& c: str) {
          if ('A' <= c && c <= 'Z') c = c - 'A' + 'a';
        }
    

Later on, some well-meaning developer on my team will come along and say "Ugh
what is this NIH syndrome?" and then they "clean it up" as:

    
    
        #include <ctype.h>
    
        for (char& c: str) {
          c = tolower(c);
        }
    

And then I have to say NOOOOOOO DON'T DO THAT YOU HAVE NO IDEA WHAT tolower()
REALLY DOES!

I struggle to imagine any real use case where you'd actually want locale-
dependent tolower() other than, maybe, a word processor -- but if you're
writing a word processor, you're probably not going to be depending on the
language's built-in string APIs to do your text manipulation.

~~~
rkangel
This is a classic case of a 'why' code comment being needed. It's obvious what
you're doing, but without a 2 line explanation, it's not clear _why_.

~~~
kentonv
Yeah I probably wrote that comment the first few times I did this but it's
hard to write it the 50th time.

Maybe I should have my own tolower() function that I can call so I only have
to write the comment once but it just feels ridiculous somehow.

~~~
random314
Why does it feel ridiculous?

~~~
kentonv
Because I've already rewritten more of the standard library than is healthy.

I mean, it's clearly the right thing to do here but I can predict the
conversation that will inevitably result... "You wrote your own tolower()
function? Why?" "The standard one is horribly broken." "How could a function
that lower-cases a letter be broken??? Jesus Kenton your NIH syndrome is out
of control." "Sigh..."

(Slightly more seriously, any particular time I need to lower-case something,
it takes 10 seconds to write out the code, but would take 10 minutes to find a
good place to define a reusable function and exactly what its API should be,
and so it never seems worth the effort in the moment. Just like how most messy
code comes to be.)

~~~
random314
This conversation can be simply be avoided by copy pasting your original
hacker news comment into the library function header.

I have noticed some coworkers have their ego gratified by being right while
everyone else is wrong. Instead of simply explaining what they are doing when
they are doing it, they will do something that looks wrong in a very
noticeable way and wait for the backlash. The backlash gives them an
opportunity to show everyone else how they were right while everyone else was
wrong and also an opportunity to play victim. However, in SW development - it
is not just the technical details - your behavior also matters in a big way.

In this particular case, the correct approach is to create your own library
function with appropriate comments. This is why the concept of a library
function was invented. It is its entire raison-d'etre. However, you are doing
everything but that. Including providing justifications in hacker news
comments instead of your source code.

Now inevitably, someone will change your inline code to use to_lower. This
will give you an opportunity to scream bloody murder, show how other engineers
don't really understand technical details, correct them and also play victim.
Create a library utility with comments and link it in - End of story.

~~~
kentonv
Speaking of people wanting to gratify their ego by being right: Everyone on
this thread trying to lecture me on software engineering? ¯\\_(ツ)_/¯

~~~
random314
At least, you get to play victim :)

------
bayindirh
Welcome to the Turkish language, where we have ı, i, I and İ. In our language
the conversion is as follows:

\- i <-> İ

\- ı <-> I

We love our dots and preserve them. For a more detailed read, please see:

[https://blog.codinghorror.com/whats-wrong-with-
turkey/](https://blog.codinghorror.com/whats-wrong-with-turkey/)

~~~
Natsu
As I understand it, Turkish is one of the more important locales to test with
because of things like this.

~~~
bayindirh
Turkish is the only language which has the ı & I pair. Similarly, AFAIK,
Turkish is again the only language with ğ and ş letters. So, by testing for
Turkish, you test for a lot of European languages at once. Moreover we share
some modified letters(ç, ü) with other Central European languages.

If your program can pass “The Turkish Test”, you pass a lot of others too.

~~~
anticensor
Azerbaijani too. Moreover, Azerbaijani has an additional letter ə, which
sounds like /æ/.

~~~
therein
I love the feeling of camaraderie arising from that partial mutual
intelligibility of Turkish and Azerbaijani.

That connection through language goes a long way.

müqəddəs bacı millət :)

------
tryauuum
Unrelated story about Russian language.

The first letter of russian alphabet is А, the last one is Я. So it's natural
to try to match russian words with '[А-Яа-я]+'. But this is a recipe for
disaster, this regexp doesn't match words with 'Ё' in them like "Артём".

This is due to the fact that regexp ranges work on byte values. All letters of
russian language have neatly ordered byte values, except for the Ё.

~~~
Sharlin
English is probably the only commonly spoken language where naïve char range
matching _kind of sort of_ works. I say ”kind of sort of” because [a-zA-Z]
trivially fails to match all words in many English texts that haven’t been
lossily compressed to ASCII, including this comment.

It is practically always wrong to match on [a-z] unless you’re parsing a
computer language whose spec guarantees that it works.

~~~
tryauuum
I always wanted to know, how easy is it to type naïve on a common western
keyboard?

Do you have to press some obscure keyboard shortcut?

~~~
reaperducer
_how easy is it to type naïve on a common western keyboard?_

In macOS, you can either use Command-u (for "umlat") followed by i, or hold
down the i key for a second and press 2 to select the ï from the pop-up menu.

~~~
masklinn
> Command-u

option-u (aka alt-u).

Generally speaking, command is for application-level or os-level commands,
control is for text edition, and alt is for alternate characters (all can be
shifted and command "overrides" the rest).

~~~
reaperducer
You're right, it's Option-u. Most of the key labels on my MacBook have long
since been scratched away.

This has happened with every single Apple keyboard I've ever used. I suspect
it's my fault, since I'm a key pounder, having learned to type on an IBM
Selectric typewriter.

------
FrontAid
Changes to the casing might also change the value's length. E.g. uppercasing
the German ß will transform it to SS. Example using JavaScript:

'ß'.toUpperCase(); // returns 'SS'

[https://en.wikipedia.org/wiki/%C3%9F](https://en.wikipedia.org/wiki/%C3%9F)

~~~
schoen
There is apparently a multi-decade controversy about that:

[https://en.wikipedia.org/wiki/Capital_%E1%BA%9E](https://en.wikipedia.org/wiki/Capital_%E1%BA%9E)

(with German language authorities recently endorsing the idea that ß can have
a distinctive uppercase form "ẞ")

------
scrollaway
Ive long thought programming languages need a "localizable string" (Aka user-
facing string) type, different from regular utf8 strings. Something like what
gettext and other i18n libraries fake for you, but native to the language.

Behaviour like this is definitely a good reason why: sorting, changing case,
etc should be consistent when dealing with strings used as constants and
identifiers, but Python's .lower() behaviour makes sense in a localizable
string context.

~~~
lazulicurio
Along similar lines, I've thought that it would be useful if Unicode included
language marks (i.e. codepoints to identify blocks of text as being written in
a specific language). It would be strictly more useful than the barebones
left-to-right/right-to-left marks (U+200E/U+200F) when deciding how to process
and display text. And it would be a step towards correcting the mess that was
Han unification.

~~~
Ericson2314
What this gets right down to is that Unicode is a flawed idea: the
meaning/behavior/whatever of characters is insanely dependent on their
context.

The problem was never gazillions of code pages, but our inability to write C
to deal with that amount of complexity circa 1990.

With modern machines, and good programming languages with good type systems, I
absolutely think we could store a language per string, and concatenate into a
polylinguistic rope if needed.

This would hopefully push us away from stringly-typed crap in general.

~~~
throwaway_pdp09
> the meaning/behavior/whatever of characters is insanely dependent on their
> context

I wish you would give an example instead of just proclaiming crapness. You
know, so we n00bs can learn something.

~~~
throwaway_pdp09
@toast0, @lazulicurio, both of your points seem to illustrate the complexities
of the languages, not "...that Unicode is a flawed idea" as the original
poster said. AFAIKS this is intrinsic complexity showing itself and does not
make any indication of how it should be done correctly, or better.

~~~
lazulicurio
> both of your points seem to illustrate the complexities of the languages,
> not "...that Unicode is a flawed idea"

The flaw in Unicode is that it punts on the intrinsic complexity---pretending
that codepoints have language-independent, plain-text, semantic meaning.

A couple of threads that have molded my views over time:

 _I can 't write my name in Unicode_
[https://news.ycombinator.com/item?id=9219162](https://news.ycombinator.com/item?id=9219162)
(Specifically these two comments
[https://news.ycombinator.com/item?id=9220530](https://news.ycombinator.com/item?id=9220530)
and
[https://news.ycombinator.com/item?id=9220970](https://news.ycombinator.com/item?id=9220970))

 _Why isn 't the external link symbol in Unicode?_
[https://news.ycombinator.com/item?id=23016832](https://news.ycombinator.com/item?id=23016832)

~~~
Ericson2314
> The flaw in Unicode is that it punts on the intrinsic complexity---
> pretending that codepoints have language-independent, plain-text, semantic
> meaning.

> Pretending "plain text" isn't an oxymoron

FTFY :)

------
chippy
[https://garygregory.wordpress.com/2015/11/03/java-
lowercase-...](https://garygregory.wordpress.com/2015/11/03/java-lowercase-
conversion-turkey/)

In the Turkish locale, the Unicode LATIN CAPITAL LETTER I becomes a LATIN
SMALL LETTER DOTLESS I. That’s not a lowercase “i”.

------
beeforpork
My genius idea was once to use toupper() to normalise paths on Windows, which
are case-insensitive. One day, a customer from Azerbaijan reported that my
application failed to access a file in C:\WİNDOWS\\...

~~~
tryauuum
i feel your pain

------
Macha
07/04/2008 -> April 7th seems about as reasonable a result as July 4th,
especially when you've explicitly opted in to a Turkish locale. I don't agree
with the article's assertion that the format being interpreted according to
the user's locale is wrong here, the one wrong part is a US centric
programmer's expectation that PP-QQ-YYYY is an unambiguous format. Use YYYY-
mm-dd when you need a format that's not ambiguous

~~~
frabert
YYYY-mm-dd also plays nice with lexicographic ordering, which is why I always
use it when I need to put dates in e.g. filenames

~~~
Macha
I'm a European working primarily with Americans. My home country uses
dd/mm/YYYY (or dd/mm for short) and the US uses mm/dd/YYYY for with mm/dd for
short. I've switched to YYYY-mm-dd simply for my own sanity and if I omit the
year I write the month in text format, such as "5 June".

~~~
withinboredom
The US military uses the almost same convention (dd-mmm-yyyy) so 07-aug-2020.

~~~
dgellow
That’s dd-mmm-yyyy

~~~
withinboredom
Thanks!

------
alkonaut
Repeat after me: don’t do string operations without explicit locale. Don’t do
string operations without explicit locale.

I don’t know why so many languages have string functions that should take a
locale but provide an overload that doesn’t and which uses the _system_ locale
as the default. It can’t be what many developers actually want, yet it has
become the norm. Worse, code using a default locale _appears_ to work on the
developers machine and in production, until someone parses a number in France
or lowercases a string in Turkey, which is a late and expensive discovery of
the bug.

The default shouldn’t be the system locale, it should be an invariant locale.
And I’ll go so far as arguing this invariant locale should be invariant across
systems (meaning it can’t just defer to a system C library either).

~~~
madeofpalk
I ran into this with C#/.NET on Windows - I tried to convert a string "1.3" to
the float 1.3, and it failed on languages that use comma as their decimal
separator.

That was a learning experience.

~~~
alkonaut
Indeed. As a person from a comma country, I find these mistakes in most code
bases I look at. It makes it frustrating to contribute to open source, for
example.

Perhaps it’ll make you feel better about your parsing bug that even the C#
compiler (Roslyn) code base had several of these issues.

------
stevoski
For a similar reason, Java on Mac and Linux was briefly broken for anyone
using it in the Turkish locale. It was because in the Turkish locale,
!“POSIX”.toLowerCase().equals(“posix”).

Relevant bug report here:
[https://bugs.openjdk.java.net/browse/JDK-8047340](https://bugs.openjdk.java.net/browse/JDK-8047340)

------
maweki
As it isn't yet mentioned: for these cases the Python standard library
explicitly has
[https://docs.python.org/3.8/library/stdtypes.html#str.casefo...](https://docs.python.org/3.8/library/stdtypes.html#str.casefold)
(str.casefold), which aggressively lowercase-normalizes strings with an
algorithm from the unicode standard. Every case comparison using lower()
instead of casefold() can be considered a bug.

~~~
Alex3917
> Every case comparison using lower() instead of casefold() can be considered
> a bug.

If you just casefold two strings and compare them, it's still a bug. You need
to normalize them to NFKC first.

~~~
pas
Is NFKC necessary, isn't NFKD enough? (As in you have to normalize and
decompose both strings, but at that point you can check them for equality, and
doing the canonical composition isn't needed, right?)

~~~
Alex3917
I think that would work if you're just checking for equality and want to
minimize processing. I guess as a web developer I always just assume people
are going to be storing strings in a database after normalizing them, so would
want to minimize string length.

------
anticensor
Correct: you would get "ınfo", "warnıng" and "crıtıcal" in Turkish and in
Azerbaijani.

~~~
mapgrep
Further context:

[https://en.m.wikipedia.org/wiki/Dotted_and_dotless_I](https://en.m.wikipedia.org/wiki/Dotted_and_dotless_I)

Did not know Istanbul is actually İstanbul.

~~~
gvx
Me neither. I did know it's not Constantinople, though.

~~~
anticensor
Constantinople (Fatih) is the capital town of Eistipolis (Istanbul).

------
decafbad
Please stop doing this. Don't bind lower() upper() functions to environment
variables or anything else system related. Sun did this in Java and doesn't
even bother to mention the issue in documents. It caused huge problems for
more than a decade.

You can just make string lowercase() uppercase() function work the same
everywhere, regardless of locale settings. Provide a special case function
lowercaseTR() or so. This works very well in Go.

By the way, Azerbaijan has the same problem because they accepted help from
wrong guys when they switched to Latin.

~~~
price
You'll be glad to hear that Python did stop doing this: Python 3 has never
behaved this way, and its `lower` and `upper` methods have always been
independent of your locale or anything else from your system.

The workaround in the OP was added in 2006 (note the reference to an issue on
"SF", i.e. SourceForge -- another era!), and is now long obsolete.

~~~
decafbad
Very much so. Thanks.

------
geofft
In C (POSIX.1-2008, specifically), there's tolower_l() and the rest of the _l
functions for this use case, which take a locale as an argument. That let's
you ask for the English (or even "C locale") lowercase versions of these
English words, even when your process's current locale is Turkish.

[https://www.man7.org/linux/man-
pages/man3/tolower_l.3.html](https://www.man7.org/linux/man-
pages/man3/tolower_l.3.html)

~~~
adamjb
The mention of _l functions reminded me of this gloriously over the top git
message/rant.

"Those not comfortable with toxic language should pretend this is a religious
text."

[https://github.com/mpv-
player/mpv/commit/1e70e82baa9193f6f02...](https://github.com/mpv-
player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe)

------
jwilk
Looks like it's no longer the case in Python 3:

    
    
       Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
       [GCC 8.3.0] on linux
       Type "help", "copyright", "credits" or "license" for more information.
       >>> from locale import *
       >>> setlocale(LC_ALL, 'tr_TR.UTF-8')
       'tr_TR.UTF-8'
       >>> 'INFO'.lower()
       'info'

~~~
anderskaseorg
Oddly, it also wasn’t the case for Python 2 Unicode strings (u'INFO'), only
for Python 2 byte strings ('INFO'). So it’s possible that Python 3 lost this
behavior by accident.

~~~
price
On some more digging through history, it looks like the change in behavior for
byte strings was intentional:
[https://github.com/python/cpython/commit/6ccd3f2dbcb98b33a71...](https://github.com/python/cpython/commit/6ccd3f2dbcb98b33a71ffa6eae949deae797c09c)

Author: Guido van Rossum <guido@python.org>

Date: Tue Oct 9 03:46:30 2007 +0000

    
    
        Replace all (locale-dependent) uses of isupper(), tolower(), etc., by
        locally-defined macros that assume ASCII and only consider ASCII letters.

------
chihuahua
I remember running into problems with SQL stored procedures where column and
table names were case-insensitive, so you don't know if you've properly typed
all the column and table names. Until a customer in Turkey eventually installs
it and you find out you've missed the proper capitalization of an identifier
containing the letter "I", and the stored procedure fails.

~~~
heavenlyblue
This is what I usually think about whenever people say yay to Unicode in
language identifiers.

~~~
formerly_proven
"I" is in ASCII.

~~~
a1369209993
"İ" and "ı" are not.

------
sedatk
Note to the next language designer: don't use strings as a substitute for
enums.

~~~
teddyh
It might be OK if strings are immutable and therefore internable.

~~~
sedatk
It doesn’t prevent someone from calling your function with “INFO” instead of
“info”, does it?

------
cazim
[http://www.moserware.com/2008/02/does-your-code-pass-
turkey-...](http://www.moserware.com/2008/02/does-your-code-pass-turkey-
test.html)

This is old but still valid reading...

------
formerly_proven
ITT calling setlocale or std::locale::global(...) is ALMOST ALWAYS a heinously
bad idea and should rarely be done, because it breaks tons of code (notably
everything that uses printf/scanf and everything using stringstream).

------
maple3142
I think things like these should be explicit. Even it is convenient to have a
default, it should be what most people would expect.

For example, instead of .lower(), we can have .lower_ascii(), .lower_turkish()
or .lower(locale) . But I know it would be tedious to use if you need to
specify it everytime, so it makes sense to have a
.lower(locale=DEFAULT_LOWER_LOCALE) . As for what should DEFAULT_LOWER_LOCALE
be, it is worth debating, but I think it shouldn't introduce unexpected
behavior.

------
60secz
Stringly typed: Play stupid games, win stupid prizes.

------
TazeTSchnitzel
The PHP interpreter has an internal reimplementation of string case conversion
that's ASCII-only in order to avoid this problem.

~~~
asddubs
doesn't php have this exact problem with their case-insensitive (hate that
btw) function/method names and turkish localization? or did they actually fix
it at some point?

~~~
dhosek
I'm guessing that they might have "fixed" it by implementing the ascii-only
tolower function, but yes, PHP used to not work properly with Turkish
localization.

------
crazygringo
Serious question.

Why on earth would you hard-code these, instead of simply call a lowercase
function in the en-US locale?

These are English words. Naively lowercasing them according to whatever locale
the server or user has set seems like a terrible programming practice. Any
call to a lowercase function should be explicitly including an argument that
specifies it's English, no?

In the same way we've all learned to never store times without an explicit
timezone (even if it's UTC), or locate a string offset without knowing your
encoding... you should never perform language transformations (case changes,
accent removal, etc.) without a locale.

Hardcoding these things is just patching over the symptoms without addressing
the cause, no?

------
mbostleman
Hence toUpper/toLower is not a strategy that passes the Turkey Test for case
insensitivity.

------
garydgregory
See also [https://garygregory.wordpress.com/2015/11/03/java-
lowercase-...](https://garygregory.wordpress.com/2015/11/03/java-lowercase-
conversion-turkey/)

------
TwoBit
This particular case seems odd to me because INFO is an English word, and ınfo
is not.

~~~
wongarsu
You could make a case that Unicode should have different "i" characters for
different languages. Then you could do all transformations unambiguously. On
the other hand almost everyone abuses the minus sign as a dash, and treats the
apostrophe and the prime sign (signifying feet or minutes) as interchangeable,
so in all likelihood they would constantly use the wrong i too.

~~~
heavenlyblue
Pretty sure that’s not true. When you switch your keyboard you will have a
proper i character in another language unless your keymap is broken. How do
you think Chinese, Russians or Greek type their characters?

~~~
tzot
The grandparent obviously meant “latin i”; none of the three languages you
mention have any latin letters, but at least Russian and Greek have some
lowercase and some more uppercase letters with the same glyph/shape as latin
ones.

~~~
heavenlyblue
Yeah, and those similar glyphs are not available on their own language
keyboard.

------
paledot
I'm going to be a bit controversial here and say that that mapping logic
should _always_ exist even if toLower() were reliable across all locales.
You're mapping between different use cases here, eg. internal to logfile to
API to database to method name to whatever, and inserting magic
transformations in your constant values rather than treating them as different
tokens for different use cases constrains you and introduces unnecessary
amounts of "magic".

------
ramses0
ObTurkeyTest: [http://www.moserware.com/2008/02/does-your-code-pass-
turkey-...](http://www.moserware.com/2008/02/does-your-code-pass-turkey-
test.html)

------
jaclaz
Only for the record, there is something very similar that may happen when
creating CD/DVD's (please read when using mkisofs and similar), with the
"dash" that when "capital" becomes underscore (but not only ) depending on the
reference ISO 9660/Joliet/RockRidge convention in use.

[https://web.archive.org/web/20151007005513/http://www.911cd....](https://web.archive.org/web/20151007005513/http://www.911cd.net/forums//index.php?showtopic=25612)

------
lxe
The practice of converting enum-like keys into their string representation by
using toString, toLower, etc seems convenient but gets very contrived very
fast. How do you deal with underscores? What about using the message in a
sentence? I say, use the enum in your code as a conditional or something but
always explicitly write out the messages intended for the user.

------
tantalor
[https://bugs.python.org/issue1524081](https://bugs.python.org/issue1524081)

> KeyError: 'Info'

------
shagmin
I learned about this in javascript when I discovered Angular has its own
lowercase method. Apparently it's internal only now.

[https://github.com/angular/angular.js/commit/1daa4f2231a89ee...](https://github.com/angular/angular.js/commit/1daa4f2231a89ee88345689f001805ffffa9e7de)

------
seqizz
Yeah, there were some weird bugs about that. I remember one in a media player.
Also "info".upper() would be İNFO probably.

------
dependenttypes
Ah yes, locales. Everyone loves them [https://github.com/mpv-
player/mpv/commit/1e70e82baa9193f6f02...](https://github.com/mpv-
player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe)

------
dusted
I think we should have stopped at ASCII, I don't care that my language has
letters not in there, it'd be neater if we just did now like back then: "This
is a computer, so everything is in English" :) Or adapt the alphabet to use
ASCII.

------
CodesInChaos
In the Danish locale "aa" doesn't start with "a".

------
mapgrep
Dumb question, if you _really_ need the exact string “info” in a given
context, why not hard code it? What does .lower() or even a map liked the
linked one actually buy you?

~~~
simion314
Maybe the input is case insensitive, for example if you work with html you
might see "DIV","div" who knows some crazy dev or tool might generate "DIv" or
"dIv" so is simpler to lowercase the input then work on it.

------
iforgotpassword
Wouldn't converting to nfkd/c first solve this issue too? My understanding of
those forms was that they're made exactly for this case.

~~~
jwilk
No, these are ASCII strings, so they are already normalized.

~~~
iforgotpassword
Oh, I haven't used python much, but I thought it's all Unicode? If this were
ascii it would work out of the box since there is no dotless lowercase i in
ascii.

~~~
estebank
There are no code point for TURKISH LOWERCASE DOTTED I not for TURKISH
UPPERCASE DOTLESS I, which means that the text doesn't carry enough
information for roundtrip preservation.

I believe this has proven to be a mistake but I'm not an expert. I don't know
_why_ it wasn't done.

------
baybal2
The "İ" strikes again

------
mro_name
what a gorgeous source-comment. Makes the non-obvious crystal-clear.

