
Truths programmers should know about case (2018) - lamby
https://www.b-list.org/weblog/2018/nov/26/case/
======
carapace
I don't want to trot out my whole "Unicode Rant" here, but I'll say that each
human language and it's digital representation should be an object of study in
its own right.

Having a standard mapping from numbers to little graphical ideas is _great_.
Trying to digitalize all the world's languages from a conceptual basis of
"they're like English but with quirks" seems to me to be a dodge.

In other words, this article describes the kind of thing that I think should
be what the Unicode people do: map out and specify datastructures and
algorithms, libraries, for each language.

~~~
kbenson
We're at the point that it might make sense to map languages to a digital
format rather than map an extensible digital format to myriad different
languages. I mean, it will never happen, but if we had a much simpler format
than Unicode that just supported a number of characters, but no combining
characters, no accents (and perhaps even no case), and let each language
decide how they wanted to represent themselves within that medium (are accents
important enough to require a separate character, or is leaving it off or just
using two characters next to each other sufficient?).

That isn't to say we wouldn't want accurate representations of languages (as
Unicode makes an attempt to do), but there's a difference between trying to
represent a _language_ , and trying to represent the _data within a language_.
In some cases, the former is important, in others, the latter vastly outweighs
the former, and making that easily accessible, categorized, compared, etc has
value.

I think it's only even conceivable at this point because of how much
communication is done online, and in abridged form. We've already adopted
pidgin languages for ad-hoc communication (as in IMs, SMS, and tweets), so I
don't think people would find this that hard to grapple with at this point.

~~~
carapace
Well the deeper problem may be that both mapping directions are impossible. No
one has proved that natural languages can be fully represented in a Turing
machine, eh? In other words, ASCII is _not_ an encoding of English. It's a
fluke (or perhaps a prerequisite) that a file of bytes can represent as much
of English as it does.

------
g_sch
I enjoyed this article because it contained concrete examples of each of its
assertions. It's a lot more engaging to read than many of the other
"falsehoods programmers believe about X" listicles.

------
RcouF1uZ4gsC
Everytime I read about Unicode quirks, I end up thinking that the relative
simplicity of the English alphabet played a large role in making the USA the
computing superpower that it is today.

* You can easily represent the common symbols in a few number of bits (7bits in case of ASCII).

* It is relatively easy to do a case-insensitive comparison.

* With the exception of numerical strings, sorting and ordering is relatively straightforward.

This allowed the early software/hardware developers to create simple systems
that were good enough to be useful to consumers.

~~~
silvestrov
No. Those 3 points are not that important. West Europe works using ISO-
Latin-1. Many European countries even had a national 7-bit version. We also
had our own computer companies, but they died out because they couldn't grow.

The 2 main reason are:

1) you have _one_ Sillicon Valley in USA, so all competent people were in one
location. If you wanted to start Intel/Apple/Google in Denmark you'll run out
of Wozniaks even before you got going because they are spread out all over the
map.

2) All of USA speaks the same language, so you can easily do complex
communication with companies in other states (e.g. Intel/Gateway in Texas).
Getting a French person to speak good English in the 60-70'ies was a, ahem,
challenge.

There is no Sillicon Valley in Europe precisly because there is no single
place, and therefore Europe miss out on the "everybody important in same
place" effects.

China's government pointed to one place at the map when they wanted a SV.
That's why China got Shenzhen. Europe/EU still can't do that politically.

~~~
pklausler
Few of the pioneering computer companies in the 60's were in Sillicon (sic)
Valley. And their engineers weren't all in one other single place in the US,
either. (How the Twin Cities lost out to Silicon Valley in the high-
performance computing race is a fascinating story, but that involves just a
few of the many computer companies that were around in those times.)

~~~
mlinksva
Agree, there are several places in the US that could've been (or remained more
peer to) Silicon Valley. I was not aware of depth of Twin Cities story.
[http://www.cbi.umn.edu/resources/MHHC/](http://www.cbi.umn.edu/resources/MHHC/)
looks interesting. If there's other reading on that you recommend I'd love a
pointer.

~~~
pklausler
A fun read is [https://www.amazon.com/Supermen-Seymour-Technical-Wizards-
Su...](https://www.amazon.com/Supermen-Seymour-Technical-Wizards-
Supercomputer/dp/0471048852)

------
maxxxxx
I love these articles. They show how even the most innocent looking things can
be really hard when you take a closer look. But I would also like to see
solutions that don't require weeks of study and weird workarounds. How do you
develop an address database that works world wide? How do you deal with dates?
How do you deal with character sets?

------
ianamartin
I think there is quite a bit of value in the "falsehoods programmers believe .
. ." trope just in the sense that they can be a shock to you. A quick reminder
that something topic you haven't had much programming experience with is way
more complicated than you ever thought it was. They are a great jumping off
point.

On the other hand, I've met people who read one of those and just stopped
there and concocted some absolutely bizarro ways of "handling" the issue by
coming up with patterns that attempt to simply avoid everything in the list.

I really do like this approach and think that it's great to have the jumping
off point and some deep exploration all in one place. Most of us who are going
to get the positive effect from "falsehoods . . ." memes probably already
have, and we could do with a break from those and more articles like this one.

------
herodotus
Very good article, but I am left with this question: is there any way to
compare two strings representing names that is complete and correct,
regardless of casing in the two original strings?

~~~
sametmax
The part "Case-insensitive comparison requires case folding" says there is an
algo for that, but it's up to the language to implement it.

E.G: in Python 3, it's recommend to use str.casefold() instead of str.lower()
for this:

    
    
        >>> "Straße".upper().lower()
        'strasse'
        >>> "Straße".lower()
        'straße'
        >>> "Straße".casefold()
        'strasse'
        >>> 
    

Depending of the context, a little str.strip() and/or a str.split() +
str.join() may also help. User usually don't want to consider blanks when
entering data (e.g: an address), while it's important for machines (e.g: a
password).

Now like @lmm said, this only normalize the codepoints. Meaning is of course,
impossible to be sure that way.

~~~
dukoid
The example is outdated:
[https://en.wikipedia.org/wiki/Capital_%E1%BA%9E](https://en.wikipedia.org/wiki/Capital_%E1%BA%9E)

------
thanatropism
In Spanish (but not Portuguese!) letters lose their accent in uppercase. Eg

Opinión => OPINION

~~~
rbonvall
While it was customary to omit the accent in uppercase letters due to
technical limitations, this is not (and never has been) accepted by the Royal
Academy:

[http://www.rae.es/consultas/tilde-en-las-
mayusculas](http://www.rae.es/consultas/tilde-en-las-mayusculas) (link in
Spanish).

Personally (I'm a nobody that doesn't always agree with the Academy) I don't
like omitting the accents at all in any circumstances.

~~~
slx26
this might be one of the most common misconceptions spanish speakers have
about their own language. most people really believes that they _have to_
remove accents while writing in uppercase, and not due to technical
limitations, they actually think it's a rule...

