Hacker News new | past | comments | ask | show | jobs | submit login

>Rob probably sees things like LANG and LC_ALL as bugs. His fix was UTF-8 everywhere, always

The problems solved by LANG or LC_ALL are not solved by UTF8 alone. Even if you use UTF8 for all your input and output, there is still the question of how to format numbers and dates to the user and how to collate strings.

These things are dependent on country and language, sometimes even varying between different places in a single country (in Switzerland, the German speaking parts use . As the decimal separator, while the French speaking part prefers ,)

These things are entirely independent of the encoding of your strings and they still need to be defined. Also, because it's a very common thing that basically needs to happen with every application, this is also something the user very likely prefers to set only once at one place.

Environment variables don't feel too bad a place.




Here in ex-USSR we have those problems too. Why not standardize decimal separators altogether worldwide? We're not dealing with feeking paper handwriting! If a number is printed on a computer display, it must look like 123456.78, not like "123 456,78"! Same goes for datetime representation.

This localization BS has spawned an entire race of nonsense, where, for example, CSV files are not actually CSV in some regions, because their values are not COMMA-separated (as the name implies), but semicolon-separated. And we, programmers, have to deal with it somehow, not to mention some obsolete Faildows encodings like CP1251 still widely used here in lots of tech-slowpoke organizations.

So: one encoding, one datetime format, one numeric format for the world and for the win. Heil UTF-8!


>not to mention some obsolete Faildows encodings like CP1251 still widely used here in lots of tech-slowpoke organizations.

as we're talking encodings: The worst file I ever had to deal with combined, within one file, UTF-8, cp437 and cp850.

I guess they had DOS and Unix machines bot no Windows boxes touching that file.

This is a problem that won't go away. Many developers are not aware of how character encoding, let alone Unicode, actually works and, what's the worst about this mess, many times, they can get away without knowing.


> If a number is printed on a computer display, it must look like 123456.78, not like "123 456,78"!

Humans find thousands separators useful. You're asking humans to give up useful things because they're hard to program.

That said, I idly wonder whether they could be implemented with font kerning. The bytes could be 123456.78, but the font could render it with extra space, as 123 456.78.

I don't know if it's possible with current font technology, and there are probably all sorts of problems with it even if it is, but it might be vaguely useful.


Humans should agree on the decimal and separator symbols, the same way that they agreed on the Indo-Arabic numerals, and symbols like + (plus) and - (minus).


Like we all agreed to the metric system already?


I don't get neither how thousands separator is useful, nor what a "genius" came up with an idea to make comma a decimal separator in computers. I have nothing against either of these things in handwriting (though I personally never separate thousands), but in computing?..

I agree though that this can (and should) be solved at font-rendering level, not at an application level.


Given that the world hasn't yet agreed on if a line ends by carriage return or carriage return-line feed I would not hold out much hope on this front (although with the death of just line feed some progress on this front as been made).

See also paper sizes and electrical power outlets.


> Given that the world hasn't yet agreed on if a line ends by carriage return or carriage return-line feed I would not hold out much hope on this front (although with the death of just line feed some progress on this front as been made).

Your point's correct, but linefeed hasn't died: it's still the line-ending on Unixes. Old Macs used carriage return; Windows use carriage return line feed; Unix uses linefeed. I don't know what Mac OS X uses, because I stopped using Macs before it came out.


I also don't get why are you still using those miles and pounds when the rest of the world agreed on kilometres and kilograms.


I live in Canada. Before that I grew up in a metric country. Though Canada is metric, I use imperial measures here and there.

I use miles for the sport of running. This is because 1609 meters is close to 1600. Four laps around a standard 400 meter track is about a mile and everything follows from that. All my training is based on miles. I think of paces per mile. If I'm traveling abroad and some hotel treadmill is in kilometers and km/h, it annoys the heck out of me.

However, paradoxically, road signs and car speedometers in miles and miles/hour also annoy the heck out of me; though at least since I use miles for running, at least I'm no stranger to the damn things.

For laying out circuit boards, I use mils, which are thousandths of an inch: they are a subdivision which gives a metric air to an imperial measure. This is not just personal choice: they are a standard in the electronics industry. The pins of a DIP (the old-school large one) are spaced exactly 100 mils (0.1") apart, and the rows are 300 mils apart. So you generally want a grid in mil divisions. (The finer-grained DIPs are 0.05" -- 50 mils.).

There is something nice about a mil in that when you're working with small things on that scale, it's just about right. A millimeter is huge. The metric system has no nice unit which corresponds to one mil. A micron is quite small: it's 25.4 mils. (How about ten of them and calling it a decamicron? Ha.)

Inches themselves are also a nice size, so I tend to use them for measuring household things: widths of cabinets and shelves and the like. Last time I designed a closet shelf, I used Sketchup and everything in inches.

Centimeters are too small. Common objects that have two-digit inch measurements blow up to three digits in centimeters.

Centimeters don't have a good, concise way to express the precision of a measurement (other than the ridiculous formality of adding a +/- tolerance). In inches, I can quote something as being 8 1/16 inch long. This tells us not only the absolute length, but also the granularity: the fact that I didn't say 8 2/32 or 8 4/64 tells you something: that I care only about sixteeth precision. The 8 1/16 measurement is probably an approximation of something that lies between 8 1/32 and 8 3/32, expressed concisely.

In centimeters, a measurement like 29 cm may be somewhat crude. But 29.3 cm might be ridiculously precise. It makes 29.4 look wrong, even though it may the case that anything in the 29.1-29.5 range is acceptable. The 10X jump in scale between centimeters and millimeters is just too darn large. The binary divisions in the imperial system give you 3.3 geometric steps inside one order of magnitude, which is useful. For a particular project, you can chose that it's going to be snapped to a 1/4" grid, or 1/8" or 1/16" based on the required precision.

So for these reasons, I have gravitated toward inches, even though I was raised metric, and came to a country that turned metric before I got here. (And of course, the easy availability of rulers and tape measures marked in inches, plus support in software applications, and the enduring use of these measures in trade: e.g. you can go to a hardware store in Canada and find 3/4" wood.)


Some things are traditionally measured in inches even worldwide, like screen diagonals or pipe diameters or, as you have noticed, mil grids. But in any other cases, seeing those feet, yards, miles and pounds in some internet resources presumably made for _international_ audience annoys the heck out of me. In our country (tip: Ukraine), not any ruler or tape measure even has inch marks, yet they are optional here when centimetres are a must. But as soon as I see a video about something "that's 36.5 feet tall", I have to run a conversion to find out what is it in metres. Pretty much the same as the case with some foreign and non-universal character encoding (when everything I see is just garbled letters and/or squares).

P.S. And yes, my ruler is made from aluminium, not aluminum.


Aluminium is an English word used in the UK.

Both the words "aluminium" and "aluminum" are British inventions. Both derive from "alumina", a name given in the 1700's to aluminum oxide. That word comes from the Latin "alumen", from which the word "alum" is also derived.

"Aluminum" was coined first, by English chemist Sir Humphry Davy, in 1808. He first called it "alumium", simply by adding "-ium" to "alum" (as in, the elemental base of alum, just like "sodium" is the elemental base of soda), and then added "n" to make "aluminum". In 1812, British editors replaced Davy's new word with "aluminium", keeping Davy's "n", but restoring the "-ium" suffix which coordinated with the other elements like potassium.

North Americans stuck with Davy's original "aluminum".

In Slovakia, we have a nice word for it: hliník, derived from hlina (clay).


LC_* is 1980's ISO C design that is unaware of things like, oh, threads. What if I want one thread to collate strings one way, and another to do it another way? Could easily happen: e.g. concurrent server fielding requests from clients in different countries.

Also, how on earth is it a good idea to make the core string routines in the library be influenced by this cruft? What if I have some locale set up, but I want part of my program to just have the good old non-localized strcmp?

The C localization stuff is founded on wrong assumptions such as: programs can be written ignorant of locale and then just localized magically by externally manipulating the behavior of character-handling library routines.

Even if that is true of some programs, it's only a transitional assumption. The hacks you develop for the sake of supporting a transition to locale-aware programming become obsolete once people write programs for localization from the start, yet they live on because they have been enshrined in standards.


I still don't understand how encodings find their way into the localization. I understand that date/time/number formatting is localizable. Do not understand why "LC_TIME=en_GB.UTF-8" would be a different option from just "en_GB"?

Can I really expect it to work if I set

"LC_TIME=en_GB.encA" and "LC_MONETARY=en_GB.encB"

How would the two encodings be used? How would they be used in a message consisting of both monetary and datetime?

Should the setting not be one for encoding (selected from a range of encodings), then settings for formatting and messages (selected from ranges of locales), then finally a setting for collation which is both a locale and an encoding? Or is the linux locale system simply using these as keys, so in reality there is no difference in LC_TIME whether you use encA or encB, it will only use the locale prefix en_GB?


>How would the two encodings be used? How would they be used in a message consisting of both monetary and datetime?

Full month names would be encoded in encA. Currency symbols in encB. Is it a good idea? No.

>Should the setting not be one for encoding (selected from a range of encodings), then settings for formatting and messages (selected from ranges of locales), then finally a setting for collation which is both a locale and an encoding?

I would argue an encoding setting should not be there to begin with or at most be application specific because that really doesn't depend on system locale (as long as the signs used by the system locale can be represented in the encoding used by the application).

I was just explaining why LC_* should exist even on a strictly UTF-8 everywhere system. I never said storing the encoding in the locale was a good idea (nor is it part of the official locale specification - it's a posix-ism)


What I hate is that the locales assume that date & number preferences are specific to one's physical location. I live in America, but I prefer German (9. March 2016 or 09.03.16) or British (9 March 2016 or 9/3/16) dates.

It's even worse when things assume that my date preferences reflect my unit preferences. I prefer standard units (feet, pounds, knots &c.) and British/Continental dates: I don't want to use French units, nor do I want to use American dates. And yet so much software assumes that it's all or nothing.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: