Hacker News new | past | comments | ask | show | jobs | submit login
Life Before Unicode (baturin.org)
118 points by sidcool on Aug 24, 2021 | hide | past | favorite | 171 comments



Pre-Unicode issues still haunt us today, kept alive by various file formats that rely on system encoding.

Under the Apple "Mac-Roman" encoding [1], the standard MacOS encoding before OSX switched to Unicode, byte 0xBD currently is capital omega (U+03A9 Ω). However, in the original 1994 release of the character set, they erroneously mapped to the ohm sign (U+2126 Ω) Apple eventually fixed this in 1997, as noted in the changelog:

    #       n04  1997-Dec-01    Update to match internal utom<n3>, ufrm<n22>:
    #                           Change standard mapping for 0xBD from U+2126
    #                           to its canonical decomposition, U+03A9.

However, in 1996, Microsoft copied over the mac encoding to CP10000 using the incorrect character [2]. Unfortunately the codepage was not corrected when Apple realized their mistake.

This discrepancy leads to a huge number of strange issues with various versions of Excel for Mac (BOM-less CSV, SYLK and other plaintext formats default to system encoding) and other software that use Microsoft's interpretation of Apple's Mac-Roman encoding rather than Apple's official character set mapping.

[1] http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.T...

[2] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/MAC/RO...


Dealing with multiple code pages was terrible, ugly, impossible, and simply awful.

Unicode was a great salvation. Until the Unicode spec wandered off into nutburgerland:

1. embedding invisible semantic information into the encodings

2. multiple encodings that mean the same thing

3. moved beyond standardizing existing alphabets, and wandered into a never-ending stream of people inventing all sorts of glyphs and campaigning to get them into Unicode

4. set language back 2000 years by deciding that hieroglyphs were better than phonetics

5. conflating 𝖋𝖔𝖓𝖙𝖘 with alphabets

6. the crazy concept of "levels" of Unicode compatibility

all resulting in it simply being impossible to correctly support Unicode in your programs, or design a font to display Unicode.


Unicode is just a cumulation of what already existed in one form or another, not an entirely new mistake.

> 1. embedding invisible semantic information into the encodings

> 2. multiple encodings that mean the same thing

Sure, ISO/IEC 8859 back then had no invisible characters nor composed characters, did it? [1]

> 3. moved beyond standardizing existing alphabets, and wandered into a never-ending stream of people inventing all sorts of glyphs and campaigning to get them into Unicode

> 4. set language back 2000 years by deciding that hieroglyphs were better than phonetics

Emoji business is surely ugly (and I have complained about this a lot), but the very reason that Unicode has emoji is that Japanese telcos have exposed their priprietary encodings to the wild email and both Apple and Google had to implement them for the compatibility. Blame them, not Unicode.

> 5. conflating 𝖋𝖔𝖓𝖙𝖘 with alphabets

Yeah, you will be surprised to hear that Unicode encodes both the Latin letter A and Cyrillic letter A separately. (I think you had already said that they should be unified in the past, but I couldn't find that reply.)

> 6. the crazy concept of "levels" of Unicode compatibility

Most Unicode conformance rules are optional (past discussion: [2]).

Also if you should display a Unicode text in your program and you are not using a stock library, it is up to you to decide what to support because they exist for the reason. No BiDi support? You now disregard right-to-left scripts (or left-to-right scripts if your program is for RTL users). No combining character support? Expect to lose a significant portion of potential users around the world. Probably a colored font support is fine to ignore, but that doesn't mean most other things can be ignored without the cost.

[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-8#Code_page_layou...

[2] https://news.ycombinator.com/item?id=26904739


> ISO/IEC 8859 back then had no invisible characters nor composed characters, did it?

No reason to repeat mistakes. The whole point of Unicode was to fix it all.

> No combining character support?

There is no reason to support both combining chars and unique code points for the same glyph. Unicode should pick one and dump the other.

> Most Unicode conformance rules are optional

I know. This is what makes it crazy and nobody implements it right, or even knows what "right" is.

> you will be surprised to hear that Unicode encodes both the Latin letter A and Cyrillic letter A separately

They do more than that with visually indistinguishable code points. People use this to embed secret messages in innocuous looking text files that print identically. There's no round trip Unicode->paper->Unicode.


>No reason to repeat mistakes. The whole point of Unicode was to fix it all.

The whole point of Unicode was to be The One True Encoding that served as a superset of all the others, so that anything expressible in any other character set could be expressed identically in it (after mapping code points, of course).


> The whole point of Unicode was to fix it all.

The whole point of Unicode was to provide a vendor-approved universal character set and encoding. Something ISO/IEC 2022 and the original draft of ISO/IEC 10646 (which was subsequently harmonized with Unicode) couldn't do.

As I'll reiterate below, it is impossible to fix "mistakes" of the past because the human text is complex and these "mistakes" are a part of that. Unicode instead had more managable (and eventually successful) goal of maintaining the compatibility with existing character sets so that it can overtake them.

> There is no reason to support both combining chars and unique code points for the same glyph. Unicode should pick one and dump the other.

Your second claim that combining characters can replace unique code points doesn't make sense at all. If you have "combining" characters then you also need base characters to be combined. If these base characters can be used by its own then it is no different from the current Unicode. If these base characters can't be used by its own then some characters might have two code points assigned, one combinable and one not. That would be more complex than today and can cause issues in multiple scripts.

I presume that you didn't think about that second claim and only thought about the first claim that unique code points can replace combining characters. This might be possible if you disregard several important scripts, and my native Korean is one of them.

Pre-Unicode Korean and Hangul support in computers was always lousy and there had been three major approaches. The first is to PRECOMPOSE every common syllable. The second is to algorithmically COMPOSE consonants and vowels into a syllable block by having byte patterns calculated from jamos themselves. The third is similar to the second but RECOGNIZEs a row of composable jamos, so the byte length of each character is variable. All three approaches have been used multiple times in the history.

By following your logic Hangul should be always encoded using PRECOMPOSE or COMPOSE. After all that's how Han characters (aka "CJK" "ideographs") are encoded, so it should be workable right? And yet we are still adding a slew of additional Han characters to Unicode every two or three year. Unicode 1.0 indeed had used PRECOMPOSE, but they had to switch to COMPOSE by 2.0 because additional characters kept being added. They are characters people do use, just less frequently.

The adoption of COMPOSE made Hangul one of the biggest scripts in Unicode, encompassing 11,172 characters (and in part triggering the introduction of UTF-16). So we should be fine as is, right? No, because there are another beast called archaic Hangul used before the 1933 standardization. They are about as frequent as least used modern Hangul characters, but some of them are definitely in use. And here's a catch: if you implement archaic Hangul in Unicode using COMPOSE, there would be at least 1,638,750 characters of them [1]. I'm sure you are definitely not okay with that.

Back then the most popular Korean word processor, nowadays called the Hangul Office, supported archaic Hangul natively. They had their own 16-bit encoding which used COMPOSE throughout modern Hangul, but their archaic Hangul encoding was very much mixed [2]. It was a combination of all three approaches above: some frequent archaic jamos are implemented just like modern Hangul using COMPOSE, remaining frequent archaic syllables got their own code a la PRECOMPOSE, and other archaic jamos are RECOGNIZEd as a character if consecutive. All of this mess because of the limited size of their encoding. And yet it was the best possible before Unicode.

The modern Unicode encoding uses both COMPOSE (modern syllables only) and RECOGNIZE, and sequences in both approaches are considered equivalent. And that equivalence is not something made up just for Korean, the same approach is valid and used for many other scripts, so if you implement the Unicode normalization you've got most things right for archaic Korean.

> This is what makes it crazy and nobody implements it right, or even knows what "right" is.

You may want to believe that there is the "right" way, but there isn't. Unicode gives you a standardized set of algorithms and data for what they can (equivalence, normalization, collation, ...). Unicode is very fine for what they support, but it can't help you if something is out of their scope, including fonts (you were conflating this a lot, weren't you). For those things you don't have a single universal answer and your answer may change according to which languages, scripts or locales you support and how much your users can tolerate, among others.

> People use this to embed secret messages in innocuous looking text files that print identically. There's no round trip Unicode->paper->Unicode.

There is no round trip between any character encoding and paper. Assume that your text is monospaced and you see something like this:

    This line definitely doesn't have a space after this:
Are you sure that there is no space after the colon? Even if we restrict ourselves to visual characters, you can do much of steganography without Unicode (font changes, whitespace counts, keming, intentional typos, alternative expressions...). You are making your own problem up.

[1] https://charset.fandom.com/ko/wiki/%EC%9C%A0%EB%8B%88%EC%BD%...

[2] https://charset.fandom.com/ko/wiki/%ED%95%9C%EC%BB%B4_2%EB%B...


> Your second claim that combining characters can replace unique code points doesn't make sense at all.

There's no reason to have both ä and a followed by the umlaut as separate characters.

> That would be more complex than today

Explain about the case above.

> Hangul

I know nothing about Hangul. I don't see what that has to do with ä, or even the multiple 'a' encodings https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symb...

> You may want to believe that there is the "right" way, but there isn't.

There being multiple encodings for the same effect is exactly my complaint.

> you were conflating this a lot, weren't you

Tell me about the Fraktur example. Two fonts in Unicode for the same letters, and that's only the beginning. Unicode has boldface and italic encodings, too, but only for some letters.

Without the snark, please.

> Are you sure that there is no space after the colon?

Yes, you're right that tabs and spaces can make two different files print identically. But people are familiar with that, after all, even diff programs have an option to ignore whitespace differences. Does any program have an option to ignore all the Unicode same-only-different differences? Why make it 100 times worse?

> you can do much of steganography without Unicode

Yes, you can. But why make it 100 times worse?

> You are making your own problem up.

Have fun writing a 100% correct program that can diff two Unicode files and not show embedded but invisible differences.


> There's no reason to have both ä and a followed by the umlaut as separate characters.

ä is a part of ISO/IEC 8859-1 among others. Unicode has a requirement that characters in existing character sets should be precomposed because the transition from the legacy system (where everything is seemingly one "character") to Unicode (where some characters may be composed of multiple code points) would be harder. This concern is not hypothetical; Unicode as today couldn't be possible if an ASCII-compatible Unicode transformation format (UTF-8) didn't exist.

> I know nothing about Hangul.

It doesn't matter you know nothing about Hangul because you are suggesting to remove some features that Hangul depends on! Prove that Hangul doesn't need those features or accept that your complaint to Unicode based on your misunderstanding of internationalization is unsubstantiated.

> Tell me about the Fraktur example. Two fonts in Unicode for the same letters, and that's only the beginning. Unicode has boldface and italic encodings, too, but only for some letters.

These characters are in (amusingly, self-explanatory) the Mathematical Alphanumeric Symbols block where the same character with different styles can have different semantics. It was originally proposed by the STIX project in 1998 (L2/98-405) in the form of variant tags and later changed to separate characters (L2/99-105) because variant tags cannot be limited to a set of desirable characters. Nowadays they are more frequently used as "formatting" in a plain text, but Unicode wouldn't be responsible for any such (incorrect) uses beyond mathematics.

Oh, and all of this information can be easily found from L2 documents linked in that Wikipedia page. If you were doing the minimal research on the topic before complaining I wouldn't be this sarcastic.

> But people are familiar with that, after all, even diff programs have an option to ignore whitespace differences.

Ideally diff should be aware of the structure of input files so that they can filter out non-semantic changes. Whitespace removal is a cheap approximation that doesn't always work. But that's another topic.

> Have fun writing a 100% correct program that can diff two Unicode files and not show embedded but invisible differences.

Why should I do that? I do care about trailing whitespaces and unused control characters in a plain text file and I want to see changes on them.

I guess you have made this problem up again as a hypothetical example (please let me know the context if not), but if you really need the visual difference of texts, the correct answer is that you should consult the rendering system---not Unicode---to list a series of glyphs and compare against them because it has a final say about the actual visual output. It's like a Unicode string indexing [1]; yes, you can do that under certain circumstances, but why do you that in the first place? You are almost surely solving a wrong problem.

[1] https://news.ycombinator.com/item?id=20050326


> Why should I do that?

If you're writing a spell checker, for example.

> hypothetical example

Spell checkers, string search, sorting strings, etc., are all real programming tasks.

> the same character with different styles can have different semantics

Yes, I know that. Unicode exceeded its charter by adding invisible semantic content like that. For the simple reason that it cannot succeed at that task. The semantic meaning of text is imputed from the context, not the glyph.


> Emoji business is surely ugly (and I have complained about this a lot), but the very reason that Unicode has emoji is that Japanese telcos have exposed their priprietary encodings to the wild email and both Apple and Google had to implement them for the compatibility. Blame them, not Unicode.

The initial set of emojis were brought in for compatibility purposes. The huge swaths of new emojis covering every conceivable animal, vegetable, and mineral created after that point is where problems started (IMHO).


The very first version of emojis was already not just a compatibility encoding. If it were solely for compatibility every character could have been just named EMOJI COMPATIBILITY-HHHHH. Unicode (correctly) determined that the presence of initial emojis would trigger further needs and prepared basic encoding principles. No trademark characters, unify with existing characters whenever possible, country flags use generic regional indicators (there was already a specific request from national bodies at that time) and so on.


RE point 4, I believe that this was initially a descriptivist rather than prescriptivist step.

People were using emoticons, showing that emoji might be desirable. Some cell phone manufacturers/carriers added their own emoji [1]. They were very popular. Unicode added emoji to the standard, since lots of people were using it.

Not adding emoji would have been _more_ prescriptivist, and since it was desired anyway would have just caused fragmentation.

[1] https://blog.emojipedia.org/correcting-the-record-on-the-fir...


Emoticons are simply interpreting :-) as a smiley face, no Unicode invention is necessary.


Yes, and emoji came from emoticons, meeting a need to express more complex ideas. Unicode adopted emoji after they were popularised.


> meeting a need to express more complex ideas

Still not necessary to have special Unicode code points for them.

    :poop:
should do just fine. No need for more than that.


:poop: was never made into stuffed animals or pillows either. Branding and imagry means something to some people despite not to you.


What does branding have to do with Unicode? Is there a Unicode code point for the coca-cola trademark or the Amazon smile?


Probably by Unicode 16 there will be. :p


Not in the Unicode guidelines as they currently stand ':).

They are very explicit that this is not permitted [1].

Also, RE :poop: 1. Huffman encoding philosophy - encode frequently used symbols using fewer bytes. 2. Standards are helpful, :poop: is very English-centric, whereas U+1F4A9 is much more interpretable.

[1] https://unicode.org/emoji/proposals.html#Selection_Factors_I...


>Not in the Unicode guidelines as they currently stand ':).

I know, I'm just being humorously facetious. (Or at least I'd like to think that I am.)


What about this one:

  :couple-with-heart-woman-man-light-skin-tone-medium-light-skin-tone:


Nothing wrong with that. The renderer can do what it likes with that.

The thing is, once you open the door to every-picture-must-have-a-unique-code-point, where do you stop? How many trillions of code points are needed?

Hey, Unicode should take the next logical step, and allow embedded Javascript! What could possibly go wrong with that?


Hey, it degrades gracefully when displayed on a system that doesn't support that glyph. Sounds like a nice feature to me.


Did you know that eons ago, people would draw a picture of a bull to mean a bull? Then they got lazy, and would just draw the head? Then they decided to just have the bull's head represent the first sound of their word for bull. Then they simplified the bull's head to 3 straight lines (people get lazy). Then for some reason decided it was easier to write if they rotated it 180 degrees.

    A
Yup. A bull's head. But now that's no good anymore, and we're back to a picture of a bull, and need about a million more pictures to flesh out our new and improved alphabet. (There are a million words in the English language.)


Considering the explosion in emoji alongside photo and video now that it's more accessible to the masses, one suspects that most people prize being able to communicate in more ways that might be informationally inefficient, as opposed to what you seem to prize more in being able communicate through text alone.


To embed pictures and video, we have HTML. No need to put it in Unicode.


Genuine question: can you explain how phonetics can be used in a character encoding for text?


I don't understand your question. `a` is for the sound `aaaahhhh`.


But 'j' in Spanish sounds completely different from 'j' in English. Which of these languages should give up their j and add a new letter to their alphabet? At least in Spanish letters are usually pronounced the same, so presumably Spanish is closer to your ideal. Contrary to English, which seems to be designed specifically to torment its speakers.

Here's a famous poem: https://ncf.idallen.com/english.html Perhaps you could give an indication how this poem would be written in English in a hypothetical phonetic spelling? Or would all the languages just convert to the International Phonetic Alphabet, and we'd forbid all the existing alphabets? :)


Thanks, you beat me to it with that poem :-)

There have been attempts for a phonemic "respelling" of English: https://en.wikipedia.org/wiki/Pronunciation_respelling_for_E...

However, apart from edge cases such as a writer wanting to convey the pronunciation of a word the audience might not be familiar with (e.g., ar-kuhn-saw for Arkansas or rice-lip for Ruislip ;-) ), other than for dictionaries, they are of little practical relevance.

Personally, I sincerely dislike the predominant use of these phonemic spelling variants in lieu of the IPA in many dictionaries these days.

In an understandable - and laudable - attempt to make their content more accessible to users who are not familiar with the IPA they, however, often got rid of the corresponding IPA representations entirely, which in turn makes the content far less accessible to users who are familiar with such representations.

That's a classic example of "knowledge encoded in the world" (phonemic representations) vs "knowledge encoded in the head" (IPA) (from the "The Design of Everyday Things" by Don Norman).


> Which of these languages should give up their j and add a new letter to their alphabet?

Neither. They both use 'j'. Problem solved. No, I'm not being facetious. Spanish books are written in Spanish, and English books are written in English. Both use 'j'. Somehow, it works. No need to invent problems.


No response to the poem, then?


I thought my response was obvious. Should I cut&paste it from the article and post it here?


It's not obvious at all. But let's cut&paste some of your stuff:

> `a` is for the sound `aaaahhhh`.

That's great for words like 'father', 'fall', and 'dark'. But not for words like 'dad' and 'fat' and 'frank'. Or 'same' and 'lame'. So no, in English, 'a' is not the sound for 'aaaahhhh'. 'aaaahhhh' is just one of the various sounds for 'a'. In a lot of ways English is like those hieroglyphics you seem to hate so much, because you have to know the mapping between the words and how they are written. It probably doesn't feel like that to you, because you're used to it by now.

If you really want phonetic spelling, you'll have to change the spelling of most English words. That is fine, but I think most people would find that much less appealing than using unicode. Your suggestion would then basically boil down to:

"Hey, let's change the spelling of most English words you know, and also ban emojis!"

I would prepare for lots of little upside-down hands with the thumb pointing down.


So you want to turn Unicode into a semantic language, too?


In most languages there's no immediate, bijective mapping between graphemes and phonemes.

Some Indo-European languages such as French, Spanish, Italian, or Dutch come quite close in that there is a one-to-one (in some cases context-sensitive) mapping between graphemes or grapheme clusters and the sounds they signify.

For English, famously, there's no such one-to-one mapping and for many cases pronunciation has to be memorised: https://ncf.idallen.com/english.html


Somehow, English printed books do just fine without icons.


Sure, they do. That has little to do with phonetics, though.


> That has little to do with phonetics, though.

I learned to read by using phonetics. It has a lot to do with the success of English text. English works well as a phonetic language. Just like the text I write here can be read by any second grader.


The ‘u’ in ruby and run sounds different. English writing is not phonetic. Arabic I think pretty much is though.


> English writing is not phonetic.

It's phonetic enough to work well as one. Teaching kids to read by sounding out the words works far better than any other method (like look-say).

Have fun punishing students by making them learn 10,000 hieroglyphs instead.


I’m a bit lost as you were talking about character encoding with phonetics originally and I couldn’t picture how that would work.

With kids yes in would agree sounding out (and pointing out exceptions to the rule) works well enough. But I’m no teacher!


There's a bit of a misunderstanding here. I was not talking about encoding phonetic sounds into Unicode. Unicode should not imply sounds or meaning.

Unicode is about visual representation.

The argument about phonetic alphabets is they are superior to hieroglyphs.


‘A’ as in ‘car’, or ‘A’ as in ‘care’?

Yes, I might be being facetious, but letters are divorced from pronunciation, and the great vowel shift made that significantly worse amongst European languages.


My acorn disagrees with you.


Unicode is supposed to be about the visual representation, not the aural.


Then that would have been a much better answer to GP's question "can you explain how phonetics can be used in a character encoding for text?"


7. Han unification


Yeah, in my opinion this is by far the biggest one. The rest of list are either non-issues, or they're annoying but you can deal with them (and there's often arguments on both sides).

Han unification, on the other hand, is a disaster for Unicode. The fact that Unicode can't comfortably mix Chinese and Japanese language characters is a disaster for an encoding like Unicode that's supposed to cover the whole world.


3. Yes, sadly.

4. Can you expand on that?


> expand

Sure. I don't know what most of the buttons do on my car's dashboard, because of the "iconitis" disease that has infected user interface design. Unicode has blessed and encouraged this crap. They don't just standardize, they invent icons intending them to replace the phonetic words.

(Apple is kinda schizophrenic on this. Sometimes you'll see "DELETE" as a button, sometimes a trash can.)

I don't know what the wash icons on my clothes mean without going to google and looking them up.

You can't look up an icon in a dictionary, making it a fundamentally wretched replacement for a phonetic alphabet. Worse, people persistently try to invent icons for verbs and adjectives, which never ever work. Even for nouns, is that scribble a bird, a duck, or a chicken?

My car has a button with something that looks like a snowflake on it. Does it turn the heater on? The defroster? The air conditioner? Turn on traction control? Who the hell knows?


> Who the hell knows?

The manual.

I see what you’re saying, Gmail is painful to use for this exact reason. I’d argue it’s fine on cars though for a few reasons:

- there probably aren’t that many icons to be aware of in a car

- cars are generally owned for a long time, so learning 2 new icons every 5-10 years is not much to ask

- cost-cutting on I18n: the snowflake could be AC in English but “airflugencoldensteiner” in some other language. It doesn’t scale and it can’t be sold across borders without looking out of place.


> The manual.

Right, while hurtling down the freeway I can always dig the 500 page manual out and dig through it. Note that I can't use the index to find those icons, because icons have no alphabetic order. Like I said, going backwards 2000 years.

> airflugencoldensteiner

So just use AC. You can even look it up in the index.

> It doesn’t scale and

Yah, show it to an illiterate desert dweller, and ask them what they make of it.

> it can’t be sold across borders without looking out of place.

Of course it can be. Whatcha gonna do, translate that entire 500 page manual into hieroglyphs?

Isn't it hilarious that now an entire generation of adults has never used a phone that looks like the typical phone icon? The phone icon on my iphone looks like a phone from the 1960's. However, "PHONE" has remained constant.


>Right, while hurtling down the freeway I can always dig the 500 page manual out and dig through it. Note that I can't use the index to find those icons, because icons have no alphabetic order. Like I said, going backwards 2000 years.

You can't also look at index to decipher what it means to see "red light bulb that is 'on' over two light bulbs that are 'off'" or "human figure walking over piano keyboard inside a white triangle which is inside a blue square" when you are driving, you are supposed to get familiar with those symbols before get in the car. Typical car has barely a dozen or two pictograms, and any car is a machine than can easily kill and maim. It would be advisable to look at the manual before one starts driving. Of course, it would advisable if those symbols were also standardized like traffic signs.

I actually agree with you about Unicode mission creep, but it is better not get overboard with pictograms that have nothing to with Unicode.


  > Of course, it would advisable if those symbols were
  > also standardized like traffic signs.
The warning lights and symbols necessary for operation of the vehicle are standardized. The only two icons that have different variants across vehicle manufacturers, that I am aware of, are the Parking Brake Engaged icon (which could have an exclamation point or a P) and the Airbag Disabled icon (which has varying levels of body thickness for the depicted human).


Traffic lights have many advantages over icons or text. They are far more reliable than electro-mechanical moving signs, can be seen at night, and can be seen at a much farther distance. The vertical configuration came about because of color blind people. I'm old enough to remember when they were horizontal.

Warning lights are used to draw attention.

It's not an argument for icons.


My old car has controls labeled "defrost", "heat", "cold". I never even had a manual for it.

US roads still have plenty of signs in English, like "Speed Limit", "Stop", "Yield", "Road Construction", "Detour", "No Turn On Red", etc. Nobody has any trouble understanding them.


> US roads still have plenty of signs in English, like "Speed Limit", "Stop", "Yield", "Road Construction", "Detour", "No Turn On Red", etc. Nobody has any trouble understanding them.

I strongly suspect that "Nobody has any trouble understanding them" is not true. While I agree with you that iconitis can go overboard, I think pretending that words are universally clearer—in the sense of every place, but also for everyone, not just you—is an exaggeration.


> I strongly suspect that "Nobody has any trouble understanding them" is not true.

This sort of comment is inevitable. I could say "everyone likes ice cream" and I'd get a comment saying they know someone who doesn't like ice cream. I could say "people have two legs" and there's be comment about someone who was born with one leg.


> This sort of comment is inevitable. I could say "everyone likes ice cream" and I'd get a comment saying they know someone who doesn't like ice cream. I could say "people have two legs" and there's be comment about someone who was born with one leg.

It seems to me that you prove a point against your own argument here. If you said "people have two legs" in an anatomy class, then that would be the wrong place to object. If you said "people have two legs" as an argument against installing wheelchair ramps, then I think that would be the right place to object that some people don't.

Given your other comments in this thread, your statement that "Nobody has any trouble understanding them" seemed to be a statement to be taken seriously if perhaps not completely literally, rather than being a casual statement of the type clearly to be understood as loaded down with qualifiers like "on average, in predominantly English-speaking in the US …". I took your argument to be against using icons, even in situations where they might allow non-English speakers equal access to crucial information, and it seems to me to be worth pointing out that universally quantified statements in that setting not only are not true in the most technical sense, but are even likely to be harmful if used to guide policy and decisions.


> Nobody has any trouble understanding them.

People who don’t speak English perhaps.

Traffic signs are something the Europeans did well and are just part of taking a driving license. Crossing a country border won’t mean you suddenly can’t read traffic signs.

Meanwhile in Thailand I find the occasional full-Thai sign I have no idea of what it means. Maybe it means turn left on red allowed, maybe the opposite. Good luck, everyone else.


> People who don’t speak English perhaps.

There's a huge difference between being competent in English and knowing a handful of English words. For example, if I travel in France and stay at a hotel with a french name, am I going to have to learn French to associate the name of the hotel with the hotel? Of course not.

> Maybe it means turn left

The thing is, if the Europeans had standardized on the word "STOP" for stop signs, people would have learned that just as easily as the icon for STOP. Oh wait, they did just that!

(I'm old enough to remember when the Germans used "HALT" signs.)


The USA should become a signatory to the Vienna convention on road signs and signals like its neighbour country. That way, people will have even less trouble. More people comprehend the common, global (pictographic) language than English.


> So just use AC. You can even look it up in the index.

No, you “can” not, you “must” look it up.

I know what a sun is, I don’t know what HO means, I will have to look it up.

You’re looking at the problem as if there was no language other than english.

Growing up in a non anglophone country I had to deal with plenty of english-only interfaces that didn’t need to be, like ON/OFF.

I agree that bird icons are hard to decipher, but usually that’s not the case for common situations, like for the play, on, off buttons.


> plenty of english-only interfaces that didn’t need to be, like ON/OFF.

Is it really faster to learn what "|" means than "ON"?

> I had to deal with

And now you're fluent in English, while learning "|" was useful only for a switch.

Again, nobody has to learn English in order to know what a handful of words like "ON" mean. You don't even need to learn the alphabet.


> I don't know what the wash icons on my clothes mean without going to google and looking them up.

But odds are these icons have been there (unchanged) since before the invention of unicode, so that's hardly an example of the "iconitis" you mention, right?


I don't recall them before Unicode. I do recall "dry clean only" on the tags, etc.

Iconitis escaped from the laboratory in 1983, when Apple decided that English was obsolete and we should revert to the language of the Pharaohs. (Well, except that even the Pharaohs of ancient Egypt discovered that phonetics was better, and turned their hieroglyphs into a phonetic alphabet. The Mayans did, too.)

An Apple evangelist came to my workplace in 1983 and gleefully tried to sell the engineers on the idea that a drawing of a Kleenex box was better than the word "print". The trouble is "print" is a verb, and nobody has yet come up with an icon for it in 40 years of trying.


> I don't recall them before Unicode. I do recall "dry clean only" on the tags, etc.

These icons are far older than unicode, and in fact is not in unicode at the moment. There's a proposal to add them, but I don't believe it's passed yet. The standardized icons themselves date from the 1970s, though non-standard laundry symbols are much older than that.

Design use of icons over text has nothing to do with Unicode, it's a natural part of human expression. Icons are everywhere, from traffic signs to laundry symbols to buttons on computers to ramblings signs. I think your reading of the typographical history here is simply wrong, neither Apple nor Unicode is responsible for the "iconiziation" of communication. We've always used icons for communication, and we always will.

If your issue is with Emoji, then you should make that case instead. Personally, i think it's great that if I send an emoji from an iPhone to an Android phone (or from/to essentially any device), you get a consistent symbol, instead of being locked in to whatever your manufacturer was using (the fact that this kind of lock-in was occurring was the original motivation for adding emoji to Unicode).

Btw, laundry symbols are a perfect demonstration for another wonderful feature of using icons: they're cross language. You can put one tag on a piece of clothing and sell it anywhere in the world. Not everyone speaks English.


> These icons are far older than unicode, and in fact is not in unicode at the moment. There's a proposal to add them, but I don't believe it's passed yet.

And yet we get bagels both with and without cream cheese. Says something about Unicode's priorities doesn't it?


> You can put one tag on a piece of clothing and sell it anywhere in the world. Not everyone speaks English.

This argument sounds attractive and inclusive, but it doesn't make any practical sense.

If a cheat sheet is needed to determine what icons mean, one column for the icon and the other with the meaning in the user's native language, what is the essential difference between that and replacing the icons with English words?

Why is it easier to learn what a strange symbol means than the word "print" ? At least if someone is faced with "print" they can figure out what it means by typing it into google or a dictionary. Not so with an unfamiliar icon.


The laundry washing pictogram symbols are an Europe (French) thing from the 1960s.

edit. Source: https://web.archive.org/web/20201001165119/https://www.ginet...


Try this with Hebrew... Because of the RTL nature of the language, we had ISO-8859-8 and ISO-8859-8-I, which as a child I always thought of as "Inverted". The characters would render the same, but going backwards. When entering a website you never knew if the author wrote the text backwards so you can present it as it is, or they wrote it normally so you need to flip it. And I can still recall some websites using CP852, back from the DOS era. Entering a website really did start with about a minute of fiddling with the encoding.


The note about IRC reminds me of troubles that faced Finnish IRC users when UTF-8 got more popular. Since IRC networks generally didn't (and don't) have actual support for different encodings and mostly just deal with byte streams, this brought a lot of issues with Ä, Ö, Å, €, and some other more esoteric characters. Naturally a lot of ä and � abounded.

An interesting consequence was that channels that had non-ASCII characters in their names were split into two, since to the network they were different characters. I remember taking over a couple of channels by creating the UTF-8 versions of them and waiting for people to slowly migrate over.

Getting all of this to work correctly was quite difficult. With Irssi and similar terminal clients, you'd have to correctly configure not just the client but also screen, $TERM, locale, and your terminal. Even if you had a correctly configured UTF-8 client, you could still have problems. Since you can't tell 8-bit single-byte encodings apart other than by heuristics, typically you would have your client attempt to treat a message as UTF-8 and if it's invalid, use a fallback option like ISO-8859-15 (latin-9). But here's the fun thing about that: since IRC networks only deal with byte streams, they may truncate an UTF-8 message in the middle of a multibyte character. This would fail to be detected as valid UTF-8 and would use the fallback option, leading to mojibake for seemingly no reason.

All of this lead to quite some resistance to UTF-8 on the channels I was on. It was deemed the problems were bigger than the new possibilities. I mean, we could speak Finnish/English just fine and there was usually no need for any other languages. Eventually UTF-8 won, especially when mIRC released a version without an option to change the encoding.


> It was deemed the problems were bigger than the new possibilities. I mean, we could speak Finnish/English just fine and there was usually no need for any other languages.

It also excluded a large part of the world from participating on IRC. Maybe if it was more proactive, IRC would have a larger role to play than it does now?

Another example is punycode. unicode URLs are super unreadable, so no wonder the rest of the earth's population doesn't care enough about the open web, the importance of URLs vs Apps.


> It also excluded a large part of the world from participating on IRC. Maybe if it was more proactive, IRC would have a larger role to play than it does now?

I have to clarify that this was based on my experience in Finnish IRC channels. We spoke Finnish (and some English because English is everywhere) and we did not wish to bring any other languages into our channels as we wouldn't be able to speak Finnish then anymore. The character set issues were really just about our own special characters and ISO-8859-1 vs ISO-8859-15 vs UTF-8 users.

Obviously character set issues probably did discourage some people from using IRC but I doubt it had anything to do with Finnish channels specifically. In the end, I think IRC was just overtaken by more convenient and featureful technologies (not to say that IRC is dead, but it's far from what it used to be).

Punycode is another one of those unfortunate things we have to have to deal with due to things like homograph attacks (apple.com vs аpple.com).


I remember this as well on IRCNet on a few German channels.

Basically mostly it was decided that everyone had to configure ISO-8859-1 (or ascii) or be kicked, and at some point the majority of ops then decided that now UTF-8 was correct.

It didn't feel great a lot of the time, but to be fair those were communities governed by some people and the barrier to entry was configuring your client correctly. I might be misremembering but most bigger channels kinda agreed on the timeline, so they all switched to UTF-8 around the same time, so it kinda worked.


The very same thing happened in French-speaking channels.


Our still-alive IRC channel never migrated. You still need to enter \xE4 for the join command to get the correct ISO-8859-1 channel. I think the stream of unintentional visitors stopped around the UTF-8 migration, though that could've also been effect of the decline of IRC use in general public.


>ISO-8859-15

Ditto in Spanish;, I still have issues at IRC-Hispano with some users as I have UTF-8 with everything.


Yes, I moved from France to Japan in 2001 and at the time it was almost impossible to have French accents and Japanese characters on the same OS. The web was where it was working best (because web browsers know to switch encoding for pages) but desktop apps that were still the norm were a shitshow.

Even on Linux, to have support for Japanese language you had to install a Japanese distribution (TurboLinux or RedHat Japanese version for example) and it was a real pain to get accents to work.

Worse, Japanese language had 3 encodings!! One for Windows (SJIS), one for Unix/Linux (EucJP) and ISO-2022-JP. And of course, Japan being Japan, Japanese companies were really lagging to switch to Unicode and stuck with their shitty encodings for a long time.


> And of course, Japan being Japan, Japanese companies were really lagging to switch to Unicode and stuck with their shitty encodings for a long time.

Not were, are. I'm currently living in Japan, and most of the emails I receive are encoded in Shift-JIS or ISO-2022-JP. Many websites too. Thankfully modern software is quite robust and displays everything properly, but I still get the random mojibake from time to time.


Encodings are the easy part. To render stuff, you need fonts, and sometimes they are built into hardware (especially with dot-matrix printers, for instance, but even laser printers embed a couple of fonts, and likely custom Japanese fonts too), and software expects fonts with a particular encoding too. All of these were issues even for "simpler" scripts like Cyrillic (Serbian).

Some of the problems with fonts remain even with Unicode (due to Han unification, different CJK regions will prefer different fonts; there might be OpenType locl-aware fonts these days, but they'd be huge).


The only left task is to get rid [1] of UTF-16 usage.

[1] http://utf8everywhere.org/


One thing they got wrong is the fact that upcase(downcase('İ')) is not reliably 'İ' and downcase(upcase('ı')) is not reliably 'ı' without extra assumptions.

That is, Unicode is missing

(1) Unicode Character 'LATIN CAPITAL LETTER DOTLESS I' (????)

and

(2) Unicode Character 'LATIN SMALL LETTER I WITH DOT ABOVE' (????)

or some such so that upcase('ı') could be mapped unambiguously to (1) and downcase('İ') could be mapped unambiguously to (2). Of course, they would have identical glyphs with 'I' and 'i', respectively.

I haven't investigated if there was a rationale to save two codepoints, but handling documents that contain an unknown mix of Turkish and other languages becomes rather weird. For example, "sık sık" becomes rather inappropriate after going through downcase∘upcase and "kilim" maps to "kılım" after going through the same mapping.

Most software rely on the current locale to make these decisions. That is, of course, not feasible in systems that process documents received from arbitrary sources.


> I haven't investigated if there was a rationale to save two codepoints, but handling documents that contain an unknown mix of Turkish and other languages becomes rather weird.

The main reason is probably that ISO-8859-9 / Windows-1254 already only had a single I and i so it would be impossible to know which Unicode character to convert them to. Still might be worth it to add your suggested codepoints though so that uppercase/lowercase that is not locale-aware can at least work correctly for new text.


> The main reason is probably that ISO-8859-9 / Windows-1254 already only had a single I and i so it would be impossible to know which Unicode character to convert them to.

The alternative codepoints I proposed could coexist with the previous ones. After all, it is not like Unicode does not offer different ways of representing what humans consider to be the "same".

For example, 'İ' ≡ uc( lc 'İ' ) can be preserved by representing the result of lc 'İ' as "i + COMBINING DOT ABOVE" which I found to be rather clever[1].

However, I am not aware of a similar trick enabling me to preserve 'ı' ≡ lc( uc 'ı' ).

For reference, I have been dealing with Turkish character issues etc since the 80s when I was typesetting texts containing math, Turkish, and several European languages using WordStar during which time I had to hack my own keyboard drivers because there was zero support for Turkish. Then, there was a period where every computer had its own convention.

A lot of people used my hacks for a long time without even realizing them. So, things have improved, but I just can't figure out why not reserve a couple of extra codepoints for logically distinct concepts. It would make nothing worse and improve things in situations where one can take advantage.

[1]: https://www.nu42.com/2017/02/for-your-eyes-only.html


Other languages have similar problems, although not as striking. It does not make sense to special-case Azeri, Crimean Tatar, Kurdish, Turkish, Tatar as you propose.

The real problem is giving in to the notion that having an unknown mix of languages in a text is an acceptable state of affairs. It's not; we should recognise the problem for what it is and work towards a future where it occurs less. Documents carry metadata about the language content, e.g. HTML and OpenDocument, precisely so that the algorithms which depend on a language like case-folding or word matching work correctly.

The situation is analogous to having a document with an unknown/undeclared text encoding: rely on the metadata where supplied, otherwise make a guess, perhaps informed by statistical properties of the text (e.g. uchardet) rather than system locale. It's not ideal, but works more often than not.


Even for Latin-script-based languages pre-unicode times were hard. Polish had about a dozen encodings: https://pl.wikipedia.org/wiki/Kodowanie_polskich_znak%C3%B3w

With the first 5 (except Mac) still in common use in 90s and the first 3 still used when unicode appeared.

Most programmers in Poland in 90s remembered how encoding errors looked like between the most popular encoding pairs because it was a constant struggle.


I remember making vocabulary lists for Russian and Greek in Latex with emacs on Linux, before unicode, it wasn't easy. Definitely spent more time getting it to work than actually using them.


Italian alphabet is very simple. Only add some accented letters to the base english alphabet (àèéìòù and uppercase variants). Still today I see a lot of Latin 1 / UTF-8 mixed errors. That's the reason why is very common to see "E'" in official documents instead of "È" and so on.


No, the reason for E’ is that it doesn’t appear on the keyword and there’s no obvious way to make it work.

You’ll also occasionally see PERCHè for the same reason.

Also non-digital natives don’t really know the difference between É and E’ on paper, they all look the same.


This reminded me of CDAC's GIST, for enabling Indian languages on computers, worked on it 2 decades back. https://www.cdac.in/index.aspx?id=mlc_gist_about


>Did they add seamless transcoding for old files from DOS? Of course not! A Russian version of Windows effectively used two different encodings: native Windows parts used Windows-1251, while the DOS subsystem still used CP866.

To be fair, for those of us in the west, it was similar, if not quite as bad. CP437 and Windows-1252 were nearly as different. Woe to you if you opened a DOS text file in Notepad, especially if it had ASCII or box art, because it'd look all messed up, and vice versa.


An informative and entertaining journey through the world of Russian character encodings, of which I previously knew nothing. Thanks for sharing!


> That encoding is, to put it politely, mildly insane: it was designed so that stripping the 8th bit from it leaves you with a somewhat readable ASCII transliteration of the Russian alphabet, so Russian letters don’t come in their usual order.

No, it is not insane. It simply continues the traditional way of creating the national Morse alphabets — almost all of them are transliterations of the international Morse code. Historical continuity is quite a harsh mistress.


When I was in the first year of the university (1996), my friend programmed a utility for conversion between six different encodings used in the Czech language space: CP852, KeybCS2, CP1250, ISO8859-2, KOI-8čs a ASCII.

Just look at this page for the impression of how many times had the wheel been reinvented in Czech and Slovak language space...

http://vorisekd.wz.cz/seznam3.htm


> When the USSR broke down, the market was quickly captured by Western hardware and software because no one was making Soviet hardware anymore anyway.

Is that still the case? Are there any large Soviet hardware manufacturers?


The second note at the bottom of the page says: "That hardware was too outdated to keep producing in any case, and couldn’t complete in a free market."

So I'd say no, there aren't any large Soviet hardware manufacturers anymore.


Seems like Russia has their own "Silicon Valley":

> Zelenograd was founded in 1958 as a new town in the Soviet Union, developed as a center of electronics, microelectronics and the computer industry known as "Soviet/Russian Silicon Valley", and remains an important center of electronics in modern Russia.

https://en.wikipedia.org/wiki/Zelenograd


> people who use languages with ASCII alphabets exclusively may think it's unjustified

And what languages would that be? (given that even some borrowed words in English have the letter é for instance)


And that's why the ' character is in ASCII.

Print a lowercase e, backspace, and grave accent ', and you have printed é.

Underscore, ^, and other characters are there specifically for overstriking.

Not many terminals do that any more, though...


this post is based on author's experience -- Unicode issues and encoding problems have continued to be a thing well into the 2010s despite unicode gradually becoming the web defacto (in the West anyways) as web 2.0 ran it's course. I haven't worked on a many international language sites in 4 or 5 years as much but so many of those other regions still have allll of the encoding issues and challenges.


When it comes to Unicode, Joel Spolsky's screed is still a classic (note that it was done before UTF-8 became the de facto standard): https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


An old story was someone from France wanting to send the Harry Potter book to someone in Russia. They got their address over email, but in the wrong encoding: https://unicodebook.readthedocs.io/definitions.html#mojibake


By comparison, recently I bought some small homemade electronics thing from Germany. The sender had written Ålesund on the package. Deutsche Post, or possibly whatever system Posten bought from the lowest bidder, declared that Å isn't a real character, so they dropped it. Unfortunately Lesund is also a place.

After it had been stuck in the sorting facility for three weeks, I called them up and eventually reached a guy who could explain that oh yes, the system does that and the package gets stuck in a loop, but eventually a human will look at it.

It did in fact arrive eventually.


Anecdotally, I had this kind of experience mostly from services in big countries. France, Germany, Italy.

Small countries' services (Slovakia, Bulgaria) need to serve many more customers abroad, so they tend to be more friendly towards foreign characters not found in their national alphabets.


From my experience the French post deals with these just fine. They have to, this issue is quite common. I have regularly received mail with a mangled address because whoever sent it did not bother to check encoding and some Unicode got printed as something else.

Also, if you write the country at the bottom (as you should), they will just send it to wherever that country’s postal services will pick it up without bothering to match whatever’s on top to a local place.


> Also, if you write the country at the bottom (as you should), they will just send it to wherever that country’s postal services will pick it up without bothering to match whatever’s on top to a local place.

If one is able to design a postal addressing scheme (many just evolve in an ad hoc fashion over time), being able to fail gracefully is a handy feature to have.

In Canada, our postal code system was set up in the 1970s, and is reasonably rational:

* https://en.wikipedia.org/wiki/Postal_codes_in_Canada

If an international shipper can (a) get the package to Canada Post (by writing "Canada" on it), and (b) has the postal code on it, then it can be narrowed down to a fairly small geographic area:

* https://www.google.com/maps/place/A1A+1A1

* https://www.google.com/maps/place/M4A+1A1


That is my experience in the U.K. as well, where it sounds similar to your description. And it works very well, it is a very sensible solution.


I think the worst mangled address I ever saw was on a package ordered from Italy. The second worst was USA.

Unfortunately, my address has a lot of accentuated characters (Ř, Á, Č, Ž).


I have seen a fair bit of those from the US as well (mine had both é and à).


It’d be kind of ridiculous that a national service would not recognize their own characters as a default. It’s more likely that the general-use OCR failed in your instance.


Å isn't a German character, it's Norwegian. GP was ordering a package from Germany to Norway.


But then they mentioned Posten, so whatever the problem was, it wasn’t Deutsche Post’s fault. Also, they should not try to match the address if it is for another country.


If the Norwegian postal service receives an item addressed to LESUND, it's hardly their fault to take it there instead of ÅLESUND (town names are often capitalized).

A post code should have made the error obvious to the postal system.


Right. I might be dense but I understood that the address was written correctly on the parcel.

So whatever the German post thinks the address is, the label should have been scanned within Denmark, and the issue is likely that the Denmark post’s contractor is incompetent, rather than “lol Germans don’t do foreign names with funny letters” (which is not what the OP wrote, but is the vibe in a lot of replies).

And indeed, a post code should render this very difficult in the first place.


The OP wrote that Deutsch Post (probably) or Posten (possibly) messed it up.

The Germans messing it up seems much more likely. There are plenty of Norwegian addresses containing Å/Æ/Ø.


Yeah but that does not make any sense. Deutsche Post would not have changed the label so the address would still be correct.


I have no idea why you think it's inconceivable that Deutsche Post wouldn't corrupt the label, but Posten (Norwegian Post), who actually use the letter Å, would have.


I'm confident OCR fails on a regular basis, which is why they still have humans in place to interpret addresses.


I just find it funny that western Europeans can't tell the difference between Cyrillic writing and random gibberish. By which I genuinely don't mean anything pejorative, in either direction.


Do you mean printed, or handwritten? Because Sütterlin looks like gibberish to most Europeans. And it is just the standard Latin alphabet, give or take. It looks like some kind of letters, but they could be from the Voynich manuscript and they would still be as understandable.

In print, it’s partly because several letters are deceptive and don’t sound anything like the closest-looking ones in the Latin alphabet. Some others are just plain weird-looking if you are not familiar with the alphabet already.


> Sütterlin

I figured that the way to keep one's personal communications private is to handwrite them in Sütterlin.


It doesn't help that the cursive Cyrillic T looks like a Latin M.


the most amazing bit is that the postal employees reversed the bad encoding and delivered the package correctly!


Yet it’s sad that they were able to do this, because it suggests that they’ve encountered it enough to develop a procedure for fixing it. (That seems more likely than it being a one-off thing, since I would expect very few people to be capable of figuring out how to convert it, even if they encounter it commonly.)


I'm supposing the procedure is the same as it is in software development which is - go ask old Yuri, he knows what to do / about this sort of thing.


And the fact that it was done successfully doesn't mean it was done quickly. Heads off to the Russian dead letter office, someone eventually picks it up for a manual review and thinks "this gibberish is too hard for right before lunch" and puts it back. Next day someone else picks it up and thinks "oh, old Yuri knows what to do with this". Or maybe not, maybe it's done quickly. Who knows.


The slovenian and slovakian embassies meet each month to exchange wrongly addressed mail.

A package sent to me, living in austria, europe, got sent through australia, australia.

Mail errors are common.


It seems strange to me that this is done through diplomatic channels rather than the regular international mail routes between the two countries. Unless you mean the mail for the Slovenian and Slovakian embassies in third countries.

But as another example, Korea Post, the South Korean postal service, recommends including "Seoul" in all addresses from outside the country, even if it's going to a different city, to increase the chances of it going to the right Korea.


Mail for the embassies afaik. and because it is a lot of pretty important mail, they deem it safer to personally exchange it.


Ah that makes sense. Thanks for answering. I was thinking the Slovenian embassy in Slovakia takes all the misaddressed/misrouted mail for the whole country and vice versa.


>That seems more likely than it being a one-off thing, since I would expect very few people to be capable of figuring out how to convert it, even if they encounter it commonly.

I don't know, I'd expect a somewhat tech savvy person in the 90s to figure this out.

>open Word

>type Cyrillic alphabet in upper and lowercase

>try out encodings until the characters look like on the package

>print and use as a translation table


As long as the postal code is legible, they can forward it to the post office for that area and let someone local figure it out. The rough shape and numbers might be enough to tell without a decoding table.


I guess I am getting old... In the "old" times, computers and software came with manuals.

A common appendix of those manuals were encoding tables, i.e., every column was an encoding and every row a codepoint.

Sometimes you had to introduce relatively common characters by pressing a special key and then the number of the codepoint for the character you wanted to insert.


Back in the day I learned cp437 and its mapping to cp866, and was able to read cyrillic DOS programs without altering the character generator in the graphics card.

This was useful on a computer with a sort of Hercules graphics card which didn’t have the capability in text mode and behaved like IBM MDA.

Needless to say, my knowledge got outdated fairly quickly.


> The address was deciphered by the postal employees and delivered successfully.

Wow, I want to know more about how the postal worker who deciphered it! On a computer, sure just convert it. But on paper, how did they work it out?


I used to have popular books about computers which had character maps of popular encoding in the appendices. Every book about programming, or using DOS/Windows 9x had them.

Each book for system users (not even power users!) had a chapter about different character sets and common problems like this. This was important.

I think the post office worker was enough of a geek or they took the package to the computer guy.

Still more interesting thing to me was that the sender got the address in the wrong encoding and instead of asking via email, just must have thought that Russians have their little ways and dutifully copied all the gibberish (Russians even have a specific term for misencoded characters: krakozyabras).


The Polish term for such characters in Polish is "krzaczki" (little bushes), presumably because of the visual resemblance between those and the misencoded diacritics. It still shows up occasionally.


It's pretty obvious from the letters which are superscripted/skipped

First they knew Russia, Moscow by the postcode, so they superscripted РОССИЯ МОСКВА, and then they knew which garbled letters correspond to these ones

After which they were able to decipher ПР Р.А.СКО.О

after which the address becomes obvious (with the help of index) that it is Проспект Вернадского, 37. I don't think they used any tables other that this implicit table.

Which is student dormitory of Moscow State University btw, I looked it up.


And at no point did the sender stop and think "this doesn't look like Russian". I mean, it's relatively easy for me, who speaks none of the following languages, to tell appart Chinese, Japanese and Korean.

I guess this is consolating for Americans who get blamed for assuming the rest of the world is like their country (postal codes, address formats, date formats, person naming conventions, business customs).


The address there is

Russia, Moscow, (postcode), (maybe a street name), (some number), (untranslated) Svetlane

So it seems to be backwards: The personal name is last, and the country/city is first. Does someone here perhaps know if that is the customary order of address in Russia, or if it was reversed too?


In the whole of USSR (and after its dissolution too), the order of addresses was big-endian: from the larger administrative unit to smaller (which seems quite logical — first you enter the country, then find the district, then the town, then the street, then the house, then the person therein, thus the address replicates the travel path of the package).

Some countries (doubtless those that entered EU, also Ukraine, can’t say for the rest) officially changed the order to be little-endian, but many older folks, who still use snail mail and write letters and postcards, don’t bother.


I was aware such an ordering was common in China, but thanks for that info about the USSR! It is a more consistent and logical ordering. The other ordering is so confusing that America and Europe have both adopted different mixed endian schemes, which just makes things worse. Consistently increasing-endian schemes are quite rare, but I think you can find one in Australia.


I think you're mixing up endianness. The way I understand it, your first paragraph actually describes little endian encoding. I understand it as "the little value (i.e., the insignificant value) at the end".

EDIT: In fact maybe I'm wrong, if you consider the lowest byte address the first line of an address label... now I'm confused, haha.


From Wikipedia:

>A big-endian system stores the most significant byte of a word at the smallest memory address and the least significant byte at the largest. A little-endian system, in contrast, stores the least-significant byte at the smallest address.

So if you count the address "fields" from top to bottom in ascending order, putting the country first and street address last would be big-endian. If you put the street address first and the country last, that'd be little-endian.

Yeah it doesn't make that much sense to me either.


Welcome to the fun that is address field localization. Some countries have house numbers first, and some have it at the end.

742 Evergreen Terrace vs. Platz der Republik 1.

Google's Contacts app just has a text field and lets Google Maps take care of the parsing.


I mean yes, I'm aware different countries do address differently, and that a person should not prematurely predetermine the fields of an address on international software.

I had asked a specific question, which your comment does not address.


I hate and love that. Someone must have seen those characters enough times to have become good at deciphering them.

What a waste of time for the postal system though, I’m surprised they didn’t just return to sender.


Non ASCII is a mistake, the only reason for it are humans that don't want to learn English.

English IS humanity's final language, so I'm going back to ASCII so I can use a simple char* for all my strings.


On the fragile assumption that this isn’t a lazy troll, the words naïve, façade, fiancée, and coöperative are all in English. And that’s ignoring extremely common English symbols like £ and — that aren’t part of ASCII, and come to think of it the “’” in “aren’t”.

And, hell, I guess that even after English becomes the world language, you’ll still want to spell people and place names correctly?


> And, hell, I guess that even after English becomes the world language, you’ll still want to spell people and place names correctly?

I suspect that anyone who believes that English is the final language (how can it be anything other than a lazy troll?) probably believes that dropping diacritics and otherwise mangling the glyphs of other alphabets is spelling them 'correctly'.


> coöperative

Not that I agree with going back to ASCII, but cooperative is already the dominant spelling by a wide margin to the point where using coöperative comes off as pretentious.


Coming off as pretentious does not make it incorrect, and it is not the business of people writing software to decide what is correct or not.


Yeah, to be fair I was scraping to find another much-used example of a diaresis. But it’s house style for the New Yorker at least.


Like Latin and then French was? Chinese could very well become the next Lingua Franca.


English is as much an eternal Lingua Franca for human-readable communications as C is for low-level high-performance programming: meaning that it isn't one at all. Both can be unseated with some effort.

C is already being slowly unseated by Rust, and there are more competitors like Zig on the way. English may or may not be unseated by Mandarin, given that American hegemony is waning, and (unfortunately) Chinese dominance on the global stage is rising.


Good news is that you can use char* for UTF-8, which is an ASCII superset, and many operations such as splitting on an ASCII character work just fine with UTF-8 strings without any code changes.


Por que? Ne comprends pas. Sprache ist überall. Yīngyǔ jīhū méiyǒu zhǔdǎo.


There's an option of making char 64-bit, in which case you can always use char*

So that sizeof(int) = sizeof(float) = sizeof(char) = 1


1. sizeof(int) and sizeof(float) are both 32-bit on most current systems

2. Unicode is (currently) only 21-bit.

3. UCS-4 (aka 32-bit char) rarely helps because in most cases you either a) don't really care about characters (most software) and can use substrings as well or b) code points are an insufficient grouping and you need variable length units anyway for grapheme clusers (text rendering and editing).


1. They will also be one 64-bit memory word. 2. So we have plenty of room. 3. So it does not hurt.


I think you have some large misconceptions about these things. sizeof(float) is going to return 4 for 4 bytes on any common computer these days. There is no "64-bit memory word" anywhere to be found, whatever that means. Nothing in either of your comments here is true or makes sense.


"Common computer these days" will eventually have to go.

And it probably will be much sooner than trans-Earth English language dominion.


I expected more of a "I guess I misunderstood how C works" sort of reply.


I began coding in 1994. I'm still basic in C, but not that basic.

8-bit byte will eventually have to go.


Will that be before or after the trans-Earth English language dominion?


That's how we ended up with UTF-16, the abomination from Microsoft. UTF-8 or bust.


Microsoft did not invent UTF-16 (which in fact is just a retcon of UCS2), they were just early adopters of Unicode and UTF-8 didn't exist yet.


They might not have invented it, but they popularised it. UTF-16 would be dead and a footnote in history if it hadn't been for Microsoft's adoption of it.


UTF-8 is appeared a bit late at the time Unicode being adopted initially. So WinNT and other projects like Java also adopts UCS-2.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: