
How the Unicode Committee Broke the Apostrophe - Ovid
https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/
======
thaumasiotes
Makes a real effort to completely gloss over a very common English use of
apostrophes. From the article:

> Consider any English word with an apostrophe, e.g. _“don’t”_. The word
> “don’t” is a single word. It is not the word “don” juxtaposed against the
> word “t”. The apostrophe is part of the word, which, in Unicode-speak, means
> it’s a _modifier letter_ , not a _punctuation mark_ , regardless of what
> colloquial English calls it.

> According to the Unicode character database, U+2019 is a punctuation mark
> (General Category = Pf), while U+02BC is a modifier letter (General Category
> = Lm). Since English apostrophes are part of the words they’re in, they are
> modifier letters, and hence should be represented by U+02BC, not U+2019.

> (It would be different if we were talking about French. In French, I think
> it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as
> two words. But that’s a conversation for another time. Right now I’m talking
> about English.)

OK, I've considered an English word: "man's". In the sentence "That man's
pants are on fire!", this is usually considered a single word, the genitive
case of "man" (personally, I'm not a huge fan of that approach, since the
"genitive" 's attaches to phrases, not to words, but it is the mainstream
position in linguistics).

In the sentence "That man's about to jump", on the other hand, the "word"
"man's" is two words joined by an apostrophe, exactly as in French "l'homme".
These clitics aren't exactly rare in English. The author shows some linguistic
training in the comments to his piece, but never once mentions clitics, and
fails to address them when another commenter brings them up.

Use U+0027 for English apostrophes. ;p

~~~
greggyb
Is a clitic distinct from a normal contraction?

I am not up on linguistics, so I'm just curious, but I'd class "don't" and
"man's" as in "the man's about to jump" the same as just contractions.

~~~
thaumasiotes
What's a "normal contraction"?

The mainstream view is that English "don't" is just a word, the negative form
of "do". The apostrophe is a historical accident. The author of this piece
gives one argument for this view in the comments, pointing out that "don't"
can appear in contexts where "do not" is ungrammatical.

A "clitic" in linguistics refers to an item which is (1) a word in the sense
of having its own dictionary entry (which I might call "at the lexical
level"), but (2) _not_ a word at the phonological level -- clitics depend for
their pronunciation on the word(s) (usually just one word) next to them. So,
for example, the 's of "the man's about to jump" is lexically a form of the
verb _is_ , but it's been reduced down to zero syllables. The indefinite
article (a/an) is another English clitic, and you can observe its
pronunciation changing according to the word that follows it in the sentence
pair:

1\. A cow trampled me.

2\. An elephant trampled me.

Traditionally, the definite article ("the") is also clitic, with one
pronunciation before consonants and a different pronunciation before vowels.
The before-consonants pronunciation is in the process of becoming universal.

Languages differ in whether clitics are written together with the words they
attach to phonologically or not. Ancient Greek clitics are traditionally
separated with orthographic spaces (we know they're still clitics because they
affect the placement of word accents). Latin clitics are written as part of
the same word: "felis canisque" (="the cat and the dog", where -que means
"and"). English uses both approaches.

SUMMARY, about the specific example you chose: "don't" and the "man's" of "the
man's about to jump" are not in the same class, because "don't" is just a word
with no internal structure, and that "man's" is two words which are realized
in speech as a single syllable. That "man's" might be thought of as a "normal
contraction", a term of no meaning that I know, but linguistically is a full
word ("man") with a clitic ("'s") attached. However, clitics in general are
not necessarily zero syllables long.

------
TazeTSchnitzel
In a _ideal_ world, yes, this is how it would work. But in practice it is not.
The vast, vast majority of documents using non-straight quotes use ’
(U+2019[1], the Windows-1252 \x92 right curly quote that Microsoft Word <3s)
for apostrophes. There's not much that can be done about that.

Unicode has to strike a balance between what's most "correct" and how the real
world actually uses it.

[1] I was looking at that codepoint and thought it must be wrong. It's too big
a number for a Latin-1 codepoint. Aren't the first 256 characters of Unicode
just Latin-1? Well, exactly. They're Latin-1, _rather than Windows-1252_ ,
which is where the now-infamous curly “smart quotes” come from. The two
encodings are easily confused, because they're mostly the same. The difference
is Microsoft replaced the extra control codes in the high byte (who needs
those, really? ASCII had too many already) with more useful new printable
characters.

~~~
mcguire
" _Unicode has to strike a balance between what 's most 'correct' and how the
real world actually uses it._"

 _That_ train left Unicode station a very long time ago. They have chosen
correctness over convenience too many times to switch tactics now.

------
shiggerino
Sadly, Unicode is a clusterfuck. But can anything be done about it? Or should
we just be happy we for once have managed to get a decent adoption of
something interoperable?

~~~
lambda
Unicode is not a clusterfuck. Overall, it is a very comprehensive, well
thought out standard, that has improved the situation for interoperable
internationalization dramatically over the hundreds of separate encodings that
preceded it.

The clusterfuck is mostly in the essential complexity of the problem; the
world's languages and writing systems are quite complex and varied, and all of
the writing systems and punctuation conventions were designed for reading,
hand-writing and hand-typesetting by people who understood the language in
question, not automatic typesetting and processing by a general purpose
computer.

The complexity of the problem was increased by the necessity of providing
migration paths from legacy encodings to Unicode and back again; without such
a guarantee, bootstrapping the world into using Unicode would have been a much
more difficult proposition, but that constraint also means that many oddities
of legacy encodings have had to be preserved in Unicode in order to be able to
preserve that round-trip mapping.

Unicode, and sister projects like the Common Locale Data Repository, are doing
an admirable job of navigating and standardizing this complex problem.

There are definitely aspects of the Unicode process where they have gotten it
wrong; UCS-2/UTF-16 is one of them, in hindsight it is apparent that UTF-8 is
superior in pretty much every way. Having a variable width encoding but which
most of the world's text fits in a fixed width, and which has endianness
issues that necessitate inclusion of an non-textual byte order marker to
disambiguate, has caused a number of problems and incompatibilities. There may
be a few other points of legitimate criticism, like some aspects of Han
unification.

But on the whole, outside of a few problems like those, the "clusterfuck" is
caused not by Unicode, but simply by the essential complexity of the problem
involved. Language and text are simply difficult things to model in a
computer.

~~~
zokier
> The complexity of the problem was increased by the necessity of providing
> migration paths from legacy encodings to Unicode and back again; without
> such a guarantee, bootstrapping the world into using Unicode would have been
> a much more difficult proposition, but that constraint also means that many
> oddities of legacy encodings have had to be preserved in Unicode in order to
> be able to preserve that round-trip mapping.

While I agree that preserving round-trip integrity was essential for Unicodes
success, I'm not sure if the approach taken to achieve that was the best one.
I would have preferred that the complexity tradeoff would have been shifted to
software converting between Unicode and legacy encodings by having more
complex mapping tables and cleaner code point space.

I also think that Unicode Consortium should have been more aggressive in
segregating (and discouraging the general use of) legacy compatibility
features/codepoints and the stuff that is actually supposed to be used. My
personal pet peeve is precomposed characters.

In a more general note, I sometimes wonder if it would have been beneficial to
have separate layers in Unicode and have more focused on providing generic
primitives. As a simple example it is mighty convenient that I can type 2³ = 8
in plain text, but arguably it would be even nicer if instead of special
'SUPERSCRIPT THREE' codepoint there would be generic superscript modifier
codepoint that could be combined with any character.

Speaking of superscripts, they demonstrate well one aspect that I dislike in
Unicode, the way they have absorbed legacy encodings verbatim. The numeric
superscripts (e.g. ⁰ ¹ ² ³ ⁴ ⁵ ⁶) happen to have inconsistent look on my
machine because the superscripts for 1, 2, and 3 are from Latin1 while the
rest are in their own block.

------
harshreality
Iʼm sold, I think... (U+02bc isn't really intended for such use, but until
there's a proper alternative, other than U+2019 or U+0027, I'm using it)[1].

    
    
        (in ~/.XCompose)
        include "%L"
        <Multi_key> <apostrophe> <minus>                : "ʼ"   U02BC   # MODIFIER LETTER APOSTROPHE
    

[1] A potential problem with U+0027 is that low-ascii ' and " have uses for
demarcating things (like attribute values in html, most popularly), so if
you're editing anything that uses ' for markup, you can't search and replace
based on ' anymore.

------
jmount
"Using U+2019 is inconsistent with the rest of the standard" I agree with the
article, just my negativity is such I would say the correct statement is more
like: "Using U+2019 is inconsistent with good use, making it consistent with
the rest of the mess that is the standard."

------
sctb
[https://news.ycombinator.com/item?id=9655387](https://news.ycombinator.com/item?id=9655387)

------
lambda
I think there are a few things wrong with this argument. I'm unconvinced by
the argument that this should actually be considered a modifier letter.

    
    
      Consider any English word with an apostrophe, e.g. 
      “don’t”. The word “don’t” is a single word. It is not the 
      word “don” juxtaposed against the word “t”. The 
      apostrophe is part of the word, which, in Unicode-speak, 
      means it’s a modifier letter, not a punctuation mark, 
      regardless of what colloquial English calls it.
    

The definition of a modifier letter is
([http://www.unicode.org/versions/Unicode7.0.0/ch07.pdf#G15832](http://www.unicode.org/versions/Unicode7.0.0/ch07.pdf#G15832)):

    
    
      Modifier letters, in the sense used in the Unicode 
      Standard, are letters or symbols that are typically 
      written adjacent to other letters and which modify their 
      usage in some way. They are not formally combining marks 
      (gc=Mn or gc=Mc) and do not graphically combine with the 
      base letter that they modify. They are base characters in 
      their own right. The sense in which they modify other 
      letters is more a matter of their semantics in usage; they 
      often tend to function as if they were diacritics, in 
      dicating a change in pronunciation of a letter, or 
      otherwise distinguishing a letter’s use. Typically this 
      diacritic modification applies to the character preceding 
      the modifier letter, but modifier letters may sometimes 
      modify a following character. Occasionally a modifier 
      letter may simply stand alone representing its own sound.
    

Punctuation, on the other hand, is
([http://www.unicode.org/faq/punctuation_symbols.html](http://www.unicode.org/faq/punctuation_symbols.html)):

    
    
      Punctuation marks are standardized marks or signs used to 
      clarify the meaning and separate structural units of text.
    

Based on these definitions, the apostrophe seen in contractions and
possessives is definitely punctuation, not a modifier letter. Modifier letters
indicate some effect on sound or pronunciation, either modifying an adjacent
letter or having a sound on their own. U+02BC (MODIFIER LETTER APOSTROPHE) is
such an example, being used to indicate a glottal stop.

Apostrophes used in contractions and possessives, however, have no effect on
pronunciation; instead, just as in the definition of punctuation, they are
used to "clarify the meaning and separate structural units of text."

    
    
      But we shouldn’t be perpetuating this problem. When a 
      programmer is writing a regex that can match text in 
      Chinese, Arabic, or any other human language supported by 
      Unicode, they shouldn’t have to add an exception for 
      English.
    

Thinking that it's possible to do text processing in a language or writing
system neutral way is a fallacy. Unicode simply provides an encoding that
allows all of these writing systems in a single document, plus a number of
algorithms that are designed to be fairly reasonable across the entire
encoding, but which cannot be correct for all languages and writing systems
without specific tailoring.

Many writing systems do not use spaces between words. Any form of word
segmentation for these writing systems will necessarily be language specific,
generally involving dictionaries. Using a regex like \w+ on Chinese or Thai
text is fairly meaningless, as it will generally match an entire sentence at a
time, rather than actually matching a single word.

    
    
      For godsake, apostrophes are not closing quotation marks!
    

No, they are not. However, they also aren't modifier letters. If you wanted to
provide a distinction for the purposes mentioned here, you would probably need
to add a new, distinct punctuation character "curly apostrophe" or something
of the sort (since the ASCII range apostrophe can't be reused due to its
overloaded meaning). However, even if you did that, you would still need to
deal with all of the legacy documents which use ASCII apostrophe and closing
quotation marks; you wouldn't actually be able to simplify the implementation
by making the assumptions that a closing quotation mark was always actually
closing a quotation.

Now having three different characters that looked identical (the modifier
letter apostrophe, the closing quotation mark, and the punctuation apostrophe)
would additionally add to confusion.

Even if you didn't introduce a new character, and instead used the modifier
letter apostrophe as a punctuation apostrophe, you would still have all of the
problems with legacy documents; it would take years for this change to make
it's way through all of the various word processing programs and text editors,
even after it had there would be existing documents using the old conventions,
etc.

In short, text processing is hard, because text conventions were designed for
human readers who know the language, not computers trying to process text in a
language-independent way, and they were designed either through handwriting or
manual typesetting, not keyboard entry. You are never going to achieve a
perfect text processing model that can handle all of the world's languages
simply by using particular global Unicode properties of characters and
applying a simple algorithm or categorization on them. A lot of text
processing will need to be contextual, language (and locale) specific, and
involve dictionaries.

I don't think that switching from the punctuation closing quote character to
the modifier letter apostrophe for the punctuation apostrophe is likely to
help much; and the confusion caused by nearly 20 years of documents that
follow the existing conventions and so having to support both conventions is
likely to make the situation much worse, not better.

