> it's an input problem before anything else I see where we have our wires cross...

crazygringo · 2024-10-10T18:04:27 1728583467

I totally appreciate the desire for more semantic encoding. I mean, it would be a dream if every sentence was semantically delimited, if every word was annotated with which hyphenation pattern it should follow for splitting across lines (when there are multiple), and whether the capital letter at the start of a sentence should remain capital even when converted to lowercase, because it's a capitalized proper noun. I could go on.

But that's not what Unicode is for. The apostrophe situation is just one of 100 things I could think of off the top of my head. Unicode encodes characters, not semantics. And this is by design, because people don't input, or want to input, semantics -- they just want to type something that looks right. Something other people can read, not something computers can semantically parse.

So we have a bunch of heuristic and AI and manual tools we use to try to annotate things semantically, and we put that information at the level of something like XML, not Unicode. Which is infinitely more flexible, because you can define and use whatever semantics you want, not limited to whatever the Unicode body decided.

If KeenQuotes gets apostrophes right 99.9% of the time, then just use that to automatically analyze all your input text and then store and process it in some kind of XML notation, like "Peter<apos>’</apos>s" or "<possessive>Peter’s</possessive>" or "<word>Peter’s</word>" or something. Unicode is the wrong level of abstraction.

thangalin · 2024-10-10T23:45:36 1728603936

> process it in some kind of XML notation

The output from KeenQuotes is used by KeenWrite. KeenWrite can generate text, HTML, XHTML, and PDF documents. Those output document formats lack correct the semantics because of UNICODE. As much as rolling my own XML notation would be fun, it won't work in practice---nobody would be able to publish their exported documents for viewing or general consumption. We'll have to agree to disagree on this one: I think UNICODE dropped the ball on English apostrophes where it didn't have to. Having one more character for curled apostrophes would have kept open the possibility of encoding unambiguous HTML documents (with respect to apostrophes/right single quotes for quotations; your point about other characters I quite appreciate).