Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a ton of recorded history of decentralized character encodings. Some people long for the days when a character was a byte and a byte was a character, but that model doesn't fit the world's languages.

The unicode consortium has been historically inclusive enough to avoid alienating enough people to cause a revolt.

They have defined ranges where they will not assign code points as private use areas, and you can use those for whatever symbols you need, but of course there's no mechanism to resolve disputes over use, and you would need your own method to determine which font was appropriate, and its likely to be challenging to use outside of applications you control.



I think the available extension mechanisms in Unicode are quite limiting and in fact close to useless because you can run into trouble when you use them.

What I want is refer to new symbols through some unique ID, a decentralized way to serve those symbols, including all relevant information (such as what those symbols mean, what they are based on, what glyphs are preferred/available, how the use of these symbols evolved over time, references to related symbols, etc. etc.).

If I want to invent a symbol for a Coke can, a link to an external website [1], a new mathematical operator, or even Covid19, I want to be able to do it now, not wait for Unicode to stamp the proposal.

[1] https://news.ycombinator.com/item?id=23016832


Use images or svgs then. Or define your custom svg font. I don't see how a decentralized way of defining whatever could even remotely work, since all of this absolutely must work while offline and while users are inputting arbitrary text.


Actually there's a mention since the first Unicode Emoji TR (https://www.unicode.org/reports/tr51/tr51-1-archive.html#Lon...) to a "longer-term goal" to support not only embedded graphics, but also "a reference to an image on a server".

I'm not sure where it comes from and what's come of it; seeing as the next sentence is "server could track usage", I have the feeling it's from the editor from Google... ;) (Mark Davis)


I don't see how being offline would be a problem, since the same problem exists when you want to type text using certain fonts which you'd need to download first.


https://www.unicode.org/review/pri408/

Current proposal to support pretty much exactly that, that is a generic Wikidata identifier (for emojis).

I'm not sure how I feel about that, on one hand I would like to have a widespread system to refer to stuff in an ontology, on the other hand it would probably make anything using plain text download stuff from the internet, and maybe even require a connection in simplicistic implementations...

Also, I kinda hate emojis, so... :)


Well, my own software tends not to use Unicode. Sometimes, it only uses ASCII. Sometimes, it doesn't care about character coding, as long as it is compatible with principle of extended ASCII. Sometimes, it is a combination of the two (such as ASCII only for commands, but comments can include any characters that are compatible with principle of extended ASCII). In one case (VGMCK), it does parse UTF-8 in some contexts (such as #TITLE commands), but the only thing it does with the decoded data is to convert it to UTF-16, since the output format (VGM) contains UTF-16 text. A UTF-8 byte order mark at the beginning of the file is not acceptable, though.

But I have thought of some mechanisms to declare character sets and character mapping, to determine which font is appropriate, etc. Unicode is a valid choice, but even if you select Unicode as your character set of use, you must specify the language code and the Unicode version. You can also specify more than one mapping, and custom mappings, for example if you are using CSUR to write in Klingon and English, then you can declare both Unicode and CSUR, together with the relevant version numbers.

I thought of a similar mechanism for declaring character sets and character mapping for TAVERN32 (which doesn't exist yet, but it is meant to be an improved text adventure game VM having the advantages of both TAVERN and Glulx). One lump declares the character sets and character mapping, and can include fallbacks if wanted, so that if ASCII mappings are declared, then it can work with any implementation even if they do not know about that character set. The story file might also (optionally) include other lumps with bitmap fonts, so that also allows it to work even if the character set is unknown (as long as it can display graphics, which some implementations may be incapable of). However, in this case, the character mapping is also relevant for compression too, and not only for determining character sets. There is then the possibility that multiple codes will be mapped to the same ASCII character (or sequence of characters in ASCII or any other character set), avoiding the problem of compatibility.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: