Perhaps two decades ago ago I had the idea, what if ASCII/Unicode were eventually replaced with a system where the codepoints actually encoded the basic "strokes" of a character, much like this, then with additional context about the character (e.g. render as Latin uppercase, as Arabic, as a geometric symbol).
And then even if you didn't have the necessary font installed, there would still be enough information to render something close enough to be meaningful, even if it wouldn't necessarily be pretty. And that the bits and bytes in a text document wouldn't be arbitrary codepoints, but inherently meaningful.
And that it would solve, especially, the case of all these old Chinese characters that still aren't in Unicode.
I never decided to pursue it, and these days universal fonts already exist... but I love so much to see someone trying it, and the idea of associating it with a neural network to actually render in each style is really clever. It makes me wonder how feasible it would be to design a system using NN's that could take, say, any Latin-1 font and automatically produce an equivalent style in any other script, based on how existing semi-universal fonts have done the same.
They're similar, but I'd argue that IDC is better at capturing the compositionality, and CDL is infinitely better at generating things "that actually look good". Using IDC, 犬 is ⿺大㇔ and 夲 is ⿱大木. However, for typesetting that (just like RRPL) is too crude, so CDL is really where you want to be: a more complex descriptor system that places components based on a grid location paired with scaling information, so that 大 and ㇔ can be positioned in a way that looks right. Especially when you hit things like 器 where the easthetics of the relative placement makes the difference between "looks right" and "were you drunk when you wrote this?" makes CDL a more appealing choice than RRPL or IDC.
A particular "feature" of this early system is that if you send random lowercase words to the character generator, it will attempt to construct Chinese characters according to the Cangjie decomposition rules, sometimes causing strange, unknown characters to appear.
>Potential fields for usage include font design and machine learning.
Am I correct that this scheme doesn't capture stroke info? I think that would be needed for fonts / handwriting recognition. Fonts have serifs / varying brush widths depending on stroke order and direction. Hand-drawn characters can also look quite different from the printed form (angles of strokes, two consecutive strokes being drawn without lifting the pen even though the printed form has two separate lines, etc).
Edit: As an example, 人 is defined as 257 which is certainly how it appears in print. However it is usually hand-written as something closer to 357, with 37 being the first stroke and 5 being the second. https://www.tanoshiijapanese.com/dictionary/stroke_order_det...
If the author is reading / or if I just missed it in the docs, were these 5000+ descriptions generated automatically somehow or by hand? Either way it's impressive
^ ^ ^ ^
The standard way to describe Hanzi is IDC codes. Those are used in Unicode proposals, and in the Unihan database.
I added IDC lookup to Pingtype's keyboard, which normally searches the radicals in a decomposition database. It's crude, but lets me input a traditional character when I can only recognise half of it.
On the other hand, people are working on it, so Mayan will probably at some point end up included in Unicode.
I don't think that something like that has been done before, but I think it's within the reach of the current state of the art for character recognition and image captioning.
The system is at heart logographic (so each squiggle means roughly one word, like "house" or "person") but you can't really do that at scale, it's too unwieldy (consider how many English words you use or are somewhat confident you would recognise if others used them), so you end up using two or three squiggles to make all but the most common words in modern Chinese writing.
The individual logograms are often related to a pictogram, but that's also true of the Latin characters you used to write this. The letter A is probably distantly related to a drawing of the head of an ox.
Han logograms also incorporate entirely abstract (ie they were never attempting to be a picture "of" anything) indications of how they should sound or what the general topic is, as hints to a reader. These would not translate to your narrative description.
I'm aware of course of the pictographic character of the originally Phoenician alphabet. The thing is, I remember when I was learning to write, adults using visual metaphors to help me learn: "now draw a little cane" (for the Greek letter ιώτα), "now a little ring" etc. This was not always possible - for example, "A" was never described to me as a "plow" (I think that's what it actually stands for).
But, visual metaphor doesn't have to pre-exist. A new, arbitrary one can be invented. I mean, a certain brush stroke doesn't have to have a concrete and universally agreed-upon pictographic meaning. If one can be found that happens to work, then that'll do.
But that may be well beyond the state of the art for statistical machine learning algorithms, currently.