Hacker News new | past | comments | ask | show | jobs | submit login
Describing Chinese Characters with Recursive Radical Packing Language (github.com)
125 points by polm23 26 days ago | hide | past | web | favorite | 26 comments



This is so cool!

Perhaps two decades ago ago I had the idea, what if ASCII/Unicode were eventually replaced with a system where the codepoints actually encoded the basic "strokes" of a character, much like this, then with additional context about the character (e.g. render as Latin uppercase, as Arabic, as a geometric symbol).

And then even if you didn't have the necessary font installed, there would still be enough information to render something close enough to be meaningful, even if it wouldn't necessarily be pretty. And that the bits and bytes in a text document wouldn't be arbitrary codepoints, but inherently meaningful.

And that it would solve, especially, the case of all these old Chinese characters that still aren't in Unicode.

I never decided to pursue it, and these days universal fonts already exist... but I love so much to see someone trying it, and the idea of associating it with a neural network to actually render in each style is really clever. It makes me wonder how feasible it would be to design a system using NN's that could take, say, any Latin-1 font and automatically produce an equivalent style in any other script, based on how existing semi-universal fonts have done the same.


Have a look at the Unicode "Ideographic Description Characters " approach (https://en.wikipedia.org/wiki/Ideographic_Description_Charac...), or WenLin's Character Description Language (https://wenlin.com/cdl).

They're similar, but I'd argue that IDC is better at capturing the compositionality, and CDL is infinitely better at generating things "that actually look good". Using IDC, 犬 is ⿺大㇔ and 夲 is ⿱大木. However, for typesetting that (just like RRPL) is too crude, so CDL is really where you want to be: a more complex descriptor system that places components based on a grid location paired with scaling information, so that 大 and ㇔ can be positioned in a way that looks right. Especially when you hit things like 器 where the easthetics of the relative placement makes the difference between "looks right" and "were you drunk when you wrote this?" makes CDL a more appealing choice than RRPL or IDC.


It's similar to this:

https://en.wikipedia.org/wiki/Cangjie_input_method#Early_Can...

A particular "feature" of this early system is that if you send random lowercase words to the character generator, it will attempt to construct Chinese characters according to the Cangjie decomposition rules, sometimes causing strange, unknown characters to appear.


It is really impressive. I had some fun with Chu's original DOS implementation [1]; for example, what does happen if you repeat a same letter up to five times? Even you don't know Chinese characters, you can easily see that some subcomponents are specialized while others remain fully general [2]. For example 骨 (BBB) is made of three 月s (B), but it gets reused in BBBBB (presumably ⿰骨骨). Careful decomposition and specialization made Cangjie extremely concise and still usable.

[1] http://www.cbflabs.com/down/show.php?id=62

[2] https://p.mearie.org/gOHG.png


This is neat.

>Potential fields for usage include font design and machine learning.

Am I correct that this scheme doesn't capture stroke info? I think that would be needed for fonts / handwriting recognition. Fonts have serifs / varying brush widths depending on stroke order and direction. Hand-drawn characters can also look quite different from the printed form (angles of strokes, two consecutive strokes being drawn without lifting the pen even though the printed form has two separate lines, etc).

Edit: As an example, 人 is defined as 257 which is certainly how it appears in print. However it is usually hand-written as something closer to 357, with 37 being the first stroke and 5 being the second. https://www.tanoshiijapanese.com/dictionary/stroke_order_det...


For stroke-level information, you'll probably want something like https://github.com/skishore/makemeahanzi (for Chinese) or https://github.com/kanjivg/kanjivg (for Japanese).


The "Inkstone Chinese" writing app by this same author seems useful for when I am riding someplace and can't pull out my notebook to practice on paper. https://www.skishore.me/inkstone/


Isn't stroke order something a set of rules you learn once (e.g. top to bottom, and left to right)? If so, you could programmatically determine the correct stroke order for each character.


There are exceptions.


Stroke order differs for japanese and chinese, as well.


This is very useful. I already have some ideas how this can help me out. One thing to keep in mind is that it is 'traditional' Chinese, not 'simplified'.


This is cool! I like how it makes a natural link between how characters are composed of semantic parts ( radicals ), and how the non-semantic RRPL maps / abstracts that, maybe exposing the semantic structure with a language that doesn't have to be aware of semantics. Very cool.

If the author is reading / or if I just missed it in the docs, were these 5000+ descriptions generated automatically somehow or by hand? Either way it's impressive


In issue #1 the author says they did it by hand.


I tried drawing a Koch-like fractal by starting with

    48-83-14-48
     ^  ^ ^  ^
And then replacing the marked numbers with the whole string to get

    4(48-83-14-48)-8(48-83-14-48)-(48-83-14-48)4-(48-83-14-48)8
and iterating from there. Unfortunately the renderer doesn't seem to like it, and in any case a proper Koch curve has lines at all angles, whereas here they must point in one of the eight directions.


Perhaps try a Hilbert curve instead?


Nice work; I like how it can all be brought back to numbers. I wonder if a new keyboard input method could use those codes.

The standard way to describe Hanzi is IDC codes. Those are used in Unicode proposals, and in the Unihan database.

https://en.wikipedia.org/wiki/Ideographic_Description_Charac...

I added IDC lookup to Pingtype's keyboard, which normally searches the radicals in a decomposition database. It's crude, but lets me input a traditional character when I can only recognise half of it.

https://pingtype.github.io


My immediate thought is this could help out Mayan orthography, which is notably “computer proof”, even though it’s a mixed logography. (Mayan would need a 3rd marker for depth order, and marker(s?) for occlusion info.)


Based on this presentation I found [1] it appears that even just encoding the character composition would be more difficult than for Chinese, and the level of artistic freedom demonstrated in the examples on slide 9 makes me doubtful that it could be replicated by just a few simple rules.

On the other hand, people are working on it, so Mayan will probably at some point end up included in Unicode.

[1] http://www.linguistics.berkeley.edu/sei/assets/unlocking-the...


I'd distinguish between glyph and ligature variants, i.e., leopard- vs man- headed characters, U-selection etc., and simply representing the entire word, itself.


That's really nice and has potential practical applications, but I was hoping for more of a narrative description, like "the little man in the big house with the sea in front of it" or something similar (very unfortunately I don't know Chinese, or its writing system, so I just made that one up).

I don't think that something like that has been done before, but I think it's within the reach of the current state of the art for character recognition and image captioning.


The Han system isn't mostly pictures of things (ideograms), so your narrative descriptions aren't really practical.

The system is at heart logographic (so each squiggle means roughly one word, like "house" or "person") but you can't really do that at scale, it's too unwieldy (consider how many English words you use or are somewhat confident you would recognise if others used them), so you end up using two or three squiggles to make all but the most common words in modern Chinese writing.

The individual logograms are often related to a pictogram, but that's also true of the Latin characters you used to write this. The letter A is probably distantly related to a drawing of the head of an ox.

Han logograms also incorporate entirely abstract (ie they were never attempting to be a picture "of" anything) indications of how they should sound or what the general topic is, as hints to a reader. These would not translate to your narrative description.


Thank you for the clarification. Like I say, I don't know anything about Chinese writing.

I'm aware of course of the pictographic character of the originally Phoenician alphabet. The thing is, I remember when I was learning to write, adults using visual metaphors to help me learn: "now draw a little cane" (for the Greek letter ιώτα), "now a little ring" etc. This was not always possible - for example, "A" was never described to me as a "plow" (I think that's what it actually stands for).

But, visual metaphor doesn't have to pre-exist. A new, arbitrary one can be invented. I mean, a certain brush stroke doesn't have to have a concrete and universally agreed-upon pictographic meaning. If one can be found that happens to work, then that'll do.

But that may be well beyond the state of the art for statistical machine learning algorithms, currently.


Everything tialaramex said is correct, when people write Chinese they are not thinking of "drawing" anything at all. But if you are interested in learning it in a pictographic way, there is a company called Chineasy that tries to make it easier with pictures; check it out, it's a pretty interesting product.


That's interesting, thank you. And it might even work.


I don't know why I did this, but your welcome. 'Murica.

(((1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357))-(48|48|48))|(48|48|48)


This is beautiful.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: