
Describing Chinese Characters with Recursive Radical Packing Language - polm23
https://github.com/LingDong-/rrpl
======
crazygringo
This is so cool!

Perhaps two decades ago ago I had the idea, what if ASCII/Unicode were
eventually replaced with a system where the codepoints actually encoded the
basic "strokes" of a character, much like this, then with additional context
about the character (e.g. render as Latin uppercase, as Arabic, as a geometric
symbol).

And then even if you didn't have the necessary font installed, there would
still be enough information to render something close enough to be meaningful,
even if it wouldn't necessarily be pretty. And that the bits and bytes in a
text document wouldn't be arbitrary codepoints, but inherently meaningful.

And that it would solve, especially, the case of all these old Chinese
characters that still aren't in Unicode.

I never decided to pursue it, and these days universal fonts already exist...
but I love so much to see someone trying it, and the idea of associating it
with a neural network to actually render in each style is _really_ clever. It
makes me wonder how feasible it would be to design a system using NN's that
could take, say, any Latin-1 font and automatically produce an equivalent
style in any other script, based on how existing semi-universal fonts have
done the same.

~~~
TheRealPomax
Have a look at the Unicode "Ideographic Description Characters " approach
([https://en.wikipedia.org/wiki/Ideographic_Description_Charac...](https://en.wikipedia.org/wiki/Ideographic_Description_Characters_\(Unicode_block\))),
or WenLin's Character Description Language
([https://wenlin.com/cdl](https://wenlin.com/cdl)).

They're similar, but I'd argue that IDC is better at capturing the
compositionality, and CDL is infinitely better at generating things "that
actually look good". Using IDC, 犬 is ⿺大㇔ and 夲 is ⿱大木. However, for
typesetting that (just like RRPL) is too crude, so CDL is really where you
want to be: a more complex descriptor system that places components based on a
grid location paired with scaling information, so that 大 and ㇔ can be
positioned in a way that looks right. Especially when you hit things like 器
where the easthetics of the relative placement makes the difference between
"looks right" and "were you drunk when you wrote this?" makes CDL a more
appealing choice than RRPL or IDC.

------
userbinator
It's similar to this:

[https://en.wikipedia.org/wiki/Cangjie_input_method#Early_Can...](https://en.wikipedia.org/wiki/Cangjie_input_method#Early_Cangjie_system)

 _A particular "feature" of this early system is that if you send random
lowercase words to the character generator, it will attempt to construct
Chinese characters according to the Cangjie decomposition rules, sometimes
causing strange, unknown characters to appear._

~~~
lifthrasiir
It is really impressive. I had some fun with Chu's original DOS implementation
[1]; for example, what does happen if you repeat a same letter up to five
times? Even you don't know Chinese characters, you can easily see that some
subcomponents are specialized while others remain fully general [2]. For
example 骨 (BBB) is made of three 月s (B), but it gets reused in BBBBB
(presumably ⿰骨骨). Careful decomposition and specialization made Cangjie
extremely concise and still usable.

[1]
[http://www.cbflabs.com/down/show.php?id=62](http://www.cbflabs.com/down/show.php?id=62)

[2] [https://p.mearie.org/gOHG.png](https://p.mearie.org/gOHG.png)

------
Arnavion
This is neat.

>Potential fields for usage include font design and machine learning.

Am I correct that this scheme doesn't capture stroke info? I think that would
be needed for fonts / handwriting recognition. Fonts have serifs / varying
brush widths depending on stroke order and direction. Hand-drawn characters
can also look quite different from the printed form (angles of strokes, two
consecutive strokes being drawn without lifting the pen even though the
printed form has two separate lines, etc).

Edit: As an example, 人 is defined as 257 which is certainly how it appears in
print. However it is usually hand-written as something closer to 357, with 37
being the first stroke and 5 being the second.
[https://www.tanoshiijapanese.com/dictionary/stroke_order_det...](https://www.tanoshiijapanese.com/dictionary/stroke_order_details.cfm?entry_id=56003)

~~~
yorwba
For stroke-level information, you'll probably want something like
[https://github.com/skishore/makemeahanzi](https://github.com/skishore/makemeahanzi)
(for Chinese) or
[https://github.com/kanjivg/kanjivg](https://github.com/kanjivg/kanjivg) (for
Japanese).

~~~
jason_slack
The "Inkstone Chinese" writing app by this same author seems useful for when I
am riding someplace and can't pull out my notebook to practice on paper.
[https://www.skishore.me/inkstone/](https://www.skishore.me/inkstone/)

------
jason_slack
This is very useful. I already have some ideas how this can help me out. One
thing to keep in mind is that it is 'traditional' Chinese, not 'simplified'.

------
deepstream
This is cool! I like how it makes a natural link between how characters are
composed of semantic parts ( radicals ), and how the non-semantic RRPL maps /
abstracts that, maybe exposing the semantic structure with a language that
doesn't have to be aware of semantics. Very cool.

If the author is reading / or if I just missed it in the docs, were these
5000+ descriptions generated automatically somehow or by hand? Either way it's
impressive

~~~
Arnavion
In issue #1 the author says they did it by hand.

------
Y_Y
I tried drawing a Koch-like fractal by starting with

    
    
        48-83-14-48
         ^  ^ ^  ^
    

And then replacing the marked numbers with the whole string to get

    
    
        4(48-83-14-48)-8(48-83-14-48)-(48-83-14-48)4-(48-83-14-48)8
    

and iterating from there. Unfortunately the renderer doesn't seem to like it,
and in any case a proper Koch curve has lines at all angles, whereas here they
must point in one of the eight directions.

~~~
messe
Perhaps try a Hilbert curve instead?

------
peterburkimsher
Nice work; I like how it can all be brought back to numbers. I wonder if a new
keyboard input method could use those codes.

The standard way to describe Hanzi is IDC codes. Those are used in Unicode
proposals, and in the Unihan database.

[https://en.wikipedia.org/wiki/Ideographic_Description_Charac...](https://en.wikipedia.org/wiki/Ideographic_Description_Characters_\(Unicode_block\))

I added IDC lookup to Pingtype's keyboard, which normally searches the
radicals in a decomposition database. It's crude, but lets me input a
traditional character when I can only recognise half of it.

[https://pingtype.github.io](https://pingtype.github.io)

------
thechao
My immediate thought is this could help out Mayan orthography, which is
notably “computer proof”, even though it’s a mixed logography. (Mayan would
need a 3rd marker for depth order, and marker(s?) for occlusion info.)

~~~
yorwba
Based on this presentation I found [1] it appears that even just encoding the
character composition would be more difficult than for Chinese, and the level
of artistic freedom demonstrated in the examples on slide 9 makes me doubtful
that it could be replicated by just a few simple rules.

On the other hand, people are working on it, so Mayan will probably at some
point end up included in Unicode.

[1] [http://www.linguistics.berkeley.edu/sei/assets/unlocking-
the...](http://www.linguistics.berkeley.edu/sei/assets/unlocking-the-mayan-
rev6-latest-sept2016.pdf)

~~~
thechao
I'd distinguish between glyph and ligature variants, i.e., leopard- vs man-
headed characters, U-selection etc., and simply representing the entire word,
itself.

------
YeGoblynQueenne
That's really nice and has potential practical applications, but I was hoping
for more of a narrative description, like "the little man in the big house
with the sea in front of it" or something similar (very unfortunately I don't
know Chinese, or its writing system, so I just made that one up).

I don't think that something like that has been done before, but I think it's
within the reach of the current state of the art for character recognition and
image captioning.

~~~
tialaramex
The Han system isn't mostly pictures of things (ideograms), so your narrative
descriptions aren't really practical.

The system is at heart logographic (so each squiggle means roughly one word,
like "house" or "person") but you can't really do that at scale, it's too
unwieldy (consider how many English words you use or are somewhat confident
you would recognise if others used them), so you end up using two or three
squiggles to make all but the most common words in modern Chinese writing.

The individual logograms are often related to a pictogram, but that's also
true of the Latin characters you used to write this. The letter A is probably
distantly related to a drawing of the head of an ox.

Han logograms also incorporate entirely abstract (ie they were never
attempting to be a picture "of" anything) indications of how they should sound
or what the general topic is, as hints to a reader. These would not translate
to your narrative description.

~~~
YeGoblynQueenne
Thank you for the clarification. Like I say, I don't know anything about
Chinese writing.

I'm aware of course of the pictographic character of the originally Phoenician
alphabet. The thing is, I remember when I was learning to write, adults using
visual metaphors to help me learn: "now draw a little cane" (for the Greek
letter ιώτα), "now a little ring" etc. This was not always possible - for
example, "A" was never described to me as a "plow" (I think that's what it
actually stands for).

But, visual metaphor doesn't have to pre-exist. A new, arbitrary one can be
invented. I mean, a certain brush stroke doesn't have to have a concrete and
universally agreed-upon pictographic meaning. If one can be found that happens
to work, then that'll do.

But that _may_ be well beyond the state of the art for statistical machine
learning algorithms, currently.

------
quirkot
I don't know why I did this, but your welcome. 'Murica.

(((1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357)|(1357-1357-1357-1357-1357-1357-1357-1357-1357-1357))-(48|48|48))|(48|48|48)

------
rendall
This is beautiful.

