Hacker News new | past | comments | ask | show | jobs | submit login
Unicode data file compression: achieving 40-70% reduction over gzip alone (hexops.com)
25 points by boyter on July 4, 2021 | hide | past | favorite | 6 comments



I wanted try my hand at this problem for fun, but I realized the problem description is a little bit incomplete. Which fields of the mentioned files should be encoded and which are irrelevant?


Yes, sorry about that - I omitted a bit of that information for brevity.

If you want to play with allkeys.txt (which is by far much more sequential, simpler data than UnicodeData.txt) then you only need to remove the non-NFD strings (since the Unicode Collation Algorithm's first step requires you to decompose the string's code points to canonical NFD form), that removes ~2,000 entries.

The full file parser code, which strips those out and other useless information like comments and version information can be found here: https://github.com/jecolon/ziglyph/blob/main/src/collator/Al...

If you want to play around with UnicodeData.txt (which is less sequential, more complex data) then only two fields are used (the code point and decomposition field), and only records where the second field is not empty (the full decomposition type name in angle brackets is not needed, only whether it is or is not there is important.)

The full parser code for that file can be found here: https://github.com/jecolon/ziglyph/blob/main/src/normalizer/...

Hope that helps!


I'm neither very familiar with unicode collation or zig, but for this line:

    19B5 199E ; [.36FC.0020.0002][.370F.0020.0002] # <NEW TAI LUE VOWEL SIGN E, NEW TAI LUE LETTER LOW VA>
You want to store a mapping from (19B5, 199E) to ((36FC, 0020, 0002), (370F, 0020, 0002)). Is that right?


Yep, that's basically correct!


What's the CPU performance before/after?


It wasn't measured emperically as we're not very worried about that aspect. Still, the author of Ziglyph, jecolon, did some rudimentary comparisons and found it was equal and sometimes better by a few milliseconds.

I suspect the size of the file being much smaller means in practice it now often fits in L1 cache during decompression, so that's probably counteracting the time spent decompressing.

It also helps that the algorithm is able to rely on the files already being well sorted, making it a pretty linear process to compress.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: