Fast character case conversion, or how to really compress sparse arrays

apankrat · on Feb 22, 2021

Spotted the packing technique in the Wine's code and killed few days to see how much further it can be taken. Pretty much one of those rabbit holes that you'd occasionally run into in a course of regular programming and end up exploring instead of, you know, working. A good diversion, I suspect quite a few people can relate :)

superjan · on Feb 23, 2021

I thought it was a cool technique. I used lookup tables for image decompression, I’m curious if this technique would apply there. In the start of the article, you exclude conversions that don’t round-trip. But I would guess this technique will work there even better, because for almost all characters the offsets are identical, with a few exceptions that get their own table blocks.

lifthrasiir · on Feb 22, 2021

Ha, that's exactly what I've done for rust-encoding [1]! (Now obsolete, go use encoding_rs instead.) For TSP I resorted to a simple greedy algorithm, which was good enough for me.

To be honest though this problem of reducing a large table in the random-accessible manner is hardly new (e.g. perfect hashing) and pretty much everything can be automated only with some expert guidance. I'd like to see a neat solution that can interactively search remappings and emit a neat code in various languages.

(As a fellow Unicode commentator, I should also mention that the OP only concerns about simple case mappings as called by Unicode. But they are pretty much the only mapping we like to optimize for their prevalence anyway.)

[1] https://github.com/lifthrasiir/rust-encoding/blob/master/src...

infogulch · on Feb 22, 2021

You might consider mentioning encoding_rs in the repo description if that's your recommendation.

apankrat · on Feb 23, 2021

Ha, indeed! Should've compressed the index too though :-P

canucker2016 · on Feb 22, 2021

Based on the title, I thought you were Daniel Lemire with another writeup for benchmarking character conversion of some sort - for examples see https://hn.algolia.com/?q=lemire.me.

Great work on reducing the size of the character conversion tables, but I don't see any evidence for the title of the web page.

"Compact character case conversion" is more appropriate currently based on the data given in the readme.

Data can be greatly compressed at the cost of access speed or the compressed data could speed up data access - with x86 CPU complexity these days, I wouldn't bet on determining the speed based on a quick glance of the code.

You state the case conversion is fast, yet I don't see any benchmarks backing up the assertion.

I'd expect a few different types of input to be benchmarked - ASCII upper/lower case alphabet chars (typical A-Z,a-z & space/punctuation thrown in to make it look like typical written language text inputs), Unicode upper/lower case et al., and random characters.

Not gonna leave people in the lurch though - Daniel Lemire has a recent post which can be used as a scaffold for benchmarking.

Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

Post: https://lemire.me/blog/2021/02/09/on-the-cost-of-converting-...

If the code is to be timed on Windows, the code will need a Windows version of "clock_gettime(CLOCK_REALTIME, ...)" - see https://stackoverflow.com/questions/5404277/porting-clock-ge...

Someone · on Feb 22, 2021

I would guess the better cache locality of a much smaller lookup table could make the smaller table with more shifts and lookups beat the larger one in speed in real-world code (where other code also competes for cache lines)

If you go down that rabbit hole, however, you might want to check whether foregoing using the shifts and lookups for 7-bit ASCII is even faster, if (as is often the case), you expect those characters to be the vast majority.

vardump · on Feb 23, 2021

You could test whether all of the characters in a SIMD (256 or 512 bits, so 32 or 64 UTF-8 chars) register are in 7-bit range and just use simple logical operations for lower or upper casing them in a few cycles.

If not, then use a slow path with lookups.

lifthrasiir · on Feb 22, 2021

Anything necessarily requiring lookup tables would be (comparably) slow. Unicode lookup tables are large, so it would be inherently slow for some inputs. For the performance you should assume the input distribution (e.g. ASCII is frequent even in Unicode texts, so we detect a run of ASCII and go SIMD). You can also combine adjacent operations: instead of working with Unicode scalar values, you can directly accept UTF-8 string and optimize tables for them (Oniguruma regexp engine of Ruby fame does this).

apankrat · on Feb 22, 2021

You want a benchmark that shows that one >>, one &, two [] and a + are going to be fast?

The test loop that wraps the lookup code will likely take as much CPU as what's being benchmarked. Even if you are to use RDPMC and even if you are to use it very sparingly.

yorwba · on Feb 23, 2021

The naive solution uses just one [], so it's reasonable to ask whether the compressed solution is faster. (Especially for the higher compression levels with even more memory accesses.)

And the relevant workload to benchmark wouldn't be just case-converting a single character, but a whole text, as would be done e.g. for constructing a case-insensitive search index.

Because most texts contain only a limited range of characters, it's entirely possible that a large lookup table works just fine because only a tiny portion of it needs to fit into the cache.

Asooka · on Feb 23, 2021

I'm fairly confident that since the lookup table fits comfortably in L1 cache, both algorithms will be about equally fast. You may see a difference if you have to case-fold the entire Library of Congress several times per user operation. The other case where there may be meaningful difference in performance would be embedded devices with small caches and slow memory.

vardump · on Feb 23, 2021

Indirect memory lookups are slow, even if the data is in the L1 cache.

rurban · on Feb 22, 2021

I did a much better optimization for musl/freebsd, about 10x smaller, with binary search into a single table of ranges and single exceptions, for both cases, lower and upper. A few ifs are better than wasting lots of cache.

https://github.com/rurban/musl/commit/7bcdc7e6e1ed2c4424d706...

But I'm also trying the automatic data optimization route, because Unicode I'd changing every year. Case mapping is easy, but the changing normalization tables cause minor data structure hickups every year. Working in that here: https://github.com/rurban/optdata

huhtenberg · on Feb 22, 2021

From the first glance, yours is going to be orders of magnitude slower.

legulere · on Feb 22, 2021

I have the feeling that immutable data (with which I don’t mean copy on write) has an extreme potential for optimization but there doesn’t seem to be a good central collection of it.

Some tips from me: mmap works really well and easy with immutable data as the operating system will manage the memory pretty well. If you store strings ordered in a string table you can compare the pointers for blazing my fast string comparison.

kccqzy · on Feb 22, 2021

I think if you have a large immutable lookup table, the compiler will put it in the rodata section. Which means the kernel can evict that mapping if it's unused and repopulate via a page fault when it's needed.

rurban · on Feb 22, 2021

mmap'ng constant tables is nice, but still wasting your caches. Which makes it much slower than a few tricks to save data and cpu. Such as using proper algorithms and data structures. I'm working on an optimizing data compiler for such oneshot lookups in static const data. Range or interval queries would be different and easier (btree's only), but oneshot needs to try much more.

legulere · on Feb 23, 2021

You can use proper data structures in mmaped files. The problem is that you will need offset-based pointers (Like position independent code, but data) and fixed strict layouts, which has very little support in programming languages.

kccqzy · on Feb 22, 2021

> The casemap_lower comprises two distinct parts. It starts with 256 index entries, followed by 3866 deltas, to the grand total of 4122.

This sounds like a code smell of using one variable for two distinct purposes. Why not separate them into two tables? After all the first lookup results in a base index to be added, and the second lookup results in a delta. They are different things.

lifthrasiir · on Feb 22, 2021

Normally that's correct, but in this case you would like to put two tables as close as possible as you already know the second table would be accessed right after the first. As a side effect two tables would share the same base address, which can be beneficial for some platforms. Technically it also allows for the possibility to overlap two tables (I think it's unlikely, but still).

cma · on Feb 23, 2021

You maybe still want it to run fast with unoptimizable debug builds to help with development (less register usage if both accesses are offset from same baseline address). It's common for interactive applications to still do some manual optimization that could otherwise be done by an optimized build for that reason.

bear8642 · on Feb 22, 2021

Nice! Bit confused on link to TSP but cool!

jcranmer · on Feb 22, 2021

I can see why you might be confused. Here's my attempt at an explanation:

The goal is to find the permutation of blocks that minimizes the total length of the table. Each block contributes an amount of that length--the amount that did not overlap with the previous block. This amount is only dependent on the interaction between the block and its previous block. So we can make a graph where each node is a block to be encoded, each node has an edge to every other node, and the weight of that edge is the amount of overlap [1]. Compute an ordering of the nodes to visit that maximizes overlap, and you've a) done a TSP traversal and b) computed the optimal permutation.

[1] Equivalently, non-overlap, gets you a minimum-weight problem, as in classical TSP. But the notes suggest it's framed as a maximal problem, finding the path that maximizes overlap--which is certainly easier to code.

apankrat · on Feb 22, 2021

Say, we have our N blocks. We make these into the nodes in a graph. Every node is connected to all other nodes.

An edge leading from A to B is a case when block A precedes block B. The weight of that edge is the AB overlap. Note that the same edge may have a different weight in the opposite direction, because it will be that of the BA overlap.

So what we are after is the heaviest path through this graph that passes through all nodes exactly once. This can be reduced to finding the heaviest cyclic path for a specific starting point (and excluding the cheapest edge from the solution and then iterating through all points). This is pretty much TSP with an inverted goal.

lostmsu · on Feb 23, 2021

So how much faster/smaller is this than a uint16->uint16 hashtable, considering this approach uses at least two levels of indirection?