To be honest though this problem of reducing a large table in the random-accessible manner is hardly new (e.g. perfect hashing) and pretty much everything can be automated only with some expert guidance. I'd like to see a neat solution that can interactively search remappings and emit a neat code in various languages.
(As a fellow Unicode commentator, I should also mention that the OP only concerns about simple case mappings as called by Unicode. But they are pretty much the only mapping we like to optimize for their prevalence anyway.)
Great work on reducing the size of the character conversion tables, but I don't see any evidence for the title of the web page.
"Compact character case conversion" is more appropriate currently based on the data given in the readme.
Data can be greatly compressed at the cost of access speed or the compressed data could speed up data access - with x86 CPU complexity these days, I wouldn't bet on determining the speed based on a quick glance of the code.
You state the case conversion is fast, yet I don't see any benchmarks backing up the assertion.
I'd expect a few different types of input to be benchmarked - ASCII upper/lower case alphabet chars (typical A-Z,a-z & space/punctuation thrown in to make it look like typical written language text inputs), Unicode upper/lower case et al., and random characters.
Not gonna leave people in the lurch though - Daniel Lemire has a recent post which can be used as a scaffold for benchmarking.
If the code is to be timed on Windows, the code will need a Windows version of "clock_gettime(CLOCK_REALTIME, ...)" - see https://stackoverflow.com/questions/5404277/porting-clock-ge...
If you go down that rabbit hole, however, you might want to check whether foregoing using the shifts and lookups for 7-bit ASCII is even faster, if (as is often the case), you expect those characters to be the vast majority.
If not, then use a slow path with lookups.
The test loop that wraps the lookup code will likely take as much CPU as what's being benchmarked. Even if you are to use RDPMC and even if you are to use it very sparingly.
And the relevant workload to benchmark wouldn't be just case-converting a single character, but a whole text, as would be done e.g. for constructing a case-insensitive search index.
Because most texts contain only a limited range of characters, it's entirely possible that a large lookup table works just fine because only a tiny portion of it needs to fit into the cache.
But I'm also trying the automatic data optimization route, because Unicode I'd changing every year. Case mapping is easy, but the changing normalization tables cause minor data structure hickups every year. Working in that here: https://github.com/rurban/optdata
Some tips from me: mmap works really well and easy with immutable data as the operating system will manage the memory pretty well. If you store strings ordered in a string table you can compare the pointers for blazing my fast string comparison.
This sounds like a code smell of using one variable for two distinct purposes. Why not separate them into two tables? After all the first lookup results in a base index to be added, and the second lookup results in a delta. They are different things.
The goal is to find the permutation of blocks that minimizes the total length of the table. Each block contributes an amount of that length--the amount that did not overlap with the previous block. This amount is only dependent on the interaction between the block and its previous block. So we can make a graph where each node is a block to be encoded, each node has an edge to every other node, and the weight of that edge is the amount of overlap . Compute an ordering of the nodes to visit that maximizes overlap, and you've a) done a TSP traversal and b) computed the optimal permutation.
 Equivalently, non-overlap, gets you a minimum-weight problem, as in classical TSP. But the notes suggest it's framed as a maximal problem, finding the path that maximizes overlap--which is certainly easier to code.
An edge leading from A to B is a case when block A precedes block B. The weight of that edge is the AB overlap. Note that the same edge may have a different weight in the opposite direction, because it will be that of the BA overlap.
So what we are after is the heaviest path through this graph that passes through all nodes exactly once. This can be reduced to finding the heaviest cyclic path for a specific starting point (and excluding the cheapest edge from the solution and then iterating through all points). This is pretty much TSP with an inverted goal.