
When to use Dictionary Compression - ot
https://fastcompression.blogspot.com/2018/02/when-to-use-dictionary-compression.html?r=1
======
faragon
For real-time dictionary compression, with infinite-size dictionary (LZ77,
using a hash table for O(1) pattern search), you can try:

git clone
[https://github.com/faragon/libsrt](https://github.com/faragon/libsrt)

cd libsrt

make

./enc -ezh <hugefile >hugefile.z

./enc -dz <hugefile.z >hugefile.again

For 1MB dictionary (faster, because of higher cache hit ratio, and using a
much smaller buffer):

./enc -ez <hugefile >hugefile.z

./enc -dz <hugefile.z >hugefile.again

For the source code, check compression (senc_lz_aux()) and decompression
(sdec_lz()) functions, much easier to understand than the Zstandard code
(because of having less features):

[https://github.com/faragon/libsrt/blob/master/src/saux/senc....](https://github.com/faragon/libsrt/blob/master/src/saux/senc.c)

------
hamilyon2
The article does not say what dictionary compression is and how it works or I
have missed it? It would be interesting, and also dictionary compression seems
to be topic of article.

~~~
pronoiac
Usually, when you compress, it's with, say, gzip, rather than something custom
for the data. The compression dictionary is kinda implicitly built and encoded
along with the data.

If you have, like, some standard data (handwave), you can build a dictionary
of frequently-used terms and refer to it when you compress and decompress, and
you can get better compression for being able to point at that pre-built
dictionary.

I might be totally offbase here, as I think database normalization would
remove much of the redundancy that a dictionary would target.

~~~
smueller1234
Database normalization has little to do with this. It's about structurally
removing redundancy from your schema. Like not having the same column and data
twice in two tables. It's not going to do anything if you have two rows with
common data such as two people with the same first name. In the context of
compression, dictionaries also don't generally correspond to a set of words or
records per se but rather arbitrary (sub)strings or blobs.

~~~
mbreese
I think they were trying to say if you had a normalized database where instead
of repeated varchar fields, you had a foreign key — that would act like a
dictionary.

    
    
        CREATE TABLE customer(
        id SERIAL,
        name VARCHAR(1024),
        status VARCHAR(1024)
        )
    

Vs

    
    
        CREATE TABLE customer(
        id SERIAL,
        name VARCHAR(1024),
        status INT REFERENCES status(id)
        )
    
        CREATE TABLE status(
        id SERIAL,
        name VARCHAR(1024),
        )
     

It wasn’t very clear from the parent above, but I think this is where they
were trying to go. The normalized version is smaller because you’ve swapped a
repeated varchar ‘status’ for an int — so that’s effectively a dictionary.
It’s not entirely correct, but it’s not that far off.

~~~
smueller1234
That makes sense, thanks for pointing out my misreading of the parent comment.

------
mianos
I know when not, when the decompression needs to be done in an interrupt
routine. I just did a "run or delta" system for GCODE. Being decoded in a
micro-controller interrupt routine I could not quite the dictionary based
routines to run quite fast enough or in small enough ram.

~~~
rasz
Was this your first embedded project? Never EVER do actual work in interrupt
routine, no compression, no floating point (even if its native and not
emulated), no computation, no lookups, just pass the marker along and exit.

~~~
Lukas_Skywalker
Not the OP, and I have never done embedded programming, but you got me hooked
with the sentence "passing along the marker" and my search engine skills sadly
failed me. Do you mean the preferred way is to do the actual work in the main
routine and only set a marker that the work should be triggered in the IR?

~~~
nothrabannosir
Ideally, an IR should just collect whatever information is necessary for the
main thread to actually handle the routine. Compare this to raising and
handling an event in e.g. javascript: the event payload is handled by the main
thread, the IR would just be responsible for creating that event payload
object. (this is a horrible comparison, to be honest)

The problem with IRs is they put the processor in a very fragile state.
Different chips handle e.g. stacked interrupts (an interrupt which occurs
while another interrupt was being handled) differently, but no matter how they
do it, you won't like it. Sometimes it turns off certain types of memory
access protection, or other important (hardware) peripherals which might be
ignored (because they, themselves, operate using those same interrupts under
the hood); it's all madness.

Hence the interrupt just puts together some event data, sets some flag, and
the main event loop can handle it when the time is there.

------
memco
Completely OT, but this site is horrible on mobile. Trying to zoom in changes
articles. Feels like any interaction changes pages.

~~~
ravenstine
Ironic how Blogger pages have become much worse for mobile in the face of
Google promoting AMP.

~~~
slig
Certainly it feels like Google wants Blogger to die already. Same with the
Google Groups.

