
Smaz - short strings compression library - antirez
http://github.com/antirez/smaz/tree/master
======
tptacek
This code made me smile, so don't take this the wrong way.

Don't run smaz() on untrusted input. Two things you need to fix, and a general
word of advice:

(1) In the 0xFE case, you check for headroom in the output buffer but not the
input buffer. If 0xFE is the last character in the input, you run right off
the end of it, dumping the contents of memory to the output string.

(2) In the 0xFF case, same thing; if 0xFF 0xFE are the last characters in the
buffer, I get at least 254 bytes of surrounding memory in my output.

How big a deal this is depends on what's in memory (and where the buffers are
being kept). Generally, these are going to be on the stack, thus, a big deal.

General advice: not in love with using signed integers for byte counts; also,
I don't think you have this problem because you're using full width integers,
but be aware that your length bytes can theoretically wrap back to 0 if you're
ever not careful in the future.

DON'T DATE ROBOTS.

~~~
antirez
Hello tptacek, it is not very common with the normal use case to decompress
things that you didn't compressed yourself, but I agree with you that is
better to add this checks at the cost of some speed. Btw note that this is
common in compression code that tries to do fast. in things like Fastlz and
similar libs this is a compile time define, I may do the same, if
SAFE_DECOMPRESSOR is defined it will do the check, otherwise will just skip
the checking code.

Also note that the access is read only, so basically it's not possible to use
such thing to run unprivileged code. Also in moder operating systems pages are
zeroed before to be used in a different process, so it is not possible to read
part of memory about a different process.

Thanks for the hint

~~~
tptacek
Three observations:

(1) It is extremely common for network apps to decompress blobs they didn't
generate; even in cases where those blobs aren't passed client/server, but are
instead roundtripped through a database or shared persistence layer.

(2) Whatever fastlz may do, the industry standard (in Win32 and Unix and in
embedded devices) is zlib --- at least from what we've found on app pentests
--- and zlib doesn't have this problem (now).

(3) Zlib is itself a minefield of vulnerabilities, which supports the point
that your compression codec is a security hotspot for an application.

But my point isn't that your code sucks; it's nice and tight.

~~~
antirez
1) ACK, you are right, actually I'm probably going to use in the future this
lib for Redis (<http://code.google.com/p/redis>), and it is important that an
untrusted DB will load correctly without memory corruptions or invalid
accesses.

2) Right, but a define to don't this checks is useful for environments where
data can be trusted but speed is very important.

3) Good point. I'll fix the code. Btw what is nice about that kind of things
is that testing against random data helps _a lot_.

Zlib should do that. For example, do you remember the old security bug of old
bind implementations of using a 'ptr' opcode (used for compression) in DNS
packets in order to create an infinite loop and mount a DoS? Years ago I
tested how this bug were trivial to spot just processing random data. This is
true for a lot of applications.

If you know how the application works you can "tune" this random data with
biases that will stimulate a lot more the code. This is a worthy testing tool
that is underused IMHO.

Thanks for your comment.

~~~
tptacek
The testing you're describing is called "fuzzing", and it's the dominant way
researchers find vulnerabilities today (I didn't fuzz your code; I just read
it, and then wrote a little test doohickey).

David Molnar at Berkeley has done a ton of work on smart fuzzing:
<http://www.cs.berkeley.edu/~dmolnar/>

Ilja van Sprundel wrote what has got to be the most popular "dumb" fuzzer,
mangle.c: <http://www.digitaldwarf.be/products/mangle.c>

Mangle routinely finds horrible, horrible things in modern code; Ilja deserves
some kind of anti-Nobel Prize for it.

------
antirez
Hello, I wrote this lib as an initial test on short strings compression. It
seems like it can be useful for things like memcached, Redis, and alike, but
may also be a good idea to make your MySQL DB shorter (chats, logs, user
messages, news titles, urls, can all be compressed with this lib). In the
README there is some info about the lib, the usage is brain-dead simple.

It's code of few hours ago, but should be pretty stable already, with this
kind of programming what's nice is that you can encrypt/decrypt millions of
strings from /dev/urandom to spot obvious bugs in very short time.

I wonder if this lib can be useful to somebody else.

Some example:

    
    
        'This is a small string' compressed by 50%
        'foobar' compressed by 34%
        'the end' compressed by 58%
        'not-a-g00d-Exampl333' enlarged by 15%
        'Smaz is a simple compression library' compressed by 39%
        'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49%
        'this is an example of what works very well with smaz' compressed by 49%
        '1000 numbers 2000 will 10 20 30 compress very little' compressed by 10%

~~~
jcl
It looks like your encoding routine could get better compression if it weighed
alternative encodings more carefully. For example, "not-a-g00d-Exampl333"
could be better compressed as a single run of literal characters rather than
as several runs of literal characters interspersed with codebook entries. (In
this sense, the single letter codebook entries might not be helping your
compression ratio.)

~~~
antirez
jcl yes this is a compromise, if you check smaz_compress it does this kind of
checks but only in order to glue together consecutive verbatim outputs into a
single one (up to 256 bytes), but does not back-check if a single byte
interruption is not worth it.

I think this can be an option that can be done in a second-pass way if a given
#define is active. Thanks for the useful comment.

------
gojomo
Seems to me that gzip or lzma with a constant, language-optimized dictionary
would both...

(1) compress better; and

(2) be more robust/self-documenting.

EG: "Feed this dictionary, optimized over 1 billion [English|Western
language|etc] documents, to zlib deflateSetDictionary() before compressing."

(The zlib/gzip dictionary is only 32KiB, but the lzma dictionary can be up to
32MiB.)

Building a small library of standard dictionaries, keyed by hash (so they can
be locally cached or fetched when needed) would offer good coverage of many
string domains.

See for example the Google 'Shared Dictionary Compression over HTTP (SDCH)'
proposal, apparently a feature of Google Toolbar:

[http://sdch.googlegroups.com/web/Shared_Dictionary_Compressi...](http://sdch.googlegroups.com/web/Shared_Dictionary_Compression_over_HTTP.pdf)

<http://groups.google.com/group/SDCH>

~~~
antirez
Sure you can do this, what you get is something that is also able to adapt
dynamically after the first 100 bytes or so. My requirement was to have
something smaller, simpler, possibly able to run on low memory systems, and
especially, with a much faster decompression compared to zlib.

I'm not exploiting an arithmetic encoding basically, and the dictionary is
(intentionally) small. Add to this lib an arithmetic encoding and you get near
to a zlib with an optimized dictionary, but at the cost of a lot of speed.

------
jcl
It looks like you put together your codebook by hand (containing phrases like
"[http://"](http://) and ".com"), right? Wouldn't it be better to
automatically extract it from a sample text corpus? Otherwise, you'd be
introducing your own biases.

~~~
antirez
Hello, no I build it with a Ruby script. But I inserted two codes by hand,
<http://> and .com. All the rest is based on a probability-length based
weight, and a test data that is the following: english books in .txt format,
and different .html pages from wikipedia.

