
Show HN: Efficiently searching compressed text files - mattgodbolt
https://github.com/mattgodbolt/zindex
======
jules
The article's method is nice, it allows you to decompress portions of the
file. I thought this would be about the FM index. The FM index allows you to
search what is essentially a bzip'd archive for a substring far more
efficiently than you would be able to search an uncompressed string for a
substring. Pretty mind blowing if you think about it.

It works because the Burrows-Wheeler transform which is used in compression is
closely related to the suffix array, which is a data structure used as a
substring index. By adding a small amount of additional metadata to the
compressed Burrows-Wheeler transformed string you can search for any
substring. Its practical performance is slower than a suffix array, but the
memory usage can be 100x lower.

------
xaa
I have been wanting this for a long time, although my primary use case would
be indexing a particular field.

It is nice that you were able to make it work with ordinary gzip files. In
bioinformatics, an approach to this problem has been to create a gzip-
backwards-compatible (i.e., can be read by zcat but not written to by gzip)
format called BGZF [1] which acts like ordinary gzip except each compressed
block is the same size, which eases indexing and partial decompression.

I would be interested to know how you solved the problem of finding the right
block and offset when each compressed block can be an arbitrary size as it is
in ordinary gzip format.

[1] [http://blastedbio.blogspot.com/2011/11/bgzf-blocked-
bigger-b...](http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-
better-gzip.html)

~~~
mattgodbolt
Fantastic! I had hoped others would have the same problem as me. I didn't have
the luxury of being able to change my many input gzip files (unlike BGZF).

The trick is to store the gzip internal state as a checkpoint every few
hundred MB. Then to decode data at a particular offset, you find the further
internal state that's before your offset; reinitialize the compression
internal state with that, and then scan on until you get to the offset you
need.

The zindex file that's generated contains both these checkpoints and the index
that lets one scan there.

There's an open issue I'd like to resolve to add support for indexing a
particular field: would support for arguments like those the "cut" UNIX
utility allow for your use case?

~~~
xaa
Smart. I will see for myself in a minute how much space such an approach
takes; I would imagine the compressor size gets nontrivially large on big
files.

Well, my exact use case, which I imagine I'm not alone in, is indexing a tab-
delimited file with a single-line header containing the column names. So
indexing by column index would be adequate, but it would present two very
minor problems: 1) I'd have to figure out a way to exclude the header row from
output (or alternately, to always output it), and 2) it would be slightly more
convenient to refer to the columns to be indexed by name rather than position,
as some of these have thousands of columns.

Can you have multiple indexes on the same GZIP file? And if so, does it create
multiple .zindex files or re-use the same one?

I don't know if SQLite supports custom data retrieval backends like Postgres,
but what would be REALLY interesting is to write one such that the indexing
information is in SQLite, and you can use SQL to query row data directly from
the GZIP index (I know this is beyond the scope of your project, I'm just
thinking out loud now).

~~~
mattgodbolt
Short reply as I'm on a phone, but thanks! I can add an option to pick the nth
field, and to skip the first X lines of the input; I think this'd cover your
use case!

The code and file format support multiple indices in the same zindex file but
I haven't got a decent way to pass command line arguments to define them yet!

Watch this space: also feel free to ping me on email for more comments!

------
cjbprime
Yep, this is a neat idea. I first saw it as part of Patrick Collison's
wikipedia-iphone project (2007-ish), which could pull Wikipedia articles out
of a compressed bz2 archive, by indexing from article name into bzip2 block
and then only decompressing the needed blocks for the article you asked for.

We adapted his code at One Laptop Per Child to preload Wikipedia snapshots on
every laptop:

[http://blog.printf.net/articles/2008/06/02/wikipedia-on-
xo/](http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo/)

[http://blog.printf.net/articles/2010/05/09/peru-olpc-and-
wik...](http://blog.printf.net/articles/2010/05/09/peru-olpc-and-wikipedia/)

------
PaulHoule
Something people don't know about gzip is that you can write a file as
multiple compressed segments and gzip will decompress one segment and then
decompress the next segment. Thus you can get the benefit of being restartable
the way bzip2 is but it is optional and flexible so you can break only on
record boundaries.

~~~
gojomo
Indeed! Restarting at record boundaries is one option. Then an external index
can point to all the ragged restart offsets, or some applications may be able
to just scan for the distinct byte-patterns of likely restart offsets.

Restarting at every N-bytes boundary – for example, every 128K – is another
option, though that requires a little more fancy thought to padding.

Another way to skip the need for an external index using any of these
techniques, but still know the raw-data line/byte-offset when jumping to a
deep position, is to encode those tallies (either chunk-by-chunk or from-
start) as GZIP header "extra field" extensions. (Here's an example extension
that just records per-chunk compressed- and uncompressed- byte-lengths:
[http://archive-
access.sourceforge.net/warc/warc_file_format-...](http://archive-
access.sourceforge.net/warc/warc_file_format-0.9.html#anchor25))

------
covi
Take a look at "Succinct: Enabling Queries on Compressed Data", NSDI 2015 [1].

[1] [https://www.usenix.org/conference/nsdi15/technical-
sessions/...](https://www.usenix.org/conference/nsdi15/technical-
sessions/presentation/agarwal)

------
101914
Reminds me of dictzip.

[http://sourceforge.net/projects/dict/files/latest/download](http://sourceforge.net/projects/dict/files/latest/download)

------
dpkendal
Why not store the log files in an FM-index?
([https://en.wikipedia.org/wiki/FM-index](https://en.wikipedia.org/wiki/FM-
index))

~~~
mattgodbolt
Interesting point, thanks for the link!

For my use case I wanted to put stuff together without writing too much novel
code; zindex uses a SQLite file to store indices and checkpoints. I'm hoping
to add json support to zq/zindex, which means a general text search isn't that
useful.

Additionally the original log files I run against in my use case are generated
upstream of me and are many many multi-gigabyte gzipped log files. I didn't
want to add too much extra storage for my logs: the "key" I index on is a tiny
part of my log, so the storage space for the index is considerably less than
the original text (even accounting for the gzip checkpoints).

------
hackmiester
I only wish that the matching could be 'fuzzy' in easily defined ways.

My use case is MAC addresses. So, if I search for 'dead.beef.cafe', this would
be perfect if it could know that I wanted to see entries indexed by
'DE:AD:BE:EF:CA:FE" also.

Is that possible? This tool is so amazing. If it did this, it would be able to
hugely improve my daily workflow.

edit: Actually, it is going to be able to hugely improve my workflow anyway,
because I often want to see data about IP version 4 addresses, too, and it
handles those fine. But I'd still like to be able to use it for MACs.

~~~
mattgodbolt
Great idea. zq uses the SQLite index built by zindex; so in theory I can
modify it to use a 'WHERE index LIKE xxx' instead of 'WHERE index=xxx' \--
then SQLish wildcards would be supported.

I've raised an issue to capture this use case; feel free to chime in there! :)
[https://github.com/mattgodbolt/zindex/issues/12](https://github.com/mattgodbolt/zindex/issues/12)

------
nattaylor
About how long does it take to create the index, maybe compared to a zgrep?
Could it be parallelized with something like `parallel --pipe`?

I haven't tried yet, but this seems like a great tool.

~~~
mattgodbolt
Thanks! For my logfiles it runs through them at around 3MB/sec (that's 3MB of
compressed file per second). Most of it is probably I/O, my giant logs are on
NFS.

I run the indexer in parallel on multiple log files: in general there's no way
to parallelize the indexing of a single file as decompressing a gzipped file
requires all the bytes up to that point. Unless you already have a gzip
checkpoint around somewhere, you need to sequentially scan: this is part of
what zindex does to allow quick random access.

Thanks for the feedback :)

------
dangirsh
See "succinct data structures" and "wavelet trees" for a similar / more
general idea.

news.ycombinator.com/item?id=7079427

I also must thank Edward Kmett for introducing me to these.

------
est
It suddenly strike me that in a compressed file, you don't have to decompress
then grep, you can grep for Huffman table and know exactly where the
occurrences are.

~~~
jimktrains2
The table doesn't contain the location in the file, only a mapping of bit
sequence to original sequence. LZW also doesn't contain information on where
sequences are.

However, you could construct the compressed version of your search string and
search for it (or in the case of LZW, generate it as you continue to search
and decompress only small portions as needed).

Either way, you still have to go through the file. at least once.

EDIT: Actually, if your search string falls inside a compression block, it may
be more difficult without uncompressing (and then discarding).

------
listic
I understand that generated log files are generated upstream of you, but how
much does compressing them actually wins?

I guess for most of us it's better to just search plain text files, if
possible?

~~~
sp332
As a rule of thumb, English in ASCII compresses 8:1. I would guess that log
files compress even better because they tend to be repetitive.

Edit: well best case is 8:1 if you put a lot of time into it. Gzip with
default settings is more like 4:1 which is still significant.

~~~
listic
How can you optimize for better compression ratios with gzip?

~~~
sp332
gzip's default is -6, but -9 gives better compression. If you can sort your
data or otherwise organize it so that similar data is grouped together, it has
a better chance of finding the redundancy. Some implementations get better
compression than the GNU implementation. The best one I know of is from the
7zip project.

Other compression formats can give better ratios, if that's an option for your
project. bzip2, xz, 7z.

------
mianos
Columnar databases all do this. Handy having a library to have a play with.

------
sultanofsaltin
Isn't there a company doing this called Pied Piper...

~~~
frik
Check out the technology section at
[http://www.piedpiper.com](http://www.piedpiper.com)

Btw. welcome to _Silicon Valley_.

------
frik
Strigi desktop search (2006) implementation was very powerful
([http://en.wikipedia.org/wiki/Strigi](http://en.wikipedia.org/wiki/Strigi)).
It was possible to search through an compressed archive inside an compressed
archive - in a memory and space efficient way -> "streaming". Strigi uses the
Jstream C++ library which allows for deep indexing of files. See this slides
from 2006:
[http://www.vandenoever.info/software/strigi/akademy2006.pdf](http://www.vandenoever.info/software/strigi/akademy2006.pdf)
. The CLucene (C++ port of Java Lucene) also uses the Jstream library
([http://sourceforge.net/projects/clucene/](http://sourceforge.net/projects/clucene/)).
Jstream works similar to the Java "InputStream".

Given the discontinued status of Strigi and CLucene, it would be great if one
would maintain the Jstream C++ library. Or take its ideas and implement it in
zindex.

