
Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time - luu
https://arxiv.org/abs/1607.04346
======
zitterbewegung
This looks really awesome. According to google scholar it went through peer
review. Seeing that it would improve compression (linear time LZ77) would be
cool but not only does it do that but also compute search index data
structures. [https://scholar.google.com/scholar?hl=en&q=Space-
Efficient+C...](https://scholar.google.com/scholar?hl=en&q=Space-
Efficient+Construction+of+Compressed+Indexes+in+Deterministic+Linear+Time&btnG=&as_sdt=1%2C14&as_sdtp=)

~~~
faragon
LZ77 was already linear time using hash tables for the search (e.g. lzop [1],
LZ4 [2], LZF [3], and other similar LZ77 compressors using the same
technique).

[1] [https://www.lzop.org/](https://www.lzop.org/)

[2] [https://github.com/lz4/lz4](https://github.com/lz4/lz4)

[3]
[http://oldhome.schmorp.de/marc/liblzf.html](http://oldhome.schmorp.de/marc/liblzf.html)

~~~
zitterbewegung
Sorry misread the paper and I meant linear space LZ77

~~~
faragon
Those fast LZ77 I pointed before use linear time (O(n)) and constant space
size (fixed-size hash tables) -which is O(1) space, even better than O(n)
-linear- space complexity-.

~~~
wolf550e
To get perfect parsing LZ, you need a lot of space to store backreferences.
When you increase the window size, the space grows a lot. I don't know what
zstd does, but I would love to see it improved by algo magic.

------
gnavarro69
Hi, a note from one of the coauthors of the paper ;-)

Yes, it is a theoretical paper and then the presentation focuses on what is
needed in such a paper: to establish that the big-Oh complexity can be
reached. Some theoretical papers will never become practical, but we believe
this one can. However, this will require a fair amount of algorithm
engineering, to take the important theoretical ideas and implement them with
common sense, changing theoretically appealing but practically useless modules
by others that work better, even if they do not reach the desired worst-case
complexity. A promising aspect of this algorithm is that it is easily
parallelizable (the batched queries). A postdoc of mine that has experience in
multithreading and compact data structures is already working on this. I won't
dare to say how competitive will be the resulting LZ-parsing algorithm,
though.

We have had some experiences with good theoretical results whose
implementation looks unsurmountable but that turned out to work very well
after some algorithm engineering. An example is our work on top-k document
retrieval:
[http://epubs.siam.org/doi/abs/10.1137/140998949](http://epubs.siam.org/doi/abs/10.1137/140998949)
[http://dl.acm.org/citation.cfm?id=3043958](http://dl.acm.org/citation.cfm?id=3043958)

Another well-known case is the first paper on the FM-index, with constants
like sigma^sigma multiplying the space, and its current realizations, e.g.
Burrows-Wheeler Aligner.

In this case, I think it will not be that hard, but still it will not be a
matter of just translating the algorithm into code.

Best, Gonzalo

------
pzh
Can somebody who read the paper in depth comment on whether the result is
practical or not? Since it's a TCS paper, it's hard to gauge whether the
constant factors are palatable or not, and there are many theoretical CS
papers that give asymptotically optimal or significantly improved algorithms
that have astronomical constant factors (e.g. matrix multiplication, etc.)

~~~
eggie
The compressed suffix tree (CST) supports a number of search algorithms that
run in the order of the query length irrespective of the size of the corpus
which is being searched. That the suffix tree is compressed allows it use
space that is often not much larger than the input sequence, and this effect
is improved in the case of inputs with internal repetitions, which seems to be
the norm in the case of naturally-arising data (i.e. almost anywhere that
Zipf's law holds).

As the authors point out, existing algorithms for the construction of the
compressed suffix tree require space that is proportional to the alphabet
size. When we work with limited alphabets, like DNA, this factor is small and
we tend to not worry about it. Many real-world data sets do have larger
alphabet sizes, and so there is a need to efficiently generate the CST for
them. Removing the alphabet-size bound on the space required for construction
is a big deal. Existing methods require huge amounts of space to construct the
CST or CSA (compressed suffix array) for large alphabets, and are barely
practical to use.

In practice results of this kind have proceeded implementations that are
usable by a year or more. Libraries like sdsl-lite have accelerated the rate
at which implementations get into the hands of compressed data structure
researchers in much the same way that R did for statistical models, so
hopefully we can see benefits of this result sooner.

Unfortunately, it will take a deeper reading of the paper by someone who is
more intimately involved in this research to be able to describe exactly why
this result may not be practical. A quick read does not suggest any obvious
problems, the authors have a good track record and are very positive about
many follow-on results (see conclusions).

~~~
lorenzhs
SDSL-lite is a fantastic library, Simon (the main author) is one of my
colleagues and spends a lot of time on making it easy to use:
[https://github.com/simongog/sdsl-lite](https://github.com/simongog/sdsl-lite)

Without having read the paper in detail, SODA (the conference where it was
presented) papers do have a reputation for being very theoretical, though, so
it may not be implemented any time soon.

~~~
cletus
That does look fantastic. Great docs too. Sadly GOL :(. This can severely
restrict where it can be used.

~~~
lorenzhs
I'm assuming you meant GPL? I've got good news for you then, the next version
is going to be under a BSD license: [https://github.com/xxsds/sdsl-
lite](https://github.com/xxsds/sdsl-lite) (currently unstable!)

------
visarga
I used to play with suffix arrays a long time ago. I wanted to accelerate grep
on a gigabyte text file. The tool was called "sary" (short for suffix array)
and still exists on a forgotten SourceForce page. Good tool, it was able to
find any substring in a huge file instantly.

~~~
tmzt
How might your method compare to the tech in ripgrep or The Silver Searcher?
Are there cases where it might be faster?

~~~
nialo
They solve different problems. In particular, ripgrep and friends are designed
to search arbitrary and potentially large directories with no pre-computation.
They run in time ~linear in the size of the files to be searched.

The paper under discussion here is about a new way to create an index that
also takes time ~linear in the size of the files to be searched, although
presumably with a higher constant factor than just searching those files.
After you have the index it's possible to search it in time linear in the
length of the query rather than the files. This is much faster, but requires
storing an index that is at least as large as the original file set, and
keeping it up to date as things are changed etc.

~~~
nightcracker
Can you retrieve the original file from the index? If yes it's maybe
interesting to store files like that for some databases by default and
retrieve the original file only when needed.

~~~
nialo
I haven't read this paper, and haven't worked with the compressed variety of
suffix trees/arrays. That said, I'm confident it's possible to retrieve the
original file from a normal suffix tree, although it would be pretty slow. I
imagine it must be possible to retrieve the file from the compressed version
as well.

If nothing else, I think it is possible to retrieve a complete file from any
index that lets you search for substrings and get a full string and position
in the file back. Just search for all length 1 substrings, get a map from
position -> string, then reconstruct the file from that.

I doubt it's worth storing files this way, because turning the index back into
the file sounds very slow. I'd rather just store the file and the index, the
time v. space tradeoff seems like a good one if you really care about search
performance. That said, I use The Silver Searcher, and it's fast enough with
no index that I don't think any of this stuff is worth the effort for
searching text files on a file system.

------
wolf550e
It will take me a while to read the paper. I saw there is no code inside. Do
they have code that improves LZ parsing as used in all general purpose
compressors? If they don't have code, can their algorithm be implemented to
improve performance of zstd or lzma?

~~~
vecter
I skimmed the abstract and the paper and didn't see any, but that doesn't
surprise me either. This is a theoretical computer science paper. It doesn't
need code to show correctness, as long as they prove their claims. I'm curious
why you're seeking such code?

edit: why the downvotes? Theoretical CS academics basically do math and very
few write any actual code to supplement their papers.

~~~
rckclmbr
I think you are downvoted because he asked a question and you didn't attempt
to answer it. You're not wrong, just off topic for what was expected.

~~~
vecter
Fair enough. I answered the question implicitly but perhaps I should have been
more explicit.

------
wfunction
Anybody going to implement this and share it open-source? :)

~~~
lifepillar
You may ask this guy:
[https://github.com/nicolaprezza](https://github.com/nicolaprezza)

He already has some interesting code :)

------
burntrelish1273
Does this imply zlib and gzip may be able to be patched to reduce memory usage
and runtime?

~~~
wolf550e
If you can, you should upgrade from DEFLATE (the compression format in zlib,
gzip, zip files, png, office docs, etc) to something more modern like zstd.

If you must preserve ability to decompress by zlib, there are different
implementations of a DEFLATE compressor that are better than zlib (either
faster or better compression ratio).

A DEFLATE compressor needs to keep a data structure that is able to answer the
query "where in the last 32,768 bytes of input there are occurrences of the
exact 3 bytes I'm looking at? If possible, check whether the 4th byte matches
too and find the longest match. Among matches equally long, prefer those which
are closer". For a 32KB window, a data structure that always finds the best
match is not large, unless you're on a microcontroller. The size of the data
structure becomes a problem when the window is large.

