
Succinct Data Structures: Cramming 80,000 words into a Javascript file - bumbledraven
http://stevehanov.ca/blog/index.php?id=120
======
ot
What a wonderful hack! I thought that succinct data structures were not really
feasible in JS because of the lack of unsigned integer arithmetic, but here
they are implemented doing arithmetic directly on base64 6 bit "bytes",
without passing through JS numbers. Very clever, I wonder how fast this
implementation performs compared to jeresig's implementations (I'd say that
the bottleneck is the binary search in `select`)

BTW, if anyone is interested in succinct data structures, there are very good
libraries for Java (<http://sux4j.dsi.unimi.it/>) and C++ (<http://www.uni-
ulm.de/in/theo/research/sdsl>).

PS. the job listing at the bottom of the page is awesome, compared to the
usual BS:

> Some benefits: You'll probably get a next-generation Blackberry that you
> have to hide under the table when in restaurants. You get to know what's
> going on when a call drops or a web page isn't loading. Plus, it's a small
> team so you get to own a large chunk of the project.

Makes me almost want to apply :)

------
matt4711
Succinct data structures are very interesting.

I recently implemented an FM-Index[1] which supports the following operations:

\- count() the number of occurrences of a pattern P of length m in O(m) time.

\- locate() determine the locations of all occurrences of a pattern P.

\- extract() extract any substring from the Text

\- recover the original text

The most interesting part is that the size constructed index depends on the
compressibility of the text itself. For English text roughly half the size of
the original text.

Construction time and memory requirements, however are not that great.

[1] <http://portal.acm.org/citation.cfm?id=1082039>

~~~
ot
Did you compare your implementation with others? There has been a lot of
algorithm engineering effort on FM-indexes and CSAs (in general, full text
self-indexes), and there is a common testbed to compare the implementations:

<http://pizzachili.dcc.uchile.cl/>

~~~
matt4711
My implementation is very similar to SSA_RRR on the pizza&chilli website. The
index implementations on the website are not easy to understand (research code
+ 8 years old) so I'm trying to re-implement a few of them (especially CSA).

I used some of the test data on the website to test my implementation [1] but
didn't really compare to their other indexes.

Also make sure to always use the chilean mirror as the italian one is outdated
(2005).

[1] <https://github.com/mpetri/FM-Index>

~~~
ot
> My implementation is very similar to SSA_RRR on the pizza&chilli website.

Thanks for the code, I've seen you are using libcds's wavelet tree. I was
wondering if you had implemented your own, since apart from libcds and sdsl
(and some unreadable research code) I couldn't find any other implementation
of the WT so I'm always curious to see if there are any developments in the
engineering of wavelet trees, the current implementations are still too slow
to be really useful, I think.

~~~
matt4711
I'm trying to get myself to implement a wavelet tree. There is however a lot
of low level stuff that I'm still not sure how to implement. Especially RRR.

This paper [1] is the most recent practical wavelet tree paper I know of.

[1] <http://portal.acm.org/citation.cfm?id=1483966>

~~~
ot
There is also this one: <http://portal.acm.org/citation.cfm?id=1198521>

Instead of RRR it uses RLE and gamma codes. Oddly, I have found no comparisons
between this approach and RRR. I would expect RLE to compress better, and it
is also conceptually simpler, but I don't know how fast it is.

------
robryan
Would be interesting to see how this method compares to the benchmarks John
did in his post.

~~~
mryall
Yeah, in particular it would be great to see how the load time compares. John
found that even loading literal JSON in a JavaScript interpreter is _much_
slower than loading a comparable length string. Presumably the latter is just
a buffer copy operation.

~~~
netghost
There are a couple of things to balance. I'm building a word game myself right
now, and ran into these same issues. I found that loading a string for each
letter of the alphabet worked adequately. It is not clever in the least, and
far from optimal, but it ended up with a reasonable size / load time /
processing trade off in the browser.

