
Show HN: TurboHist: Turbo Histogram Construction - powturbo
https://github.com/powturbo/TurboHist
======
ddorian43
1\. What's the license of TurboPFor,TurboRLE ?

2\. Have you tried comparing TurboPFor that with roaring bitmaps ?

3\. Have you tried if it makes sense putting it in lucene (on the postings
list)?

4\. What forums would you suggest for low level stuff on
compression,indexes,search-engine,db-engine,mechanical-sympathy ?

5\. What do you work on in real-life? Or licensing the libs?

6\. Is it possible to make lucene 10X+ faster and what would need to be done
(in high level) ? (like what scylladb did to cassandra, like they didn't
change the language but architecture explained here:
[http://www.scylladb.com/technology/architecture/](http://www.scylladb.com/technology/architecture/))

7\. What do you think of bing/bitfunnel using bloom-filters instead of
posting-lists
[http://bitfunnel.org/strangeloop/](http://bitfunnel.org/strangeloop/) ?

8\. And about [http://research.microsoft.com/en-
us/um/people/trishulc/paper...](http://research.microsoft.com/en-
us/um/people/trishulc/papers/Maguro.pdf) ? Distributing documents by 'term'
instead of 'id' ?

~~~
powturbo
1\. What's the license of TurboPFor,TurboRLE ?

GPL (see license in the source files).

2\. Have you tried comparing TurboPFor that with roaring bitmaps ?

No, but because of the size overhead of roaring bitmaps they can be only used
for few integer arrays. For example when collecting intermediate results for
queries. Whereas there are no size limits for TurboPFor functions (ex.
millions of tiny or big arrays).

3\. Have you tried if it makes sense putting it in lucene (on the postings
list)?

Some expriments with SIMD posting lists in lucene are reporting ~20% gains.
see [http://blog-archive.griddynamics.com/2015/06/lucene-simd-
cod...](http://blog-archive.griddynamics.com/2015/06/lucene-simd-codec-
benchmark-and-future.html) Posting lists are only one part in a search engine.
The whole architecture must be taken into account, like parallel indexing,
parallel query processing, particularly intersections.

4\. What forums would you suggest for low level stuff on
compression,indexes,search-engine,db-engine,mechanical-sympathy ?

I'm not aware of special forums for SE, but there are excellent academic
papers on this subject. Some pointers:
[https://github.com/lemire](https://github.com/lemire),
[https://twitter.com/powturbo](https://twitter.com/powturbo), arxiv.org,
Papers about in memory database like about SAP-HANA,...

5\. What do you work on in real-life? Or licensing the libs?

?

6\. Is it possible to make lucene 10X+ faster and what would need to be done
(in high level) ?

Multicores parallel shared nothing architecture like the on in the TurboPFor
inverted index app and a ram resident inverted index.

7\. What do you think of bing/bitfunnel using bloom-filters instead of
posting-lists
[http://bitfunnel.org/strangeloop/](http://bitfunnel.org/strangeloop/) ?

Bloom filters are lossy per definition and here again posting lists are only
one part in a SE. Currently with TurboPFor, mostly frequent terms are taking
in average less than one bit per posting for document ids. Term positions
required for phrase queries taking the large part of an inverted index are
difficult to compress.

8\. And about [http://research.microsoft.com/en-
us/um/people/trishulc/paper...](http://research.microsoft.com/en-
us/um/people/trishulc/paper..). ? Distributing documents by 'term' instead of
'id' ?

It is always possible to do the partitioning by both term and doc-ids. For
large scale search engines there is no alternative to a ram resident (memory
mapped) inverted index, if you want maximum performance. See for ex. inverted
index app in TurboPFor.

~~~
ddorian43
Thanks, I really appreciate it. Already follow Lemire. I think I've seen you
also on encode.ru though that's only for compression. There's also
[https://groups.google.com/forum/#!forum/mechanical-
sympathy](https://groups.google.com/forum/#!forum/mechanical-sympathy) .
Regarding the answers:

3\. About "parallel query processing". You mean like using multiple cores for
1 query ? Since data is in ram I thought it would be more efficient for single
thread ? Or do you mean parallel on the segments (like lucene does) ?

5\. I meant what do you do to put food on the table ? Like Lemming is
professor. Are you employed/contractor/business ?(just curious)

9\. What about "global terms dictionary" ? Or 'global document-id-->numeric-
id' dictionary ? Like a service where you send term/doc-id(say url) and get
back the numeric-id? Or would those make the posting lists not compresable ?
(just thinking ways of lowering overhead on each shard so they take more
documents)

10\. What about gpu/fpga/asic ,you know of any paper/way that can be used for
a search-engine and provides enough value compared to just adding more cpu/ram
? Or they're limited too much by low memory? If yes what about having a
ssd(nvme?) on the gpu like AMD intends to? Or using NVME for a ram extension,
since it 'acts' like a slower+persistent ram ?

~~~
powturbo
Thanks for your nice feedback.

3\. About "parallel query processing"....

I mean, each thread (index server) is responsible for only one
shard/segment/partition. You need a local broker (1 per PC/node) to read
client queries and dispatch them to all index server threads. Each thread
processes the query on its own partition and sends back the results to the
broker. The broker collects and merges the results and sends the query answer
to the client. The client can be another global broker controlling serveral
multicores PCs. Most of this is implemented in the the TurboPFor inverted
index demo.

9\. What about "global terms dictionary" ? Or 'global document-id-->numeric-
id' dictionary ?

I think, term dictionaries are not a problem, you can have one lexicon per
PC/node (common to all local partitions) accessed from all index server
threads.

10\. What about gpu/fpga/asic,you know of any paper/way that can be used for a
search-engine and provides enough value compared to just adding more cpu/ram ?

For maximum performance, you must have the whole index in the GPU own memory,
otherwise the memory transfer will be the bottleneck. All query processing
like decoding and intersections must be done on the GPU.

~~~
ddorian43
3\. I was thinking more of mmaping the index, and assuming it's smaller than
ram, it will always be in ram. This way all cores have access to the same
index and can satisfy all queries (and not having to keep separate data
structures for each core + 1 less merge). Or you mean it's faster this way
(how, meaning local-memory faster than mmap in-memory?)?

~~~
powturbo
Yes, you can simply mmap the index file. The index server have not to read the
index at the start. The goal of query processing is to decrease the query
latency, this is why it is better/necessary to do partitioning. Each query can
be processed simultaneously by all threads. Intersections are very costly for
long posting lists.

