
Bleve: full-text search and indexing for Go - porjo
http://www.blevesearch.com/
======
mschoch
Well since someone submitted us here with no apparent reason or context, allow
me to provide something of interest. (primary contributor of bleve here)

Just recently we merged support a new experimental index scheme called
'scorch'. This new index scheme is designed from the ground up to reduce index
size and improve performance. It features:

\- a segment based approach, much like Lucene

\- vellum FTS for the term dictionary -
[https://github.com/couchbase/vellum](https://github.com/couchbase/vellum)

\- roaring bitmaps for the postings lists -
[https://github.com/RoaringBitmap/roaring](https://github.com/RoaringBitmap/roaring)

\- and compressed chunked integer storage for all the posting details

It's still experimental at this point, but shows considerable indexing
speedup, index size reduction, and similar query performance to the old index
format used today.

The code for this new index scheme can be found here:
[https://github.com/blevesearch/bleve/tree/master/index/scorc...](https://github.com/blevesearch/bleve/tree/master/index/scorch)

~~~
wejick
Are there any plan to have sharding mechanism implemented on the blevesearch?
IMHO it should be on the application level, but having this on blevesearch
would be good.

Regarding segments based approach like lucene, means we will need to do
segments merging which in my experience quite resource intensive. I don't
actually know how blevesearch handles this prior segment based approach.

~~~
mschoch
Bleve has support for querying across multiple indexes (shards) but does not
prescribe any mechanism to split the data. So, it's up the application to
divide the data how it sees fit, but you can use Bleve functionality to
execute the same query across multiple indexes and merge the results.

Merging is required and is indeed somewhat resource intensive. Bleve's current
indexing approach has no segments, instead all index data is serialized into a
key/value store. This approach allowed us to experiment and plug-in a variety
of implementations. Unfortunately, the key/value abstraction limits the way
you interact with data, so there are a number of drawbacks. One key gain we
get from the segmented approach vs the key/value store approach is that we no
longer need to maintain a backindex to handle updates/deletes.

~~~
wejick
Yes and if I'm not mistaken that's what alias on bleve for.

Not yet looking into scorch. So scorch would replace other storage engine like
rocksdb and leveldb?

~~~
mschoch
Yes, when you create a new index, you can choose the index implementation. The
previous one was called 'upsidedown', and this one used key/value stores for
the actual storage. So, you would also specify
BoltDB/RocksDB/LevelDB/Moss/etc. The new index implementation is called
'scorch' and it writes directly to disk instead of going through a key/value
store.

------
shabbyrobe
I ran into some significant performance issues with Bleve during a weekend
hackathon a few months ago. I was trying to index the stackoverflow data dump
for fun and I couldn't get it to successfully complete. I'm guessing i was
running into some boltdb related limitations but I didn't have the time to dig
deeper before the party ended and I had to get back to the day job.

SQLite's FTS5 allowed me to load the entire data set without breaking a sweat,
but query performance was unexpectedly poor for some combinations of terms
with no apparent consistency, and it became unusable due to a temp table I
couldn't work out how to avoid when attempting to search in descending primary
key order.

It was many months ago now and I have forgotten some of the particulars but
thought it might make an interesting jumping-off point for discussion about
any ideas anyone might have for indexing 60gb of data into a flat file setup
like Bleve or SQLite uses.

Would the new scorch experiment perform better with an index of that size than
the boltdb backend?

~~~
mschoch
Yeah, the old index format had many factors contributing to it taking up
considerable space. As a single data point, we have a beer-search sample app,
this includes a data directory with 29MB of JSON files. In the old index
format, with mapping that did a realistic configuration of indexing many
fields, and storing some of them, the bleve index size was over 200MB.

With scorch the index size is 22MB. Query performance is comparable (and we
haven't even gotten to really tuning this yet).

~~~
shabbyrobe
Nice one! I'll keep my eyes peeled for a release and give it another crack. I
think all those creaky old scripts still work.

------
wejick
I am using elasticsearch on production with golang as the both indexer and
search service. Last holiday I played with blevesearch to make it works like
Elasticsearch search, but the work is far far far from complete.
[https://github.com/wejick/balasticsearch](https://github.com/wejick/balasticsearch)

~~~
bogomipz
>"I am using elasticsearch on production with golang as the both indexer and
search service."

Can you describe how this golang plus elasticsarch hybrid works? Is this is
distinct from the link you provided?

~~~
wejick
Basically my applications are consuming elasticsearch REST API. One acts as
indexer, the other one is using search API. I'm using this amazing package
[https://github.com/olivere/elastic](https://github.com/olivere/elastic)

------
est
Some related work:

[https://github.com/go-ego/riot](https://github.com/go-ego/riot)

[https://github.com/huichen/wukong](https://github.com/huichen/wukong)

------
st3fan
What kind of index size does Bleve work with well? Megabytes? Gigabytes?

I’d love to understand at what scale people are using this engine.

~~~
mschoch
With the current index scheme we would regularly do 10s of gigabytes (which is
embarrassingly small in our opinion). We haven't done large scale testing with
scorch yet, but in some smaller configurations the new index is 1/10th the
size of the previous one, so we hope to have moved the bar considerably on
data sizes that work.

------
lima
I'm new to Go - but it looks like the example code completely ignores the
error codes.

Is that a common thing?

~~~
mschoch
You are correct. I think when I originally designed the homepage I wanted to
show how you index and search in just a few lines. I thought omitting the
error handling boilerplate was acceptable, but properly handling these errors
is also key for a good initial experience. So I'm persuaded to at least
revisit the decision.

[https://github.com/blevesearch/blevesearch.github.io-
hugo/is...](https://github.com/blevesearch/blevesearch.github.io-
hugo/issues/12)

------
billconan
is there a c++ library like this?

~~~
hroman
Xapian [https://xapian.org/](https://xapian.org/)

