Bleve: full-text search and indexing for Go

mschoch · on Jan 7, 2018

Well since someone submitted us here with no apparent reason or context, allow me to provide something of interest. (primary contributor of bleve here)

Just recently we merged support a new experimental index scheme called 'scorch'. This new index scheme is designed from the ground up to reduce index size and improve performance. It features:

- a segment based approach, much like Lucene

- vellum FTS for the term dictionary - https://github.com/couchbase/vellum

- roaring bitmaps for the postings lists - https://github.com/RoaringBitmap/roaring

- and compressed chunked integer storage for all the posting details

It's still experimental at this point, but shows considerable indexing speedup, index size reduction, and similar query performance to the old index format used today.

The code for this new index scheme can be found here: https://github.com/blevesearch/bleve/tree/master/index/scorc...

wejick · on Jan 7, 2018

Are there any plan to have sharding mechanism implemented on the blevesearch? IMHO it should be on the application level, but having this on blevesearch would be good.

Regarding segments based approach like lucene, means we will need to do segments merging which in my experience quite resource intensive. I don't actually know how blevesearch handles this prior segment based approach.

mschoch · on Jan 7, 2018

Bleve has support for querying across multiple indexes (shards) but does not prescribe any mechanism to split the data. So, it's up the application to divide the data how it sees fit, but you can use Bleve functionality to execute the same query across multiple indexes and merge the results.

Merging is required and is indeed somewhat resource intensive. Bleve's current indexing approach has no segments, instead all index data is serialized into a key/value store. This approach allowed us to experiment and plug-in a variety of implementations. Unfortunately, the key/value abstraction limits the way you interact with data, so there are a number of drawbacks. One key gain we get from the segmented approach vs the key/value store approach is that we no longer need to maintain a backindex to handle updates/deletes.

wejick · on Jan 7, 2018

Yes and if I'm not mistaken that's what alias on bleve for.

Not yet looking into scorch. So scorch would replace other storage engine like rocksdb and leveldb?

mschoch · on Jan 7, 2018

Yes, when you create a new index, you can choose the index implementation. The previous one was called 'upsidedown', and this one used key/value stores for the actual storage. So, you would also specify BoltDB/RocksDB/LevelDB/Moss/etc. The new index implementation is called 'scorch' and it writes directly to disk instead of going through a key/value store.

millisecond · on Jan 7, 2018

Not directly sharding but I've been working on a fork of Bleve that uses Cassandra as the backing store which allows for horizontal scaling of a single index: https://github.com/wrble/flock

Still very much a work in progress but the core is functional.

ddorian43 · on Jan 7, 2018

see elassandra for something similar

SAI_Peregrinus · on Jan 7, 2018

Since I didn't see it with a quick look, why call it bleve? Given the logo it's clearly a reference to Boiling Liquid Expanding Vapor Explosion, but that seems an odd choice of name with no relation to the project. Do you just like fire?

mschoch · on Jan 7, 2018

I was watching one of those engineering disaster shows and thought it would make a good name for a project. I didn't find any other software projects using the name, and it seemed like it would have decent googleability. The relationship to fire/explosions has given good themes for logos and sub-project names (like scorch).

Denzel · on Jan 7, 2018

The reason is most likely because they saw this on the front-page: https://news.ycombinator.com/item?id=16085873. I've noticed that people like to submit items that are tangentially related to those that popup on the front page.

_pctq · on Jan 7, 2018

Hi,

There's a question I don't find an answer for by looking at homepage and documentation: does it handle concurrent queries? (I wonder, since I see a store is a single file)

That is, is it something akin sqlite, meant to be used as an embedded engine for standalone applications, or is it fit to be used by a centralized api?

mschoch · on Jan 7, 2018

Concurrent queries are supported (not sure what you mean by store being a single file, it is a directory of many files).

Concurrent indexing is also possible, so long as you can arrange to not put duplicate document ids into batches executing concurrently.

As for usage, it is just a library, so it is typically embedded in a single process (though this can serve multiple clients concurrently).

Distributing the index across multiple nodes is done at the application level. At Couchbase we do this with bleve in a separate project called 'cbft'.

https://github.com/couchbase/cbft

_pctq · on Jan 7, 2018

> (not sure what you mean by store being a single file, it is a directory of many files)

I see, my bad. I saw one binary file named "store" in the directory created by the example code, I thought it would be it.

Thanks for explanation!

shabbyrobe · on Jan 7, 2018

I ran into some significant performance issues with Bleve during a weekend hackathon a few months ago. I was trying to index the stackoverflow data dump for fun and I couldn't get it to successfully complete. I'm guessing i was running into some boltdb related limitations but I didn't have the time to dig deeper before the party ended and I had to get back to the day job.

SQLite's FTS5 allowed me to load the entire data set without breaking a sweat, but query performance was unexpectedly poor for some combinations of terms with no apparent consistency, and it became unusable due to a temp table I couldn't work out how to avoid when attempting to search in descending primary key order.

It was many months ago now and I have forgotten some of the particulars but thought it might make an interesting jumping-off point for discussion about any ideas anyone might have for indexing 60gb of data into a flat file setup like Bleve or SQLite uses.

Would the new scorch experiment perform better with an index of that size than the boltdb backend?

mschoch · on Jan 7, 2018

Yeah, the old index format had many factors contributing to it taking up considerable space. As a single data point, we have a beer-search sample app, this includes a data directory with 29MB of JSON files. In the old index format, with mapping that did a realistic configuration of indexing many fields, and storing some of them, the bleve index size was over 200MB.

With scorch the index size is 22MB. Query performance is comparable (and we haven't even gotten to really tuning this yet).

shabbyrobe · on Jan 8, 2018

Nice one! I'll keep my eyes peeled for a release and give it another crack. I think all those creaky old scripts still work.

wejick · on Jan 7, 2018

I am using elasticsearch on production with golang as the both indexer and search service. Last holiday I played with blevesearch to make it works like Elasticsearch search, but the work is far far far from complete. https://github.com/wejick/balasticsearch

bogomipz · on Jan 7, 2018

>"I am using elasticsearch on production with golang as the both indexer and search service."

Can you describe how this golang plus elasticsarch hybrid works? Is this is distinct from the link you provided?

wejick · on Jan 7, 2018

Basically my applications are consuming elasticsearch REST API. One acts as indexer, the other one is using search API. I'm using this amazing package https://github.com/olivere/elastic

wejick · on Jan 7, 2018

Yes it is different from balasticsearch

est · on Jan 7, 2018

Some related work:

https://github.com/go-ego/riot

https://github.com/huichen/wukong

st3fan · on Jan 7, 2018

What kind of index size does Bleve work with well? Megabytes? Gigabytes?

I’d love to understand at what scale people are using this engine.

mschoch · on Jan 7, 2018

With the current index scheme we would regularly do 10s of gigabytes (which is embarrassingly small in our opinion). We haven't done large scale testing with scorch yet, but in some smaller configurations the new index is 1/10th the size of the previous one, so we hope to have moved the bar considerably on data sizes that work.

lima · on Jan 7, 2018

I'm new to Go - but it looks like the example code completely ignores the error codes.

Is that a common thing?

mschoch · on Jan 7, 2018

You are correct. I think when I originally designed the homepage I wanted to show how you index and search in just a few lines. I thought omitting the error handling boilerplate was acceptable, but properly handling these errors is also key for a good initial experience. So I'm persuaded to at least revisit the decision.

https://github.com/blevesearch/blevesearch.github.io-hugo/is...

yalue · on Jan 7, 2018

It isn't common, and the code example wouldn't even compile because "err" is assigned but never used. I imagine that they included the "err" variables to show that they're available, but didn't want to clutter the example with error checks.

You can ignore error codes in go by assigning them to the special variable "_", but, outside of very short toy examples, it is a huge warning sign for terrible code. It's certainly not common to ignore errors in Go.

roblaszczak · on Jan 8, 2018

for examples like this it is much better to explicit ignore error:

_ = index.Index(identifier, your_data)

it will also pass https://github.com/kisielk/errcheck check

eventually we can just panic error

if err := index.Index(identifier, your_data); err != nil { panic(err) }

billconan · on Jan 7, 2018

is there a c++ library like this?

hroman · on Jan 7, 2018

Xapian https://xapian.org/

latenightcoding · on Jan 7, 2018

There is a C library: https://lucy.apache.org/

ddorian43 · on Jan 7, 2018

modern C++ https://github.com/phaistos-networks/Trinity

leetcrew · on Jan 8, 2018

if all you care about are relatively simple queries in english, clucene or lucy could probably work for you.

AFAIK, there is no native C/C++ library that comes anywhere near modern lucene. for me the killer feature is that you get analyzers for many different languages out of the box with current lucene/elasticsearch.