Well since someone submitted us here with no apparent reason or context, allow me to provide something of interest. (primary contributor of bleve here)
Just recently we merged support a new experimental index scheme called 'scorch'. This new index scheme is designed from the ground up to reduce index size and improve performance. It features:
- and compressed chunked integer storage for all the posting details
It's still experimental at this point, but shows considerable indexing speedup, index size reduction, and similar query performance to the old index format used today.
Are there any plan to have sharding mechanism implemented on the blevesearch?
IMHO it should be on the application level, but having this on blevesearch would be good.
Regarding segments based approach like lucene, means we will need to do segments merging which in my experience quite resource intensive. I don't actually know how blevesearch handles this prior segment based approach.
Bleve has support for querying across multiple indexes (shards) but does not prescribe any mechanism to split the data. So, it's up the application to divide the data how it sees fit, but you can use Bleve functionality to execute the same query across multiple indexes and merge the results.
Merging is required and is indeed somewhat resource intensive. Bleve's current indexing approach has no segments, instead all index data is serialized into a key/value store. This approach allowed us to experiment and plug-in a variety of implementations. Unfortunately, the key/value abstraction limits the way you interact with data, so there are a number of drawbacks. One key gain we get from the segmented approach vs the key/value store approach is that we no longer need to maintain a backindex to handle updates/deletes.
Yes, when you create a new index, you can choose the index implementation. The previous one was called 'upsidedown', and this one used key/value stores for the actual storage. So, you would also specify BoltDB/RocksDB/LevelDB/Moss/etc. The new index implementation is called 'scorch' and it writes directly to disk instead of going through a key/value store.
Not directly sharding but I've been working on a fork of Bleve that uses Cassandra as the backing store which allows for horizontal scaling of a single index: https://github.com/wrble/flock
Still very much a work in progress but the core is functional.
Since I didn't see it with a quick look, why call it bleve? Given the logo it's clearly a reference to Boiling Liquid Expanding Vapor Explosion, but that seems an odd choice of name with no relation to the project. Do you just like fire?
I was watching one of those engineering disaster shows and thought it would make a good name for a project. I didn't find any other software projects using the name, and it seemed like it would have decent googleability. The relationship to fire/explosions has given good themes for logos and sub-project names (like scorch).
The reason is most likely because they saw this on the front-page: https://news.ycombinator.com/item?id=16085873. I've noticed that people like to submit items that are tangentially related to those that popup on the front page.
There's a question I don't find an answer for by looking at homepage and documentation: does it handle concurrent queries? (I wonder, since I see a store is a single file)
That is, is it something akin sqlite, meant to be used as an embedded engine for standalone applications, or is it fit to be used by a centralized api?
I ran into some significant performance issues with Bleve during a weekend hackathon a few months ago. I was trying to index the stackoverflow data dump for fun and I couldn't get it to successfully complete. I'm guessing i was running into some boltdb related limitations but I didn't have the time to dig deeper before the party ended and I had to get back to the day job.
SQLite's FTS5 allowed me to load the entire data set without breaking a sweat, but query performance was unexpectedly poor for some combinations of terms with no apparent consistency, and it became unusable due to a temp table I couldn't work out how to avoid when attempting to search in descending primary key order.
It was many months ago now and I have forgotten some of the particulars but thought it might make an interesting jumping-off point for discussion about any ideas anyone might have for indexing 60gb of data into a flat file setup like Bleve or SQLite uses.
Would the new scorch experiment perform better with an index of that size than the boltdb backend?
Yeah, the old index format had many factors contributing to it taking up considerable space. As a single data point, we have a beer-search sample app, this includes a data directory with 29MB of JSON files. In the old index format, with mapping that did a realistic configuration of indexing many fields, and storing some of them, the bleve index size was over 200MB.
With scorch the index size is 22MB. Query performance is comparable (and we haven't even gotten to really tuning this yet).
I am using elasticsearch on production with golang as the both indexer and search service. Last holiday I played with blevesearch to make it works like Elasticsearch search, but the work is far far far from complete.
https://github.com/wejick/balasticsearch
Basically my applications are consuming elasticsearch REST API. One acts as indexer, the other one is using search API.
I'm using this amazing package https://github.com/olivere/elastic
With the current index scheme we would regularly do 10s of gigabytes (which is embarrassingly small in our opinion). We haven't done large scale testing with scorch yet, but in some smaller configurations the new index is 1/10th the size of the previous one, so we hope to have moved the bar considerably on data sizes that work.
You are correct. I think when I originally designed the homepage I wanted to show how you index and search in just a few lines. I thought omitting the error handling boilerplate was acceptable, but properly handling these errors is also key for a good initial experience. So I'm persuaded to at least revisit the decision.
It isn't common, and the code example wouldn't even compile because "err" is assigned but never used. I imagine that they included the "err" variables to show that they're available, but didn't want to clutter the example with error checks.
You can ignore error codes in go by assigning them to the special variable "_", but, outside of very short toy examples, it is a huge warning sign for terrible code. It's certainly not common to ignore errors in Go.
if all you care about are relatively simple queries in english, clucene or lucy could probably work for you.
AFAIK, there is no native C/C++ library that comes anywhere near modern lucene. for me the killer feature is that you get analyzers for many different languages out of the box with current lucene/elasticsearch.
Just recently we merged support a new experimental index scheme called 'scorch'. This new index scheme is designed from the ground up to reduce index size and improve performance. It features:
- a segment based approach, much like Lucene
- vellum FTS for the term dictionary - https://github.com/couchbase/vellum
- roaring bitmaps for the postings lists - https://github.com/RoaringBitmap/roaring
- and compressed chunked integer storage for all the posting details
It's still experimental at this point, but shows considerable indexing speedup, index size reduction, and similar query performance to the old index format used today.
The code for this new index scheme can be found here: https://github.com/blevesearch/bleve/tree/master/index/scorc...