
Riot – Full-text search engine in Go - veni0
https://github.com/go-ego/riot
======
wolfgarbe
It seems the whole index is kept in RAM. Thus the index size is limited by the
amount of RAM available. This explains the impressive indexing and search
performance (1M blog 500M data 28 seconds index finished, 1.65 ms search
response time, 19K search QPS) The Persistent storage data is stored to the
hard disk solely when the program closes. The data is then restored from the
hard disk when the program restarts ( [https://github.com/go-
ego/riot/blob/master/docs/zh/persisten...](https://github.com/go-
ego/riot/blob/master/docs/zh/persistent_storage.md) ). This is a limited
approach compared to Lucene/Solr/Elasticsearch LSM which handle high-volume
inserts to its indexes with a log-structured merge-tree (LSM) and where the
index size is only limited by the available hard disk space.

~~~
markpapadakis
1.65ms for what kind of queries? Also, is that 1M blog posts, all weighting
500mb in total size(characters)?

~~~
ethanwillis
I wonder if they're using succint data structures.

I'm in bioinformatics and the first time I implemented a wavelet tree to
reduce the size of genomes in memory.. It was just breathtaking.

~~~
ausjke
very interesting, can you elaborate and on it a little more?

I need quick fuzzy search on a low-end embedded device that has limited
storage(both RAM and HDD), was thinking about putting the index on a server
with plenty RAM then do websocket or RPC for that.

~~~
ethanwillis
There's a very good blog post for the implementation details here:
[http://alexbowe.com/wavelet-trees/](http://alexbowe.com/wavelet-trees/) I had
a decent implementation in Python, but it's on my old macbook that I would
need to dig up. If you're interested you can add me on telegram: @rightcheek.

Now to go with Wavelet trees you may or may not need to know about suffix
arrays and optimal suffix array construction. Take a look at this:
[https://en.wikipedia.org/wiki/Suffix_array](https://en.wikipedia.org/wiki/Suffix_array)
This is what's going to give you space efficiency in combination with a
wavelet tree. And the wavelet tree also gives you good rank/select efficiency.

Edit: Here's a suffix array construction algorithm implementation I did (not
sure if it's fully correct)
[https://github.com/ethanwillis/comp7295_final/blob/master/sa...](https://github.com/ethanwillis/comp7295_final/blob/master/saca.py)
It is based on this paper:
[https://local.ugene.unipro.ru/tracker/secure/attachment/1214...](https://local.ugene.unipro.ru/tracker/secure/attachment/12144/Linear+Suffix+Array+Construction+by+Almost+Pure+Induced-
Sorting.pdf)

------
wiremine
Sidebar: I wish open source authors would think a bit harder about naming
their projects. Here's some other projects already named riot:

\- [http://riotjs.com/](http://riotjs.com/)

\- [https://riot-os.org/](https://riot-os.org/)

\- [https://github.com/vector-im](https://github.com/vector-im)

~~~
KitDuncan
And riot games. Basically the worst name possible for SEO.

~~~
KGIII
Then again, the language is 'Go.' I am not sure that was the best naming
choice and I've heard complaints that it was initially difficult to search for
it. My understanding is that 'golang' has helped as a query. So, I guess the
situation has improved but I understand it was problematic at first.

To be clear, I've never used the language. I actually dislike programming,
though I've decided to get back into it because I have a couple of projects I
want to poke at. I'm now deciding between Java and Python.

Maybe I should do an 'ask HN' submission.

~~~
chungy
"Ask HN" sounds like a great idea.

I think Python is the much better choice to start out, but Java is pretty
great too. It's hard to find languages that are actually bad choices... maybe
COBOL.

~~~
KGIII
I have done some COBOL and even some PASCAL. Yeah, I'm old.

I hired professionals in 1995. I was done doing any of the coding by 2000. I
sold and retired in 2007.

I took only one course in C. Everything else was learned on my own,
informally.

~~~
terminalcommand
I think Python would be better to start with. It is interpreted so you will
get sane error messages. The community is omnipresent. If you'd like instant
answers, you could just pop in to #python on irc.freenode.net.

Most of the times all your questions will be answered by a simple google
search, as there are mountains of good questions and answers on stackoverflow.

IMHO for you, getting back into programming will be as simple as opening the
interpreter and starting to type.

I don't think programming has changed, the basic mentality is still the same.
And nothing beats good old experience.

The new niche "trends" such as asyncronous programming, actor-based
programming etc. could be easily learned by lingering on HN for a while :).

Best of luck getting back on the keyboard!

------
Jemaclus
Also, for consideration: Bleve
([https://github.com/blevesearch/bleve](https://github.com/blevesearch/bleve))

I'm in the process of building my own search engine (as a learning exercise,
but also because it's related to my day job). I've learned that it's one thing
to write a full-text search engine, like this one, and it's quite another to
do field-specific searches with faceting support and so on, like Algolia and
Lucene-based search engines do.

That said, this is clean and simple. I like it. I can definitely learn from
this.

~~~
markpapadakis
Supporting faced search and other functionality requiring access to per
document field-values is just an extension over the core IR functionality.

Tracking (document, field) values can be used for query by range or by
geolocation primitives (that's what Lucene does, where it will index that data
into a special tree-like structure, and for each query, it will build a custom
'iterator' and use it along with other iterators to match documents), and for
static ranking of matched documents.

BTW, Lucene and Algolia are vastly different in terms of the underlying
architecture.

~~~
lobster_johnson
What is Algolia's underlying architecture like? Are there any papers or code?

~~~
markpapadakis
See links about Algolia arch (and other related material) here:
[https://github.com/phaistos-networks/Trinity/wiki/IR-
Search-...](https://github.com/phaistos-networks/Trinity/wiki/IR-Search-Links)

~~~
gdillon
Thanks for the links, Mark! (Disclaimer, I work for Algolia.)

As Mark mentions on his summary page, the best place for that kind of
information is our CTO's "Inside the Engine" series (8 parts).

[https://blog.algolia.com/inside-the-algolia-engine-
part-1-in...](https://blog.algolia.com/inside-the-algolia-engine-
part-1-indexing-vs-search/)

------
unicornporn
Kind of taken [https://about.riot.im/](https://about.riot.im/)

~~~
lylejohnson
Not to mention [http://riotjs.com/](http://riotjs.com/).

~~~
speps
What about [https://www.riotgames.com](https://www.riotgames.com) ? More so
because they have an engineering blog, which is very interesting :
[https://engineering.riotgames.com/](https://engineering.riotgames.com/)

~~~
literallycancer
Mentioning Pendragon et al. in a discussion about originality? Rich.

~~~
whyever
I think the company is a bit bigger than one person by now.

------
hardwaresofton
A little late but if you didn't know you could do reasonably fast Full Text
Search with SQLite... Now you know:

[https://sqlite.org/fts3.html](https://sqlite.org/fts3.html)

[https://sqlite.org/fts5.html](https://sqlite.org/fts5.html)

~~~
burntsushi
Does it suffer the same limitation as PostgreSQL's fulltext search? i.e., It
doesn't use corpus frequencies in its ranking function. (I skimmed the docs
but couldn't immediately find my answer.)

~~~
hardwaresofton
I'm not sure -- but I'm going to guess yes? I tried to look around and
couldn't find an answer either...

------
dpcx
Not that I'm against people building tools in their language of choice, but
how does this compare to Sphinx
([http://sphinxsearch.com/](http://sphinxsearch.com/))?

~~~
didip
Last I used Sphinx, it is tied to MySQL. Is that still true?

If so, then Sphinx could be a deal breaker to some.

~~~
mrweasel
It hasn't been tied to MySQL in the last 10 years, so I'm wondering if it ever
was.

You can connect to Sphinx using a MySQL client, use it as a MySQL storage
engine or using MySQL as a data source. But it's not specifically tied to
MySQL.

------
sAbakumoff
I mean, we have bleve[0]. What else do you need, really?

[0]
[https://github.com/blevesearch/bleve](https://github.com/blevesearch/bleve)

~~~
veni0
Multi-language and distributed support, simple.

~~~
sAbakumoff
bleve search does support multiple languages.

~~~
veni0
But like the Chinese must use plug-ins.

------
conmarap
Ah, this is awesome! So far I had to rely on Lucene++, and it could get
complicated at times. Using go is something I had been wishing for.

~~~
veni0
Yep, go can be a simple deployment.

------
tenken
Also for consideration:
[https://github.com/coleifer/scout](https://github.com/coleifer/scout)

> RESTful search server written in Python, powered by SQLite.

------
dest
A comparison with Yacy would be interesting IMHO

------
maxpert
I wish I could understand mandarin, such nice projects need some good
translation.

~~~
veni0
Thanks for understanding, the document is improving.

------
ct520
Does anyone know anything like this that will search pdf text, or tiff?

~~~
mikey_p
[http://tika.apache.org](http://tika.apache.org)

Which I'm pretty sure can be embedded in Solr, has plugins for Elasticsearch
and others.

