
Ask HN: How to build full text search at scale? - ratpik
Data Store - Elasticsearch
Scale - 10 million writes&#x2F;day (500 GB&#x2F;day), about 100K search queries per day<p>Trying to figure out<p>1 - How to control access to data (multi-tenancy where there are ~100K tenants)<p>2 - Database design - Indexes and Shards and best practices around mixing different types of documents in a single index.
======
itronitron
I recommend writing down what exactly scale means for your needs. Number of
users? Number of queries? Number of sources? Number of 'result-sets'? Number
of documents? Number of text fields?

Elasticsearch is built on-top-of Lucene which is a Java API that you can use
in pretty much any application. If you already have a system that can search
the MySQL clusters then I would recommend hooking Lucene into that system
instead of standing up another one.

~~~
ratpik
Added more details

~~~
itronitron
>> How to control access to data (multi-tenancy where there are ~100K tenants)

So basically, you need to run a web server that serves a search page in which
your users can create and submit queries. The web server receives the query
and then routes it to the appropriate search handler. The web server should
handle access control and there are several standard different approaches for
this.

Solr and ElasticSearch can both be used in this manner.

>> Database design - Indexes and Shards and best practices around mixing
different types of documents in a single index.

Depends a lot on what your users want to get in their search results. A first
step would be to identify the primary text fields they want to search in each
document type, then create a standard text field in the schema into which each
document's primary content gets indexed. You can get fancy by running
different document types through different analyzer/tokenizer chains (for
example if they were in different languages) and you can do a lot of
'cheating/preprocessing' here so that the primary search text field has good
information in it.

------
bufferoverflow
I doubt you will get a useful answer in a comment. Each of your questions is
very broad. And you didn't even select a search engine yet. You didn't specify
the scale you're dealing with. You didn't specify the number of reads/writes
per second that you expect.

Choose one system and learn it well.

~~~
ratpik
Scale - Added details to the post.

Search Engine - Elasticsearch

