
Web search 30M rows sub-second, cheaply - jgord
https://quantblog.wordpress.com/2015/02/10/web-search-30-million-rows-sub-second-cheaply/
======
arafalov
Did you try looking at Solr or Elasticsearch? Inverted index (Lucene) is what
they are based on. You even said "search" in your own article....

~~~
thedarkproject
Would recommend looking into ElasticSearch too. Recently had my first hands-on
experiences and I am pretty amazed.

~~~
endymi0n
He didn't really define what he was going to "search" for, so yeah, when
meaning full text search, Elastic is probably the way to go. When going for
exact lookups, Aerospike will actually make use of his SSDs, when going for
slice & dice on wide rows, BigQuery or Redshift will fit his bill -
unfortunately he didn't say much about his exact use case. But inventing your
own data store is probably one of the worst ideas ever except for gaining
knowledge / as a rite of passage.

~~~
isoos
ElasticSearch (and Solr and Lucene) can be used to do index (and lookup) exact
expressions, e.g. string keys/enums, or numbers. It can work with the
combination of text search and exact search terms. I've had great results of
using these for general purpose search and database lookups.

------
pgaddict
Sadly the post is very light on important details. For example, which version
of PostgreSQL and which kind of indexes? Because GIN indexes do pretty much
exactly what he describes as 'tag->id' (but in form that is compressed and
very efficient for processing).

~~~
jgord
You and others make a good case - my post was to summarize what I ended up
doing, as an interesting hack / datapoint - not to slam postgres [ or Redis,
or Lucene/solr, for that matter ]

The main takeaway is really how fast SSDs are and how much bang for your buck
you can get on a single commodity host instance [ thru some creative wrangling
of data structures ]

I will endeavor to followup with a fair comparison using postgres...

Can you suggest the best approach / config to keep the top-index working set
in ram ?

~~~
pgaddict
I think the form of the blog post (pretty much just short bullet points) makes
it rather difficult to get an idea of what's the intended takeaway.

I do share your conclusion that SSDs are very good value for workloads that
need to do a lot of random I/O.

Regarding the best config for PostgreSQL, it really depends on the hardware.
If you really want to use boxes with just 2GB of RAM, use smaller shared
buffers and let the page cache do the rest.

If you need more help with tuning PostgreSQL for the benchmark later, either
ping me at tomas(at)pgaddict.com or join pgsql-performance@postgresql.org (I
do occasionally lurk there, but others will surely help too).

~~~
jgord
yeah.. I started with bullets as a draft, intending to expand into prose, but
it seemed borderline readable, so I thought Id leave it as an exercise in
blogging.

awesome, thanks Thomas - Ill check out your blog too.

------
joshu
This is one of those "computer science matters" things.

Learn your data structures. Learn what the storage engines use. Use the right
structure for the right problem.

~~~
jgord
..your right.

The spirit of my hack was really just that - to apply the well known data
structures and algos to solve a problem.

Its unusual to want fast, large and cheap all together - if you really want to
squeeze things you might look at how modern SSD performance can be exploited -
my hack is just one instance of that.

------
lukaslalinsky
Most comments here are about using an existing search solution, but sometimes
inventing your own solution has a lot of benefits.

I needed something similar when I was building
[https://acoustid.org/](https://acoustid.org/) and for the first iteration I
used PostgreSQL's GIN indexes, which are nice and easy to work with, but once
the index grew beyond some point, it was getting very slow. I spent some time
designing a minimalistic inverted index that does just what I need, nothing
else, while compacting the data as much as possible, to keep it in RAM, and
the results were just excellent. It was super fast compared to the GIN indxes.
Now I have 10x more data and it's still as fast as it was before. If you are
interested, here is the code - [https://bitbucket.org/acoustid/acoustid-
index](https://bitbucket.org/acoustid/acoustid-index)

~~~
jgord
Really interesting to see you had a similar use-case...

[ you explain the approach much better than OP :]

------
fizx
So, it seems like someone doesn't realize that postgres supports a non-
cobbled-together full text search?

[http://www.postgresql.org/docs/8.3/static/textsearch-
indexes...](http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html)

~~~
jgord
I was aware, but definitely worth reiterating.

My post should have a disclaimer - "unless you have several years using many
sql and nosql datastores and some background in algorithms and data-structures
- DONT DO THIS"

My reason for _not_ using postgres and its GIST style indexes and full search
primitives ( which I like and use often, btw ) was that I could find no way to
configure postgres to hold the full top level index in RAM. I could run thru a
query over all data to load it into the working set, but ideally I would like
to mark it 'preload' \- do you know a way to do this ? I did consider putting
the inverted index in its own postgres DB on a RAM disk, but it seemed non-
recommended.

The spirit of the post was : in most cases yes you can find something good off
the open source shelf.. [ particularly redis which was very close to being
what I needed ] but sometimes rolling a custom data structure that fits the
problem can be a surprising win.

Particularly now that SSDs give a really different performance profile to what
we are used to. I think DB implementations will adapt to fit SSD
characteristics better over the next couple years.. just as they have made the
desktop much more snappy.

New DB engines such as rethink and aerospike are very likely tuned more
towards SSDs out of the box .. I haven't spent much time with those yet.

Theres a class of problems can be solved by scaling vertically, if you
approach it by creatively applying the basics of algorithms and data
structures. This is why many people use redis, its a superb bag of useful
data-structures you can co-opt to your needs.

~~~
spyder
With a quick google search I found this stackexchange answer to preload data
into cache with the pg_prewarm extension (from Postgres 9.4)

[http://dba.stackexchange.com/a/116536](http://dba.stackexchange.com/a/116536)

But this doesn't guarantees that the data will stay in the cache because:
"Prewarmed data also enjoys no special protection from cache evictions, so it
is possible for other system activity may evict the newly prewarmed blocks..."

In MySQL you can create tables with the MEMORY storage engine:
[https://dev.mysql.com/doc/refman/5.5/en/memory-storage-
engin...](https://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html)

~~~
jgord
prewarm extension sounds interesting, Ill take a look.

------
nreece
I prefer, and suggest, Sphinx over Elasticsearch. It has better performance,
easier ETL and handles large datasets better.

~~~
boyter
Agreed. Elastic search is fantastic, but for simple search that needs to be
fast Sphinx is hard to beat. I have been using it for years on searchcode.com
and have never hit an issue with it that was not due to my own mistakes.

A fantastic piece of software.

~~~
STRML
Wow. What a great plug. Loving searchcode.com. Thank you for running such a
great service! I have endless frustration trying to search Google or DDG for
these things.

~~~
boyter
Thank you!

------
jtchang
I recommend elasticsearch as well. Just make sure to firewall it off and be
careful opening it up. Elasticsearch a few versions back had a remote exploit.

~~~
GlennS
In particular, ElasticSearch offers the possibility to include snippets of
scripts from a number of languages (e.g. Groovy) in your search. If you do so,
the file-access capability of those languages remains available.

I'd be surprised if it's an issue these days, because:

1\. It's disabled by default, and has been for a while.

2\. If you do turn it on, the config options let you be more specific about
where those scripts are allowed to come from.

------
fleitz
Snatching defeat from the jaws of victory, Postgres right choice, missed
tsearch2

Also 1TB of data fits easily in RAM

~~~
illumen
1TB of ram still costs more than 2GB of ram + SSD

