

Ask YC: Very large database scanning - ca98am79

I have a database with 10 million+ rows that needs to be scanned on several columns.  For example, we need to select from the database where column A&#62;10 and column B&#60;5.34 and column A&#60;100 and column C=3, etc... on 40 or so columns.  All of the columns are int or float.  We want it to be possible to select on these columns in any order and any values.<p>It is important that it be very fast.  I care more about speed than I do about how much disk space it takes.  I've tried  indexing, partitioning - but I haven't come up with a solution that works quickly and in the way we want.<p>Could anyone offer any advice on the best way to go about this? Or advice on what I should read? Anything is greatly appreciated!
======
davidu
You haven't provided enough information.

What is quickly? 10ms? 100ms? 1 second?

What is your database?

Suffice it to say, this isn't a hard problem. Why not stick all 100mm rows in
memory?

~~~
ca98am79
10ms would be great, but anything less than 1 second would work. The database
is mysql

------
aristus
You're not going to get very much faster than a full table scan with this
setup.

If an int column has values from 1-10, indexes don't help, for the obvious
reason that there will only be 10 index buckets to put things into. I will bet
more than one of your 40 (40?!) columns has a restricted set of values.

You could try playing with the order of the where clause, and/or try sticking
the entire database into RAM.

~~~
xirium
For MySQL Server 4 and below, simple indexes and compound indexes with three
or less values are ignored. For MySQL Server 5 and above, the heuristic is
more vague. Anyhow, if one or more columns has a combined total of three or
more values then an index could reduce a table scan to a range scan.

------
bayareaguy
Sounds like what you want is a bitmap index.
<http://en.wikipedia.org/wiki/Bitmap_index>

~~~
ca98am79
actually, I think this could work well but mysql doesn't seem to support it.
Is there a way to manually create them? Thank you.

~~~
bayareaguy
Perhaps you could use FastBit ( <http://sdm.lbl.gov/fastbit> ) along with
mysql. This paper may of interest:

Analyzing Enron Data: Bitmap Indexing Outperforms MySQL Queries by Several
Orders of Magnitude. Kurt Stockinger, Doron Rotem, Arie Shoshani, Kesheng Wu

Abstract _FastBit [1] is an efﬁcient, compressed bitmap indexing technology
that was developed in our group. In this report we evaluate the performance of
MySQL and FastBit for analyzing the email trafﬁc of the Enron dataset. The
ﬁrst ﬁnding shows that materializing the join results of several tables
signiﬁcantly improves the query performance. The second ﬁnding shows that
FastBit outperforms MySQL by several orders of magnitude._

[http://www.osti.gov/bridge/servlets/purl/881612-KQlHWx/88161...](http://www.osti.gov/bridge/servlets/purl/881612-KQlHWx/881612.PDF)

~~~
ca98am79
great. thank you very much

------
xirium
Which database server are you using? Do most queries include conditions of the
form C=3 on a limited number of columns? Are any of the columns large?

~~~
ca98am79
mysql

yes, most queries are on a limited number of columns - none of the columns are
large.

~~~
xirium
Ignore the constraints of the form A>10\. Only consider the constraints of the
form C=3. If many of your queries are of the form C=3 and many are also of the
form C=3 AND D=4 and some are also of the form C=3 AND D=4 AND E=5 then create
a compound index of the form (C,D,E)

One index of this form will reduce a some of the full table scans to more
specific range scans. Create permutations and variations of this type index to
cover all common cases.

Next, what storage engines are you using? Do your queries have any ORDER BY
clauses?

~~~
ca98am79
So I loaded it into memory, but still SELECTs which are true for 99% of the
table (10 million+ rows) are taking 1 min 7.95 sec, which is way too long.

I guess the best way to do this is just to create many tables which are
subsets of the data for queries that return large amounts - and then come up
with an algorithm which selects from these subsets based on the query? For
example, if a>1 returns 99% of the data - create and store a table with this
selection and if someone selects where a>1 AND b>5, query the subset.

I don't know. If I do that, I will end up with GBs and GBs of tables. I wish
there were some magic algorithm to make things 10x faster.

~~~
xirium
If you're using MyISAM and you don't care about diskspace then you can use an
R-tree index to search any two inequalities. For example, an R-tree index can
be used for queries of the form A>10 AND A<100 AND B>5\. Also note that R-tree
indexes can be appended to my previously suggested indexes. So, an index of
the form (C,D,E,R) would efficiently locate rows for a query of the form C=3
AND D=4 AND E=5 AND A>10 AND A<100 AND B>5

