

R17 on spinning disk faster than PostgreSQL on SSD - matthewnourse
http://www.rseventeen.com/blog/bakeoff_part_2_mysql_postgresql_r17.html

======
justinsb
This benchmark is probably accurate, but not necessarily fair. Here's why:

If all the systems have to do a table scan (complete scan of the data), then
you're (almost always) I/O bound. Because we're relying on sequential I/O, a
good hard disk is as good as a consumer-grade SSD. Postgresql doesn't do
compression (if rows fit on a single page), and MySQL doesn't by default
either (though you can enable InnoDB page compression), so if you're using
compression with r17 you can get more rows per MB transferred, so you win.

However, this only works if you're doing table scans on highly compressible
data. The minute you compete against an index, I suspect you're going to lose
heavily (e.g. selecting the number of hits by second over a one minute
interval should be very fast on the relational databases, as they should be
able to answer from the index)

MySQL and PostgreSQL are really OLTP databases, but you're running an OLAP
workload here (multi-minute queries and table scans are essentially the
definition of OLAP). As you say, a comparison against Hadoop is probably more
relevant (as long as you compress the data). For this workload, an OLAP
database is the tool to beat, and Hadoop is probably the leading contender in
the open-source OLAP space.

Though I'd normally bet on PostgreSQL, I think the easiest fix is to turn on
Innodb compression in MySQL: you should see query times that are much more
comparable. You'd probably still win (because MySQL is doing a lot more), but
the margin should be much smaller.

I think it's also possible to denormalize your tables and improve your
indexing, so that we can avoid table scans and then PostgreSQL/MySQL should be
much faster, but I don't think this is really the point of your benchmark.

Edit: Just spotted a (much bigger) problem - if your data is 54GB raw, when
compressed that should be within the memory size of your machine (12GB), but
if MySQL and PostgreSQL aren't compressing their data won't fit into memory
(particularly with all those unused indexes). So in fact you're comparing RAM
speeds to SSD speeds, which really isn't apples to apples.

~~~
matthewnourse
Thanks for taking the time to look over the tests in such detail!

MySQL's EXPLAIN output says that it's using the indexes. PostgreSQL's EXPLAIN
output says that it's _not_ (and from what I can tell so far, this is a
deliberate design decision by PostgreSQL developers). r17 "loses" to MySQL on
one query and "wins" on the other. So I agree that r17 has a much harder job
competing against indexed data...but with r17 you didn't have to wait to
create the index....not such a big deal for OLTP, a much bigger deal with
OLAP.

I would like to redo the bakeoff(s) with compressed InnoDB tables but as they
can take several days to run and life is short, first I should focus on Hadoop
as we both agree that is more relevant.

Re the data size issue: the 54GB raw data set compresses to 28GB. The data
generator I use creates data that's more random than the "real world" and so
doesn't compress very well...makes life harder for r17, which is what I want.
The smaller data set for the SSD test compresses to about 13GB, which isn't
ideal I agree...I should have bought a larger SSD, I didn't think that MySQL
would make so many large temporary files :). To mitigate this issue I ensured
that the data set was _not_ cached before I ran the r17 script.

I am very keen to find out the truth about r17's usefulness, thanks for your
part in that. For the next bakeoff I'll provide more details about methods and
machine behavior.

------
rbranson
This fails at the most basic benchmark rules. Do you really think PostgreSQL
is 20-40x slower than alternative implementations? Do you think that is
reasonable?

I'm going to go ahead and assume (as with most benchmarks) that the PostgreSQL
instance was not configured properly and was running a stock configuration.

~~~
matthewnourse
[re fail] my main goal is to find out if r17 is "generally useful" for other
people doing data mining, _not_ to pull the wool over any eyes. What would you
like to see done differently? I am very happy to redo the comparison...or
better yet, help someone else to redo it and satisfy themselves (or not) of
r17's usefulness.

[re reasonableness PostgreSQL being 20-40x slower] Yes, but only at some
things. In the first bakeoff
([http://www.rseventeen.com/blog/bakeoff_part_1_mysql_postgres...](http://www.rseventeen.com/blog/bakeoff_part_1_mysql_postgresql_r17.html))
PostgreSQL was clearly "better" than MySQL. And r17 can't do transactions and
is useless for anything "on line", which is where PostgreSQL really shines.
(As commented below) next up I'll be baking off against Hadoop, which is a
more apples<->apples comparison.

[re misconfiguration] is there a specific configuration that you would prefer
that I use? I'm also happy to show the EXPLAIN output if that would help.

~~~
molecule
In default configuration for mysql and postgresql daemons, these benchmarks
don't do anything to make full use of the hardware cited.

Here are guides for the respective db configs.

<http://www.revsys.com/writings/postgresql-performance.html>

[http://www.mysqlperformanceblog.com/2006/09/29/what-to-
tune-...](http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-
mysql-server-after-installation/)

------
matthewnourse
R17 is a data mining language that's a cross between SQL and Bash. For example
this

SELECT username, COUNT(1) AS num FROM users GROUP BY username ORDER BY num;

is roughly equivalent to

io.file.read('users') | rel.select(username) | rel.group(count) |
rel.order_by(_count);

The most interesting difference is that each r17 clause executes concurrently
:).

Download link is here: <http://www.rseventeen.com/#download>

~~~
skimbrel
At first glance it looks like doing a query involves streaming the entire
dataset into memory while selecting and projecting on the fly. If that's true,
what happens when you have truly massive rows (i.e., things containing
MEDIUMTEXTs or worse)?

Okay, reading further down you only get very basic data types. Still, nothing
in the spec appears to prohibit very long rows, and I'd imagine performance
starts to fall off once you're throwing around tens of kilobytes per row. Any
plans to support pushing the projection operation into the read phase so you
can work with massive individual records?

And where's the source? I want to see exactly how much this differs from a
modern SQL engine.

~~~
matthewnourse
It streams the dataset into memory 256K (more for longer rows) at a time. It
doesn't load the whole dataset into RAM unless it must eg for a join, sort or
grouping.

I don't currently have plans to push projection into the read phase, but the
phases are all pretty close together :) so maybe it wouldn't be required. How
massive is "massive" for you? 10s of K? Megs?

R17 is not currently open source, but I haven't ruled it out.

------
kemiller
I think it's disingenuous to measure load+index+query time and then declare
that you're faster. Not saying there aren't use cases where that's valuable
(analytics/warehousing db for one) but it's not what most people are looking
for. Is this usable as a transactional store? What do those numbers look like?

The pipeline model is very interesting, though, and I'll take a look for our
warehouse, which at some point became too expensive to keep up with.

~~~
matthewnourse
I agree that it would be disingenuous to base "faster" on load+index+query.
I'm basing it on the query times alone. I would like very much to base it on
load+index+query 'cause then I could have said "faster than MySQL and
PostgreSQL" :) [and probably many systems, since r17 is built specifically for
zero indexing overhead].

R17 is _only_ for analytics & warehousing, it doesn't do transactions at all.
So in that sense this comparison is unfair, which is why next up I want to do
the same comparison with Hadoop.

Thanks for taking a closer look for your warehouse! Please contact me directly
or comment somewhere if you'd like any help from me.

------
ck2
Boosting speed, fantastic - replacing the completely easy and logical SQL
language, not so much:

    
    
       io.file.read('users') | rel.select(username) | rel.group(count) | rel.order_by(_count);
    

is

    
    
        SELECT username, COUNT(1) AS num FROM users GROUP BY username ORDER BY num;
    

seriously? why?

~~~
matthewnourse
I hear you, and I would have preferred to use the SQL (or any other well-
understood) syntax. Some of the reasons are:

1) I want the query writer to be in complete control over what happens first,
rather than a query optimizer. The r17 syntax makes this explicit.

2) Similarly, I want it to be clear which things will be executed in parallel
and which won't.

3) I agree that SQL is "completely easy and logical" if your query is small
but for large SQL queries I find that I need to understand the query as one
big mass. With r17 you can take it one clause at a time. It's a bit like a
stack-based language in this respect. You can keep adding clauses at the end
of a pipeline, eg WHERE (or even multiple WHEREs) can come after a GROUP BY,
ORDER BY etc...contrast this with SQL's need for a special HAVING clause.

------
molecule
what configurations were used for the mysql and postgres daemons?

------
fleitz
That's great but does it beat SSAS? Does it beat LINQ?

It seems that the primary competitors to r17 would be K/Q,LINQ,Powershell,and
possibly SSAS (unsure how much statistical power is in r17).

~~~
matthewnourse
r17 doesn't run on Windows, and I don't have any plans to make it work on
Windows at this time.

Apart from that niggle, I agree...more bakeoffs needed. There's not much
statistical power in r17, it's more a brute force thing. If you want more
finesse the idea is to hand off to a fancier (and most likely slower) language
like R.

