
A Bunch of Money on AWS and Some Benchmark Results - nikita
https://www.memsql.com/blog/memsql-tpc-benchmarks/
======
georgewfraser
As someone who's spent a lot of time working with TPC-DS [1] and talking to
people about it [2], I see a couple areas that could be improved in this
benchmark:

1\. Total run time is not an appropriate way to summarize the performance
across queries, because some queries take 100x longer than others. The
appropriate way to summarize this kind of data is to use the geomean [3].

2\. The official TPC-DS queries make heavy use of grouping sets, which are a
rarely-used SQL feature. I think TPC-DS is better if you rewrite the queries
to eliminate grouping sets.

3\. You used the exact same queries to "warm up" the data warehouse, and to
test the performance. Some data warehouses (notably Redshift) aggressively
cache intermediate compilation results, so they are much faster the second
time they see a query or even a _fragment_ of a query. To model a real user
submitting queries interactively, you should use warmup queries that are
similar to but not the same as the ones you use to measure performance.

4\. You can solve the "vendor benchmarking their own product" problem by
submitting a PR to our repo [4], which currently tests Redshift, Snowflake,
BigQuery, Azure SQL DW, and Presto. We'd be happy to review it and endorse the
timing if it meets our standards for fairness!

[1] [https://fivetran.com/blog/warehouse-
benchmark](https://fivetran.com/blog/warehouse-benchmark)

[2] [https://www.youtube.com/watch?v=XpaN-
PqSczM](https://www.youtube.com/watch?v=XpaN-PqSczM)

[3]
[https://en.wikipedia.org/wiki/Geometric_mean](https://en.wikipedia.org/wiki/Geometric_mean)

[4]
[https://github.com/fivetran/benchmark](https://github.com/fivetran/benchmark)

~~~
sroussey
Curious to see TiDB in there

~~~
AdamProut
You need columnstore storage and a reasonably mature query optimizer to get
good results on TPC-DS. TiDB is lacking both right now.

~~~
dongxu
TiDB developer here, we're working on columnstore for TiDB, will release in
the near future. On the query optimizer side, I think in TiDB 3.0, TiDB's
optimizer has given the best execution plan for most of the TPC-H (TPC-H 50G)
queries. But we've never tested on TPC-DS yet.

------
arjunnarayan
This benchmark is pretty ridiculous for the following reasons:

1\. Their database is run in asynchronous durability mode.

2\. They specifically do the one thing that TPC-C says you shouldn't do, which
is get really high throughput on a small dataset. TPC-C enforces that you
scale your data-stored with the query throughput. CockroachDB maxes out at
~12.8tpmC/warehouse because _its waiting at the legal maximum throughput_ , as
opposed to running up the numbers in a way that's against the rules (and
spirit) of the benchmark.

3\. They make all the TPC-DS mistakes that georgewfraser points out elsewhere
in this thread.

4\. They run in read committed mode (they don't support anything higher),
CockroachDB runs in serializable mode.

I ended up ranting about this on Twitter, so rather than reproducing
everything here, I'm going to link to my rant there. Apologies for the cross-
posting across fora:
[https://twitter.com/narayanarjun/status/1128393193941274624](https://twitter.com/narayanarjun/status/1128393193941274624)

~~~
AdamProut
(MemSQL CTO here)

1\. MemSQL is running with synchronous replication in all these benchmarks.
All data is stored on a 2nd machine before any transaction is acknowledged as
committed. You’re right this is not as strong as running with both synchronous
writes to disk and over the network. MemSQL supports this as well and results
in about a 40 to 50% performance hit depending on the disk speed. Very few of
our customers run in this configuration so we didn’t include it (the edge case
of multiple machines losing power is not worth the performance hit for them).

2\. Can you point me to what you’re describing in the TPC-C specification? I
have never heard of what you’re claiming. TPC-C has maximum allowed latency
requirements for the 5 transaction types it runs and also requirements around
the mix of those transactions in the workload. The goal of the benchmark is
still to run as many "New Order" transaction per minute while maintaining the
latency requirements of the other unmeasured transactions running in the
background (this is what tpmC stands for). We used the Percona TPC-C driver
for MySQL to handle this (with a few small bug fixes).

3\. The main thing we wanted to show is that our performance on TPC-DS is
similar (better at some scale factors, slower on others) to data warehouses
that specialize in running these types of queries. We likely should have
provided more details (per query break downs and what not).

4\. We used the Percona MySQL TPC-C driver with some changes to make the
initial data loading faster. That driver uses the “FOR UPDATE” clause in MySQL
instead of running in serializable isolation level.

I know you did a lot of work on CockroachDB. The point of the blog post was
not to attack cockroach (I personally didn’t want to mention it at all), but
to show how MemSQL is different. We are one of the few distributed SQL
databases with competitive results on all 3 major TPC benchmarks.

~~~
arjunnarayan
1\. Comparing numbers from one system (Cockroach) that adheres to strict
durability requirements to another that does not (MemSQL) is apples to
oranges, especially, as you point out, you see a 2x performance hit when you
impose that requirement.

2\. What you're looking for is the 'Think Time' mentioned in the TPC-C spec[1]
(table in 5.2.5.7). From 5.2.5.2, I quote:

> for each transaction type, the Keying Time is constant and must be a minimum
> of 18 seconds for New- Order, 3 seconds for Payment, and 2 seconds each for
> Order-Status, Delivery, and Stock-Level.

Chapter 4 is pretty thorough on elaborating on this. The comment under section
4.1.3 explicitly states:

> Comment: The maximum throughput is achieved with infinitely fast
> transactions resulting in a null response time and minimum required wait
> times. The intent of this clause is to prevent reporting a throughput that
> exceeds this maximum, which is computed to be 12.86 tpmC per warehouse.

Again, CockroachDB numbers are right up against this limit - because the
database is waiting, as required! It's within ~99% of the maximum allowed. No
bar is allowed to go more than 1% higher! So stacking a bar chart next to it
that goes 10x higher is pretty misleading.

3\. I'm pretty impressed that you can run all the TPC-DS queries. That's
pretty impressive. But performance wise, there really isn't enough fleshed
out, and given that the TPC-DS authors explicitly disavow the single metric
that you use (power test numbers), is simply too little to claim parity to
existing databases. That said, in this conversation I'm an OLTP guy; I'll let
others more experienced with Data Warehouse benchmarking take this up, e.g.
[3]

4\. This one I'll concede that you are doing the appropriate thing as per spec
(SELECT FOR UPDATE ensures serializability), but it's the single part of the
spec that's not held up over time - the paper "Making Snapshot Isolation
Serializable" is a great explanation of just what lengths you have to go to to
prove that a set of transactions only provide serializable histories when run
in a degraded isolation mode. That said, fair enough, no anomalies will be
present due to Alan Fekete's proof. But do note that CockroachDB is doing _a
lot of extra work_ (work that MemSQL can elide, since it's simply not checking
for isolation anomalies) to ensure that histories are always serializable[4].

5\. While I don't work there, I did a lot of work specifically on benchmarking
CockroachDB, and would like to politely request that you take down those bars
for CockroachDB, since you're taking numbers that are shackled to the THINK
TIME maximum and comparing them to a system that is not.

[1]: [http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-
c_...](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-
c_v5.11.0.pdf)

[2]:
[https://dl.acm.org/citation.cfm?id=1071615](https://dl.acm.org/citation.cfm?id=1071615)

[3]:
[https://twitter.com/gregrahn/status/1128448156180422656](https://twitter.com/gregrahn/status/1128448156180422656)

[4]: I'll shamelessly plug my blog post on this for the reader interested in
more about transaction isolation levels:
[https://ristret.com/s/f643zk/history_transaction_histories](https://ristret.com/s/f643zk/history_transaction_histories)

~~~
dkhenry
That "Think Time" that you are referring to is supposed to emulate users
running transactions on the database. So its not the database waiting its the
driver waiting. While I do understand the reason for putting that in, you know
very well that violating that limit doesn't artificially give CockroachDB or
MemSQL an advantage when you are talking about 100,000 warehouses and a random
distribution off transactions.

If CockroachDB is concerned about THINK TIME enough to ask for the numbers to
be removed, this would be a great opportunity for them to remove that limit
and see exactly how much they could push the benchmark.

~~~
knz42
You should read up on why this think time exists. It has nothing to do with
"emulating slow clients" and everything to do with not claiming "I have a fast
database" by running gazillion txn/sec on just 1MB of data in RAM.

TPC-C requires that you increase the amount of "live data" if you want to
display/advertise more performance. That's the benchmark's rule.

If you want to benchmark something else, that's fine, but then

1) don't call it "TPC-C" 2) don't compare with databases that play by the
rules.

~~~
dkhenry
I know why it exists, and my point is they are not trying to show a gazillion
txn/sec on 1MB of data. You are looking at a dataset that is several TB's.
They have far surpassed the point where a vendor is trying to cheat by putting
all the data in L1 cache and claiming to be fast.

------
tyingq
Might be advisable to note that you don't support foreign keys if you're going
to show how much better your performance is versus a database that does.

~~~
AdamProut
(MemSQL CTO here)

You're right, MemSQL doesn't support foreign keys as of yet, but none of these
benchmarks require foreign key support. Two of them (TPC-H and TPC-DS) are a
set of complex SELECT queries where foreign keys are not relevant at all.
TPC-C is a write heavy benchmark, but the specification doesn't require
foreign keys to be maintained (the data model does indicate the foreign key
relationships though)[1].

These are unofficial benchmark results (not independently verified by TPC), so
our interpretation of the specs may not be 100% correct, but I think we got it
right as far as foreign keys are concerned.

[1] [http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-
c_...](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-
c_v5.11.0.pdf)

~~~
tyingq
Just seems like you would want that noted if you're going to compare with
CockroachDB and say things like _" Our results show that we can do both
transaction processing and data warehousing well"_.

------
mc110
The great thing about AWS and other cloud platforms is that you can put the
data in e.g. S3, then use the tool of choice for your workloads.

I'm assuming that MemSQL works fine with that sort of configuration, rather
than requiring you to lock your data up in some proprietary format.

Also, unlike the bad old days of on-premise platforms, you can try things out
to see how they work. You could even do that with a public dataset first, to
see how it works (see
[https://registry.opendata.aws/](https://registry.opendata.aws/) for a list of
these).

For example, there is an Amazon Customer Review dataset of over 160 million
customer reviews - you could use that and try MemSQL for various use cases,
then look at alternatives.

Disclaimer - clearly as a Kognitio employee I'd suggest you looked at us for
analytics use cases, and you can see an example of sentiment analysis as scale
using the Amazon Customer Review data set at
[https://kognitio.com/blog/sentiment-analysis-amazon-
reviews-...](https://kognitio.com/blog/sentiment-analysis-amazon-reviews-
pt1/). Also, a couple of articles on LinkedIn at
[https://www.linkedin.com/pulse/100-shades-grey-other-
amazon-...](https://www.linkedin.com/pulse/100-shades-grey-other-amazon-
review-discoveries-mark-chopping/) for another piece of work on that same
data, and [https://www.linkedin.com/pulse/media-brexit-story-so-far-
may...](https://www.linkedin.com/pulse/media-brexit-story-so-far-may-corbyn-
less-shambles-than-mark-chopping/) for a view on Global Media coverage of
Brexit over time.

------
diminoten
Regardless of the minor technical nitpicks here, I respect the hell out of the
person/people who wrote this article, if only because I know how insanely hard
it is to express a technical thought in great detail while maintaining a
certain amount of levity to try and keep people interested.

------
peterwwillis
Similar to the phrase _" Lies, damned lies, and Benchmarks"_, should be the
phrase _" Lies, damned lies, and AWS costs"_

~~~
wmf
Because?

~~~
peterwwillis
Because all a benchmark tells you is the result of one use case. It's not an
estimate of what _your_ use will result in. It's basically a sales pitch. And
sadly, upper management falls for it all the time, often without an evaluation
period or even spitballing it for a week with the teams that would be using
it. Tons of factors will change the result for a particular use case,
regardless of how "normalized" the test tries to be (especially considering
most are designed for high-load high-performance high-scale scenarios, which
isn't the average real-world use case; most people just have unoptimized
queries, or a shit legacy stack, or they expect to _eventually_ have a big use
case and were told to find either the fastest or cheapest solution, which may
not even be in AWS to begin with, and so on...). The AWS cost is just another
benchmark, so it's just as much a lie.

~~~
diminoten
The whole point of benchmarks is that their "use case" correlates roughly with
_every_ use case related to the technology being benchmarked.

Is the correlation 1:1? No. Is it still relevant when making a decision about
what technology to use? Absolutely.

~~~
dragonwriter
> The whole point of benchmarks is that their "use case" correlates roughly
> with every use case related to the technology being benchmarked.

That's the _idea_ of benchmarks, but it's not actually true of real
benchmarks, because invariably some factors which can improve a particular
benchmarks beaean inverse relation to performance for some other use cases of
the technology.

~~~
diminoten
The problem you're having is that everybody already _knows_ that, and plenty
of folks can still extract value out of a benchmark anyway.

~~~
dragonwriter
> The problem you're having

I'm not having a problem.

> is that everybody already knows that,

Certainly, some people act like they don't, at least in the context of
specific benchmarks.

And some people outright claim the opposite of what you say everyone
understands, e.g., by claiming that benchmarks inherently correlate with every
possible use of a technology.

> and plenty of folks can still extract value out of a benchmark anyway.

Understanding both the general issue and, ideally, the specific areas of
potential concern in relation to particular benchmarks and your intended use
is a big part of being able to effectively extract value from a benchmark.and,
yes, lots of people do recognize those facts and extract value from
benchmarks.

Others _don 't_, and still _apply_ benchmarks in decision-making, but it's
less clear that they are extracting value. Confusing the measurement most
readily available with the measurement most relevant to need is a common
problem (and not just with benchmarks.)

~~~
diminoten
The problem you're having is you can't figure out how to use benchmarks to
understand how a system works.

You're throwing up your hands and saying, "TOO DIFFERENT FROM REALITY!"

Other folks don't do that, and while not 1:1, they _are_ able to correlate the
performance of a benchmark with the performance of their own use case.

Try not to get wrapped around the axle on "everyone", by the way, it's not
literal.

~~~
dragonwriter
> The problem you're having

Again, I'm not having a problem.

> is you can't figure out how to use benchmarks to understand how a system
> works

No, I have no partucular problem evaluating whether a benchmark is useful to a
decision and if so, how.

Nor have I said anything indicating any such problem.

> You're throwing up your hands and saying, "TOO DIFFERENT FROM REALITY!"

No, I'm saying he naive statement upthread that benchmark performance
correlates with every possible use of a technology is nonsense.

That's it.

> Other folks don't do that

Actually, some do, but that's neither here nor there.

> and while not 1:1, they are able to correlate the performance of a benchmark
> with the performance of their own use case.

Once again, yes, lots of people have the skill to figure out whether and in
what way particular benchmarks have utility for their usecases. I explicitly
said that in the post you are responding to.

That's very different than your claim that I reacted _against_ , which is that
any benchmark inherently correlates with every use case, which
is—again—complete nonsense.

~~~
diminoten
> any benchmark inherently correlates with every use case

Oh. That's what you thought I said? Right. I didn't mean to say that, and if
somehow I did say that I withdraw. Benchmarks are valuable if you know how to
use them, but of course I agree with you, a benchmark isn't always relevant to
every use case.

Side note, quoting many small portions of a person's comment and replying
exclusively to the quoted bits is inferior to replying to them with full
sentences/paragraphs. I've only ever seen what you're doing done by folks who
are _super_ interested in arguing and _completely_ uninterested in having a
conversation.

It definitely comes across like you _do_ have a problem, which you have
repeatedly stated you do not (and I believe you)! It just muddies the water,
what you're doing here.

