
Breaking the trillion-rows-per-second barrier with MemSQL - navinsylvester
http://blog.memsql.com/shattering-the-trillion-rows-per-second-barrier-with-memsql
======
JVerstry
I have doubts about the "We’re actually spending some time on every row"
claim. Saving data using columnstore often comes with meta-data saved at the
page level, such as min, max and count values for the data in that page. These
values are used to filter and optimize the flow of data processed for a query
(I mean 'skipped' here). If you run a 'count' query 10 times, it's very
unlikely the DB will count rows 10 times. It will rely on the page's existing
meta-data when available (i.e., already computed). The tests described in the
post are misleading IMHO.

EDIT: This comes on top of the fact that DBs can store queries results too.
Moreover the post does not tell whether they have implemented clustered or
filtered indexes on the considered columns. It does not explain how partition
has been performed too. All this has a big impact on execution time.

~~~
nikita
The query in the example is not “count” but a count + group by query. While it
is possible to precompute group by results on every possible columns per page
we don’t do that and this query does touch every row. The article touches on
how this is possible: operations on encoded data, AVX2, not using a hash table
for the intermediate result of the group by. And we certainly don’t fake it
with storing query results.

We guarantee that the result is legit.

~~~
HenryR
What does the columnar format look like? Particularly, is the group by column
compressed with RLE? That’s kind of a pre-computed group-by + count that would
make this kind of query very very fast :)

~~~
nikita
This is an astute observation. Here is the accurate answer from the author of
the feature: “We do have special handling for RLE for filters and group by
column but in both cases there would be per row component (updating selection
vector for filter on RLE, updating aggregates for group by on RLE) and so far
I saw RLE encoding usually slower than dictionary encoding for both these
cases”

------
nmstoker
Impressive although reading that they used 448 cores on super-expensive Intel
lab machines takes the edge off a little

~~~
AmVess
1,280,625,752,550 rows per second is pretty impressive even if it was done on
a million dollar's worth of CPU.

~~~
StreamBright
And do you have a business use case that requires that?

~~~
AndyNemmity
I've worked on a ton of businesses cases that require it. This is what I
worked on for years.

It's generally a large company, large issue sort of problem. You can throw
millions in hardware/software into a problem that saves you 100's of millions.

~~~
nikita
We can solve it quite efficiently. You can define an index on a columnstore
table and we will maintain order on a per node basis

------
ddorian43
So some questions::

1\. Isn't (3)vectorization and (4)SIMD the same thing ?

2\. I don't see the data-size before-after compression ?

3\. How much RAM has each server ?

4\. How do all cores work for all queries ? Is the data sharded by core on
each machine or each core can work on whatever data ?

5\. What's a comparison open-source tool to this ? Only I can think about is
snappydata.

~~~
plamb
SnappyData employee here. In general this is called the "HTAP" industry
(Gartner's phrase: Hybrid Transactional/Analytical Processing).

SnappyData: [https://www.snappydata.io](https://www.snappydata.io), MemSQL:
[https://www.memsql.com/](https://www.memsql.com/), Splice Machine:
[https://www.splicemachine.com/](https://www.splicemachine.com/), SAP Hana:
[https://www.sap.com/products/hana.html](https://www.sap.com/products/hana.html),
GridGain: [https://www.gridgain.com/](https://www.gridgain.com/)

are some of the technologies within it

------
danbruc
Meh. They used 448 cores to count the frequency of bit patterns of some small
length in a probably more or less continuous block of memory. They had
57,756,221,440 total rows, that are 128,920,138 rows per core. If the data set
contained 256 or less different stock symbols, then the task boils down to
finding the byte histogram of a 123 MiB block of memory. My several years old
laptop does this with the most straight forward C# implementation in 170 ms.
That is less than a factor of 4 away from their 45.1 ms and given that AVX-512
can probably process 64 bytes at a time, we should have quite a bit room to
spare for all the other steps involved in processing the query.

Don't get me wrong, in some sense it is really impressive that we reached that
level of processing power and that this database engine can optimize that
query down to counting bytes and generating highly performant code to do so,
but as an indicator that this database can process trillions of rows per
second it is just a publicity stunt. Sure, it can do it with this setup and
this query, but don't be to surprised if you don't get anywhere near that with
other queries.

~~~
defen
> My several years old laptop does this with the most straight forward C#
> implementation in 170 ms.

Sure, but writing a custom C# program for each query on data that has been
pre-formatted in the most optimal manner for that program is not really
comparable to writing a bog standard SQL query.

~~~
danbruc
I am not sure I understand what you want to say but I assume you want to say
that what I did was kind of cheating because I wrote code tailored to that
problem and to work on data in a format that makes that problem rather easy?
Correct me if I am wrong.

But they did exactly the same just that the query planer, optimizer, and
compiler generated the code from the SQL query. They still picked a data set,
a physical layout of that data set, and a query that would result in maximum
throughput. That was not any random SQL query, it was a carefully picked one,
chosen because it would be able to take advantage of all possible
optimizations.

------
jnordwick
And of course, how does it compare to kdb? So it seems less expensive, but
also lacks the advanced query language.

The last tests I saw for kdb was the 1.1 billion taxi ride.

[http://tech.marksblogg.com/billion-nyc-taxi-
kdb.html](http://tech.marksblogg.com/billion-nyc-taxi-kdb.html)

Where it basically outperformed every other CPU based system with slightly
more complex queries.

Any comparisons planned?

~~~
manigandham
It'll be faster than Redshift and about the same as ClickHouse, +/\- depending
on hardware and setup.

It's a great system, we used it for 2 years and it's one of the most polished
databases out there with a simple MySQL interface. It's more general purpose
than kdb, with a nice rowstore + columnstore architecture. I believe they're
adding full-text search indexes in the latest version too.

If you need the query language, the advanced/asof joins, or the tightly
integrated query/process environment, then there's no match to kdb though.

------
thinkMOAR
"When you deliver response time that drops down to about a quarter of a
second, results seem to be instantaneous to users."

I don't think everybody agrees with this statement.

~~~
WorkLifeBalance
The threshold for most people is slightly lower than a quarter second, but
it's certainly the case that above a quarter-second users will notice the
delay[0].

However, people may be primed for longer delays still seeming instant. e.g.
Smartphones for many years had a built-in 300ms delay on any click event, and
even without that event typically still have delays on many 'instant' actions.

So while the delay will be registered as being present, it may not be
registered as "this site is slow" but "it's just a natural delay".

[0] [https://psychology.stackexchange.com/questions/1664/what-
is-...](https://psychology.stackexchange.com/questions/1664/what-is-the-
threshold-where-actions-are-perceived-as-instant)

~~~
spyder
The threshold for noticeable touch "rubberbanding" is 1 ms:

[https://www.youtube.com/watch?time_continue=52&v=vOvQCPLkPt4](https://www.youtube.com/watch?time_continue=52&v=vOvQCPLkPt4)

------
n0tme
If all this data just fits in memory then what is surprising about the speed?

~~~
bufferoverflow
Well, you can try and run MySQL or Postgre off a RAM drive, but I suspect you
won't get close to this result. And then your data is not on a reliable
storage media.

~~~
n0tme
Can't talk about MySQL or PostgreSQL, but with Oracle I am pretty sure you can
get very much close to this result with cached data. And this is without using
the in-memory option. However, talking about it withouth any experiments to
prove a point is just speculation. Memory vs Disks changes a lot performance-
wise.

~~~
exikyut
It was expressly forbidden in the software contract to benchmark Oracle back
in the 80s.

[https://danluu.com/anon-benchmark/](https://danluu.com/anon-benchmark/)

~~~
captain_perl
It's forbidden to publish the benchmarks, not to make them. And it's because
most users are rubes who can't tune 2 databases at the same time.

------
paulsutter
How fast is data import? Loading into RAM? (for example booting up a cluster
for an existing imported database on AWS)

Working with some datasets with 100s of billions of short rows, curious to
give it a try.

~~~
nikita
We have seen 1Gb/s streaming ingest into MemSQL. Also feel free to join
[https://memsql-public.slack.com](https://memsql-public.slack.com) where
people help with developer edition.

~~~
paulsutter
With how many nodes? Does it scale linearly with nodes? (1Gb/s by itself
doesn't help me estimate the scale of the project), Thanks!

~~~
nikita
Data ingest scales linearly assuming the source is scalable (S3, HDFS, or
Kafka). There are other things that matter: how wide is the table, what data
types, etc. We achieve 1GB/s on a 16 nodes cluster for some combination of the
above. What is your target?

------
tabtab
The speed of light is a hard limit. I don't believe there is any free
lunch[1], but trade-offs to manage. I'm skeptical of any claim that implies
free or easy speed without potentially significant trade-offs.

If you can live with somewhat out-of-date and/or out-of-sync data, you can
throw mass parallelism at big read-only queries to get speed. The trade-offs
often are best tuned from a domain perspective such that it's not really a
technology problem, although technology may make certain tunings/tradeoffs
easier to manage.

[1] (Faster hardware may give us incremental improvements, but the speed of
light probably prevents any tradeoff-free breakthroughs.)

~~~
heavenlyblue
What are you on about? The author is counting rows in a database and checks
how much time it takes the database to process the query. Counting rows is
easily parallelisable.

So are many problems, and we still are at a point where we are creating
languages and frameworks that allow us to work with the data at a higher
level: e.g. theano, numpy or TensorFlow.

Speed of light is only an issue for latency-sensitive operations: e.g. if
every sub-operation is non-commutative in a given transaction then it would be
expected to be completed before any other sub-operation.

~~~
tabtab
Re: _Speed of light is only an issue for latency-sensitive operations_

That was my point: if you don't care about data being "fresh" and consistent
(per transaction coordination) you CAN throw mass parallelism at the problem.
If you do care, then the speed of light is the bottleneck.

~~~
tabtab
Clarification: "mass parallelism" may also involve distributed and duplicated
data sets.

------
amelius
How fast can it sort those rows?

~~~
nikita
Can you share the whole use case as well as data size?

------
EGreg
Who from HN would need this and why?

Serious question. I would like to know different real use cases from people on
HN, given our backgrounds.

~~~
tgtweak
We actually use memsql for almost exactly this use case. Realtime arbitrary
grouping and filtering on multiple (30+) columns on a billion+ rows with
substantial concurrency (reporting tools and api's of them). Previously with
pgsql this required a lot of fine tuning of indexes and lots of etl scripting
to break things into separate dimension tables, and then large join queries to
pull them back. This is expensive at the development and operational level,
and data was batched in which involves delays. It was also extremely resource
intensive to query and moderate levels of concurrency required a master and
several slaves, with response times for anything more than a week of data
requiring multi-second waits, with the worse cases approaching the minute
mark.

The hardware in this example is overkill and impractical for most use cases -
to say the least. For our setup, Memsql does this for us on a single node with
256Gb of ram, 40 cores (1 aggregator and 4 leafs) and a modest enterprise nvme
ssd. The machine cost $4,500 over a year ago. Adding more machines to mem is
pretty trivial should we ever need to partition this across machines, despite
this not being necessary.

There are some gotchas and it should not be consisted a drop-in replacement
for MySQL.

~~~
Fiahil
So, you're still shipping data between your primary datastore and memsql, or
you've switched entirely on memsql ?

~~~
tgtweak
Originally it was just a port, but now the inserts go straight into mem. This
used to be a big no-no on mysql (with inno and myisam anyway) as it would
invalidate a lot of query cache on every insert. Here you can refresh the
query every second and see the counts go up.

