
How to Analyze Billions of Records per Second on a Single Desktop PC - cwinter
https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/
======
nkurz
_I have encountered multiple claims that in-memory analytics databases are
often constrained by memory bandwidth, and I myself held that misconception
for longer than I want to admit._

So if not memory-bandwidth, what is the constraining factor? Or specifically,
what's the limiting factor that causes the observed benchmark speeds for
LocustDB?

My guess would be that a well designed in-memory database should be limited by
memory-bandwidth, but that real-world memory-bandwidth shouldn't be thought of
as a single number. Instead, it depends on the size and pattern of requests,
the latency of different cache hierarchies, and (crucially but often
forgotten) on the number of requests in flight.

Is there a better terminology that distinguishes these two uses of memory-
bandwidth?

~~~
cwinter
The constraining factor in this case is CPU throughput. The CPU cannot execute
instructions fast enough to keep up with the data being read in from memory.

Of course you are right that this is not the whole picture, it is possible to
be bottlenecked on neither memory bandwidth nor CPU throughput.

With respect to memory, two additional considerations are _read amplification_
and _latency_.

Read amplification: Read amplification basically means that you are reading
data but not actually using it. Suppose we have a table stored in row format
with four columns C1, C2, C3, C4 which store 8 byte integers. Row format means
that the values for the columns are interleaved, so the memory layout would
look like this, where cn refers to some value for column Cn:

    
    
      0x0000 c1 c2 c3 c4 c1 c2 c3 c4
      0x0040 c1 c2 c3 c4 c1 c2 c3 c4
      ...
    

One property of conventional memory is that we can only read memory in blocks
of 64bytes (the size of a cache line). So if we have a loop that sums of all
of the values in C1, we will actually also load all of the values for columns
C2, C3, C4. This means we have a read amplification of factor 4, and our
usable bandwidth will be a quarter of the theoretical maximum.

Latency: Suppose we also don't have any read amplification and make use of all
values for each loaded cache line. Then after we finish processing a cache
line, and load the next cache line, we might have to wait a long time (100s of
cycles) before the new cache line actually arrives! So then we would actually
be stalled because of memory latency during some parts of program execution.

One of the big advantages of using columnar layout (which stores all values
for a column sequentially) is that it all but eliminates read amplification.
Similarly, by accessing data sequentially we make it possible for the CPU to
predict what data will be accessed in the future, and have it loaded into the
cache before we even request it.

Of course there's a number of reasons why things might not work out quite so
perfectly in practice, and those may why performance in this benchmark starts
to degrade before the (usable) memory bandwidth reaches the theoretical
maximum memory bandwidth. But as you can see it still possible to get very
close, so it's safe to assume that queries which read much less data than the
maximum bandwidth are constrained on CPU.

~~~
nkurz
Excellent explanation of the advantages of a columnar layout! I'm still a
little doubtful about the exact diagnosis, though.

 _The constraining factor in this case is CPU throughput._

When you looked at it with VTune, did you find clear evidence of this? That
is, were you consistently executing 4 instructions per cycle, or were you
maxed out at 100 percent utilization for a particular execution port? While
it's possible to be at these limits (say if you were hashing each input) in
most cases I think it's rare to get to that point when reading from RAM.

 _Similarly, by accessing data sequentially we make it possible for the CPU to
predict what data will be accessed in the future, and have it loaded into the
cache before we even request it._

One important caveat to this (at least on Intel) is that the hardware
prefetcher does not cross 4KB page boundaries. So if you are doing an 8B read
every cycle, and if you are able to do your other processing in the other 3
µops available that cycle, it's taking you ~500 cycles to process each page.
And then you hit a ~200 cycle penalty when you finish each page, which means
you are taking a ~25% performance hit because of memory latency, and probably
have a CPI closer to 3 rather than 4! So if you (or your compiler) haven't
already done so, you might see a useful performance boost by adding in
explicit prefetch statements (one per cacheline) for the next page ahead.

Anyway, thanks for your response. I like your approach, and I wish you luck in
making it even faster!

~~~
cwinter
Yes, excellent points. No doubt there is still much room for optimization.

------
minimaxir
Note: there are multiple pages to the post, which have the benchmarks.

~~~
Aissen
Thanks, I missed this at first. I've always found pagination to be
disrespectful of the reader, I wonder what was the motivation here.

------
paulsutter
Correct title is "How to Analyze Billions of Records in 20 Minutes and One
Second"

All of these "fast" databases have a fatal flaw - it takes forever to load the
data in the first place. In this case loading takes >1000x longer than the
query so loading is all that matters.

~~~
t0mbstone
Loading data from cold storage (HDD/SSD, etc) into RAM only has to be done
once on startup, though. After that single slow startup, all of the additional
queries happen super fast.

Your complaint only makes sense if you only intend to perform one single query
against a dataset.

~~~
skgoa
That assumes that we have enough RAM to hold all of the data we could ever
want to analyse and that we have this data at startup. Neither of which has
ever been the case as far as I have personally experienced.

~~~
xgbi
My friend works for $financial_subcontractor and he told me they have an XmX
of 1,46TB on their servers so that they can load the data needed for market
analysis.

The response time is quite fantastic (less than a minute), but there are
drawbacks: \- GC takes in the order of minutes to complete \- loading the data
takes 90 min from spinning rust

So yeah, it's possible, but your bottleneck becomes the storage driver.

~~~
neuro
I've seen that setup in a number of hedge funds

------
ducreux
The author had no idea what they were doing with kdb.

They even admit that they couldn't be bothered to modify their ingestion
scripts to not partition their data.

~~~
gricardo99
Yup.

>The pickup_ntaname column is stored as varchars by ClickHouse and kdb+, and
as dictionary encoded single byte values by LocustDB.

But it would be trivial to convert to enum/sym type in kdb+. It's silly to
query and group by strings.

~~~
inteleng
This is one of the frustrating parts of database software. Where can
information about the ways to optimize these parameters be found (outside of
random posts scattered around StackExchange)?

~~~
gricardo99
so true. Either you have lots of experience working with specific databases to
setup/optimize queries, and you know what works best through personal blood,
sweat and tears, and/or you have intimate knowledge of the inner workings of
the database architecture/implementation and know the theoretical best
approach to structure your schema/queries.

But even then, hardware/networking performance and tuning can throw a wrench
in the most seasoned/knowledgeable approaches. Users can further bring
otherwise solid setups to a grinding halt with unanticipated use-cases.

The only hope when you hit these inevitable road-blocks is that you're working
for someone that appreciates the difficulty of the problem.

------
emmelaich
It'd be interesting to compare this against traildb.

[http://traildb.io/](http://traildb.io/)

~~~
ryanworl
This is a very interesting project! Thank you for posting it.

Having read both this post and the documentation for TrailDB, I don’t think
they are comparable past very simple use cases. This post seems to be covering
a complete database system with a query parser, planner, and (previously?) a
storage engine. TrailDB is more like a storage engine you can push certain
kinds of filters down into, but performs neither query parsing nor planning
for you.

~~~
emmelaich
True. But the latest blog post for TrailDB mentions two new query interfaces,
_trck_ and _reel_.

I also wouldn't be surprised if someone has done a sqlite backend for it.

------
bcaa7f3a8bbc
Past:

How to Analyze Billions of Records per Second on a Supercomputer

Present:

How to Analyze Billions of Records per Second on a Single Desktop PC

Future:

How to Analyze 5 Records per Second on Thousands of Machines Distributed
Across the Globe Over the Blockchain.

~~~
taneq
I can't find it now but there was a great blog post about parallel processing
of large data sets just using Linux streams from the command line. They ended
up comparing a single laptop (iirc) against a Hadoop cluster and coming out
ahead.

~~~
wcrichton
[https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

~~~
potatoyogurt
in other words, "I used a chainsaw to cut an apple and it SUCKED at it."

If you're processing an amount of data that comfortably fits in memory on a
single machine, then obviously Hadoop is going to perform poorly in
comparison. The costs of scheduling a job onto N mappers/reducers,
transferring code to each node, waiting for the slowest mapper/reducer to
finish, transferring data from mappers -> reducers, replicating output on
HDFS, etc. are well-understood. It's true that many people try to use Hadoop
when they'd be better served with simpler solutions, but that does not justify
the amount of shade that the author throws at it.

~~~
mmt
> are well-understood. It's true that many people try to use Hadoop when
> they'd be better served with simpler solutions

I posit that these two assertions are contradictory.

My own understanding of the term "well understood" is that it is synonymous
with "widely understood". If many people are still making the mistake of using
Hadoop when those costs outweight the benefits, it seems that understanding
isn't quite wide enough.

That said, although I have a grasp of when the tradeoff is so loopsided as to
be obvious, I don't know where to go (or where to point other people to go)
for a better understanding of where the boundary is.

Where should we go to better learn that understanding of those costs?

~~~
potatoyogurt
They're well understood by anyone who has used these technologies
professionally. I probably should have been ore precise with my language,
though. It's really a grey area as to where the boundary is and it depends a
lot on your specific application. But as a general rule of thumb, my feeling
is that for tens to hundreds of GB, you should consider it. And for TBs or
more, you almost certainly want to be doing something distributed. Hadoop
isn't necessarily the best option then, but it's a powerful tool. I don't know
if there's any resource out there that really goes deep into the tradeoffs
involved though. There probably is, given how popular the subject is, but I'm
not aware of one.

The problem with the article is that if it's for a general audience that
doesn't understand the tradeoffs of a system like Hadoop, it really paints a
picture that it is just a bad, slow tool. It barely acknowledge just how
rigged the comparison is at all, aside from mentioning that you might need
something like Hadoop for really big data in the conclusion, while it is
peppered with unnecessarily snide comments about Hadoop that will probably be
more memorable. I think it is liable to leave readers more confused about the
tradeoffs involved after reading than before.

~~~
TeMPOraL
> _They 're well understood by anyone who has used these technologies
> professionally._

We need to substitute the word "professionally" with more precise terms when
talking about our industry. Because if one was to read "professionally" as "at
work", then your statement is absolutely false - both the lower bound and
average amounts of critical thinking and caring in this industry are
_extremely_ low. Even ignoring people who obviously have no clue, I can still
imagine Hadoop and other big data stacks being sanctioned by "professionals"
in management for buzzword-generating reasons, and implemented by
"professional" "engineers" for CV padding reason.

~~~
potatoyogurt
You're probably right. I'm thinking of specific coworkers when I think of the
level of knowledge that should be expected, and generally the standard I would
hold coworkers to would include understanding this. But it is probably not as
widely understood as I would hope.

------
golanggeek
cwinter does Locust also support scalability across machines.. what is the
timeline for it to be production ready..

