
Performance in Big Data Land: Every CPU cycle matters - ptothek2
http://eng.localytics.com/performance-in-big-data-land-every-cpu-cycle-matters-part-1/
======
zmmmmm
So this person says every CPU cycle matters and then immediately takes the
single CPU cycle and multiplies it by billions, the scale of their data.

So no, it's not "every" CPU cycle, it's the ones that scale with the highest
dimension of your data that matter. Which is the same old story we have always
had, save your energy for optimising the parts that matter, because the ones
that matter probably matter orders of magnitudes more than the ones that
don't.

------
GeneralMayhem
>If AUTOCOMMIT = ON (jdbc driver default), each statement is treated as a
complete transaction. When a statement completes changes are automatically
committed to the database. When AUTOCOMMIT = OFF, the transaction continues
until manually run COMMIT or ROLLBACK. The locks are kept on objects for
transaction duration.

This made me cringe. Whether a series of operations takes place in one
transaction or many isn't something you can just turn on and off depending on
what looks more expensive!

The article ended up suggesting more transactionality, which is generally good
(although the reason given is not the important one, namely "you're less
likely to have all your data completely ruined"), but if you make the process
distributed and aren't careful about sharding you may end up trading average-
case cost in network load for much worse worst-case cost due to lock
contention and transaction failures.

Optimizing database access patterns at scale is _hard_ , and blithely making
major changes to things that impact correctness is not the way to do it.

~~~
tobz
There should probably have been a little more of an explanation on the post,
but: 99% of the connections that touch Vertica, for us, are read-only.
Actually, we have no system in which there is mixed read/write to Vertica.
Either a system reads from it, or writes to it. That made it very easy for us
to figure out where to turn off autocommit, and to do it without losing any of
the aforementioned correctness.

------
jrbancel
Is 100 Billion (order of a few TB) Big Data?

In my experience, CPU is rarely the big issue when dealing with a lot of data
(I am talking about tens of PB per day). IO is the main problem and designing
systems that move the least amount of data is the real challenge.

~~~
hvidgaard
The major takeaway I had from my courses in data intensive applications, was
that IO is all that matters. It is the limiting factor to such an extend that
you don't really care about the algorithmic efficiency with regards to CPU
calculations, or memory.

You analyse algorithms in terms of IO access, and specifically access pattern.
If you cannot make the algorithm in a scanning fashion, you're in for a bad
time.

~~~
bdarfler
There is always a balance here between CPU and IO. For a long time databases
and big data platforms were pretty terrible with IO. However, as the computer
engineering community has had time to work with these problems we have gotten
considerably better at understanding how to store data via sorted and
compressed columnar formats how to exploit data locality via segmentation and
partitioning. As such most well constructed big data products are CPU bound at
this point. For instance check out the NSDI `15 paper on Spark performance
that found it was CPU bound. Vertica is also generally CPU bound.

[https://www.usenix.org/conference/nsdi15/technical-
sessions/...](https://www.usenix.org/conference/nsdi15/technical-
sessions/presentation/ousterhout)

~~~
hvidgaard
After skimming the paper, I'm fairly confident it's not the same at all. We
only managed the theoretical side of a scenario where there would be multiple
TB hard drives, on multiple machines. Any efficient algorithm would work in a
scanning manner, and not seek backwards beyond what could be kept in ram. We
did simulate this, and the result was quite clear, IO matters.

From the paper the following 3 quotes highlight exactly why they where CPU
bound:

> We found that if we instead ran queries on uncompressed data, most queries
> became I/O bound

> is an artifact of the decision to write Spark in Scala, which is based on
> Java: after being read from disk, data must be deserialized from a byte
> buffer to a Java object

> for some queries, as much as half of the CPU time is spent deserializing and
> decompressing data

------
gtrubetskoy
CPU is probably not the best example, but the point is very valid, that at
100B scale anything is large.

We humans are not very good at appreciating orders of magnitude. I usually
explain it this way: if it takes you 1 hour to process 1M records, then 10M
will take 10 hours, and 100M will take 4.2 days while 10B will take over a
year.

------
brendangregg
I hope later posts in this series explore Linux perf_events or flame graphs,
which is the origin of the (unattributed) background image
([http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html)).
:)

~~~
tobz
Heyo!

Sorry about the attribution. I'm trying to find who controls the blog as we
speak so I can have them add it. (I work at Localytics, but I'm not the
author.)

We've gingerly explored flame graphs to understand Vertica behavior under
load, and we still have a lot that we want to try and use it for. I'm not sure
if it will make an appearance in a further post, but we've definitely used
your perf_event/ftrace-based tooling. :)

~~~
ptothek2
@brendangregg It was my bad, and you're totally right. I should not have let
that fall to the wayside here. We're adding it ASAP. I'm sorry about that.

------
syed99
"Different data types will force Vertica to use a different number of CPU
cycles to process a data point" At the end of the day that performance bump
comes down to the data point itself, sometimes the decrease in that CPU cycle
wouldn't be as significant as expected.

Would love to see if the performance bump is highly significant on a much
larger and complex data set.

------
martin_
Am I misunderstanding something? If one CPU cycle accounts for 27 seconds,
then the savings of 10 seconds suggest we saved one half of a CPU cycle per
iteration? Or do the queries not touch every row?

Optimizing data types and minimizing locks seem like general optimization
tips, I was hoping for more advanced techniques for 100B rows.

~~~
joefkelley
One CPU cycle per row saves 27 seconds of one CPU's time. That 10 seconds was
saved on every CPU in the cluster. So if there were 50 CPUs, that's 500
seconds, or ~20 cycles per record, by the original calculation.

In reality, the change in data type probably optimized disk access more than
it did number of CPU cycles. That can often be more of a bottleneck.

------
shulu
Totally agree with the "Every CPU cycle matters". It might be more easier to
save cpu cycle by saving I/O, utilizing data locality (with in datacenter
racks) or even better serialization (binary, columnar or indexed).

Reducing locking and using shorter data type seem inadequate for the "Big
Data" scene.

~~~
bdarfler
You are exactly right, however Vertica already handles the data locality,
columnar data storage and data compression for us. Vertica is so good at its
job that we are CPU bound on most queries and these types of strategies around
reducing locking and using shorter data types make a difference.

------
andmarios
Then why big data land is dominated by JVM-based frameworks?

~~~
corysama
Because a couple decades ago Java convinced Enterprise Land that they can't
hire millions of C++ jockeys and expect them to work effectively in huge
projects that plan to evolve into the next decades' (aka: the present's)
legacy mudball. Instead, they decided it would be easier to hire millions of
Java jockeys and have them build enormous kiln-fired mudballs using the same
architectural strategy as the Egyptian pyramids. They convinced academia to
raise an entire generation of Java jockeys, hired them all right out of
school, and set them immediately to piling up enormous mud bricks forever.

So, now they have a few million Java jockeys churning away and a few million
person-decades of work put into their mud piles. When starting any new
project, there isn't much question about how to build it: More Mud!

~~~
nascentmind
This is the problem here.

As an embedded developer where every cycle counts I have come up with the same
question as the poster above why bother with such languages. If a switch
processes packets at line rate with the use of ASIC's why not have some
similar development in the world of big data.

------
sargun
I might suggest a new definition for "Big Data" \- Data, whose size is greater
than fits in one machine's memory.

~~~
bbrazil
The definition I like is that it's when the size of the data becomes a
significant challenge to solving your problem.

For example 1TB of data won't fit in memory, but if all you need to do is a
sequential read in under a day then it's not a problem.

~~~
toast0
1TB /will/ fit in memory, you can get an ec2 instance with 2TB; but your point
stands.

------
yummyfajitas
Is this really about cpu rather than disk? I don't see anywhere where he
attempted to control for disk io by passing the integers.

In fact, since vertica is column oriented, I don't think you can pad things
easily.

------
mobiuscog
I'm struggling to understand why they wouldn't just use non-locking selects
instead of turning auto-commit off.

Does the auto-commit add _additional_ lock overhead for some reason ?

