
1.1B Taxi Rides on Kdb+/q and 4 Xeon Phi CPUs - hhyndman
http://tech.marksblogg.com/billion-nyc-taxi-kdb.html
======
buckie
In my experience, kdb+'s k and q (which is broadly speaking a legibility
wrapper around k, which again broadly speaking is APL without unicode) are
phenomenally fast for dense time series datasets. Though they can struggle
(relatively, still pretty fast) with sparse data that's not really what they
are built for. They were built for high-performance trading systems, and
trading data is dense.

If you like writing dense, clever regexs (which I do) then you'll love k & q.
The amount that you can get done with just a few characters is unparalleled.

Which leads to, IMHO, their main drawback: k/q (like clever regexes) are often
write-only code. Picking up another's codebase or even your own after some
time has passed can be very hard/impossible because of how mindbendingly dense
with logic the code it. Even if they were the best choice for a given domain,
I'd try to steer clear of using them for anything other then exploratory work
that doesn't need to be maintained.

~~~
de_Selby
That's more on you than it is the language though. You can write obtuse write-
only code in any language.

I will concede that there is a culture of trying to be a bit _too_ clever on
the k4 mailing list but it's perfectly possible to write maintainable code in
kdb+

~~~
Cyph0n
But from what I recall, the syntax is terse _by design_ \- this is not
inherently bad, though. In other mainstream languages, you have to go out of
your way to write obtuse code (e.g, code golf).

I'm guessing that best way to address this issue is through liberal use of
explanatory comments.

~~~
de_Selby
Terse syntax isn't an issue, you just need to get used to reading it. Using 1
character variable names is another matter though.

There is no reason not to use camel cased variable names and indent functions,
if/else blocks etc, and when written this way the code can be perfectly
legible even to non q programmers.

Something else that leads people into the write-only trap is that the usual
way of working with the language is to use the REPL loop while working, where
you can tend to be doing multiple things on 1 line. It's just laziness not to
reformat and clean up the code afterwards though.

~~~
geocar
> There is no reason not to use camel cased variable names and indent
> functions,

There is: it makes the program bigger. Program source code length is
significant, and if you have more lines, you have more opportunities for bugs.

~~~
nl
Is this supposed to be a joke? Or maybe related to some weird coding
environment?

Long variable names may take more characters, but there is no way they
increase bugs.

Indenting(!) doesn't even take many more characters if you use tabs...

~~~
ams6110
Mathematical formulas don't use long variable names. They use x, y, and
various greek letters.

~~~
nl
This is generally (amongst mathematicians) considered to be a problem, not a
good thing (Source: work with university Math departments).

Edit, see [http://mathoverflow.net/questions/8295/origins-of-
mathematic...](http://mathoverflow.net/questions/8295/origins-of-mathematical-
symbols-names) for the confusion this causes..

------
mtanski
I've build columnar OLAP databases and database engines in C++ for work. Now
I'm doing it in my free time. Based on my experience the Phi and it's
architecture is very exciting for OLAP databases workloads.

Reasons:

\- Even in a OLAP database you end up with quite a few places that have very
branchy code. Research on GPU friendly algorithms on things like (complex)
JOINS and GROUP BY is pretty new. Additionally complex queries will functions
and operations that you might not have a good GPU implementation for (like
regex matching)

\- Compression. You can use input data that compressed in anyway that there is
a x86_64 library for. So you can now use LZ4, ZHUFF, GZIP, XZ. You can have
70+ independent threads decompressing input data (it's OLAP so it's pre-
partitioned anyways). (Technically branching, again)

\- Indexing techniques that cannot efficient implemented on the GPU can be
used again. (Again branching)

\- If you handle your own processing scheduling well, you will end up with
near optimal IO / memory pattern (make sure to schedule the work on the core
with local memory) and you not bound PCIe speed of the GPU. With enough PCIe
lanes and lots of SSD drives you process as near memory speeds (esp. when
we'll have Xpoint memory)

So the bottom line is if can intelligently farm out work in correct size
chunks (it's OLAP so it's prob partitioned anyways) the the Phi is fantastic
processor.

I'm primarily talking about the bootable package with Omni-Path interconnect
(for multiple).

------
anonu
I've been using kdb+/q for a long time (7 years now) - and can attest to its
speed. Objects are placed in memory with the intention that you will run
vectorized operations over them. So both the memory-model and the language are
designed to work together.

Lots of people complain about the conciseness of the language and that it is
"write-once" code. I tend to disagree. While it might take a while to
understand code you didn't write (or even code you wrote a while ago),
focusing on writing in q rather than the terser k can improve readability
tremendously.

My only wish is that someone would write a free/open-source 64-bit interpreter
for q - with similar performance and speed to the closed version. Kona (for k)
gets close
[https://github.com/kevinlawler/kona](https://github.com/kevinlawler/kona)

~~~
srpeck
Some other k-inspired languages to have a look at:

\- [https://github.com/johnearnest/ok](https://github.com/johnearnest/ok)

\- [https://github.com/zholos/kuc](https://github.com/zholos/kuc)

\- [https://github.com/ngn/k](https://github.com/ngn/k)

\- [http://t3x.org/klong/](http://t3x.org/klong/)

\- [https://github.com/tlack/xxl](https://github.com/tlack/xxl)

------
svan99
Nice writeup. If you are interested in learning KDB/Q, please take a look at
this book:
[http://code.kx.com/mkdocs/qformortals3/](http://code.kx.com/mkdocs/qformortals3/)

~~~
picodoc
or for a really quick high level overview you can use this:
[https://learnxinyminutes.com/docs/kdb+/](https://learnxinyminutes.com/docs/kdb+/)

it's a really beautiful little language once you get into it :-)

------
nnx
I absolutely love this blog series. Can't wait to read what's next :)

First time I noticed (mention of) recap at
[http://tech.marksblogg.com/benchmarks.html](http://tech.marksblogg.com/benchmarks.html)

~~~
qume
If he made the layout a bit uglier, and made the language more esoteric and
generally difficult to understand, this would make a fantastic academic paper.

But seriously, what a wonderful world it would be if all papers were this well
written.

------
chiph
Under a second to do an avg across 1.1 billion rows spread over four machines.
That's pretty amazing.

~~~
jxy
For a columnar database, that's a continuous chunk of memory. Assuming 32bit q
defaults to 32bit int, 1.1 billion integers across four machines means each
64-core (with 4 threads/core) KNL chip is averaging over 275M elements of int
array, or 1.1M 32bit int operations per thread. Now think again whether that's
amazing or not.

~~~
shaklee3
You're not accounting for the memory bandwidth at all. Yes, that's still
amazing. Try doing that in opentsdb.

~~~
throwawayish
~4.4 GB in 150 ms are just about 30 GB/s.

------
mmcclellan
This was a good idea for a test. I'll definitely check out the author's other
stuff. Commenting briefly on cost: while the article mentions the free 32 bit
version early on, the actual benchmarks were done using the commercial
version. I've had the impression the comercial version was cost prohibitive
for us poor folks. For those interested in experimenting with Xeon Phi though,
it looks like you can get started for ~$5k:
[http://dap.xeonphi.com/](http://dap.xeonphi.com/)

~~~
Twirrim
If you just want to meddle with a Xeon Phi, you can get some as cheap as $300:
[https://www.amazon.com/Intel-
BC31S1P-Xeon-31S1P-Coprocessor/...](https://www.amazon.com/Intel-
BC31S1P-Xeon-31S1P-Coprocessor/dp/B00OMCB4JI/ref=sr_1_1?ie=UTF8&qid=1485379661&sr=8-1&keywords=xeon+phi)
, though that's from the 3100 rather than 7200 family, and so won't perform as
fast. That kind of money puts it more into the hobby territory, though.

~~~
protomyth
Is there a difference in how you program the different Xeon Phi families?

~~~
mtanski
With the new generation there is now a difference. This is because this
generation has Phis in both addin card (PCIe) and a bootable CPU package (when
you run Linux or windows or whatever on the Phi).

Generally with the PCIe one you're running something like OpenCL and with
system CPU package you run threads and processes like you normally would.

Technically you could run software directly on the old addin cards since they
boot to Linux but you had handle the distribution, running and communication
of your software with the host. (you could run any x86_64 binary)

~~~
VodkaHaze
Do you still need the costly intel compiler suite to run some C++ code on it?

Honestly they make it pretty hard to hack with as a device. It could succeed
as a device if they let devs easily create cool applications with it.

~~~
mtanski
Both gcc & llvm can compile binaries targeting the Phi. Since it runs in Linux
(as addin PCIe card or host Linux CPU) it can run any kind of ELF x86_64
binary. They also support generating AVX-512 instruction of the Phi.

------
WhitneyLand
I don't see how these results provide much useful information in terms of
being able to say x is faster than y.

The hardware doesn't seem consistent across different benchmarks. He says it's
fast for a "cpu system", but for practical purposes Phi competes more with
GPGPUs.

Would this be just as fast with one redis system with 512GB ram? I don't know
too many apples to oranges here.

~~~
sologub
The author isn't testing just the software but combinations of various
software/hardware systems, including for example PostgreSQL on an i5 CPU, 16GB
RAM and 850 SSD hardware: [http://tech.marksblogg.com/billion-nyc-taxi-rides-
postgresql...](http://tech.marksblogg.com/billion-nyc-taxi-rides-
postgresql.html)

~~~
WhitneyLand
Great, and btw I think it's cool and I enjoyed reading it.

But what useful conclusions can be drawn from it?

~~~
LeifCarrotson
All conclusions are only valid for similar workloads, but each of MapD and
GPUs, Q/kdb+ and Xeon Phi, Redshift, Athena, Big Query, Presto, and
Elasticsearch claim to be fast, inexpensive, easy to work with, and otherwise
great for Big Data. Which ones really are fast? How fast is fast? How much is
this going to cost? Do I need 5 nodes or 50?

A few examples of some useful conclusions:

\- Just because a relatively well-optimized PostgreSQL database on a regular
workstation takes 5 minutes to run a query doesn't mean you can't get special
hardware to run that query faster than you can type.

\- Spark + S3 + Amazon Elastic Map Reduce look like an ideal tool to work with
large data, but they're pretty slow compared to better tools, and even
compared to plain PostgreSQL.

\- HDFS really is a lot faster than S3.

\- Performance of an Xeon Phi 64-core CPU is within an order of magnitude to
an NVidia Titan X.

\- Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and
takes about 30 minutes, but on Redshift expands to 2 TB and takes many hours
to upload on a normal connection, plus 4 hours to actually import!

\- It might cost $5000 to custom-build a GPU-based supercomputer that can do
these queries in under a second, but you can run similar queries if you're
willing to wait for 5 minutes each by spinning up instances for a few dollars
an hour plus a few more dollars an hour for storage, or by just running
PostgreSQL on your workstation.

Also, not a conclusion, but it's incredibly useful to have a simple example
exactly how to configure the tool and import some CSV data

~~~
WhitneyLand
These conclusions don't seem very useful because either they are already well
established or are not valid. Some examples:

Just because a relatively well-optimized PostgreSQL database on a regular
workstation takes 5 minutes to run a query doesn't mean you can't get special
hardware to run that query faster than you can type.

 _Already well established for years with systems like redis, and more
recently with gpu databases, and other techniques posted on HN regularly._

Spark + S3 + Amazon Elastic Map Reduce...is pretty slow compared to better
tools, and even compared to plain PostgreSQL.

 _Not valid because it doesn 't generalize. It so much depends on type of work
being done, system architecture, etc, that you can only say it may or may not
be true._

HDFS really is a lot faster than S3.

 _This is already well established, Amazon states aa much right in the
docs:[http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-
pl...](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-
systems.html*)

Performance of an Xeon Phi 64-core CPU is within an order of magnitude to an
NVidia Titan X.

_Not precise enough to matter because getting within 10x difference is not
close to being competitive.*

Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and takes
about 30 minutes, but on Redshift expands to 2 TB and takes many hours to
upload on a normal connection, plus 4 hours to actually import!

 _I don 't see how it's possible for 104GB of csv text data to decompress into
only 125GB. For cvs to compress only ~20%...doesn't make sense._

It might cost $5000 to custom-build a GPU-based supercomputer that can do
these queries in under a second

 _No, two problems here. The hardware in question could have used 1 cheap CPU
instead of two expensive Xeons and been much less expensive. Bigger problem:
The MapD software itself will be $50,000._

~~~
LeifCarrotson
The speed comparisons may be well known to you, but as someone only really
using trivial desktop app SQLite databases, they weren't known to me. Thanks
for pointing out my errors!

> I don't see how it's possible for 104GB of csv text data to decompress into
> only 125GB. For cvs to compress only ~20%...doesn't make sense.

The CSV file itself is around 500 GB. The internal representation, which might
use binary formats for numbers, or compress text, uses 125 GB. Redshift
expands it to 2TB for all the indexing and mapping.

> Bigger problem: The MapD software itself will be $50,000.

Ouch. That's a rather large oversight. Is the author affiliated with MapD,
perhaps?

------
mattnewton
This is the comparison to the titan X / mapD article I was looking for. Still
looks like the gpu is very competitive.

Sort of meta, but Mark's job seems awesome. Gets all these toys and writes
about configuring them. (The actual configuring is probably a pain but still)

------
1024core

        % cat startmaster.q
    
        k).Q.p:{$[~#.Q.D;.Q.p2[x;`:.]':y;(,/(,/.Q.p2[x]'/':)':(#.z.pd;0N)#.Q.P[i](;)'y)@<,/
    

Looks like line noise... :D

------
gravypod
At what point will CPUs out parallelize GPUs and will we be able to move vidoe
rendering back onto the CPU?

I see that as being something I'd very much like.

------
dunkelheit
Pretty wide array of technologies covered in these benchmarks. I wonder how
ClickHouse will fare, should be very competitive.

------
wyldfire
How does Phi's MCDRAM compare to GDDR5 (wrt throughput)?

~~~
loeg
According to wikipedia, the fastest GDDR5 can do 256 Gbit per chip[0]. I don't
know how many chips are typically used. MCDRAM in the article does 400 GB/s,
or 3,200 Gb/s. That would require 12.5 of those GDDR5 chips, assuming they
scale linearly.

[0]:
[https://en.wikipedia.org/wiki/GDDR5_SDRAM#Commercial_impleme...](https://en.wikipedia.org/wiki/GDDR5_SDRAM#Commercial_implementation)

~~~
shaklee3
GDDR5 can typically do 240GB/s access time on a typical GPU, and there are
multiple chips on many cards (Tesla K80). The newer cards use HBM2 and can do
732GB/s
([http://www.nvidia.com/object/tesla-p100.html](http://www.nvidia.com/object/tesla-p100.html)).

~~~
AlphaSite
HBM2 is 1024GB/s (256 per stack).

~~~
shaklee3
Kind of. Nvidia lowered the voltage on their P100 so it does not hit those
rates. Theoretically it can go that high, but the power draw was too large.
Next gen we'll likely see that.

------
pvitz
Has somebody here experience with Jd and could comment on the status or the
performance? Thanks!

~~~
eggy
The interpreted Jd is fast, but you need the compiled, commercial license Jd,
or you used to, for the speed test.

I love J compared with K, but that is because I found it first, and the
differences between J and K are minimal, but a different enough to keep me
using J.

~~~
scottlocklin
I think jd is mostly plain old interpreted J. There are a few shared libraries
to add functionality to core J, but it's mostly just J. You can probably get a
non-commercial license if you ask nicely.

~~~
sndean
> You can probably get a non-commercial license if you ask nicely.

This was my experience. I got a nice email from Eric Iverson in response along
with the activation key.

------
gbrown_
Nice to see some KNL usage outside the traditional large HPC centers :D

~~~
nextos
Yes, I remained a bit skeptical, but it seems to be taking off after latest
iteration.

------
stuntprogrammer
If the combination of such languages, high-performance hardware, and large
scale compute problems is interesting.. the startup I work for in Mountain
View is hiring...

~~~
hpcjoe
:D Hope things are well by you!

~~~
stuntprogrammer
And you too sir!

------
andrewstuart2
I get the distinct feeling this is not the usual price for a Xeon Phi. Still,
might keep an eye to see if it comes back into stock.

[https://www.walmart.com/ip/INTEL-SERVER-CPU-SC7120P-XEON-
PHI...](https://www.walmart.com/ip/INTEL-SERVER-CPU-SC7120P-XEON-PHI-
COPROCESSOR-7120P-1-2G/147170233)

------
tmostak
It looks like year is extracted from pickup_datetime at ETL, so hence its not
a fair comparison against the other databases that do this at runtime in Q3
and Q4. In something like MapD Q3 would be nearly as fast as Q1 (~20ms)
without the extract function, which involves relatively complicated math.

------
thedarkknight0
Wow, they are some incredibly impressive numbers. Great write up. HT to Mark

------
smulh76
Unbelievably fast! Interesting blog tbh.

------
mrcactu5
for reference the city is New York City taxi rides from 2009-2015 and there's
about 500GB of metadata

[http://tech.marksblogg.com/billion-nyc-taxi-rides-
redshift.h...](http://tech.marksblogg.com/billion-nyc-taxi-rides-
redshift.html)

------
Smca
Astonishing speed given the scale...

------
vegabook
All lots of fun, but kdb has an eye watering cost of 200k dollars per year per
server.

Here's hoping some combo of Apache Arrow (also cache aware, much more language
stack flexibilty), Aerospike (lua built in), Impala, and others, can finally
take on this overpriced product, which has had a lack of serious competitors
for 20 years, owing to its (price inelastic) finance client base.

~~~
kpierre
> Apache Arrow (also cache aware, much more cross platform)

kdb+ is available for raspberry pi, is that cross platform enough?

[https://kx.com/2016/06/08/kx-releases-raspberry-pi-build-
wit...](https://kx.com/2016/06/08/kx-releases-raspberry-pi-build-with-
libraries/)

~~~
vegabook
32 bit. Please be serious. You know full well that for all non-toy work kdb is
exhorbitantly expensive.

