
1.1B Taxi Rides Using OmniSciDB and a MacBook Pro - tmostak
https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.html
======
twoodfin
Whenever I look at large aggregation benchmarks like this, I try to estimate
cycles/value or better cycles/byte.

Take this query:

    
    
        SELECT cab_type,
               count(*)
        FROM trips
        GROUP BY cab_type;
    

This is just counting occurrences of distinct values from a bag of total
values sized @ 1.1B.

He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at
least a bit even when they're all running all out. Let's say 3B
cycles/core/second. So in .134 seconds (the best measured time) he's burning
~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.

While that's ridiculously efficient for a traditional row-oriented database,
for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than
I might have expected.

Presumably the # of distinct cab types is relatively small, and you could
dictionary-encode all possible values in a byte at worst. I'd expect
opportunities both for computationally friendly compact encoding ("yellow" is
presumably a dominant outlier and could make RLE quite profitable) and SIMD
data parallel approaches that should let you roll through 4,8,16 values in a
cycle or two.

Even adding LZ4 should only cost you about a cycle a byte.

That's not to denigrate OmniSciDB: They're already several orders of magnitude
better than traditional database solutions, and plumbing all the way down from
high-level SQL to bit twiddling SIMD is no small feat. More that there's
substantial headroom to make systems like this even faster, at least until you
hit the memory bandwidth wall.

~~~
alexbaden
(disclaimer: I work for OmniSci)

I think this is a good point. On GPUs, SIMT is effectively automatic
vectorization, so our focus has been on the memory bandwidth wall (we make use
of cuda shared memory in nvidia GPU mode for aggregates like the above query).
Non-random access compression on GPUs also has been a nonstarter, at least
historically. With more recent GPUs and more recent versions of CUDA, perhaps
this is changing. But on CPUs, we have started looking into vectorization.
There is a tradeoff, though -- the vectorization LLVM passes do add time to
the compilation phase, and at subsecond query speeds that time isn't always
worth it.

There are also a few other tricks to get closer to roofline performance. If
you sort the input data on the key you're grouping by you can see small
performance improvements, mostly from better cache locality. But, part of the
"magic" of OmniSciDB is that you can group on any key and get good performance
without ingesting, reindexing, etc.

~~~
bluestreak
I wonder what does this query compile to? In terms of C-code equivalent?

    
    
       SELECT cab_type, count(*)
       FROM trips
       GROUP BY cab_type;
    

From execution time it seems to me that this is a straight sum() of 32-bit
integers. "cab_type" has two distinct values and if stored 32bit value for
"green" is 0 and "yellow" is 1, straight sum of these integers will produce
the desired outcome and explain performance. That said the same performance
will not extend to key that has three or more distinct values.

~~~
nikita
In a naive case it compiles to a loop over all the elements and hash table
prob for each element. Now the magic comes from a few observations:

\- cab_type has very few distinct values. so you can encode those values from
1 .. N and use an array of size N instead of the hash table

\- you can build a “parallel scan”: split the rows evenly across many threads
and each thread processes it’s on chunk

\- the operation per row is very basic: you need to look up in the array and
increment a value. so you can use SIMD to perform operations on multiple rows
at the same time

\- using some bit manipulation magic you can do the above on “encoded values”:
you never need to convert cab_type bit represetation to an integer from 1..N

~~~
bluestreak
Thanks for the explanation Nikita. Any branching, hashing etc will increase
execution time. The execution example is on laptop, which has 2 memory
channels. 0.13s is absolute max that this laptop is able to muster as far as
memory throughput goes. Usually 4 threads is enough to saturate memory.

I have written this piece of code:
[https://github.com/questdb/questdb/blob/master/core/src/main...](https://github.com/questdb/questdb/blob/master/core/src/main/c/share/vec_agg.cpp#L147)

This sums 64bit values and using AVX2 it will sum 1Bn in 0.26s. Incrementing
conditionally will not be as fast and will throw vectorization out of the
window too.

~~~
nikita
This sounds about right. The query is using count and not sum which i believe
can be a bit faster.

Yes. Branching will absolutely hurt. Good old x100 paper teaches how to avoid
branching:
[http://cidrdb.org/cidr2005/papers/P19.pdf](http://cidrdb.org/cidr2005/papers/P19.pdf).

And of course there is no branching in MemSQL for this use case. And also no
hashing b/c number of groups is small and you can use an array and not a
hashtable.

Finally if you compress data rather than do the sum on an uncompressed array
you will have a lot more compact data representation which would allow you not
hit the memory bandwidth ceiling this quickly (4 threads)

~~~
tmostak
Hi @nikita, good to reconnect.

When you say an array and not a hash table, do you just mean a simple perfect
hash table indexed by the offset of the dictionary id? We use this fairly
extensively for inputs of bounded domain (i.e. dictionary-encoded strings,
moderately-sized integer ranges, even binned values, numeric or timestamp),
but call it a perfect hashing. Assume we're talking about the same thing but
wanted to clarify.

~~~
nikita
Yes, that’s it.

I’m still of an opinion that it’s important to demonstrate performance on more
complex queries with joins, subqueries, subselects, and clustered data
movements. The count(*), group by query is a very very simple case.

------
tmostak
For those wanting to try it for themselves, we recently released a preview of
our full stack for Mac (containing both OmniSciDB as well as our Immerse
frontend for interactive visual analytics), available for free here:
[https://www.omnisci.com/mac-preview](https://www.omnisci.com/mac-preview).
This is a bit of an experiment for us, so we'd love your feedback! Note that
the Mac preview doesn't yet have the scalable rendering capabilities our
platform is known for, but stay tuned.

You can also install the open source version of OmniSciDB, either via
tar/deb/rpm/Docker for Linux
([https://www.omnisci.com/platform/downloads/open-
source](https://www.omnisci.com/platform/downloads/open-source)) or by
following the build instructions for Mac in our git repo:
[https://github.com/omnisci/omniscidb](https://github.com/omnisci/omniscidb)
(hopefully will have standalone builds for Mac up soon). You can also run a
Dockerized version on your Mac, but as a disclaimer the performance,
particularly around storage access, lags a bare metal install.

~~~
ra
There's an old issue in your Github repo where you suggest that you may
support OpenCL for GPU.

Is this on your roadmap?

~~~
tmostak
Unfortunately it's probably not in the cards in the near term just do to other
priorities and insufficient demand (plus alternatives like HIP for AMD). I
will say a lot of us here at OmniSci would kill to leverage the latent GPUs in
our Macs and other places, so we'd welcome any community help towards this end
(it's not a trivial thing to add, but also not particularly difficult either,
just work).

We'll plan to update that issue with the above.

------
hodgesrm
It would be great to understand why OmniSciDB does so well on this benchmark
but seems to do far less well on others.

The ClickHouse team was (obviously!) very interested in Mark's result and
tried out OmniSciDB on the standard analytics benchmark that CH uses to check
performance. Results are here:
[https://presentations.clickhouse.tech/original_website/bench...](https://presentations.clickhouse.tech/original_website/benchmark.html#\[%2210000000%22,\[%22ClickHouse%22,%22OmniSci%22\],\[%220%22,%221%22,%222%22\])]

Anyway, really intriguing results from Mark. Looking forward to learning more
about the source of the differences.

Disclaimer: I work at Altinity, which supports ClickHouse.

Edit: Fixed bad link

~~~
chicagobuss
Agreed. I'm also a bit frustrated that he never tuned vertica in his test of
it - he just loaded the data into the default superprojection and queried it
like that. Nearly all of vertica's power comes from its concept of projections
- just like normal DBs benefit from indexes. I'd really like to see how it
performs to these other systems once it's been tuned properly because in my
experience it's always been the fastest DB I've ever worked with.

[https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Gl...](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Glossary/Superprojection.htm)

[https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Projections/WorkingWithProjections.htm)

------
skavi
I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in
his laptop.

Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade
the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.

~~~
moogleii
Actually, I don't think it's possible to configure a 16" MBP without a
discrete GPU - they all come with AMDs. Only the 13" MBPs can be configured
without one, but he says he's using a 16".

~~~
vkrishnamurthy
His other benchmarks of OmniSciDB that ran on systems with Nvidia GPUs, but I
think he's pointing out that the in this case, the AMD GPU wasnt used by the
DB engine even if it was part of the system config.

~~~
skavi
Yeah, I just checked and it appears to be a CUDA project, so neither the Intel
nor AMD GPUs help here.

Still, that could have been clarified in the article.

~~~
tmostak
Just to clarify, most of the query engine is built around LLVM-based JIT
compilation, and CUDA is not really used per say except for GPU-specific
operators like atomic aggregates and thread synchronization, and of course we
use the driver API to manage the GPUs, allocate memory, etc. Supporting AMD
GPUs or the upcoming Intel Xe GPUs (or frankly anything that has an LLVM-
backend) would not be particularly hard, it would just require adding similar
supporting infra.

------
xiaodai
I once contacted the author to check out my open source package and benchmark
it and he mentioned that he actually charges for the benchmarking exercise. So
yeah.

~~~
wenc
I can't speak to your specific situation, but I grew up around consultants.

Consultants have a rule: "Never say No. Always say: this is how much it will
cost".

It's a "no" but with a threshold.

~~~
EsotericAlgo
The implication is that this article was paid for by OmniSciDB or someone with
an interest. This changes the presumed context and should be noted upfront by
the author if that is the case.

~~~
wenc
I don't know if I'd necessarily draw that implication -- I have no data; my
broader point was more around the fact that consultants have a different way
of saying "no" than most people.

People can elect to work on things voluntarily for personal reasons, but if
someone asks them to fix or do something for them, then instead of saying no,
they might say "how much will you pay me?" It's their time and they're under
no obligation to offer it to you for free (though they can voluntarily choose
to if they want).

~~~
EsotericAlgo
That's a valid point. It's possible everything else was on his own time and it
was just his way of saying no. I end up reading the "how much will you pay
me?" as suggesting other things posted may be paid. That's not a fair read.

------
angryyellowman
>The GPU won't be used by OmniSciDB in this benchmark but for the record it's
an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200
upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it
won't have a material impact on this benchmark.

He lost me here...I get that it doesn't matter, but come on, if you don't know
that your computer has a GPU other than the integrated graphics (that you
admit you paid more to upgrade) then what are you really doing...

~~~
jeroenhd
Another comment suggested that he got this information from the "about my mac"
popup, which only shows the dedicated GPU when connected to an external
display or when using an application that uses the dedicated GPU.

If he ran his benchmarks and then checked his hardware while writing this
article then the author might've gotten confused by that.

------
danso

        COPY trips
        FROM '/Users/mark/taxi_csv/*.gz'
        WITH (HEADER='false');
    

> _The above managed to complete in 31 minutes and 40 seconds. The resulting
> import produced 294 GB of data in OmniSciDB 's internal format._

I’m really curious how a simple import (no indexes or data typing) into SQLite
would compare. But I don’t have 700GB of free SSD space to spare.

~~~
tmostak
Mark did a benchmark of SQLite using its internal file format a few years ago
([https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-
pa...](https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-parquet-
hdfs.html#comparing-to-sqlite-s-internal-format)), clocking the import at 5.5
hours. It looks like this was done though on a spinning disk, so given a
proper SSD, and a newer version of SQLite, it might be much faster.

------
wenc
For a list of benchmarks by the same author.

[https://tech.marksblogg.com/benchmarks.html](https://tech.marksblogg.com/benchmarks.html)

Caveat: these benchmarks only test the simplest of operations like aggregation
(GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are
performed. Even basic filtering (WHERE) doesn't seem to have been tested.
YMMV.

~~~
x87678r
Wow click house is right up there.

~~~
nwsm
They're all on different hardware. Are they really that comparable?

~~~
bdcravens
No, but I think each piece of software is put in a proper context, to match
what most would commonly use in that particular use case. For example, the
Clickhouse benchmarks are run against typical modest cloud instances.

~~~
tmostak
To be fair, the c5d.9xlarge instances are $1.728 each per hour, or $5.18 for
the 3-server cluster (looks to be about $3.06/hr for reserved 1-year pricing).
Even with reserved pricing, that's $26,806 a year, or 6.5X more than a $4K
laptop that likely will last for years and would be bought anyway (or at least
a cheaper variant, which would also run these queries nearly as quickly). Of
course that's very apples-to-oranges, so another way to look at this is that
OmniSci would probably see significantly better performance on a single
c5d.9xlarge than what we saw on this Mac (would need to benchmark, but
informally I can say that OmniSci was 2-3X faster running on CPU on my Linux
workstation compared to my Mac).

Disclaimer: No disrespect to ClickHouse here, it's an amazing system that I'm
sure beats out OmniSci for certain workflows.

~~~
shaklee3
The biggest loss for omnisci was the in-memory limits. The highest end GPUs
have 32GB, while you can find CPU servers with multiple TB. As soon as you
spill out of that you take a big performance hit.

------
rexreed
Side note: you can see a cool demo of OmniSci at
[https://www.aidemoshowcase.com/2020/07/08/omnisci/](https://www.aidemoshowcase.com/2020/07/08/omnisci/)

------
pvtmert
I wonder why nobody uses -C and --strip options of tar. It handles those stuff
automatically.

    
    
        mkdir -p application
        tar --strip=1 -C application -xf archive.tar
    

For example installing NodeJS from archive when composed with curl:

    
    
        OS=$(uname -s)-x64
        VER=12.16.1
        curl -#L "https://nodejs.org/dist/v${VER}/node-v${VER}-${OS,,}.tar.gz" \
        | sudo tar --strip=1 -xzC /usr/local
    

IMHO thats probably the reason why most apps are having proprietary installers
:)

------
zX41ZdbW
Follow-up:
[https://news.ycombinator.com/item?id=23990844](https://news.ycombinator.com/item?id=23990844)

------
coolgeek
> [https://tech.marksblogg.com/omnisci-macos-macbookpro-
> mbp.htm...](https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.html)

Is this what a keyword-stuffed URL looks like? This is terrible! This does
nothing at all to communicate semantic meaning about the post's content

------
popotamonga
Data does not fit in Ram so i guess in the end its about file formats and
minimizing disk access, thats why some of the competition benchmarksbare
terrible no?

------
antb123
I wonder how it compares with a similar pandas instance?

------
innocenat
Someone shout put 'analytic' in the title. It make no sense to me right now.

