Maybe the future of frameworks like this will be on the GPU. I'm just not seeing any evidence of it yet. Right now, Spark fills the space where you can throw globs of memory at TB- to PB-scale problems. I could very well be wrong, but I don't see how this is going to be cost-effective on GPUs given the current cost of memory there.
2. The dataset is trivially small because this is a new engine built for the rapids eco system and it is limited for the time being to a single node. We are releasing our distributed version for GTC (mid March) and will be able to give you more reasonable comparisons. This is a similar path of development to our pre Rapids engine which went from single node to distributed in about a month because we have built this engine to be distributed. Right now we are finishing up UCX integration which is the layer we will be using to communicate between all the nodes.
3. You can always try it out. Its own dockerhub (see links in this post) and if you want to run distributed workloads right now you can manage that process using dask by handling the splitting up of the job yourself. In a few weeks you will be able to have the job split up for you automatically without need for the user to be aware of the size of the cluster or how to distribute data across it.
With GPUs, the software progression is single gpu -> multi-gpu -> multinode multigpu. By far, the hardest step is single gpu. They're showing that.
Also - Spark is very slow compared to analytic distributed DBMSes.
Seems fishy that the benchmark is small enough to fit into GPU RAM as well
We are more interested in trying to compare to tools that can provide workflows which are:
A. I have a ton of data in S3, HDFS, cloud storage solutions, centralized NFS
B. I want to use the GPU to perform learning, graph analytics, classification, etc.
I want to get from A to B as fast as possible utilizing the same resources for A that I will need to complete steb B.
We were able to get a query that took 16 hours to 4 seconds. This took a long time, and included changing how the data was stored etc.
Anyway, fast Spark performance is possible but it is very sensitive to tuning (this was in ~2016, so maybe things have improved).
The one I did the previous year is nicely linked from http://hadoopsummit.org/melbourne/agenda/
Let's try imagenet training. Intel's best time is 3h25m on 128 nodes. Imagenet is ~150 gb.
(150 GB)/(3 hr * 3600 sec/hr * 128 nodes) = less than one megabyte per second per node! Caffe is slow! CPUs are slow!
Or bs metrics are bad?
You can see the workflow for yourself and see how the task is quite parallelizable.
ETL can be very extensive. For example, we first built this when we had to take data from 15 different database systems that represented individuals and their pension contributions and join across these systems. It was a largish join about ~4 tables from each system so around a 60 table join.
That was "JUST ETL". The job was preparing the data for training. ETL is often times a large part of people's workflow. Looking for needle situations in a haystack. That can be JUST ETL. If you believe there is a more apt word for extracting data from a system, performing unspeakable transformations on it, then making that information available to another process then please tell me. Being of Peruvian stock myself I take great license with my language and grammar.
I’m betting that the bottleneck is running ML training. Otherwise I can’t see why an ETL job on that amount of data would take several minutes.
Not sure how it compares to this tool, but it uses a GPU to do both high density analytics, by loading data straight from disk to the gpu with dma avoiding the cpu entirely to do the work, or by using the GPU as a prefilter to remove lots of candidate rows and then letting the CPU do the core query on much fewer rows.
The old school way of, run this query, get my results, read those results using something like jdbc, is too slow. You can literally take out the "run this query" portion and it will still be slower than what we believe needs to exist to power GPU accelerated workloads.
We are trying to keep up with the pace of hardware improvements that NVIDIA has been churning out. Moving a result set from Postgres ==> "wherever I need to be" by performing copies and making allocations for that data is a cost we want to save people from.
Horizontal scaling matters as well, which is not shown here. Furthermore, using less than 16gb of data.. why would you use these frameworks in the first place?
It is common knowledge that a grep and sort on 100gb is faster than Hadoop, however 100tb is a different story
Apache Spark is easy target to beat. I would be much more interested looking at comparison to ClickHouse :)
After importing then they spend 34 more minutes making the data into a columnar representation. Alright so 89 minutes in and we still haven't run queries.
Oh but its not distribute yet. Darn I have to run some non standard sql commands like
CREATE TABLE trips_mergetree_x3
ENGINE = Distributed(perftest_3shards,
Ok can I query my data yet? No you have to move it into this distributed representation and that takes 15 more minutes. Oh ok...
And now? Yes you can run your queries but they aren't really very fast.
SELECT cab_type, count(*)
GROUP BY cab_type;
Can take 2.5 seconds on a 108 cpu core cluster for only 1.1BN rows? Thats not fast. That's particularly slow given that requires you to ingest and optimize your data.
Maybe you want to show us an example of some simple tests you have run with blazing and clickhouse. As I read it now its not worth our time to look into becuase its so very different from what we are trying to offer which is:
Connect to your files wherever you have them
Train / Classify
I was hoping to see some serious consideration given to these kinds of benchmarks, considering Clickhouse is one of the most cost effective tools I've used in the real world and occasionally outperforms things like mapd.
I was expecting your solution to outperform Clickhouse at least in some aspects, and a benchmark showing where it wins. Instead you reveal ignorance of Clickhouse and even the benchmarks you linked.
Your comment comes off as incredibly arrogant and at the same time incredibly misinformed. Disappointing to see this attitude from the team.
If it doesn't take input from Arrow and CUDF and it doesn't produce output that is Arrow CUDF or one of the file formats we are decompressing on the GPU. Then we don't care unless one of our users asks us for this.
We are 16 people and a year ago were 5. We can't test everything out just the tools our users need to replace in their stacks. I apologize if I came off as arrogant. I have tourette's syndrome and a few other things that make it difficult for me to communicate, particularly when discussing technical matters. If I have offended you I do apologize but not a single one of our users has said to me I am using clickhouse and want to speed up my GPU workloads. Maybe its so fast they don't mind paying a serialization cost going from clickhouse to GPU workload and if so thats great for them!
I do suggest you seriously benchmark against clickhouse, because where single node performance is concerned, it is the tool to beat outside arcane proprietary stuff like kdb+ and brytlytdb. I have used single-node clickhouse and seen interactive query times where an >10 node spark cluster was recommended by supposed experts.
Clickhouse is not a mainstream tool (and I have discussed its limitations in other threads) but it is certainly rising in popularity, and in my view it comes pretty close to 1st place for general purpose perf short of Google scale datasets.
The more complex the query the less you can rely on being clever and the more the guts need to be performant and that is more important to us right now.
We use the DTC airline on time performance dataset (https://www.transtats.bts.gov/tables.asp?DB_ID=120) and Yellow Taxi trip data from NYC Open Data (https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&...) for benchmarking real-time query performance on ClickHouse. I'm working on publishing both datasets in a form that makes it easy to load them quickly. Queries are an exercise for the reader but see Mark Litwintschik's blog for good examples of queries: https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html.
We've also done head-to-head comparisons on time series using the TSBS benchmark developed by the Timescale team. See https://www.altinity.com/blog/clickhouse-timeseries-scalabil... for a description of our tests as well as a link to the TSBS Github project.
It would be helpful if you could qualify some of these statements.
Our open source code (the compute kernels, GPU DataFrame, etc.) is part of RAPIDS AI. You can see source and license here: https://github.com/rapidsai/cudf
So is BlazingSQL open source? If it is where is the source code & what license is it?
Also where is the source code for RAPIDS? And what license is it?
We have to start somewhere and if you think we should compare against something else please tell us what you think we should compare ourselves to! We are already looking into some of the other tools that have been mentioned here in the comments.
Did Spark have a 2x slow down between the tests? It seems pretty misleading.
In our original demo, we ran a V100 GPU against 8 CPU nodes. The new T4 GPUs cut our cost in half, which meant we reduced the Apache Spark cluster to 4 CPU nodes in order to maintain price parity. Even with the reduced GPU memory, the whole workload ran significantly faster.
... on the other hand, that's more difficult to measure, and TDP figures are not what really happens.
I also see a completely different use case. We don't want to ingest data into Blazing. That is NOT our cup of tea. We will make tools to help you persist data quickly. Working on it. But we want to read data the way YOU store it without asking you to duplicate it into our system and requiring significant effort from the user.
You have files. Great we can read them, query them, do all kinds of acrobatics on them and don't need you to store redundant copies of what you already had.
The files are stored in S3 or HDFS and other file systems. Great, we have built connections to interface with those file systems and you can query directly from them.
A user should be able to get blazing's docker container. Register a filesystem with the engine, and start querying their files within minutes of having launched the container. When we don't do this we fail. It does not seem to me that AresDB is trying to do this but I could be completely wrong and if you think so please tell me why!
BlazingDB is meant to be connected to other tools in the rapids eco system. You interact with it through python. You can 0 copy IPC the data to any other tool in the rapids eco system. Want to a power a dashboard, sure, but this isn't the use case that we are optimizing for. We don't have a JDBC connection. We don't connect to tableau.
Blazing Focuses on Interpreting rather than JITing. Instead of focusing on trying to make the most performant compiled code possible by JITing exactly what we want we pay a process cost that reduces I/O. We convert Logical Plans into physical execution plans that are suitable for a SIMD architecture whose optimization centers around minimizing materializations and global memory accesses. We tried JIT but couldn't find a way to compile plans in less than a few hundred nano seconds at which point the overhead of compiling superseded that of interpreting operations in a kernel.
We aren't targeting enterprise sql users specifically. We are targeting people using GPUs that want to accelerate the process of making information available to those gpus by leveraging the GPUs for the process of parsing (CSV), decompressing (Parquet) and then transforming data to make it suitable for machine learning, classifying etc.
Conda works too but its still a bit painful and we are working on making that a bit more user friendly.
We initially bet on OpenCL... but had to write almost everything ourselves. We then switched to GOAI, despite having to write much of Apache Arrow[JS], among other things. I love q's like that now that more and more pieces are falling into place with best-of-class tech. That's a big deal for our users, and helps our small punch above our weight for helping them. 2019 will be fun!
SELECT colA + colB * 10, sin(colA) — cos(colD) FROM tableA