My naïve impression working at smaller scales is that in a reasonably balanced system, storage is often the bottleneck, and if it can deliver data fast enough the CPU probably won't break a sweat counting or summing or so on. The amount of interest in GPU databases, though, suggests that's not the case for some interesting workloads.
Classical "join" patterns, like merge-join, hash-join, and other such algorithms, have been run on GPUs with outstanding efficiency. Taking advantage of the parallelism.
A GPU merge-join is a strange beast though. See here for details: https://moderngpu.github.io/join.html . So its a weird algorithm, but it is clearly far more efficient than anything a traditional CPU could ever hope to accomplish.
In any case, it is clear that the high-memory bandwidth of GPUs, coupled with its parallel processors, makes the GPU a superior choice over CPUs for the relational merge-join algorithm.
GPUs probably can't be used effectively in all database operations. But they can at LEAST be used efficiently in "equijoin table", one of the most fundamental and common operations of a modern database. (How many times have you seen "JOIN Table1.id == Table2.id"??) As such, I expect GPUs to eventually be a common accelerator for SQL Databases. The only problem now is building the software to make this possible.
We augment the GPU algorithms with external CPU algorithms when the GPU implementations aren't ideal or can't run due to memory constraints.
We recently talked about this at a session at CMU's Database Group, if you're interested - https://www.youtube.com/watch?v=bSYFwd50EcU
That said, I've heard anecdotes from people I trust that heavily optimized use of CPU vector instructions is competitive with GPUs for database use cases.
Probably true for current computer software. But there are numerous algorithms that allow groups of nodes to work together on database-joins, even if they don't fit in one node.
Consider Table A (1000 rows), and Table B (1,000,000 rows). Lets say you want to compute A Join B, but B doesn't fit in your memory (lets say you only have room for 5000 rows). Well, you can split Table B into 250 pieces, each with 4000-rows.
TableA (1000 rows) + TableB (4000 rows) is 5000 rows, which fits in memory. :-)
You then compute A join B[0:4000], then A join B[4000:8000], etc. etc. In fact, all 250 of these joins can be done in parallel.
As such, its theoretically possible to perform database joins on parallel systems, even if they don't fit into any particular node's RAM.
Assuming you have 16 GB or RAM on the GPU, theoretically ~1 second is all it would take to load the GPU with that amount of data. Unfortunately, when you take into consideration huge data sets, you'll be able to saturate that with 5 M2 SSDs each running at 3200 MB/s, assuming Disk<-RAM DMA->GPU.
Those would also require at least 5 PCIe 2x8 ports on a pretty performant setup. RAM bandwidth is assumed to be around ~40-60GB/s, so hopefully no bottlenecks there.
This is assuming your GPU could swizzle through 16GB of data in a second. They have a theoretical memory bandwidth of between 450-970 GB/s.
Now realistically, per vendor marketing manuals, the fastest DB I've seen allows one to ingest at 3TB / hour ~~ 1GB / second.
So there must be more to it than the theoretically 16GB / second business. PCIe4 x16 doubles the speed to 32GB/s but at this time it looks pointless to me.
Not all queries or database operations are bottlenecked by PCIe. Some are compute bound, others are network bound, etc.
Edit: Unfortunately, each run is on different hardware, but it at least gives you an idea of what's possible.
CMU has an interesting lecture if you want to learn more about Brytlyt
Couple interesting things to note:
- They attain some of their speed by requiring that data be pre-sorted
- They've built their database on Postgres for query planning, but for any query which does not match what they've accelerated on GPU, they do not have the ability to failover to utilizing postgres on the CPU.
- Data is brought into GPU memory at table CREATE time, so the cost of transferring data from disk->host RAM->GPU RAM is not reflected. Probably wouldn't work if you want to shuffle data in/out of GPU RAM across changing query workloads. https://youtu.be/oL0IIMQjFrs?t=1310
- The blog had to use data type DATE instead of DATETIME for Brytlyt since it doesn't support the latter. DATETIME was used for the other DBs, which is a heavier computation. https://tech.marksblogg.com/billion-nyc-taxi-rides-p2-16xlar...
So all-in-all it seems like a more carefully constructed, hardware balanced comparison would be needed to see which the quickest would be.
Also note that the dataset is 600GB, so it won't fit a sinlge GPU, not even close.
According the benchmark, the fastest 8 GPU node takes about 0.5 seconds. The cost of that node on AWS is about 24$/hour. The 21 node spark cluster takes 6 seconds. But, it only costs 4$/hour.
An additional benefit with Spark is that it can be used for a lot more variety of operations than a GPU.
This cost disadvantage restricts GPU processing to niche use cases.
Using your numbers, the GPU solution has half the cost for similar performance? How does that not make sense?
> This cost disadvantage restricts GPU processing to niche use cases.
All GPU compute applications are niche use cases.
This assumes AWS pricing. You build a farm of GPUs and buy in bulk, you get much better cost basis. GPU farms are becoming more and more of a thing now and definitely less 'niche'.
Presumably once you have all your data in memory then the CPU becomes a bottleneck again, and if you can ship out the number crunching to GPUs in an efficient manner (i.e. you don't waste loads of time shuffling data between RAM and GPU) then you'll see performance gains.
In fact, that's why sorting on a GPU is going to almost always be worthwhile. Sorting is a O(n*log(n)) operation, but the transfer is O(n).
The Table-Join is O(n^2) if I remember correctly per join. (If you have 5 tables to join, its a ^2 factor for each one). That means shipping the data (a O(n) operation) is almost always going to be negligible.
So I'd expect both join and sort to be pushed towards GPUs, especially because both joins and sorts have well known parallel algorithms.
Asymptotic algorithmic complexity for a serial processor is meaningless here, it provides no indication on how a parallel machine (e.g., a PRAM or other, or perhaps trying to map it to the GPU's unique SIMD model) will perform on the problem.
The arithmetic intensity per byte of memory loaded or stored for sort or join is low. You can exploit memory reuse when data is loaded in the register file for sorting (for a larger working set), but you can do that for sorting on SIMD CPUs in any case with swizzle instructions (e.g., bitonic sorting networks). The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.
Its not meaningless. Its a legitimate cap. Bitonic sort for example is O(nlog^2(n)) comparisons, which is more comparisons than O(nlog(n)) for a classical mergesort.
Let me use a more obvious example.
Binary Searching a sorted array should almost NEVER be moved to the GPU. Why? Because it is a O(log(n)) operation, while Memory-transfers is a O(n) operation. In effect: it is more complex to transfer the array than to perform a binary search.
Only if you plan to search the array many, many, many times (such as in a Relational Join operator), will it make sense to transfer the data to the GPU.
In effect, I'm using asympotic complexity to demonstrate that almost all algorithms of complexity O(n) or faster are simply not worth it on a GPU. The data-transfer is a O(n) step. Overall work complexity greater than O(n) seems to benefit GPUs in my experience.
> The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.
GPUs have both higher-ram and more numerous arithmetic structures than a CPU.
The $400 Vega64 has HBM2 x2 stacks, for 500GBps to VRAM. It has 4096 shaders which provide over 10TFlops of compute girth.
A $800 Threadripper 2950x has 16core / 32-threads. It provides 4x DDR4 memory controllers for 100GBps bandwidth to VRAM and 0.5 TFlops of compute.
Arithmetic intensity favors GPUs. Memory-intensity favors GPUs. The roofline model says GPUs are better on all aspects. Aka: its broken and wrong to use this model :-)
GPUs are bad at thread divergence. If there's an algorithm with high divergence (ie: Minimax Chess algorithm), it won't be ported to a GPU easily. But if you have a regular problem (sorting, searching, matrix multiplication, table joins, etc. etc.), they tend to work very well on GPUs.
Many, many algorithms have not been ported to GPUs yet however. That's the main downside of using them. But it seems like a great many number of algorithms can in fact, be accelerated with GPUs. I just read a paper that accelerated linked-list traversals on GPUs for example (!!!). It was a prefix-sum over linked lists.
Searching sorted arrays is actually very common in these workloads. Why? Analytics workloads typically operate on timestamped data stored in a sorted fashion where you have perfect or near perfect temporal and spatial locality. Thus even joins tend to be cheap.
With Skylake and AMD Epyc nearing in on 300GB/sec and much better cost efficiency per GB of memory vs. GPU memory the case for GPUs in this application seems dubious.
I will grant you that GPUs have a place in more complex operations like sorts and joins with table scans. They also blow past CPUs when it comes to expensive computations on a dataset (where prefetching can mask latencies nicely).
A good example of a dense sort + join GPU workload would be looking for "Cliques" of Twitter or Facebook users. A Clique of 3 would be three users, A, B, and C, where A follows B, B follows C, and C follows A.
You'd perform this analysis by performing two joins: the follower-followee table on itself three times.
So it really depends on your workload. But I can imagine that someone who is analyzing these kinds of tougher join operations would enjoy GPUs to accelerate the task.
Alternatively you can saturate pcie lanes with gpus and load data from ram
You can load compressed data up to the GPU, decompress it there, and run very complicated mathematical functions. This can be very beneficial when you run a JOIN operation.
See a comparison here, with SQream DB (a GPU accelerated data warehouse) vs. some unnamed data warehousing solutions - https://sqream.com/why-im-not-afraid-of-sql-joins/
In many ways, this paper is easier to read instead: http://www.csee.usf.edu/~tuy/pub/SSDBM17.pdf
But I'm pretty sure the "moderngpu" version of merge-join is more optimal.
While I originally didn’t get the case for GPU accelerated databases, it makes more and more sense given that the bandwidth speeds between GPU and CPU are steadily increasing, making the GPU an increasingly attractive option since the latency for GPU<->CPU syncs is diminishing.
It does make coding on the system a bit more difficult, but we found that the performance gains were worth it.
Also I would add that this system seems, at least for now, geared toward solving a specific set of problems Uber is grappling with around time-series data, whereas OmniSci is trying to tackle general purpose OLAP/analytics problems. Not that solving specific use cases is not great/valid, or that they don't plan to expand the functionality of AresDB over time to encompass broader use cases.
That said, there is interesting academic work around GPU sort that achieves even higher performance than Thrust in many scenarios, and we are looking at the feasibility of incorporating a framework we have found particularly promising.
We were considering doing that some years ago, but opted for writing our own (GPU accelerated) database too.
We talked about it recently at a CMU Database Group talk (relevant parts start at around 17:00 - https://youtu.be/bSYFwd50EcU?t=1028)
Can you translate raytracnig to DB relations?
You have a raytracing engine, and many proven approaches to be able to determine what is visible from any given point.
What if the "point" was, in fact, "the query" and from that point - you apply the idea of raytracing to see what pieces of information are the relevant fields in a space of a DB that pulls in all those fields - and the shader translates to the "lens" or "answer" you want to have from that points perspective?
Is this where a GPU would help? Why? Why not?
In practice, does it make sense? Not really since each ray is not the same amount of work. One ray can go straight to the camera. Another could bounce many, many times going through many objects before hitting the camera or dying out altogether. GPUs are terrible at handling workloads with high thread divergence (some threads run much slower than others).
I like how they highlight the different existing things they tried and what the shortcomings were which led to building this, although not sure where I could use this to solve a problem.
ClickHouse scan performance is through the roof, but it also seems fairly difficult to operate compared to the alternatives. For example, the concept of "deep storage" in Druid and Pinot makes rebalancing and replication trivial (at the expense of additional storage space). ClickHouse doesn't have that and requires more babysitting. And that's without even going into something like BigQuery, which is on a completely different level regarding operational simplicity if your use cases support it.
Also, if queries are heavily filtered by arbitrary dimensions, then ClickHouse starts to lose its edge compared to fully-indexed systems.
This makes ClickHouse fairly niche IMO, despite being exceptional at what it does.
I wouldn’t call GPUs perfect for performing analytics functions. It can be painful to force your data processing algorithms through shaders. What GPUs are is cheap in terms of dollar cost per flop and power consumption.
One peeve, not loving the idea of having to use AQL and not having a command line analytical interface.
Does AQL support windowing functions, CTEs, sub queries ?
Why does any of that require real-time analysis? What's the business need for this being instantaneous? Feels very NIH...
In short, Uber has a ton more infrastructure because they need to do more in real time. Calculating the price of a trip from point A to point B needs to be done in real time. Then there are other factors like Pool, surge, and various rider preferences that all need to be accounted for.
Uber is sort of unique in how much of their stack nessicates real time analytics
- Build dashboards to monitor our business metrics
- Make automated decisions (such as trip pricing and fraud detection) based on aggregated metrics that we collect
- Make ad hoc queries to diagnose and troubleshoot business operations issues
None of those require real-time information, because there isn't real-time decision-making required from the information presented.*
With the exception of the "automated decisions", then I don't see the need.
Frankly, it seems like AMD is going to be more successful with HCC and its HIP (CUDA-compatibility layer). It plays a distant 2nd place vs NVidia, but the CUDA-like single source environment is a superior development platform.
OpenCL 2.2 seems to be only supported on Intel platforms. Either Intel GPUs, Intel CPUs (to AVX512), or Intel Altera FPGAs.
If I were to write OpenCL for AMD GPUs, I'd stick with OpenCL1.2 unless there was an absolute need for OpenCL 2.0 features. The OpenCL 1.2 stuff just seems more mature and better tested on AMD GPU Platforms. Hopefully ROCm changes that, but that's my opinion for the current state of affairs.
> Khronos has talked about the convergence between OpenCL and Vulkan - a little clumsily as it turns out - as the message has been often misunderstood - our bad. To clarify:
> a. OpenCL is not going away or being absorbed into Vulkan - OpenCL Next development is active
> b. In parallel with a. - it is good for developer choice if Vulkan also grows its compute capabilities
> c. OpenCL can help b. through projects like clspv that test and push the bounds of what OpenCL kernels can be run effectively over a Vulkan run-time - and maybe inspire Vulkan compute capabilities to be expanded.
Related article: https://www.phoronix.com/scan.php?page=news_item&px=Vulkan-O...
* As a datapoint Pinot/Druid/Clickhouse can do 1B timeseries on one server. AresDB sounds like it's in the same ballpark here
* Pinot/Druid don't do cross table joins where AresDB can. My understanding is these are at (or near?) sub-second which would be a very distinguishing feature. I'm not sure how this will translate to when distributed mode is built out, as shuffling would become the bottleneck. Maybe there would be some partitioning strategy that within a partition allows arbitrary joining or something?
* Clickhouse can do cross table joins, but aren't going to be sub-second
* AresDB supports event-deduping. I think this can easily be handled by the upstream systems (samza, spark, flink, ..) in lambda
* Reliance on fact/dimension tables.
- This design/encoding is probably to help overcome transfer from memory to GPU, which in my limited experience with Thrust was always the bottleneck.
- High cardinality columns would make dimension tables grow very large and could become un-unmanageable (unless they are somehow trimmable?)
And don't forget that rolling out commercial products in an enterprise is often expensive and time consuming e.g. involving legal, procurement, architecture etc.