One of the VP’s of engineering called it “toil vs talent”. People who “toiled” at work, meaning doing good maintenance work, would be rewarded with good bonuses but those with “talent” would be rewarded with promotions. Of course this drove people to come up with fake new services so that they could demonstrate “talent”. This also lead to an explosion of new services that overlapped or did nothing useful. Instead of working together, groups would make new services instead of working with existing service-owners because they needed to justify writing a new service. It was sickeningly transparent.
This project was one of those projects. It has no real use case because why the fuck would we want to use GPUs except to look cool on your resume. The sad thing is that the projects is overstating how well it’s being used internally. Internally people use Pinot instead of this.
For all you future CTOs, consider your incentive schemes carefully and don’t be so far removed from the action that you can’t see when your org is rotting. This is what the CTO did, and like I said, it was one of his biggest failures because it gutted the engineering org. Instead of working together, every team was looking at get promotions at the expense of the company and it showed.
Early I on, I made a harmless comment in Slack about how one person’s pet project wasn’t a good fit for our needs so our team would be using the older, more proven solution. Later that evening the person pulled me aside, almost in tears, begging me to never say anything critical about his project in a Slack channel again. He explained that at his previous role, success or failure depends entirely on the perception of one’s personal projects and that seemingly innocent comments could tank someone’s promotion chances for years.
I felt bad for him because he had clearly come out of a toxic situation. However, one of his teammates later warned me that he was keeping a journal of potentially incriminating things that I had said in Slack and a detailed log of every issue that he could find with our team’s project in case he “had to use it against me later”.
I could never tell if this was a unique experience or the norm at some companies like Uber.
The engineer you describe sounds like they have mental health issues. There may be some teams with terrible managers but all companies have this, and I’ve seen similar or worse situations at Amazon.
Most engineers I worked with were great but there were many engineers that “played the game” in order to get a promotion and more money. It was sickening but if that’s the way the CTO sets the incentive scheme, who can blame an engineer for following it? It’s more on the CTO for setting the terrible culture than the engineers.
Pretty much every company out there has a concept of 'promotion packet', its basically building a case for one's promotion. Of course in a company the budgets are fixed, and so are promotion cycles(yearly in most places). You miss out a turn, you could lose an year, or even risk losing two. In that case its fairly common for managers to build a list of accomplishments(file/packet), and rival managers to build a anti-case/defence for the same. Stack ranking eventually is all about a combination of merit+advocacy+lobbying+counter-lobbying at so many levels that I'd say the engineer who cried wasn't wrong at all.
This is the case in nearly every company. We just wish to delude ourselves that politics is absent at some places.
This sort of power play comes with the territory in a large people structure.
And these people are louder and more common now because it's easy to hide criticism and claim accomplishments with all the politics, buzzwords and general sensitivity these days.
Because he discourages critical feedback - it leads to 2 major problems:
1) He does not improve [because of lack of feedback].
2) He discourages team around him from improving [because of lack of feedback].
Why not explain that to him (and then if he does not understand - fire him)?
That is not cool.
I think he was operating under the assumption that he could build rapport quickly and then hire a team underneath him to get the work done. That might work at a hyper growth startup that values growth over profit, but we were a mature and profitable company looking to keep headcount reasonable. Most of his plans were so over engineered that they would have take 5-10x the engineers to actually finish on time, so they ended up being half-finished projects that requires constant on-call attention.
That was the tipping point for a lot of us in the old guard. There was an exodus around that time, including myself. It’s not worth fighting those political battles day in and day out while walking on eggshells in every Slack channel.
Good on you for exiting.
By the way, Uber already had two real-time analytics systems before AresDB. One is the aforementioned ES-based service, and the other is Pinot, which was owned by a Pinot contributor. I was in one of those so-called alignment meetings about using AresDB. Engineers from both the ES service and Pinot were there. It was a disaster. The engineers simply asked what AresDB was supposed to solve, and presented charts over charts to show that computation or lack of join operator was never a problem (because for analytics, data streams can be pre-joined), while efficient IO was. The AresDB team simply repeated that join was important, and parallel computation was critical.
I left the company soon after, so I'm not sure how many critical use cases AresDB has been serving since then. Hopefully they do find some sweet spot to justify the cost of developing such system.
In combination with what you listed above, and the rigid, narrow pay bands, it's no wonder everyone is fishing for constant promotions.
I was thinking about this a bit after reading your comment. Now, I say this as having recently negotiated pretty hard for a "Staff" title, and previously had a "Team Lead" title. I think in certain situations it makes sense to have the title authority to shoot obviously bad engineering problems down, but this has been the exception, rather than the rule, in most of my 12 years of being a software engineer.
So your comment makes me wonder: would it make sense to have a system where everyone would simply be called "Engineer", but allow engineers to vote secretly after having worked with another engineer, on various aspects of their colleague's technical expertise. Engineering Managers or perhaps HR would be aware of the engineering votes, but engineers wouldn't, which would remove much of the implicit bias in engineering meetings. Rather, pay grades would be determined by votes, but no one would ever just "Leave it to Yakaaccount, she's the principal engineer". Everyone would have responsibility to be a solid engineer.
It so happens that at that company everyone's title was simply 'Software Engineer'. There was no ego bs so it was a great place to work. I think how helpful you were to others being part of the review is another reason for that. There was a way to see some people's 'rank' by looking at the org chart if you really wanted to. In general swe and senior swe would be under an eng manager, if someone was directly under a director that told you something, and if someone was directly under a vp that also told you something.
I worked at a place that had a system like this. Started great, but turned toxic after a hiring spree.
> Everyone would have responsibility to be a solid engineer.
You would think so, but people with other intentions try to find ways to manipulate the system to achieve various strange goals. They're often successful.
Since mid-2019 or so, the company realized this mistake and the pendulum swung back towards TK’s more Ayn Randian compensation philosophy. But the damage was done. FWIW I never saw much of a power dynamic around title. But it is definitely the most important task of an engineering manager to secure promotion-worthy projects and hand them out intelligently. Since promotion is based on impact, it’s almost entirely determined by the project charter. Difficulty or skill deployed in execution is a tertiary concern.
Also given the precipitous drop in equity value, promotions with hefty raises are required just to keep people’s TC in moderate decline instead of freefall.
Well, there are reasons, but it's true that GPUs in DBMSes is still something that's very immature.
> Internally people use Pinot instead of this.
Can you elaborate a bit regarding the choice of Pinot over other analytic DBMSes?
Uh. What does the JVM have to do with the data model’s ability to do handle joins?
1. There are very few analytic DBMSes which are actually fast (and compare against reasonable baselines). Most claims of speed are bogus. Or rather, might be better than what's otherwise available to use, but are still slow.
2. Designing an analytic DBMS to properly utilize a massively-parallel processing device is a monumental task, and I would claim that it has not yet been undertaken. Existing research and production systems graft such use onto a system whose fundamental design dates back to the 1980s in many ways.
3. CPU-utilizing anallytic DBMSes are typically faster than GPU-based ones, to a great extent due to the above - but also since we've had decades of work on optimizing them.
4. GPUs are artificially handicapped on Intel-architecture systems, because they are placed "far" from main memory relative to the CPU. More literally - the bandwidth you get t between your GPU and main memory is typically 0.25x the bandwidth a CPU socket has with main memory. This is critical for analytic query processing (as opposed to neural network simulation which is more computation-heavy and can tolerate this handicap much better).
PS - Always glad to discuss this further with whoever is interested.
There are also other considerations such as: The desire to combine analytics and transactions; performance-per-Watt rather than per-processor; performance-per-cubic-meter; existing deployed cluster hardware; vendor lock-in risk; etc.
GPU on the other hand is slightly complex. It behaves like lots of small CPUs with their own local memory. They can access the full swath of memory but in parts and copying between the two is much more expensive. If the problem can be boiled down to map on GPU and reduce, then GPU excels. If the problem is serial or can be parallelized with SIMD instructions, CPU will run circles around GPU.
SIMD has come pretty far.
Now, it might be a lot less than 10x if your use of the GPU is suboptimal in any way, which means you don't really have a lot of leeway for non-optimality in your design. Plus, there's the handicap I mentioned in main-memory bandwidth - while the GPU's on-board memory is much smaller.
On the other hand - whoever said that the system you're measuring against is making optimal use of CPU resources? It may very well not, in which case the computation and memory throughput ratios are not upper bounds at all.
Just remember that if someone tells you "I got a 50x speedup by porting this to a GPU!" - then more likely than not, their baseline was a massively sub-optimal system. Which is not to say their work is without merit: Improving the performance of a real-life system does make actual work go faster, today, rather than pursuing a dream of future optimality.
I realize it's FOSS, but - I can't just go read their sources.
My intuition says that they're probably doing a decent-but-not-optimal job for their own use-case and are not planning on developing it into something more general. Specifically, the fact that they only accept their own query language is fishy.
I also note that the public repository on GitHub has not been updated for about half a year.
Caveat: Ares might be brilliantly designed and implemented, I can't really fault them for anything with any certainty. Just speculating here.
Maybe my question is more around, what business decision would be impacted by not having real-time instantly reserved dashboards.
Honest question here not trolling.
On the other hand, having enough people or money to throw at something were never a problem, so…
It was some amazing tech, but it falls into the category of "when all you have a hammer, everything looks like nail". sometimes you really need a company culture to reward people for creating values instead of deliverable for promotion
(I work on ClickHouse and enjoyed this article when it came out.)
> That said, I've heard anecdotes from people I trust that heavily optimized use of CPU vector instructions is competitive with GPUs for database use cases.
This comment is important imo. Also related to applied ML inference in applications... the memory needs can grow quite a bit and this data transfer cost, including the memory size limitations vs RAM, becomes very real very fast.
Not sure I understand the scale of the use case or where it's mentioned as well as in comparison to the big data tools mentioned.
Absolutely, but as last year’s discussion highlights, a bunch of GPUs connected via NVLINK kind of gives you the aggregate memory of the set for some of these database applications (large-scale ML training has also gone this way).
That’s why our A100 system design is 16 A100s in a single host. 16x40 GB gives you 640 GB of aggregate memory, which is pretty attractive for many applications.
The question as always is cost vs benefit. If there’s something that a GPU backed < noun > can do that you “couldn’t” with a large Intel/AMD cpu box, or is actually a large integer multiple cheaper, it’s probably worth the development effort.
We'll be on premium support soon ... hoping we can get access to folks like yourself for some of this.
Do you have models >16 GB that you’re trying to do real-time inference against?
Feel free to send me an email regardless! (In my profile)
Edit: https://news.ycombinator.com/item?id=23800049 was my writeup for a recent Ask HN about cost efficient inference.
* CPUs: medium data, and queries that are small / slow / irregular
* GPUs: general analytics over data that is small (in-memory) or large / streaming data (replace Spark):
-- small data (100MB - 512GB): all in GPU memory, so question if boring queries ("select username from django_table" better in psql) or compute ones ("select price where ..." better in GPU SQL)
-- medium data: data sits in CPU RAM / SSD with compute nodes, and in a preorganized / static fashion, e.g., time series
DB. too much data for GPU RAM, yet enough for for local SSD, so PCI bus is the bottleneck (8-32 GB/s)
-- large data (ex: 10TB spread through S3 buckets) + streaming (ex: 10GB/s netflow): you'll be waiting on network bandwidth anyways, so network link of 10GB/s -> PCI of 10GB/s -> GPU wins out over CPU equiv anyways. Good chance, instead of the pricey multi-GPU V100/A100s, you'll want a fleet of wimpy T4 GPUs.
As network/disk<>GPU high-bandwidth hw rolls out and libs automate their use, the current medium data sweet spot of CPU analytics systems goes away.
-- The category of 'irregular' (non-vectorizable) compute has been steadily shrinking for the last ~30 years as it's an important + fun topic for CS people. Even CPU systems now try to generally optimize for bulk-fetches (cacheline, ...) & SIMD-compute (e.g., SIMD over columns), and that inherently can only go so far until it's effectively a GPU alg on worse hw.
I see other areas in practice like crazy-RAM CPU boxes and FPGA/ASIC systems that I'm intentionally skipping as these end up pretty tailored, while my breakdown above is increasingly common for 'commodity HPC'.
First, this is essentially limiting the scope of "analytics" to selection/aggregation centric operations which are memory bandwidth bound. Many types of high-value analytic workloads and data models don't look like that. Even when 90% of your workload is optimal for GPUs, I've often seen the pattern that the last 10% is poor enough that it largely offsets the benefit.
Also, GPUs have better memory bandwidth than CPUs but people overlook that CPUs can use their limited memory bandwidth more efficiently for the same abstract workload, so the performance gap is smaller than memory-bandwidth numbers alone would suggest.
Second, 10TB is tiny; this is around the top-end of what we consider "small data" at most companies where I work. For example, in the very broad domain of sensor and spatiotemporal analytics, we tend to use 10 petabytes as the point where data becomes "large" currently, and data sets this size are ubiquitous. This data is stored with the compute when at all possible for obvious reasons -- it ends up looking more like your "medium" case in practice, albeit across a small-ish number of machines. The cost of processing tens of petabytes of data on GPUs would be prohibitive.
Lastly, a growing percentage of analytics at every scale is operational real-time, so new data needs to be integrated into the analytical data model approximately instantly. GPUs are not good at this type of architecture.
GPUs have their use cases but their Achille's Heel is that their performance sweet spot is too narrow for many (most?) real-world analytic workloads, and for some workload patterns the performance can be much worse than CPUs. CPUs provide much more consistent and predictable performance across diverse and changing workload requirements, which is a valuable property even it has worse performance for some workloads. Databases give considerable priority to minimizing performance variability because users do.
- RE:Gap, sort of. Ultimately, it's still typically there though, in three important ways. The set of interesting compute that needs that CPU arch there is increasingly small relative to workloads ("super-speculative thread execution on highly branchy..."). Multi-core CPU vs. single GPU is more 2-10X for most tuned code: most 100X claims are apples/oranges b/c of that. When you get beyond those workloads, 100X becomes real again for multi-GPU / multi-node b/c of the bandwidth. Yeah, your real-time font library might still win out on CPU SIMD, but you have to dig for stuff like that, while the more data/compute, the more this stuff matters & gets easier.
- RE:scale, storing 10PB in CPU RAM is also expensive, so we're back to streaming... and thus back to where GPUs increasingly win. Even if you could afford that in CPU RAM, you can probably afford making that accessible to the GPUs too, and then save not just on the hw, but the power (which becomes the dominant cost.) Your example of large-scale & real-time spatiotemporal data seems very much leaning towards GPU, all the way from ETL to analytics to ML. It's still hard to write that GPU code as the frameworks are all nascent, so I wouldn't fault anyone for doing CPU on production systems here for another few years.
-- RE:real-time: writing is on the wall, mostly around (again) getting the unnecessary CPU bandwidth bottleneck out of the way in HW, and (harder), the efforts to use that in SW.
No one stores 10PB in RAM that I know of. A good CPU database kernel will run out of PCIe lanes driving large gangs of NVMe devices at theoretical without much effort. The performance for most workloads is indistinguishable from in-memory, but at a fraction of the cost. It would be slower to insert GPUs anywhere in this setup. (In modern database kernels generally, "in-memory" offers few performance benefits because storage has so much bandwidth that a state-of-the-art scheduler can exploit.) An interesting open research question is the extent to which we can radically reduce cache memory entirely, since state-of-the-art schedulers can keep query execution fed off disk for the most part, even in mixed workloads. Write sparsity still recommends a decent amount of cache for mixed workloads but probably much less than Bélády's optimality algorithm superficially implies.
Almost nothing is CPU-bound in databases these days in reasonable designs, not even highly compressive data model representations, parsing, or computational geometry. Which is great! A lot of analytics is join-intensive, but that is more about latency-hiding than computation. I would argue that the biggest bottleneck at the frontier right now is network handling, and GPUs don't help with that, though FPGAs/ASICs might.
I'm not sure how a GPU would help with operational real-time. Is it even possibly to parse, process, and index tens of millions of new complex records per second over the wire concurrent with running multiple ad hoc queries on a GPU? I've done this many times on a CPU but I've never seen a GPU database that came within an order of magnitude of that in practice, and I've used a few different GPU databases plus some custom bits. GPUs work better in a batch world.
I use GPUs, just not for analytical databases. I am biased in that GPU databases have consistently failed to deliver credible workloads across many scenarios in my experience and I understand at a technical level why they didn't live up to their marketing. Every time one gets stood up in a lab, and I see many of them, they fail to distinguish themselves versus a state-of-the-art CPU-based architecture. Most of them actually underperform in absolute terms. Almost everyone I know that has designed and delivered a production GPU database kernel eventually abandoned it because CPUs were consistently better in real-world environments.
GPU capabilities are improving, but I have seen limited progress in directions that address the underlying issues. They just aren't built to be used that way, and there are other applications for which they are exceedingly optimal that we wouldn't want to sacrifice for database purposes. CPU developments like AVX-512 get you surprisingly close to the practical utility of a GPU for databases without the weaknesses.
Anyway, this is a really big, really large conversation. It doesn't fit in the margin of an HN post. :-)
Assuming the analytic workload does have some compute, however, that's more of a comment about traditional systems having bad bandwidth than GPUs themselves. GPUs are already built for latency hiding, so it's more like CPUs are playing catchup to them. Two super interesting things have been happening here going forward, IMO:
- Nvidia finally got tired of waiting for the rest of the community to expose enough bandwidth. $-wise, they bought mellanox and now trying for ARM. In practice, this means providing more storage->device and network->device through improved hw+sw like https://developer.nvidia.com/blog/gpudirect-storage . I'm not privy to hyperscaler discussions on bandwidth across racks/nodes, but the outside trend does seem to be "more bandwidth", and I've been watching for straight-from-network in from cloud providers.
- Price drops for more hetero hardware. E.g., the T4 on AWS is ~6X cheaper for less choice in stuff like 64b vs 32b than the V100, yet has similar memory and perf within that sweet spot of choices. Nvidia pushes folks to DGX (big SSD -> local multi-GPU), which works for some scales, but in the wild, I see people often land on single-T4 / many-node once you take network bandwidth + cost into consideration in larger and more balanced systems.
For our own workloads, we don't trust GPU DBs enough to get it right, so found RAPIDS to be a nicer sweet spot where even junior devs can code it (Python dataframes), while perf people can predictably tune it and appropriately plug in the latest & greatest. Out-of-memory / streaming / etc. only became a thing starting in ~december, e.g., see recent https://github.com/rapidsai/tpcx-bb results + writeups, so it's been a wild couple of years. We still stick to single-GPU / in-memory for our workloads as we care about sub-second, but have been experimenting & architecting it for as ^^^^ smooths out for our use (and to help our customers who have different-shaped workloads). I've been impressed by stuff like the T4-many-node experience as layers like dask-cudf and blazingsql build up.
This categorical insight should be front and center of any discussion about the relative merits of using GPUs vs CPUs.
I would agree that productive GPU data frameworks in streaming modes are nascent, e.g., https://medium.com/rapids-ai/gpu-accelerated-stream-processi... .
May Nvidia will start to create ML type cards with memory expansion options at some point.
Microsoft is bringing the DirectStorage API from XBox to Windows, Nvidia calls theirs RTX IO. I think they're the same class of idea, like Vulken vs. Metal.
They do have SBCs, I think, but other than being the basis for the Nintendo Switch I haven't heard much about them.
What does the JVM have to do with joins?
Completely original excuse for an over-staffed engineering organization to justify doing some crazy stuff.
I'm not incredibly familiar with Ares save the linked article, but we aren't a DBMS or manage data in any way.
BlazingSQL is a SQL engine, it's easier to think of it similar to SparkSQL, Presto, Drill, etc.
We're core contributors to RAPIDS cuDF (CUDA DataFrame), which is a Pyhton and C++ library for Apache Arrow in-GPU memory. The Python library follows a pandas-like API, and the compute kernels are in C/C++.
BSQL binds to the same C++ as the pandas-like cuDF. What this enables users to do is interact with a DataFrame with either SQL or pandas depending on their needs or preferences. This interoperability means that the rest of the RAPIDS stack can be applied to a variety of different use cases (data viz, ML, Graph, Signal Processing, DL, etc), with the same DataFrame.
The DataFrame also has performant libraries for IO, Joins, Aggregations, Math operations, and more.
Here is an example of running a query on ~1TB on a single GPU in under 9 minutes. The data was stored on AWS S3 in Apache Parquet. https://twitter.com/blazingsql/status/1303370102348361729
Here is an example of scaling that same query up to 32 GPUs and running it in 16 seconds. https://twitter.com/blazingsql/status/1304450203030880257
Again, think of BSQL as a query engine, that runs queries on data wherever and however you have it. Here is a BSQL user running 1-2 minute queries on 1.5TB of CSV files using 2 GPUs. https://twitter.com/tomekdrabas/status/1303824164273270789
Let me know if that helps at all (or not).