* [...] consolidate and unify data management systems
Can someone translate this to English? I can see and recognize the individual meanings of the words, but I don't understand what they're trying to say.
The section titled "An Overview of Velox" gets into the meat of it - you give it some data and an optimised plan of the operators you want applied (expression, evaluation, aggregation, etc.) - Velox then executes that plan as efficiently as possible given the available compute resources.
That way multiple top-level systems like analytics databases, dataframe implementations, etc. can all share the same underlying execution engine.
No, it's close to the core internals of an OLAP database. The "operators" it can execute in question are things like filter, join, aggregate, group by, projection (select), things of that nature. It makes sure to use available resources like SIMD and multithreading to do that efficiently.
If you built a SQL parser -- and also the glue to create query plans from that -- you could attach it to Velox to do all that on some data source, for example. But you'd still need a storage layer (disk, s3) and also some kind of higher layer if you wanted to use multiple computers for a complete database. The query execution engine is a critical component, however.
spark and presto have big farms of workers. it's a replacement for the part that owns the memory where the computations happen on the workers. it has a handful of fast support libraries and i/o libraries but can also make use of spark and presto. it is probably faster to start up and has less fixed overhead (beyond the space used by the work area) than the usual runtimes because it is written in c++. it also is probably a better bridge for dl type workloads.
you can think of it as similar to a "python kernel" but distributed and language agnostic for "big data" type jobs.
"In common usage scenarios, Velox takes a fully optimized query plan as input and performs the described computation. Considering Velox does not provide a SQL parser, a dataframe layer, or a query optimizer, it is usually not meant to be used directly by end-users; rather, it is mostly used by developers integrating and optimizing their compute engines."
So the way you use it is that you describe some computation over your data as a query plan, and you implement a dataframe layer so Velox knows how to retrieve data from your database, and then Velox will efficiently execute the query plan? But it doesn't even optimize, so the problem it solves is that these systems like Spark and Presto don't efficiently execute optimized queries?
This world is very far removed from me, does anyone have a concrete example of how Velox might help them? Why is Velox better than both Presto worker and Spark engine. Aren't those core components of the system?
Disclaimer: I work at Meta. I don't work on Velox, but my work intersects with Velox in multiple ways.
The short answer is consistent semantics. We have a large data warehouse, and several different engines that query that data warehouse. We want, as much as possible, consistent semantics across all of our engines for our users. That is, as much as possible, we want the same query to produce the same results on different surfaces. The value of that is so that users can draft a query on one surface, and use that same query elsewhere with confidence that they will get the same results. If we can consolidate on the same execution engine inside of our query engines, we can achieve that.
Minor quibble on terminology: "Velox knows how to retrieve data from your database". I would instead say that Velox knows how to retrieve data from your storage. Velox is deeply integrated into the query engine, and the combination of the query engine and the storage is "the database." In large data warehouses, we've already separated storage from compute to achieve scalability.
If this is all too abstract, think of this way: your query engine (such as Presto) is like a full computer system, while Velox is like the processor. Processors, by themselves, are not useful. They need to be attached to a motherboard which has RAM and connections to hard drives, GPUs and other external devices. Your query engine is like the computer system that contains that motherboard and all the components connected to it. There's enormous value in having multiple computer systems with different capabilities, but using the same kind of processor: you get consistent behavior when the capabilities are the same. Velox is that processor, ready to be plugged into different query engines.
> Ultimately, this fragmentation results in systems with different feature sets and inconsistent semantics — reducing the productivity of data users that need to interact with multiple engines to finish tasks.
> In order to address these challenges and to create a stronger, more efficient data infrastructure for our own products and the world, Meta has created and open sourced Velox.
Maybe I'm missing something here, but it sounds like a lot of separate services got created that solve the same or similar problem in slightly different ways. These services became hard to use because they were fragmented. So the solution is to keep all the services and build a complex service as a middle man?
Why not unify the good parts of all the services into one central service? Then deprecate and transition off all the old fragmented ones? I understand that it's really hard to coordinate all of this and properly transition, but isn't the alternative of having to maintain many slightly different services (and now a complex middle man) more detrimental long term?
> Velox leverages numerous runtime optimizations, such as filter and conjunct reordering, key normalization for array and hash-based aggregations and joins, dynamic filter pushdown, and adaptive column prefetching.
That's a strong set of capabilities. I'm excited to see where this goes -- this could catalyze a Cambrian explosion of data systems that offload execution to Velox.
I see this as a continued effort of middleware being rewritten in C++, Rust and Go to replace Java - seems like common wisdom "Java can be as fast as C" has finally been abolished as this situation progresses (Kubernetes and other newer cloud middleware written in Go instead of Java, etc.)
> Kubernetes was originally written in Java and rewritten in Go after some Go advocates joined the team.
Just to be clear, this didn't go very far. It's more like Brendan was willing to write Java, and a lot of other folks hate it. I don't think any of the core folks were Go advocates, just most of "us" were C/C++ people. Docker itself being in Go, meant it would lower a lot of friction, and besides we had great tooling for Go.
tl;dr: It was really "Should we (re)write this in C++? Eh, how about Go?".
Maybe! I'd definitely argue for the Kubelet to be in Rust, if starting from scratch. The apiserver is less clear. But note that at Google it's unclear those would be written in Rust even today.
I broadly agree, though I think you might be under-counting the degree to which accumulated ecosystem value drives this phenomenon, especially amongst a group of languages not known for strong mutual interoperability.
Java is fucking fast, I imagine Cliff Click would like a word with someone arguing golang outperforms a well-tuned JVM. But the language isn’t aging the best and interop with non-JVM languages is, also not the best.
golang seems pretty optimized for polyglot SOA-type setups where a comparatively modest amount of existing code compared to say C++ isn’t a real drawback because you’re hitting that stuff over the network anyways.
Rust is probably the better language in C++ niches at this point when you don’t have a big C++ ecosystem investment and can do a more “greenfield”-type project. That can become a holy war and I’ll leave my position at: “there are good reasons to choose both”.
Containerization also helps the trend. Previously, you manage JVM runtime separately in your cluster and the deployment becomes simpler (just deploy the jars). Without containerization, maintaining separate runtime dependencies for compiled language is not as smooth. But containerization changed the equation. Now you deploy runtime dependencies along side your binary (if needed). It is actually a liability to deploy JVM runtime along side your jars now (actually, can you deploy java container images with JVM on the machine?)
For me, its mostly around tooling/dependency management. In go, you install the compiler and you are done - you can create a project, add dependencies (directly from github/git), and ship it as a self-standing binary.
In java (granted I haven't done java for a few years), you install the compiler/vm, and that is what you have. You have to decide if you are a maven or a gradle shop (and install these, or go bare lib/ mode), install and configure the said tools. When adding a dependency, you hope its on maven central, but sometimes its not, so random git repos are harder to try out/consume. Then eventually, you build your project and end up with a jar file, but you still have to manage its dependencies. You need the vm to run it, so you need to figure this out (jpackage also needs configuring in maven/gradle). You need your dependencies too, so you also need to figure this out (fatjar?).
Maybe things have gotten better in the recent years (and I'm happy to hear how), but my impression is still that the amount of dependency management you need to do with java far exceeds what you need to do in go.
First of all you can deploy everything together, just like people ship their whole computers with containers.
Secondly anyone accessing Maven central directly is doing it wrong, the repo should be internal, validated by IT and legal, so for the consumers it doesn't matter how the JAR got there.
In a way it is ironic to see the whole containers/WebAssembly ecosystem redoing Java App Servers, 20 years later.
The JVM trades memory for all kinds of optimizations so it's less about raw performance and more about memory allocation tricks (and related FFI issues). This is where golang at the moment has an edge over it. Assuming you can tolerate its primitive abstractions.
Once Loom&Co get merged, the JVM is going to be unrecognizable. It's also very possible that one of those Kotlin-native/GraalVM AOT projects goes mainstream. But right now it's a problem for high-performance systems.
Personally I regret this trend because Java/Golang are much easier for people like me than C++/Rust. I cannot imagine myself going back to manually managing memory. Browsing Big Data open source code will not be as educational anymore.
You don't do manual memory management in Rust either. AFAIK (correct me if I'm wrong), even in C++ manual memory management is discouraged in favor of RAII.
You surely do, first of all the borrow checker is only a compiler validation that you are writing the manual code correctly, and no one is magically writing Drop traits implementations for the user.
Likewise on the C++ side, someone has to write those constructor/destructor pairs, and there are ways to get RAII wrong.
Yes, you have to think about ownership. Because of single owner. But in general, not alloc/free. I work in Rust (after doing C++ for the 10ish years prior) full time and a couple nights ago was honestly the first time I had to really think about this (I had to do a 'forget' because I was passing a ptr to a vector back through from WebAssembly to our Rust-based runtime, and then clean it up back there instead of having Rust let it fall out of scope and free when the stack frame exited).
Think of Rust as a world where pretty much everything is inside a std::unique_ptr.
Most developers used to a GC language will have little problem working in Rust once they understand the single ownership model.
Yes, GUI work is a pain because of the way event loops and ownership in existing UI toolkits work, they're generally not designed for this. But Arc<Mutex< is likely your friend here.
Async can be a pain, but you learn the ways. I work in a codebase with quite a bit of it.
There are appropriate and inappropriate places to apply Rust.
> no one is magically writing Drop traits implementations for the user.
Yes, the compiler is - for almost all structs. I've been working with Rust since around 1.0 and I can count the number of times I manually had to write Drop implementations with my hands. Unless you are writing lower-level parts of the stack (which you rarely need to, since for many of those there are good crates already available) where you are responsible for resources that need a custom Drop implementation, the auto Drop implementation is good enough.
It looks to me like the kind of librar(ies) that would normally be used as an internal aspect of a database or data analytics tools. Data containers / structures, data manipulation operators, and facilities for moving them around.
What this is is part of nice trend to open source some of the fundamental R&D that is happening inside the BigCorps.
Also, this kind of thing (and, indeed middleware as well) has been done in C++ by default inside Google (and probably Facebook as well) since forever.
I think it's a little more nuanced than that. Java was fast enough combined with security and safety guarantees to make it worth it in a total calculation. But Go and Rust are both faster with at least as good if not better safety guarantees so you are seeing a move away. Language choice is never wholly about speed. Java filled a niche of fast enough with certain guarantees and affordances to allow it to fill certain roles.
C++ has gained a lot of affordances as well to improves safety as long as your can enforce their usage in the product so it's starting to eat into Java's market share as well. But I think long term Rust and Go will absorb more of what you would have done in Java than C++ will.
Sounds like beam is something you can use on top - beam is more query planner that can translate it's plans to several other engines. Velox executes those plans.
Is this similar to Arrow datafusion but in C++? Tbh I think every hot new dataframe or analytics db has such components. The basic idea is not too different from the textbook at first glance.
So this is an Apache Arrow database engine integrated into other databases? My main takeaway is that it's great to see more projects standardizing on Arrow and pushing it further down the stack.
Just a clarification about "fb presto", there isn't a "fb presto", its Linux Foundation Presto, which has its own unique innovations and used reliably at all scales, including places like Meta, Uber, Alibaba, and more recently ByteDance. Linux Foundation Presto isn't controlled by one vendor, it has 10 member orgs overseeing the project, following Linux project practices, and doing community outreach like PrestoCon. See for yourself @ prestodb.io (Disclaimer: I'm Ahana CEO, providing a Cloud Managed Service for Presto-based SQL Lakehouses, and member of Linux Foundation's Presto Foundation)
Airflow has always been focused on workflows and it’s tasks don’t have to query data in an external system. You could have code running in airflow that runs a query that is executed by Velox.
* accelerating data management systems
* [...] streamlining their development
* [...] consolidate and unify data management systems
Can someone translate this to English? I can see and recognize the individual meanings of the words, but I don't understand what they're trying to say.