Hacker News new | past | comments | ask | show | jobs | submit login
Velox: An open-source unified execution engine (fb.com)
151 points by polyrand on Sept 1, 2022 | hide | past | favorite | 52 comments



* [...] unified execution engine

* accelerating data management systems

* [...] streamlining their development

* [...] consolidate and unify data management systems

Can someone translate this to English? I can see and recognize the individual meanings of the words, but I don't understand what they're trying to say.


The section titled "An Overview of Velox" gets into the meat of it - you give it some data and an optimised plan of the operators you want applied (expression, evaluation, aggregation, etc.) - Velox then executes that plan as efficiently as possible given the available compute resources.

That way multiple top-level systems like analytics databases, dataframe implementations, etc. can all share the same underlying execution engine.


So it's like an operating system for cloud workloads?


No, it's close to the core internals of an OLAP database. The "operators" it can execute in question are things like filter, join, aggregate, group by, projection (select), things of that nature. It makes sure to use available resources like SIMD and multithreading to do that efficiently.

If you built a SQL parser -- and also the glue to create query plans from that -- you could attach it to Velox to do all that on some data source, for example. But you'd still need a storage layer (disk, s3) and also some kind of higher layer if you wanted to use multiple computers for a complete database. The query execution engine is a critical component, however.


spark and presto have big farms of workers. it's a replacement for the part that owns the memory where the computations happen on the workers. it has a handful of fast support libraries and i/o libraries but can also make use of spark and presto. it is probably faster to start up and has less fixed overhead (beyond the space used by the work area) than the usual runtimes because it is written in c++. it also is probably a better bridge for dl type workloads.

you can think of it as similar to a "python kernel" but distributed and language agnostic for "big data" type jobs.


> Meta’s infrastructure plays an important role in supporting our products and services.


lol, that's basically the definition of infrastructure


"In common usage scenarios, Velox takes a fully optimized query plan as input and performs the described computation. Considering Velox does not provide a SQL parser, a dataframe layer, or a query optimizer, it is usually not meant to be used directly by end-users; rather, it is mostly used by developers integrating and optimizing their compute engines."

So the way you use it is that you describe some computation over your data as a query plan, and you implement a dataframe layer so Velox knows how to retrieve data from your database, and then Velox will efficiently execute the query plan? But it doesn't even optimize, so the problem it solves is that these systems like Spark and Presto don't efficiently execute optimized queries?

This world is very far removed from me, does anyone have a concrete example of how Velox might help them? Why is Velox better than both Presto worker and Spark engine. Aren't those core components of the system?


Disclaimer: I work at Meta. I don't work on Velox, but my work intersects with Velox in multiple ways.

The short answer is consistent semantics. We have a large data warehouse, and several different engines that query that data warehouse. We want, as much as possible, consistent semantics across all of our engines for our users. That is, as much as possible, we want the same query to produce the same results on different surfaces. The value of that is so that users can draft a query on one surface, and use that same query elsewhere with confidence that they will get the same results. If we can consolidate on the same execution engine inside of our query engines, we can achieve that.

Minor quibble on terminology: "Velox knows how to retrieve data from your database". I would instead say that Velox knows how to retrieve data from your storage. Velox is deeply integrated into the query engine, and the combination of the query engine and the storage is "the database." In large data warehouses, we've already separated storage from compute to achieve scalability.

If this is all too abstract, think of this way: your query engine (such as Presto) is like a full computer system, while Velox is like the processor. Processors, by themselves, are not useful. They need to be attached to a motherboard which has RAM and connections to hard drives, GPUs and other external devices. Your query engine is like the computer system that contains that motherboard and all the components connected to it. There's enormous value in having multiple computer systems with different capabilities, but using the same kind of processor: you get consistent behavior when the capabilities are the same. Velox is that processor, ready to be plugged into different query engines.


> Why is Velox better than both Presto worker and Spark engine

For a start Spark doesn't support AVX2/AVX512.

JVM I believe still hasn't finalised the Vector type and then Spark would need to be updated to take advantage of it.


Anyone know how this compares to Photon by Databricks? That’s probably the benchmark + arch comp I’d like to see…

https://www.databricks.com/product/photon


> Ultimately, this fragmentation results in systems with different feature sets and inconsistent semantics — reducing the productivity of data users that need to interact with multiple engines to finish tasks.

> In order to address these challenges and to create a stronger, more efficient data infrastructure for our own products and the world, Meta has created and open sourced Velox.

Maybe I'm missing something here, but it sounds like a lot of separate services got created that solve the same or similar problem in slightly different ways. These services became hard to use because they were fragmented. So the solution is to keep all the services and build a complex service as a middle man?

Why not unify the good parts of all the services into one central service? Then deprecate and transition off all the old fragmented ones? I understand that it's really hard to coordinate all of this and properly transition, but isn't the alternative of having to maintain many slightly different services (and now a complex middle man) more detrimental long term?


>Why not unify the good parts of all the services into one central service?

https://xkcd.com/927/


> Velox leverages numerous runtime optimizations, such as filter and conjunct reordering, key normalization for array and hash-based aggregations and joins, dynamic filter pushdown, and adaptive column prefetching.

That's a strong set of capabilities. I'm excited to see where this goes -- this could catalyze a Cambrian explosion of data systems that offload execution to Velox.


I see this as a continued effort of middleware being rewritten in C++, Rust and Go to replace Java - seems like common wisdom "Java can be as fast as C" has finally been abolished as this situation progresses (Kubernetes and other newer cloud middleware written in Go instead of Java, etc.)


Apparently you missed the part where it is integrated into Java libraries via JNI.

Kubernetes was originally written in Java and rewritten in Go after some Go advocates joined the team.

Docker was originally written in Python and rewritten in Go after some Go advocates joined the team.

Don't mix technical capabilities with people wanting to refresh their CVs.


> Kubernetes was originally written in Java and rewritten in Go after some Go advocates joined the team.

Just to be clear, this didn't go very far. It's more like Brendan was willing to write Java, and a lot of other folks hate it. I don't think any of the core folks were Go advocates, just most of "us" were C/C++ people. Docker itself being in Go, meant it would lower a lot of friction, and besides we had great tooling for Go.

tl;dr: It was really "Should we (re)write this in C++? Eh, how about Go?".


The history told at FOSDEM was a bit different, if I remember correctly.

And to come back to my point, if Docker or Kubernetes were invented today, I bet they would be using Rust or Zig instead of Go.


Maybe! I'd definitely argue for the Kubelet to be in Rust, if starting from scratch. The apiserver is less clear. But note that at Google it's unclear those would be written in Rust even today.


I broadly agree, though I think you might be under-counting the degree to which accumulated ecosystem value drives this phenomenon, especially amongst a group of languages not known for strong mutual interoperability.

Java is fucking fast, I imagine Cliff Click would like a word with someone arguing golang outperforms a well-tuned JVM. But the language isn’t aging the best and interop with non-JVM languages is, also not the best.

golang seems pretty optimized for polyglot SOA-type setups where a comparatively modest amount of existing code compared to say C++ isn’t a real drawback because you’re hitting that stuff over the network anyways.

Rust is probably the better language in C++ niches at this point when you don’t have a big C++ ecosystem investment and can do a more “greenfield”-type project. That can become a holy war and I’ll leave my position at: “there are good reasons to choose both”.


Containerization also helps the trend. Previously, you manage JVM runtime separately in your cluster and the deployment becomes simpler (just deploy the jars). Without containerization, maintaining separate runtime dependencies for compiled language is not as smooth. But containerization changed the equation. Now you deploy runtime dependencies along side your binary (if needed). It is actually a liability to deploy JVM runtime along side your jars now (actually, can you deploy java container images with JVM on the machine?)


For me, its mostly around tooling/dependency management. In go, you install the compiler and you are done - you can create a project, add dependencies (directly from github/git), and ship it as a self-standing binary.

In java (granted I haven't done java for a few years), you install the compiler/vm, and that is what you have. You have to decide if you are a maven or a gradle shop (and install these, or go bare lib/ mode), install and configure the said tools. When adding a dependency, you hope its on maven central, but sometimes its not, so random git repos are harder to try out/consume. Then eventually, you build your project and end up with a jar file, but you still have to manage its dependencies. You need the vm to run it, so you need to figure this out (jpackage also needs configuring in maven/gradle). You need your dependencies too, so you also need to figure this out (fatjar?).

Maybe things have gotten better in the recent years (and I'm happy to hear how), but my impression is still that the amount of dependency management you need to do with java far exceeds what you need to do in go.


First of all you can deploy everything together, just like people ship their whole computers with containers.

Secondly anyone accessing Maven central directly is doing it wrong, the repo should be internal, validated by IT and legal, so for the consumers it doesn't matter how the JAR got there.

In a way it is ironic to see the whole containers/WebAssembly ecosystem redoing Java App Servers, 20 years later.


The JVM trades memory for all kinds of optimizations so it's less about raw performance and more about memory allocation tricks (and related FFI issues). This is where golang at the moment has an edge over it. Assuming you can tolerate its primitive abstractions.

Once Loom&Co get merged, the JVM is going to be unrecognizable. It's also very possible that one of those Kotlin-native/GraalVM AOT projects goes mainstream. But right now it's a problem for high-performance systems.

Personally I regret this trend because Java/Golang are much easier for people like me than C++/Rust. I cannot imagine myself going back to manually managing memory. Browsing Big Data open source code will not be as educational anymore.


You don't do manual memory management in Rust either. AFAIK (correct me if I'm wrong), even in C++ manual memory management is discouraged in favor of RAII.


You surely do, first of all the borrow checker is only a compiler validation that you are writing the manual code correctly, and no one is magically writing Drop traits implementations for the user.

Likewise on the C++ side, someone has to write those constructor/destructor pairs, and there are ways to get RAII wrong.


Seriously, no, re: Rust.

Yes, you have to think about ownership. Because of single owner. But in general, not alloc/free. I work in Rust (after doing C++ for the 10ish years prior) full time and a couple nights ago was honestly the first time I had to really think about this (I had to do a 'forget' because I was passing a ptr to a vector back through from WebAssembly to our Rust-based runtime, and then clean it up back there instead of having Rust let it fall out of scope and free when the stack frame exited).

Think of Rust as a world where pretty much everything is inside a std::unique_ptr.

Most developers used to a GC language will have little problem working in Rust once they understand the single ownership model.


Try to write a native GUI application in Rust, or async code, and you will see how little you have to think about it.


Yes, GUI work is a pain because of the way event loops and ownership in existing UI toolkits work, they're generally not designed for this. But Arc<Mutex< is likely your friend here.

Async can be a pain, but you learn the ways. I work in a codebase with quite a bit of it.

There are appropriate and inappropriate places to apply Rust.


It surely is my friend, and I will need to manually call clone() and borrow().


> no one is magically writing Drop traits implementations for the user.

Yes, the compiler is - for almost all structs. I've been working with Rust since around 1.0 and I can count the number of times I manually had to write Drop implementations with my hands. Unless you are writing lower-level parts of the stack (which you rarely need to, since for many of those there are good crates already available) where you are responsible for resources that need a custom Drop implementation, the auto Drop implementation is good enough.


Trivial types don't count, their Drop implementation is basically do nothing.

And for Rust standard library, where the Drop actually does something, someone else wrote the implementation for you.


Maybe, except this isn't "middleware?"

It looks to me like the kind of librar(ies) that would normally be used as an internal aspect of a database or data analytics tools. Data containers / structures, data manipulation operators, and facilities for moving them around.

What this is is part of nice trend to open source some of the fundamental R&D that is happening inside the BigCorps.

Also, this kind of thing (and, indeed middleware as well) has been done in C++ by default inside Google (and probably Facebook as well) since forever.


I think it's a little more nuanced than that. Java was fast enough combined with security and safety guarantees to make it worth it in a total calculation. But Go and Rust are both faster with at least as good if not better safety guarantees so you are seeing a move away. Language choice is never wholly about speed. Java filled a niche of fast enough with certain guarantees and affordances to allow it to fill certain roles.

C++ has gained a lot of affordances as well to improves safety as long as your can enforce their usage in the product so it's starting to eat into Java's market share as well. But I think long term Rust and Go will absorb more of what you would have done in Java than C++ will.


It sounds very similar to apache beam. You can actually create runners for various data management systems [1]

[1] https://beam.apache.org/documentation/runners/


Sounds like beam is something you can use on top - beam is more query planner that can translate it's plans to several other engines. Velox executes those plans.



Interesting to see how Databricks reacts to this given that they have their own Project Lightspeed (replacing the Spark execution engine).


Is this similar to Arrow datafusion but in C++? Tbh I think every hot new dataframe or analytics db has such components. The basic idea is not too different from the textbook at first glance.


This seems like analogous to LLVM, looks like we could (finally) build various front ends for analytics on tops of this?


So this is an Apache Arrow database engine integrated into other databases? My main takeaway is that it's great to see more projects standardizing on Arrow and pushing it further down the stack.


Cool that it’s being integrated into presto.


Note that it's "fb presto" rather than more popular and widely developed Trino fork.


Just a clarification about "fb presto", there isn't a "fb presto", its Linux Foundation Presto, which has its own unique innovations and used reliably at all scales, including places like Meta, Uber, Alibaba, and more recently ByteDance. Linux Foundation Presto isn't controlled by one vendor, it has 10 member orgs overseeing the project, following Linux project practices, and doing community outreach like PrestoCon. See for yourself @ prestodb.io (Disclaimer: I'm Ahana CEO, providing a Cloud Managed Service for Presto-based SQL Lakehouses, and member of Linux Foundation's Presto Foundation)


Is it same as YARN?


I think it's more like Databricks' photon.. a rewrite of the execution engine that can be plugged into existing spark deployments.


[flagged]


Please don't do this here.


Ok! Point taken. Sorry!


If that's the case, the domain is short for "Fine Bodies" presumably.


it's airflow with more specifics around transportable data structures? instead of junky xcoms?


Airflow has always been focused on workflows and it’s tasks don’t have to query data in an external system. You could have code running in airflow that runs a query that is executed by Velox.


could this be a name conflict with this https://www.thermofisher.com/order/catalog/product/VELOX ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: