> the Query Graph Model (QGM) representation is quite abstract and hardcodes many properties, making it exceptionally difficult to understand. Its claimed extensibility is also questionable.
Agree, substrait is a really cool project! Related: if you like substrait you might want to check out datafusion too. The project is a query execution engine built on top of Apache Arrow (incl. SQL parser, query planner & optimizer, execution engine, extensible user defined functions, among others) and it implements a substrait provider and consumer: https://github.com/apache/arrow-datafusion/tree/main/datafus...
In general I'm not interested in analytics/OLAP, but in more OLTP type workloads, so the Arrow stuff and that whole world is of not much interest to me.
After a quick look, I'm not sure if I would call this “industrial strength”. In particular, the join optimizer (typically the heart of a large-scale SQL optimizer) looks very rudimentary? And the statistics it uses have zero idea about correlation, no histograms beyond min/max…
I was wondering about the same claim. However, I believe that JOIN's are a common weakness among OLAP database engines, and DataFusion is built on top of a columnar storage format - Apache Arrow.
Getting the IR correct so that it's both easy to use and flexible enough to be useful is a really interesting design challenge. Our primary abstraction in the query plan is called a Node, and is way more general than the IR type described in the article from OP. This has probably hurt us: we only recently separated the responsibility to fetch rows into its own part of the runtime, out of the IR -- originally row fetching was coupled to the Node type directly.
First installment is mostly introductory but gets more interesting near the end, and the rest of the series sounds promising. I'm looking forward to it.
I also recommend the sqlite.org docs "overview of the optimizer" and "vbdb bytecode" which you could see as an IR.
Thanks for your comment. I'm the author of this post.
I have been thinking about sharing something about database internals for a long time, but none of my writing ideas was out of cliché then(e.g. introduce some algorithms, summarize some papers).
But I just realized that it maybe an interesting thing to talk about "how" and "why" instead of "what".
The rest posts are coming soon, hope you can see them on hacker news again.
An honest question, since you're there: Why do you consider MySQL 8.0's optimizer to be based on relational algebra? It's true that the new executor introduced in 8.0 is Volcano-style, but the optimizer is pretty much ad-hoc as I see it.
I agreed with you on the ad-hoc part. The thing is Oracle guys have made significant improvement in MySQL 8.0, including the Volcano-style executor.
They also implemented DPhyp join reorder algorithm but with an ad-hoc IR. The join optimizer IR(they call it RelationalExpression) is based on relational algebra. I think this is a good start.
But it’s not easy to migrate a project lived for decades, especially one with poor design.
Yeah, I know, I wrote the new executor and the hypergraph join optimizer (the one based on DPhyp), that's why I wondered :-)
It's true that if you are using the hypergraph optimizer, you will get a rewrite from the array of tables into RelationalExpression. But I find it hard to call that relational algebra; in particular, it only supports joins and tables as operations. Filters are pushed down ad-hoc, and things like grouping or windowing operators are simply not representable in this structure at all. Columns are not dealt with at all either (projection is unavailable).
And perhaps more importantly; RelationalExpression is hardly used. Most of the optimizer works on the old array-of-tables structure, then it briefly becomes RelationalExpression for condition pushdown, then the hypergraph is created and RelationalExpression is never to be seen again. The entire hypergraph optimizer works by inducing subgraphs of a hypergraph; it does not use relational algebra.
Also, notably, MySQL 8.0 does not actually _use_ the hypergraph optimizer by default. You need to explicitly compile it in (it's off in release builds), and then enable it using an optimizer switch. So unless you go to fairly great lengths to enable it yourself, RelationalExpression and friends is never used.
I agree that using a relational algebra IR would be a good idea; it's a better structure than what's in there right now (which comes all the way from MySQL 3.x, and is extremely unflexible to work with). It's just that I don't think MySQL 8.0 does it. :-)
(I obviously don't speak for Oracle, not the least because I haven't worked there in a while)
If you're into languages and compilers and PL theory, you might really enjoy the "Alice" book (Foundations of Databases): http://webdam.inria.fr/Alice/
The title format "What We Talk About When We Talk About X," which today perhaps would be called a meme started with a famous short story by Raymond Carver, part of his short story collection published in 1981 with the same title (https://en.wikipedia.org/wiki/What_We_Talk_About_When_We_Tal...). It's a great story, strongly recommend it.
This format is now very widely used in blogs and other pieces, almost to the point of being overdone. TIL from (https://lithub.com/what-we-talk-about-when-we-talk-about-thi...) that Murakami "asked Tess Gallagher, Carver’s widow, for permission to use the title form." for his memoir published in 2009. He's probably an important factor in the resurgence of the format.
Also worth mentioning that the title was made up by Carver's editor Gordon Lish, who significantly altered the story (originally called Beginners). Lish made relatively extensive changes to many of Carver's stories, and is widely accepted as having improved them, though not everyone agrees.
Great point! Lish was a tight controller who was universally liked, to say the least. He’s generally insufferable (judge for yourself don’t this PM interview: https://www.theparisreview.org/interviews/6423/the-art-of-ed....) A sexier version of Maxwell Perkins with better hair, perhaps? Yet he did teach writing and edited well.
Way, way past the point of being overdone! And I'm sure most of the people who keep overdoing it are referencing the Murakami book, rather than the great Carver short story.
Hmm, no credit to Haruki Murakami, who wrote the memoir of the title you yanked “What I Talk About When I Talk About Running”?
Murakami credited Raymond Carver's collection of short stories called “What We Talk About When We Talk About Love” for inspiring his title. Might be nice for you to include a small mention.
It's pretty standard to not cite cute references like this. Nobody puts Dr Strangelove in their blog bibliography when they do a "How I stopped worrying and learned to love X" post.
This is a good point. Maybe I missed the mark here. I'm not a writer, just a reader/appreciator so if my advice is bad from a writer's perspective then I hope the author ignores it altogether.
Thanks for reminding, I will update the post later. By the way, the storytelling of Haruki Murakami influenced my a lot, I should have mentioned it when the title first occured to me.
You can of course do what you want in your own writing but the GP's suggestion that you need to 'credit' or explain every allusion is not good writing advice - it would mean not having allusions at all.
I enjoyed the article and learned some things reading it. Re-reading my comment, it could have been worded better so I'm sorry about that. As I mentioned in another comment here, I'm not a writer, just an admirer of Murakami's, so if this suggestion contradicts what is common or best practice for authors I hope you'll ignore it.
Things like that are for readers to enjoy when they notice them, as you did, and it's fun to share the noticing with other people. You can just write something along the lines of 'Enjoyed the piece, was your title a reference to X? X was inspired by Y!'. That way don't accidentally come across as a messageboard nitpicker but as the erudite reader that you are.
I don't know much about the context, but it was interesting to note that Materialize scrapped their QGM code last year: https://github.com/MaterializeInc/materialize/pull/17139
Also, a couple of interesting projects in the IR space:
- https://substrait.io/ is a cross-language serialization for Relational Algebra
- https://www.lingo-db.com/ is an MLIR-based (LLVM) query engine described extensively in this paper https://db.in.tum.de/~jungmair/papers/p2485-jungmair.pdf?lan...