
OctoSQL v0.3.0: Streaming, Temporal SQL on Event Streams in Go - cube2222
https://github.com/cube2222/octosql/releases/tag/v0.3.0
======
cube2222
Hey everybody, one of the authors here!

For those unfamiliar with OctoSQL, it’s a local database/file query tool which
lets you use SQL to query multiple databases and file formats simultaneously.

This is a very big release we’ve been working on for almost a year in terms of
our bachelor thesis. It’s an almost-total rewrite from the ground up. It adds
support for streams of events, event time semantics, watermarks and multiple
temporal SQL extensions, such as triggers, inspired by last years “One SQL To
Rule Them All” paper. It also adds durability, which means you can kill and
restart the OctoSQL process, there will be no record duplication and it will
start where it left off.

You can see the features for yourself. Following however, I’ll describe some
of the stuff we learned, what we’d do differently. We’re an open source
project, so no bullshit marketing here, I think such “lessons learned” are
very valuable overall.

First and foremost, I think we hadn’t really decided what we wanted OctoSQL to
be. Thus it evolved into a mutation of two ideas. The first one being a stream
processing sytem like Flink or Spark. The second one being a local developer
tool that’s quick, friendly, simple to use and very elastic, much akin to jq,
jql or textql. I don’t think building such a mutation well is in our reach
currently, so OctoSQL will have to choose one going forward.

Now if we go the stream processing system path, then OctoSQL’s architecture,
similar to the Volcano query engine, is much slower than if it’d use a column
oriented micro-batch internal representation. If we’d create a system like
this today, we’d probably orient it around Apache Arrow, like DataFusion did.

We’ve also stood with our decision of connecting to actual databases only in
the execution phase, which resulted in many schema-based optimisations to be
impossible.

However, if we go the local tool path, then everything we did around
durability, building everything around local transactional Badger key-value
storage, providing restartability and exactly-once semantics in the face of
restarts isn’t worth much, and has a hugely negative impact on performance
(10x+), which we saved a bit by parallelizing everything. Though it is quite
magical when restarting starts where it left off.

Where next? Maybe we’ll scrap all the transactional storage stuff, and keep
octosql a handy single-machine tool to query many databases and explore data.
Not trying to be something it won't ever really manage to become. The roadmap
is current too, with stuff like ergonomic json manipulation to do. The lack of
columnar processing is a cons, but a boon too, we can try to be as ergonomic
and elastic as possible. Is the restartability feature useful to you, what do
you think? We’d really love to hear your feedback about all the aforementioned
issues, and how they’d impact your usage of OctoSQL! Please comment!

In regards to development speed and releases, as said before, we did all this
as our bachelor thesis (it's been pretty intense connected with University and
a job) so work has slowed down recently as we regain our energy. But we're
slowly feeling more motivated once again, so hopefully it'll gain more
momentum again now.

Anyways, it's been a great ride so far, hope some of you will find it useful!

Our bachelor thesis is also publicly available, it contains a more detailed
description of implementation details, tradeoffs and comparisons to other
projects:
[https://github.com/cube2222/octosql/raw/master/whitepapers/t...](https://github.com/cube2222/octosql/raw/master/whitepapers/thesis.pdf)

BTW If you like what octosql is doing and would like to help speed up
development (with the vision I've shown above) feel free to sponsor us on
GitHub, we'll appreciate it and be able to spend more time on it.

