Hacker News new | past | comments | ask | show | jobs | submit login
OctoSQL allows you to join data from different sources using SQL (github.com/cube2222)
114 points by BerislavLopac on July 16, 2022 | hide | past | favorite | 54 comments



What I really want to see is a general purpose query engine. This has some of the features but I want to see features like:

- Very easy to model HTTP APIs as a table.

- Easy to model basically anything as a table for example files on my filesystem.

- Optional caching to make development quicker (don't need to hit external services for every iteration).

- The ability to build up my own personal library of "tables" or "graphs" over time in an incremental way.

- A decent query planner so that I can avoid expensive things (like API calls) if I can determine if I need the object based on something cheaper (like a local disk access).

- Ideally a language more compostable than SQL so that I can define common filters and views that can be shared across queries.

I think the main fundamental difference is that this wants all of the data upfront in a data file. I want something that is easy to extend to sources that are possibly non-listable or at the very least I don't want to have all of the data available.

IIUC this is possible with something like a Postgres or Sqlite foreign data wrapper but IIUC the optimization isn't always great and they are difficult to write.


I'm building out Trustfall[0] with most of the same goals in mind: query everything with a single composable syntax and fast performance, incrementally evolve the schema, and build up the types of objects it can understand and query into your own personal (or shared!) graph.

I recently gave a conference talk on how we use queries over Dockerfiles + project configuration + deployment info to triage bugs and prevent them from coming back at my workplace. An example query is "find Python projects whose pyproject.toml specifies Python 3.9 but which get deployed with a Dockerfile that installs Python 3.7 instead." Here's the talk link: https://www.hytradboi.com/2022/how-to-query-almost-everythin...

If you'd like to see Trustfall in action, I recently used it to build a Rust semver-checking tool[1] that's just a bunch of Trustfall queries like "find structs that used to be public but now aren't public anymore"[2] etc.

Trustfall definitely doesn't yet do everything I'd like it to (e.g. caching), and the Adapter API needs some more polish to improve ergonomics. But if you're interested in trying it out despite those caveats, I'd love to hear what you think!

[0]: https://github.com/obi1kenobi/trustfall

[1]: https://crates.io/crates/cargo-semver-checks

[2]: https://github.com/obi1kenobi/cargo-semver-check/blob/main/s... -- the query is in GraphQL syntax, you can copy-paste it into an editor to get syntax highlighting


You should look at Apache Calcite[0]. Like OctoSQL, you can join data from different data sources. It's also relatively easy to add your own data sources ("adapters" in Calcite lingo) and rules to efficiently query those sources. Calcite already has adapters that do things like read from HTML tables over HTTP, files on your file system, running processes, etc. This is in addition to connecting to a bunch of traditional databases including SQL databases, MongoDB, Redis, and several others.

Disclaimer: I am a member of the PMC, but not unfortunately not particularly active atm.

[0] https://calcite.apache.org/


That looks pretty cool. I might reach for that next time.

I was hoping for something where sources were more "on the fly" though. Like where I could run an arbitrary command to make a row. It seems like making an "adapter" is a bit more of an investment here.


Exposing a "row" is not enough for the query system though, it has to be able to do filtering and joining. If it has to scan every row for any operation, you might as well have an endpoint that returns every row... and then you're back at the currently supported workflow.


That's what the complexity is.

Some data sources will be expensive and can't be filtered so you want to avoid looking up rows as much as possible (do the filtering using other predicates if possible, do the loopup only if required). Some data sources will support efficient filtering with pushed down predicates.

Ideally when "asking questions" I don't need to worry about it. The optimizer can use the info provided to do something sensible.


How can the info be provided if you don't write a full adapter? I feel like I'm misunderstanding what you were asking for.


I guess you do end up a "full adapter" but you can do it piecemeal. For most of my use cases my custom tables don't need to be highly optimal, I can just give a rough expectation of cost, and if I don't provide a batch get then obviously you can't do push-down of predicates.

Some metadata can be provided by the definition, and maybe some can be learned on-the-fly (like latency and data distribution?)


I designed Trustfall[0] out of frustration with existing systems that either:

- didn't allow optimizations like batch get and predicate pushdown even when I knew they'd help, or

- made it really difficult to add them piecemeal after-the-fact on an existing schema, by either demanding everything be batched and optimized, or that nothing be batched and optimized.

In my experience, it's usually a small component of the schema that is causing most of the runtime cost, and it's not always predictable ahead of time where it's going to be. It's dependent on the queries and the data size and shape, so it changes over time.

So I wanted to be able to write a working schema and adapter quickly, and then optimize the parts that were slow in practice for my workload and not just ones that "hypothetically might be slow," and this is what Trustfall lets me do.

More info in my reply on an ancestor comment in the thread.

[0]: https://github.com/obi1kenobi/trustfall


FWIW, you can do this with Calcite. All you really need to write an adapter is something which gives all the data in a table. If that isn't efficient enough, you can write some rules for the planner which allow operators to pushed down.


I'm curious what you mean by "arbitrary command to make a row." Could you give more of an example of a potential use case?


The point is that it is very flexible. For example I could compute a hash of an input string, run a virus scan on a file or whatever else I wanted.

In one concrete example I was trying to explore some things about Super Mario Bros so the command would be to run the emulator based on the input and output some information about the game state (character location, speed, level...)


I'm still not totally following your example. If it helps, it wouldn't be too challenging to make an adapter which pulls data from arbitrary shell commands.


Echoing this, you have essentially described Calcite.

The community is incredible, too.


We've done something like this at Hasura.

It allows users to write adapters for datasources in any language, expose them over an HTTP API, and then combine them together into a federated GraphQL API:

(Disclaimer: I work on this feature)

https://techcrunch.com/2022/06/28/hasura-now-lets-developers...

https://hasura.io/blog/hasura-graphql-data-connectors/


Based on your "model everything as a table" sensibility, you might enjoy https://www.visidata.org/. It basically treats everything (CSVs, Excel files, the filesystem, SQLite files, etc) as tables and then lets you interact with them through a uniform table manipulation interface.


Yeah, something like a client side only GraphQL would be nice. Define your service/object relations and then query remote (REST or otherwise) services as a connected graph. Data is woven together client side but the benefit is services are agnostic to it and you could weave together many public services.


Trustfall (see sibling comment: https://news.ycombinator.com/item?id=32120007 ) is my attempt at what you're describing: GraphQL syntax with custom directives like @recurse / @optional / aggregations / arbitrary filters, executed client-side over whatever remote APIs are available (REST or otherwise).

It's by no means fully polished yet and lacks many features that'd be useful to have, but I'd love your thoughts if you decide to take a peek! It supports Rust, Python, and (extremely soon!) WASM in the browser.

https://github.com/obi1kenobi/trustfall


Yeah, something like that would be really cool. Although I think the actual computational abilities of graphql are less than I am envisioning, but I'm not too familiar. I'm thinking of SQL-like (or even more powerful) aggregations and computations.


The rest sounds great too, but I particularly like the analogy of higher levels of abstraction as compost -- using composability to enrich your queries.


Hey!

> I think the main fundamental difference is that this wants all of the data upfront in a data file.

Absolutely not! Moreover, OctoSQL can push down predicates to databases so that it only has to download a small subset of the table, if the datasource and query allow it.

> Very easy to model HTTP APIs as a table.

"Very easy" is relative, but you can take a look at the random_data[0] datasource which is exactly this. I'm also planning to add a GitHub datasource fairly soon. That said, there is Steampipe[1] for which this is the main use case afaik (hitting API's and exposing them as tables through Postgres FWD's written in Go), so it might be a smoother and more polished experience. There's also tons of plugins already available for it.

> Easy to model basically anything as a table for example files on my filesystem.

Yep, definitely. That's the idea behind OctoSQL. Strive to create a tool for easily exposing anything through SQL (like your machine's processes list, an API, and join that with a file, or database). There's still lot's of documentation work left to do though, in order to make the plugin authoring experience easier.

> A decent query planner so that I can avoid expensive things (like API calls) if I can determine if I need the object based on something cheaper (like a local disk access).

Probably depends on the use-case, and it sometimes needs you to be fairly explicit, but OctoSQL does in fact do that. It will push down predicates to underlying databases, which means joining something small with something very big (while only taking very small amounts of the latter) can be very fast with LOOKUP JOIN's.

> I want something that is easy to extend to sources that are possibly non-listable or at the very least I don't want to have all of the data available.

Doable. An example of this is the `plugins.available_versions` table[2]. It requires you to provide the plugin name as a predicate, as the versions need to be downloaded from the plugin's own repository (and listing all plugin repositories on each query isn't really what you want to be doing; You can also LOOKUP JOIN with the `plugins.available_plugins` table if that is indeed what you want).

[0]: https://github.com/cube2222/octosql-plugin-random_data

[1]: https://steampipe.io

[2]: https://github.com/cube2222/octosql/blob/main/datasources/pl...


Hey, author of OctoSQL here, happy to see it submitted and happy to answer any questions!


Great work. If you could run it as a server that other applications could connect to (e.g. over JDBC or some other common protocol), you would have the makings of an open data virtualization platform.


Thanks!

Definitely considering adding a server-mode with Postgres wire protocol compatibility.

It's tricky for the more dynamic/dataflow'y parts, as OctoSQL is able to give you a live updating output table (which Postgres wire protocol doesn't support), but I can go with a similar approach as Materialize[0] does for those use cases - creating a live-updating materialized view that you can query from.

That said, for now I'm still concentrating on the overall local usage experience and ergonomics, there's still much to improve there.

[0]: https://materialize.com


Materialize is really neat, also checkout https://github.com/mit-pdos/noria. It inverts the query problem and processes the data on insert. Exactly like what most applications end up doing using a no-sql solution.


It certainly seems more lightweight and modular, but I'm curious if you see this overlapping with Presto at all, in terms of product fit, scale/performance, etc.

I'll almost certainly use this to pull parquet, orc, and csv data into postgres at some point in the next month or two, so thanks.


The main difference in product fit is that OctoSQL is single-machine-only and I have no plans for distributed execution.

I'm not sure about the performance of Presto, but I'd expect OctoSQL to be of fairly similar speed for complex operations/data types (operating on strings, objects, running some functions on them) and slower for vectorizable arithmetics (avg of a column of floats in a parquet file).

Other than that, OctoSQL should be much easier to write plugins for, much easier to use, and even though there's no plugin for it yet, it should have much better support for streaming inputs (i.e. Kafka).


How hard would it to use this as a library? Our application does a lot of CSV processing and would love to try and see if this could help us streamline a lot of that.

Our application is written in Ruby so I was thinking just writing a wrapper which shells out to the cli.


Probably not too hard. The root.go file (in the root directory) does all the plumbing, so you should be able to either use OctoSQL as a CLI, or create your own based on that file.


Just did a quick search about embedding go code in Ruby applications and it looks relatively straightforward…looks like I have a weekend project on my hands :)


How does it compare to Teiid? It seems to be at least partially in the same space.


Thanks for mentioning it! I've never heard of Teiid and it indeed does seem to have some architectural commonalities.

To me, the main differences look like they are 1. Teiid is mainly targeted at OLTP workloads, while OctoSQL is more OLAP-oriented; 2. OctoSQL is to be used as a standalone CLI, while Teiid integrates into your Java Application (if I understand their docs correctly).


Teiid can also be deployed as a standalone server (see the Wildfly route). There is a JDBC driver to connect to it, and they emulate the Postgres wire protocol so you can even use psql to connect to it and run queries.


Hi, this looks like a great addition to the dataflow toolbox!

I wondered if OctoSQL supports memory-mapping parquet files like Vaex, or the whole file has to be loaded into memory first?


I'm not sure how Vaex works, but for what it's worth, OctoSQL doesn't load whole Parquet files into memory.

It will only read the columns that are used in your query and will read those columns page by page (Parquet columns are organized into pages on disk) and process pages record by record. It won't ever buffer the whole thing if it's not necessary for the query as a whole.


How does the plugin system work? I thought Go didn’t allow for dynamically loading code, where you able to find a work around?


When a plugin starts, it begins listening on a local unix socket with a gRPC server.

The main OctoSQL process then connects to that gRPC server for each plugin.

Omitting a bunch of details, each record stream will end up being a gRPC streaming connection.

This approach is cross-platform (unix sockets work on Windows since recently!) and more than fast enough (millions of records per second), while being easy to keep organized (everything stays on the file system).

That said, all of this is abstracted away by the plugin SDK, so you only have to implement the right interface and the rest will get handled by the SDK.

Here's an example plugin that simple and fairly easy to read: https://github.com/cube2222/octosql-plugin-random_data

And here's the gRPC schema for plugins, which might be useful to get a better picture: https://github.com/cube2222/octosql/blob/main/plugins/intern...


I gave this a try. Looks neat however one thing I noticed is that it does not seem very performant in querying parquet files. Take this simple query for example: 'select * from ./order.parquet limit 5;' It takes roughly 38seconds to execute on my machine (I ran it a few times). The order file is about 300mb with 7.5m records in it. To give some comparison DuckDB can run this query in less than a second (while also directly accessing the parquet file).


Thanks for writing about this!

This is a performance regression in the v0.8.0 release it seems, where the outer LIMIT clause doesn't correctly short-circuit evaluation.

Overall, for a `LIMIT 10` query, it should basically be instantaneous (and is, with v0.7.3). Will fix this asap.

That said, DuckDB will *definitely* be much faster for processing Parquet files, that's for sure.


This is now fixed[0]. I've released a new version (v0.8.1) and now a query like this should once again take just ~tens of milliseconds to complete.

[0]: https://github.com/cube2222/octosql/commit/ba4a8896be35ffdcc...


Congratulations, very cool project.

I'm interested in learning about how federated data source query engines like this handle query planning and optimization.

If I wanted to study your implementation for hints where would I start?


Thanks!

I'll start by saying that OctoSQL is single-machine-only, as I'm not sure what exactly you meant with "federated".

I'd recommend starting by going with a debugger through the execution of the root function in root.go, as that calls out to all the macro transformations.

Then, you can take a look at the optimizer (optimizer directory) and the Postgres plugin source[0], as an example of a plugin that is able to push down predicates to the underlying database. As well as the Typecheck (logical -> physical) and Materialize (physical -> execution) transformations.

I'm planning to write a few technical documents about the implementation soon, while writing some actual usage documentation as well.

[0]: https://github.com/cube2222/octosql-plugin-postgres


Great, thanks!


Join AWS and study Athena Federated query feature


I like the idea of that, though Im curious if an EL approach is the right way to go especially at scale where queries might get all blocked run out of mem in OctoSQL.


Could you please explain in more detail what you mean (I'm not sure what you mean with "EL approach")?

OctoSQL isn't really meant to be used "at scale", as it's not distributed. It's not trying to compete with Spark, Flink, or BigQuery. You should be good to use it with data up to the order of Gigabytes / Tens of Gigabytes, locally.

OctoSQL never loads the whole table into memory if it doesn't need to. Processing happens record by record, and if predicates allow it, only part of the underlying table will ever be read (by pushing down predicates to a SQL DB you're reading from for example).


Ever heard of PostgreSQL FDW (Foreign Data Wrapper) and associated community developed extensions ?


Yes! There's even a great project built around them to streamline their use, Steampipe[0].

That said, the last time I tried it, it was orders of magnitude slower than OctoSQL. Moreover, reusing the Postgres query engine has its advantages, but it also takes away flexibility you might want to customize the dialect - in the case of OctoSQL, that's i.e. support for streaming and live updating queries.

[0]: https://steampipe.io


> The last time I tried it, it was orders of magnitude slower than OctoSQL.

Any link to blogpost ? what was the use case ?


I tried the Steampipe CSV plugin on the taxi dataset that is used in the OctoSQL benchmarks. Simple grouping.


No, I haven’t.


OctoSQL is an awesome project and Kuba has a lot of great experience to share from building this project I'm excited to learn from.

And while building a custom database engine does allow you to do pretty quick queries, there are a few issues.

First, the SQL implemented is nonstandard. As I was looking for documentation and it pointed me to `SELECT * FROM docs.functions fs`. I tried to count the number of functions but octosql crashed (a Go panic) when I ran `SELECT count(1) FROM docs.functions fs` and `SELECT count(*) FROM docs.functions fs` which is what I lazily do in standard SQL databases. (`SELECT count(fs.name) FROM docs.function fs` worked.)

This kind of thing will keep happening because this project just doesn't have as much resources today as SQLite, Postgres, DuckDB, etc. It will support a limited subset of SQL.

Second, the standard library seems pretty small. When I counted the builtin functions there were only 29. Now this is an easy thing to rectify over time but just noting about the state today.

And third this project only has builtin support for querying CSV and JSON files. Again this could be easy to rectify over time but just mentioning the state today.

octosql is a great project but there are also different ways to do the same thing.

I build dsq [0] which runs all queries through SQLite so it avoids point 1. It has access to SQLite's standard builtin functions plus a battery of extra statistic aggregation, string manipulation, url manipulation, date manipulation, hashing, and math functions custom built to help this kind of interactive querying developers commonly do [1].

And dsq supports not just CSV and JSON but parquet, excel, ODS, ORC, YAML, TSV, and Apache and nginx logs.

A downside to dsq is that it is slower for large files (say over 10GB) when you only want a few columns whereas octosql does better in some of those cases. I'm hoping to improve this over time by adding a SQL filtering frontend to dsq but in all cases dsq will ultimately use SQLite as the query engine.

You can find more info about similar projects in octosql's Benchmark section but I also have a comparison section in dsq [2] and an extension of the octosql benchmark with different set of tools [3] including duckdb.

Everyone should check out duckdb. :)

[0] https://github.com/multiprocessio/dsq

[1] https://github.com/multiprocessio/go-sqlite3-stdlib

[2] https://github.com/multiprocessio/dsq#comparisons

[3] https://github.com/multiprocessio/dsq#benchmark*


This is mostly right, but there are quite a few important caveats here.

> And third this project only has builtin support for querying CSV and JSON files.

This is actually not true :)

There's also Parquet, and you can read arbitrary text files line by line (which with soon-to-come regex capture should let you easily work with all kinds of textual log files).

Other than that, there are plugins for other databases which are intentionally kept out of the core.

> I tried to count the number of functions but octosql crashed (a Go panic) when I ran `SELECT count(1) FROM docs.functions fs` and `SELECT count(*) FROM docs.functions fs` which is what I lazily do in standard SQL databases.

Looks like a bug in the docs datasource, thanks! Already diagnosed the issue and it should be fixed soon.

> It will support a limited subset of SQL.

It's not really a subset. It also has extensions not available in standard SQL like temporal SQL, as well as ergonomic handling of columns with multiple possible value types (I.e. a String | Int column), which is a common occurrence in i.e. JSON files and is not well-solved by standard SQL.

There are many features currently in OctoSQL and that I'm planning to add in the future that wouldn't be possible with the SQLite approach, nor with the Steampipe Postgres approach.

Simple example: tailing a log file live while filtering lines using SQL.

But yeah! Overall there are definitely tradeoffs, and you should choose the tool that makes sense for you.

After I'm done stabilizing the core I will move on to implementing a ton of functions available in other engines, so that should cease to be an issue soon as well - there's not that many of them, it just hasn't been a priority so far.


The docs panic issue has now been fixed[0] and released[1]. Thanks again for mentioning it!

[0]: https://github.com/cube2222/octosql/commit/f7a1b92ba65584bc6...

[1]: https://github.com/cube2222/octosql/releases/tag/v0.8.1




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: