Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: OctoSQL – Query and join multiple databases and files, written in Go (github.com/cube2222)
310 points by cube2222 on July 16, 2019 | hide | past | favorite | 70 comments


Hey, one of the authors here.

The motivation behind this project is that I always wanted a simple commandline tool allowing me to join data from different places, without needing to set up stuff like presto or spark. On another hand, I never encountered any tool which allows me to easily query csv and json data using SQL (which at least in my opinion is fairly ergonomic to use).

This started as an university project, but we're now continuing it as an open source one, as it's been a great success so far.

Anyways, feedback greatly requested and appreciated!


> commandline tool allowing me to join data from different places, without needing to set up stuff like presto or spark. On another hand, I never encountered any tool which allows me to easily query csv and json data using SQL (which at least in my opinion is fairly ergonomic to use).

This exactly describes Drill (https://drill.apache.org/) which can query any data source under the sun (RDBMS, NoSQL, files, clusters, serialized data (JSON, CSV, Parquet...), object storage (S3, OpenStack, ...)) using SQL and JOIN between them with pushdown optimizations. It's got an awesome CLI and can be used in any language via REST or JDBC.


Thanks, didn't know about it! I'll make sure to read up on it!


I last tried this program several years ago, and I remembered the set up was somewhat difficult and arcane. It still very really useful, I should give it another look.


i have a reoccurring use case where i have huge csv files that i need to aggregate / summarize using a pretty old netbook. the only way i can get this done is through spark/scala. would drill be a better option?


Sure. Create a filesystem data source in the Drill config (https://drill.apache.org/docs/file-system-storage-plugin/) with the directories where your CSVs are in. Then query them however you'd like. Use the CLI if you want to produce another CSV or REST/JDBC if you need the data somewhere else. If the query is complex and long-running you can monitor it's state in the Drill web UI.


You might need to tweak the Drill JVM memory settings a little bit though if you're doing this on a netbook without much memory!


Generally, I make use of file_fdw module of Postgres to achieve this.

https://www.postgresql.org/docs/11/file-fdw.html


Didn't know about that, thanks!

Is there a way to connect this to other databases, to basically have views of your redis/MySQL/other databases in postgres?



Seems like it's solving part of the problems we're having really well, I'll have to read up more on it, thanks!


Sure. You might also wanna check this:

PG community extensions - https://pgxn.org/

There are a lot of open-source or free extensions to make PG as a data platform rather than just a RDBMS.


Postgres Foreign Data Wrappers are an incredibly powerful tool. I've in the past built a basic data warehouse for a distributed system by pulling in read slaves from assorted service's databases as FDW sources to a central Postgres instance, after which Postgres allows you to treat them all as one big database.


This is a helpful utility for sure - but beware, joins between foreign data sources and local ones can be slow and very un-optimized, especially with bigger data sources.

I used to work for a company that made a data federation software platform. I am always surprised that people forget that management of external data is part of the SQL standard actually - SQL/MED ("SQL Management of External Data") [1]!

[1] https://en.wikipedia.org/wiki/SQL/MED


Completely unrelated, but that URL really threw me for a loop. I am not used to seeing a slash in the article topic (section after '/wiki/').


I wonder if you or the other Postgres users here could give me some tips.

What would be the best way to use the data from a huge Excel file into my web apps?

Currently I'm converting it to CSV and use `BULK INSERT` with MSSQL. I know that with MSSQL I can also use Excel files as "external database" but AFAIK, Excel files are not indexed and are super slow when used as an "external database".

I wouldn't mind switching to Postgres. I actually prefer Postgres.


THe process is similar, you would convert to csv and then use the \copy command to bulk insert the data. In pg 12, the copy command has grown a lot of very useful features like WHERE filtering, so if you get the chance go for 12:

https://www.cybertec-postgresql.com/en/tech-preview-improvin...


As an MSSQL guy who rarely connects it to excel, if your workflow allows it consider pulling the entirety of the excel sheet into a table in the db and working on it just from there. What's huge for excel isn't for MSSQL.

As a decades-long user of MSSQL guy who is getting sick of being rpeatedly done over by MS, could you explain why you prefer postgres? That would be incredibly interesting, TIA


For the excel sheet I was wondering if there was a better way since I want to update all the date many times per day to be up-to-date. I thought maybe I was wasting some ressources.

> could you explain why you prefer postgres?

I don't have many reasons to be honest. Both MSSQL and Postgres are fine I guess but I like that Postgres is open source and so I can use it in personal projects for fun.

I guess the thing I hate the most about MSSQL is that, at work, we never have the latest version, because why pay again. And every single time I search how to do something I must use the less optimal way since the new way only works on the latest version.

At least with Postgres I can always run the latest version.


I can't help you really but decide where the master data lives - in the spreadsheet or in the DB? Try and move it into the DB, so it needn't touch the spreadsheet unless you want to only look at it in the spreadsheet, never alter it there.

DBs handle and manipulate data better far better than excel. Excel does data presentation better.

A decent DB, well looked after and backed up, is the safest place for any quantity of data. Spreadsheets are not - excel is not reliable once you start filling it up and doing complex updates.

So that's my advice. Of course your needs and workflow may not fit that at all well, so YMMV!

Thanks for the opinion on mssql vs postgres.


Not OP but, in my mind (having used both extensively), Postgres is basically the closest I feel you can get to MSSQL in the open source world, in terms of completeness of a fully functional relational database that scales - not to mention Postgres has tons of features that MSSQL has been playing catch up with for years.

I like both of them personally. Postgres didn't used to have as strong of a showing on Windows, and not as fancy tools as Management Studio/Query Analyzer/Azure Data Studio, but I think today Postgres is absolutely viable to replace SQL Server if you want to switch to open source.



Do fdw/standard connectors support pushdown predicates well? Documentation doesn't say much about that.


Yep. PG 9.5 and later supports it.

Use "EXPLAIN VERBOSE" before your query to get more info.


My god. I didn't know postgres can do this. Thank you!


At Hasura, we recently released the preview of Remote Joins that allows you to join different GraphQL APIs. So, all you need is a graphql wrapper over your DB/sources (Postgres comes in-built) and you can then join whatever sources.

https://blog.hasura.io/remote-joins-a-graphql-api-to-join-da...


A great motivation and keeping it simple from an interface perspective is a cool intention.

It's a great start with tons of directions you can take it, and many interesting challenges along the way - keep up the good work!


Thanks for the kind words! Appreciate it!


I've been approaching this problem from a slightly different direction: I'm constructing a set of tools for converting different data formats into SQLite, because once your data is in SQLite you can join it, filter it and do all kinds of other useful SQL things on top of it.

Tools I've written so far include:

https://github.com/simonw/csvs-to-sqlite - convert CSV files to a SQLite database

https://sqlite-utils.readthedocs.io/en/stable/cli.html - convert JSON files

https://github.com/simonw/db-to-sqlite - export data from any SQL database supported by SQLAlchemy (e.g. MySQL and PostgreSQL) into a SQLite file

More tools here: https://datasette.readthedocs.io/en/stable/ecosystem.html#to...

The end goal is to have databases I can load up in my Datasette web application tool so I can run SQL and export the results as JSON/CSV: https://datasette.readthedocs.io/


This looks great!

However, we're aiming for datasources which are either too big to download or change frequently, so also have very different goals really. (the same SQL for everything inspiration though!)


Cool project! I've had success with ClickHouses's local utility which is extremely fast. It helps that its basically a "local" version of an already insanely efficient columnar database.

https://www.altinity.com/blog/2019/6/11/clickhouse-local-the...



Sounds like a neat project. On the topic, you can actually make things like SQL files (but not json) available to PostgreSQL via Foreign Data Wrappers. Worth checking out if you use PostgreSQL already.


microsoft logparser https://www.microsoft.com/en-us/download/details.aspx?id=246... was good for querying various text formats from command line with an SQL-like syntax


Did it start as a support tool for easy query of open data or what was the motivation?


The only motivation was to make a tool allowing you to query multiple datasource simultaneously in SQL and that was also the start of it.


there's datasette[1] to look forward to (perhaps to get some inspirations or port some ideas)

[1]: https://github.com/simonw/datasette


Pretty cool stuff.

Does it support writing to (single) data sources too or only reading queries?


Thanks! We only support reads, and for now we're not planning to add write support.

You can use the csv output format though and later import that somewhere.


Really cool, thanks for sharing. You all might want to look at Apache Calcite (http://calcite.apache.org/) as well for inspiration, which has similar functionality as a subset of its features!


Apache Drill vs Calcite? Seem very similar.


Drill depends on Calcite for SQL parsing and cost-based query optimization!


Lots of Apache projects have very similar counterparts, which has always puzzled me - I wonder if there are publicly available official selection criteria for new Apache projects?


This doesn't bother the Apache Software Foundation. So long as a project is able to follow the "Apache Way" (licensing, treatment of source artifacts, organizational structure of project members and community, etc.), then the project can exist in the ASF without conflict of other similar projects.

The ASF's mission is not focused on bringing one best-of-breed product in every category. More, the ASF is focused on the stewardship and delivery of the software. They are more interested in the organization and community surrounding the project than the code itself.

https://www.apache.org/theapacheway/


Looks like it's solving part or even all the same problems we are.

Will surely be researching on how they approached it.

Thanks for it! Will surely help us going forward.


Cool project. On a side note, Apache Drill (https://drill.apache.org/) is also worth checking out. The homepage doesn't make it very apparent, but I've used it to query CSVs, etc., using SQL.


Yeah an Drill has been around for a few years. Will likely continue to use Drill over octocat (no offense OP)


"Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite."

https://github.com/dbohdan/sqawk


Consider adding a server/daemon with jobs and scheduling. This would let people copy and reconcile data between systems, or prep data at night, etc... It could also save indexes and table statistics between queries.


That's definitely our plan for the future!


This is great. I keep the following in my bash profile for querying CSVs with SQL, which works OK, but this is probably better:

  function csv_to_sqlite() {
    echo "used as csv_to_sqlite PATH_TO_CSV TABLE_NAME"
    sqlite3 -csv -cmd ".import $1 $2" -cmd ".headers on"
  }
Two suggestions:

1. It would be a lot more useful if I could link an entire database rather than just tables

2. Sometimes I just want to query a single CSV, so a command line argument where I can just specify the path would be nice


Thanks!

1. Initially we wanted to do one table per config-entry, but that'll be an easy to do UX improvement.

2. We're planning to add support for piping the csv/json file through stdin! That would be the most seamless workflow I think.


If anyone is trying it out, beware it requires you use an alias for your table, e.g.

    select * from tbl as t
not

    select * from tbl


Thanks for sharing. This is cool, just what I wanted. I like the fact I can create config files and query not just he database but also query JSON.


Make sure to let us know if you have any feedback after testing it out!

The Go standard library really helped handling all the different json file formats you might expect. (Like being a whole file table, or record per line)


Hey, at work we have been forced to use Denodo. How does this compare to Denodo? Seems a lot better and doesn’t try to do any “data virtualization”. I am a bit scared by the query optimizer because the Denodo one is super buggy. Overall I think the concept is great and this might be a great companion to etl tools.


I used to work for a competitor to Denodo. Query optimization for data virtualization/federation across heterogenous systems is very complicated - that much I can assure you.

You might be interested in Dremio (which is a modernized and open source version of Drill, but also has a commercial version) too. I had previously studied Spark, PrestoDb and Hive, among other things, for similar purposes.

If you ever want to chat feel free to contact me via my profile email.


If vis is the focus, I'd recommend Postgres FDW over anything else here, just because PG is supported by almost every vis software (metabase, redash, many Apache projects, Tableau, Google Data Studio (free) etc.)


I've been working on a new company that's doing cloud hosted postgres foreign data wrappers as a service (plus a bunch of interface / workflow stuff on top) and while I'm biased I totally agree. The postgres FDW system has gotten very sophisticated in later releases. It does performance optimizations like pushing down filters / aggregates, and it even goes further by supporting sending remote explain calls out to the underlying datasources to use in query planning. In one of my local tests vs. Denodo it smartly decided _not_ to push a group by down when the cardinality of the column being grouped on is very high (since the grouping doesn't reduce the number of rows sent over the wire very much), which blew me away.

If anyone is interested I'd love to chat: steve (at) syndetic.co


Hey, I have no idea really. I've never used Denodo and a quick glance at their homepage doesn't tell much so I don't understand what they mean with data virtualization.

We're obviously surely not as battle tested, this being the initial release, but hopefully we'll be able to compare favorably!

The query optimizer is kinda simple currently. Mostly pushing down filters under maps and pushing supported workloads down to the datasources. Though I don't know how it will evolve going further ;)


When joining tables from 2 different data sources, is the join performed by loading two tables in full into memory ? If yes - any plans on improving this, so that it does not OOM or kill the remote database ? If no - how do you handle this case ?


No, currently we support only lookup joins. We're planning to add table/hash-joins based on table-size heuristics.


How do you look up csv records?..


Hey, sorry for the late response, currently totally inefficient - scan whole file each time. We plan on adding an in memory or on disk index for such cases.


Really nice!

Something similar (with a GUI) is https://www.dremio.com/ that I think has gained some traction.


Dremio is quite cool and still something I am watching develop closely. Their involvement with the open source ecosystem is quite nice. Apache Drill, Arrow, Parquet and Gandiva are all super important contributions that the team behind Dremio has been involved in AFAIK.


awesome, really cool!


A small tangential comment: In the README you mention octopi. Although not wrong as per the Oxford English Dictionary which lists octopuses, octopi and octopodes (in that order), normally it is better to use octopuses.


I did that intentionally cause I think octopi sounds kind of cute :)


Why not let that word live it's own little life?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: