
Show HN: OctoSQL – Query and join multiple databases and files, written in Go - cube2222
https://github.com/cube2222/octosql
======
cube2222
Hey, one of the authors here.

The motivation behind this project is that I always wanted a simple
commandline tool allowing me to join data from different places, without
needing to set up stuff like presto or spark. On another hand, I never
encountered any tool which allows me to easily query csv and json data using
SQL (which at least in my opinion is fairly ergonomic to use).

This started as an university project, but we're now continuing it as an open
source one, as it's been a great success so far.

Anyways, feedback greatly requested and appreciated!

~~~
truth_seeker
Generally, I make use of file_fdw module of Postgres to achieve this.

[https://www.postgresql.org/docs/11/file-
fdw.html](https://www.postgresql.org/docs/11/file-fdw.html)

~~~
brunoqc
I wonder if you or the other Postgres users here could give me some tips.

What would be the best way to use the data from a huge Excel file into my web
apps?

Currently I'm converting it to CSV and use `BULK INSERT` with MSSQL. I know
that with MSSQL I can also use Excel files as "external database" but AFAIK,
Excel files are not indexed and are super slow when used as an "external
database".

I wouldn't mind switching to Postgres. I actually prefer Postgres.

~~~
tempguy9999
As an MSSQL guy who rarely connects it to excel, if your workflow allows it
consider pulling the entirety of the excel sheet into a table in the db and
working on it just from there. What's huge for excel isn't for MSSQL.

As a decades-long user of MSSQL guy who is getting sick of being rpeatedly
done over by MS, could you explain why you prefer postgres? That would be
incredibly interesting, TIA

~~~
brunoqc
For the excel sheet I was wondering if there was a better way since I want to
update all the date many times per day to be up-to-date. I thought maybe I was
wasting some ressources.

> could you explain why you prefer postgres?

I don't have many reasons to be honest. Both MSSQL and Postgres are fine I
guess but I like that Postgres is open source and so I can use it in personal
projects for fun.

I guess the thing I hate the most about MSSQL is that, at work, we never have
the latest version, because why pay again. And every single time I search how
to do something I must use the less optimal way since the new way only works
on the latest version.

At least with Postgres I can always run the latest version.

~~~
tempguy9999
I can't help you really but decide where the master data lives - in the
spreadsheet or in the DB? Try and move it into the DB, so it needn't touch the
spreadsheet unless you want to only look at it in the spreadsheet, never alter
it there.

DBs handle and manipulate data better far better than excel. Excel does data
presentation better.

A decent DB, well looked after and backed up, is the safest place for any
quantity of data. Spreadsheets are not - excel is not reliable once you start
filling it up and doing complex updates.

So that's my advice. Of course your needs and workflow may not fit that at all
well, so YMMV!

Thanks for the opinion on mssql vs postgres.

------
MoOmer
Really cool, thanks for sharing. You all might want to look at Apache Calcite
([http://calcite.apache.org/](http://calcite.apache.org/)) as well for
inspiration, which has similar functionality as a subset of its features!

~~~
jadbox
Apache Drill vs Calcite? Seem very similar.

~~~
GordonS
Lots of Apache projects have very similar counterparts, which has always
puzzled me - I wonder if there are publicly available official selection
criteria for new Apache projects?

~~~
taftster
This doesn't bother the Apache Software Foundation. So long as a project is
able to follow the "Apache Way" (licensing, treatment of source artifacts,
organizational structure of project members and community, etc.), then the
project can exist in the ASF without conflict of other similar projects.

The ASF's mission is not focused on bringing one best-of-breed product in
every category. More, the ASF is focused on the stewardship and delivery of
the software. They are more interested in the organization and community
surrounding the project than the code itself.

[https://www.apache.org/theapacheway/](https://www.apache.org/theapacheway/)

------
cplat
Cool project. On a side note, Apache Drill
([https://drill.apache.org/](https://drill.apache.org/)) is also worth
checking out. The homepage doesn't make it very apparent, but I've used it to
query CSVs, etc., using SQL.

~~~
dk-
Yeah an Drill has been around for a few years. Will likely continue to use
Drill over octocat (no offense OP)

------
blacksqr
"Sqawk is an Awk-like program that uses SQL and can combine data from multiple
files. It is powered by SQLite."

[https://github.com/dbohdan/sqawk](https://github.com/dbohdan/sqawk)

------
irq-1
Consider adding a server/daemon with jobs and scheduling. This would let
people copy and reconcile data between systems, or prep data at night, etc...
It could also save indexes and table statistics between queries.

~~~
cube2222
That's definitely our plan for the future!

------
mjirv
This is great. I keep the following in my bash profile for querying CSVs with
SQL, which works OK, but this is probably better:

    
    
      function csv_to_sqlite() {
        echo "used as csv_to_sqlite PATH_TO_CSV TABLE_NAME"
        sqlite3 -csv -cmd ".import $1 $2" -cmd ".headers on"
      }
    

Two suggestions:

1\. It would be a lot more useful if I could link an entire database rather
than just tables

2\. Sometimes I just want to query a single CSV, so a command line argument
where I can just specify the path would be nice

~~~
cube2222
Thanks!

1\. Initially we wanted to do one table per config-entry, but that'll be an
easy to do UX improvement.

2\. We're planning to add support for piping the csv/json file through stdin!
That would be the most seamless workflow I think.

------
quickthrower2
If anyone is trying it out, beware it requires you use an alias for your
table, e.g.

    
    
        select * from tbl as t
    

not

    
    
        select * from tbl

------
srameshc
Thanks for sharing. This is cool, just what I wanted. I like the fact I can
create config files and query not just he database but also query JSON.

~~~
cube2222
Make sure to let us know if you have any feedback after testing it out!

The Go standard library really helped handling all the different json file
formats you might expect. (Like being a whole file table, or record per line)

------
kfk
Hey, at work we have been forced to use Denodo. How does this compare to
Denodo? Seems a lot better and doesn’t try to do any “data virtualization”. I
am a bit scared by the query optimizer because the Denodo one is super buggy.
Overall I think the concept is great and this might be a great companion to
etl tools.

~~~
minxomat
If vis is the focus, I'd recommend Postgres FDW over anything else here, just
because PG is supported by almost every vis software (metabase, redash, many
Apache projects, Tableau, Google Data Studio (free) etc.)

~~~
stevepike
I've been working on a new company that's doing cloud hosted postgres foreign
data wrappers as a service (plus a bunch of interface / workflow stuff on top)
and while I'm biased I totally agree. The postgres FDW system has gotten very
sophisticated in later releases. It does performance optimizations like
pushing down filters / aggregates, and it even goes further by supporting
sending remote explain calls out to the underlying datasources to use in query
planning. In one of my local tests vs. Denodo it smartly decided _not_ to push
a group by down when the cardinality of the column being grouped on is very
high (since the grouping doesn't reduce the number of rows sent over the wire
very much), which blew me away.

If anyone is interested I'd love to chat: steve (at) syndetic.co

------
polskibus
When joining tables from 2 different data sources, is the join performed by
loading two tables in full into memory ? If yes - any plans on improving this,
so that it does not OOM or kill the remote database ? If no - how do you
handle this case ?

~~~
cube2222
No, currently we support only lookup joins. We're planning to add table/hash-
joins based on table-size heuristics.

~~~
riku_iki
How do you look up csv records?..

~~~
cube2222
Hey, sorry for the late response, currently totally inefficient - scan whole
file each time. We plan on adding an in memory or on disk index for such
cases.

------
sails
Really nice!

Something similar (with a GUI) is
[https://www.dremio.com/](https://www.dremio.com/) that I think has gained
some traction.

~~~
sixdimensional
Dremio is quite cool and still something I am watching develop closely. Their
involvement with the open source ecosystem is quite nice. Apache Drill, Arrow,
Parquet and Gandiva are all super important contributions that the team behind
Dremio has been involved in AFAIK.

------
21stio
awesome, really cool!

------
yannis
A small tangential comment: In the README you mention octopi. Although not
wrong as per the Oxford English Dictionary which lists octopuses, octopi and
octopodes (in that order), normally it is better to use octopuses.

~~~
cube2222
I did that intentionally cause I think octopi sounds kind of cute :)

