The motivation behind this project is that I always wanted a simple commandline tool allowing me to join data from different places, without needing to set up stuff like presto or spark. On another hand, I never encountered any tool which allows me to easily query csv and json data using SQL (which at least in my opinion is fairly ergonomic to use).
This started as an university project, but we're now continuing it as an open source one, as it's been a great success so far.
Anyways, feedback greatly requested and appreciated!
> commandline tool allowing me to join data from different places, without needing to set up stuff like presto or spark. On another hand, I never encountered any tool which allows me to easily query csv and json data using SQL (which at least in my opinion is fairly ergonomic to use).
This exactly describes Drill (https://drill.apache.org/) which can query any data source under the sun (RDBMS, NoSQL, files, clusters, serialized data (JSON, CSV, Parquet...), object storage (S3, OpenStack, ...)) using SQL and JOIN between them with pushdown optimizations. It's got an awesome CLI and can be used in any language via REST or JDBC.
I last tried this program several years ago, and I remembered the set up was somewhat difficult and arcane. It still very really useful, I should give it another look.
i have a reoccurring use case where i have huge csv files that i need to aggregate / summarize using a pretty old netbook. the only way i can get this done is through spark/scala. would drill be a better option?
Sure. Create a filesystem data source in the Drill config (https://drill.apache.org/docs/file-system-storage-plugin/) with the directories where your CSVs are in. Then query them however you'd like. Use the CLI if you want to produce another CSV or REST/JDBC if you need the data somewhere else. If the query is complex and long-running you can monitor it's state in the Drill web UI.
Postgres Foreign Data Wrappers are an incredibly powerful tool. I've in the past built a basic data warehouse for a distributed system by pulling in read slaves from assorted service's databases as FDW sources to a central Postgres instance, after which Postgres allows you to treat them all as one big database.
This is a helpful utility for sure - but beware, joins between foreign data sources and local ones can be slow and very un-optimized, especially with bigger data sources.
I used to work for a company that made a data federation software platform. I am always surprised that people forget that management of external data is part of the SQL standard actually - SQL/MED ("SQL Management of External Data") [1]!
I wonder if you or the other Postgres users here could give me some tips.
What would be the best way to use the data from a huge Excel file into my web apps?
Currently I'm converting it to CSV and use `BULK INSERT` with MSSQL. I know that with MSSQL I can also use Excel files as "external database" but AFAIK, Excel files are not indexed and are super slow when used as an "external database".
I wouldn't mind switching to Postgres. I actually prefer Postgres.
THe process is similar, you would convert to csv and then use the \copy command to bulk insert the data. In pg 12, the copy command has grown a lot of very useful features like WHERE filtering, so if you get the chance go for 12:
As an MSSQL guy who rarely connects it to excel, if your workflow allows it consider pulling the entirety of the excel sheet into a table in the db and working on it just from there. What's huge for excel isn't for MSSQL.
As a decades-long user of MSSQL guy who is getting sick of being rpeatedly done over by MS, could you explain why you prefer postgres? That would be incredibly interesting, TIA
For the excel sheet I was wondering if there was a better way since I want to update all the date many times per day to be up-to-date. I thought maybe I was wasting some ressources.
> could you explain why you prefer postgres?
I don't have many reasons to be honest. Both MSSQL and Postgres are fine I guess but I like that Postgres is open source and so I can use it in personal projects for fun.
I guess the thing I hate the most about MSSQL is that, at work, we never have the latest version, because why pay again. And every single time I search how to do something I must use the less optimal way since the new way only works on the latest version.
At least with Postgres I can always run the latest version.
I can't help you really but decide where the master data lives - in the spreadsheet or in the DB? Try and move it into the DB, so it needn't touch the spreadsheet unless you want to only look at it in the spreadsheet, never alter it there.
DBs handle and manipulate data better far better than excel. Excel does data presentation better.
A decent DB, well looked after and backed up, is the safest place for any quantity of data. Spreadsheets are not - excel is not reliable once you start filling it up and doing complex updates.
So that's my advice. Of course your needs and workflow may not fit that at all well, so YMMV!
Not OP but, in my mind (having used both extensively), Postgres is basically the closest I feel you can get to MSSQL in the open source world, in terms of completeness of a fully functional relational database that scales - not to mention Postgres has tons of features that MSSQL has been playing catch up with for years.
I like both of them personally. Postgres didn't used to have as strong of a showing on Windows, and not as fancy tools as Management Studio/Query Analyzer/Azure Data Studio, but I think today Postgres is absolutely viable to replace SQL Server if you want to switch to open source.
At Hasura, we recently released the preview of Remote Joins that allows you to join different GraphQL APIs. So, all you need is a graphql wrapper over your DB/sources (Postgres comes in-built) and you can then join whatever sources.
I've been approaching this problem from a slightly different direction: I'm constructing a set of tools for converting different data formats into SQLite, because once your data is in SQLite you can join it, filter it and do all kinds of other useful SQL things on top of it.
The end goal is to have databases I can load up in my Datasette web application tool so I can run SQL and export the results as JSON/CSV: https://datasette.readthedocs.io/
However, we're aiming for datasources which are either too big to download or change frequently, so also have very different goals really. (the same SQL for everything inspiration though!)
Cool project! I've had success with ClickHouses's local utility which is extremely fast. It helps that its basically a "local" version of an already insanely efficient columnar database.
Sounds like a neat project. On the topic, you can actually make things like SQL files (but not json) available to PostgreSQL via Foreign Data Wrappers. Worth checking out if you use PostgreSQL already.
Really cool, thanks for sharing. You all might want to look at Apache Calcite (http://calcite.apache.org/) as well for inspiration, which has similar functionality as a subset of its features!
Lots of Apache projects have very similar counterparts, which has always puzzled me - I wonder if there are publicly available official selection criteria for new Apache projects?
This doesn't bother the Apache Software Foundation. So long as a project is able to follow the "Apache Way" (licensing, treatment of source artifacts, organizational structure of project members and community, etc.), then the project can exist in the ASF without conflict of other similar projects.
The ASF's mission is not focused on bringing one best-of-breed product in every category. More, the ASF is focused on the stewardship and delivery of the software. They are more interested in the organization and community surrounding the project than the code itself.
Cool project. On a side note, Apache Drill (https://drill.apache.org/) is also worth checking out. The homepage doesn't make it very apparent, but I've used it to query CSVs, etc., using SQL.
Consider adding a server/daemon with jobs and scheduling. This would let people copy and reconcile data between systems, or prep data at night, etc... It could also save indexes and table statistics between queries.
Make sure to let us know if you have any feedback after testing it out!
The Go standard library really helped handling all the different json file formats you might expect. (Like being a whole file table, or record per line)
Hey, at work we have been forced to use Denodo. How does this compare to Denodo? Seems a lot better and doesn’t try to do any “data virtualization”. I am a bit scared by the query optimizer because the Denodo one is super buggy. Overall I think the concept is great and this might be a great companion to etl tools.
I used to work for a competitor to Denodo. Query optimization for data virtualization/federation across heterogenous systems is very complicated - that much I can assure you.
You might be interested in Dremio (which is a modernized and open source version of Drill, but also has a commercial version) too. I had previously studied Spark, PrestoDb and Hive, among other things, for similar purposes.
If you ever want to chat feel free to contact me via my profile email.
If vis is the focus, I'd recommend Postgres FDW over anything else here, just because PG is supported by almost every vis software (metabase, redash, many Apache projects, Tableau, Google Data Studio (free) etc.)
I've been working on a new company that's doing cloud hosted postgres foreign data wrappers as a service (plus a bunch of interface / workflow stuff on top) and while I'm biased I totally agree. The postgres FDW system has gotten very sophisticated in later releases. It does performance optimizations like pushing down filters / aggregates, and it even goes further by supporting sending remote explain calls out to the underlying datasources to use in query planning. In one of my local tests vs. Denodo it smartly decided _not_ to push a group by down when the cardinality of the column being grouped on is very high (since the grouping doesn't reduce the number of rows sent over the wire very much), which blew me away.
If anyone is interested I'd love to chat: steve (at) syndetic.co
Hey, I have no idea really. I've never used Denodo and a quick glance at their homepage doesn't tell much so I don't understand what they mean with data virtualization.
We're obviously surely not as battle tested, this being the initial release, but hopefully we'll be able to compare favorably!
The query optimizer is kinda simple currently. Mostly pushing down filters under maps and pushing supported workloads down to the datasources. Though I don't know how it will evolve going further ;)
When joining tables from 2 different data sources, is the join performed by loading two tables in full into memory ? If yes - any plans on improving this, so that it does not OOM or kill the remote database ?
If no - how do you handle this case ?
Hey, sorry for the late response, currently totally inefficient - scan whole file each time. We plan on adding an in memory or on disk index for such cases.
Dremio is quite cool and still something I am watching develop closely. Their involvement with the open source ecosystem is quite nice. Apache Drill, Arrow, Parquet and Gandiva are all super important contributions that the team behind Dremio has been involved in AFAIK.
A small tangential comment: In the README you mention octopi. Although not wrong as per the Oxford English Dictionary which lists octopuses, octopi and octopodes (in that order), normally it is better to use octopuses.
The motivation behind this project is that I always wanted a simple commandline tool allowing me to join data from different places, without needing to set up stuff like presto or spark. On another hand, I never encountered any tool which allows me to easily query csv and json data using SQL (which at least in my opinion is fairly ergonomic to use).
This started as an university project, but we're now continuing it as an open source one, as it's been a great success so far.
Anyways, feedback greatly requested and appreciated!