I just downloaded duckdb and used it to query parquet files. It's amazing. Finally, an easy way to query Parquet data without spinning up a local Spark instance or doing pd.read_parquet. It's pretty fast too.
I've always felt that Parquet (or ORC) ought to be more popular mainstream formats than they are, but in practice the interactive tooling for Parquet has always been weak. For years, there wasn't a good GUI viewer (well, there were a few half-hearted attempts) or a good SQL query tool for Parquet like there is for most relational databases, and lazily examining the contents of a large Parquet folder from the command line has always been a chore. At least now, DuckDB's parquet support moves the needle on the last point.
Which GUI do you use now? For me I have to use Impala via HUE, only after I have created an external table from the parquet file. I really don't like web based GUIs, especially when my employer has a crappy server running the GUI rendering.
I don't because I haven't found a good one for Parquet. I would really like something like SSMS for SQL Server.
I would poke at local Parquet files either through a local instance of Spark (or DataBricks when I was using that for remote data folders) or use Pandas' read_parquet from Jupyter.
Same, I would really love a local-running tool like SSMS or Teradata SQL Assistant. CDSW has been good to use with Python (interacting with spark.sql) as a somewhat IDE however it's also web-based. The best thing I have gotten to work is Python within VS Code (running locally) tunnelling to a CDSW project, where I again can run spark.sql. However, it still prints tables out in ASCII way instead of the nice way of SSMS or Teradata SQL Assistant.
We used Apache Drill a few years back and it could do exactly that - run locally and provide a SQL interface to query parquet files. Still great to see more and better options!
Drill is a fair choice (it supports tons of formats), but for me it was a bit heavy -- it required Java and plus it returned results in a form that wasn't easily manipulated (the result is a table in the console drawn with text-chars [1]). If you had a really long result set, you couldn't do a buffered scroll due to text UI limitations -- everything gets printed out and you have to either do manual pagination using LIMIT clauses or just have a really long scrollback buffer. This kind of breaks the continuity of many data analysis workflows.
With DuckDB, the return result is a Python data structure (list of tuples), which is amenable to further manipulation without exporting to another format.
Many GUI db tools often return results in a tabular form which can be copy-pasted into other GUI apps (or copy-pasted as CSV).
That's a fair question, and I guess my answer would be "ergonomics". I tend to work in Python not Scala, and there's a certain amount of boilerplate that is necessary to even get Spark functional. For instance, this is a code snippet to spin up local Spark (not Spark Shell) I have stored in a file which I have to look up because there's no way I can commit this to memory:
Fair enough. I’m more comfortable in Scala than Python. Have you tried using a local Jupyter notebook? That would seem to tick all the boxes.
https://jupyter.org/install
I do use Jupyter and the boilerplate code above is what I usually deploy in Jupyter.
Spark syntax is still just a touch more complicated than necessary for my needs because it's solving the much more general distributed computing problem. (not to mention the impedance mismatch -- Spark is very much a Scala native app, and it feels that way to a Python user).
DuckDB on the other hand seems to provide a quick way to interact with local Parquet files in Python, which I appreciate.
DuckDB is awesome.. I played with it as a pandas replacement a while back, because I prefer SQL to pandas' arcane api. A few things were slower than they should be, but it's clearly getting better all the time, and now the basics are in place I'm looking forward to all sorts of crazy optimisation work.
Well done and thanks Duckdb team!
I tried out DuckDB before (previous comment: https://news.ycombinator.com/item?id=24531732 ) and thought it looked interesting, but this tutorial has really made me excited about the project.
I've been meaning to figure out Parquet for ages, but now it turns out that "pip install duckdb" gives me everything I need to query Parquet files in addition to the other capabilities DuckDB has!
Now, if only I could somehow take that parquet processing optimizer out of DuckDB and into a Postgres fdw reading parquet files... that would make my ETLs work great.
Does anyone know of another open source project that uses an optimizer to perform queries on parquet files?
- Postgres is a row store optimized for transactional workloads, whereas Parquet is a column-oriented format optimized for analytical workloads.
- Database storage is typically very expensive SSD storage optimized for fast IO and high availability. Parquet files, on the other hand, can be stored in inexpensive object storage such as S3.
- Loading is an additional and possibly unnecessary step
Sure but what is the scenario where you wouldn't want to use DuckDB and you'd want to use PostgreSQL instead but wouldn't want to load into PostgreSQL? DuckDB is already a SQL engine. The only additional benefit I'd imagine from PostgreSQL would be stuff you get from it being on disk?
Oh yeah that makes sense. I assumed the commenter was just looking for any system to run SQL over Parquet files. If you don't want to use this system that makes it easy, fair enough.
> If you don't want to use this system that makes it easy, fair enough.
Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community.
It didn't read unkind or snarky at all to me - just an amicable way of ending a thread of investigation where the result was a simple "a ha, I understand" moment.
I've always felt that Parquet (or ORC) ought to be more popular mainstream formats than they are, but in practice the interactive tooling for Parquet has always been weak. For years, there wasn't a good GUI viewer (well, there were a few half-hearted attempts) or a good SQL query tool for Parquet like there is for most relational databases, and lazily examining the contents of a large Parquet folder from the command line has always been a chore. At least now, DuckDB's parquet support moves the needle on the last point.