Hacker News new | past | comments | ask | show | jobs | submit login
Querying Parquet with Precision Using DuckDB (duckdb.org)
129 points by mytherin on June 25, 2021 | hide | past | favorite | 32 comments



I just downloaded duckdb and used it to query parquet files. It's amazing. Finally, an easy way to query Parquet data without spinning up a local Spark instance or doing pd.read_parquet. It's pretty fast too.

I've always felt that Parquet (or ORC) ought to be more popular mainstream formats than they are, but in practice the interactive tooling for Parquet has always been weak. For years, there wasn't a good GUI viewer (well, there were a few half-hearted attempts) or a good SQL query tool for Parquet like there is for most relational databases, and lazily examining the contents of a large Parquet folder from the command line has always been a chore. At least now, DuckDB's parquet support moves the needle on the last point.


Glad that DuckDB is useful to you!

Some more notes that might be of use:

* We also have a command line interface (with syntax highlighting) [1]

* You can use .df() instead of .fetchall() to retrieve the results as a Pandas DataFrame. This is faster for larger result sets.

* You can also use DuckDB to run SQL on Pandas DataFrames directly [2]

[1] https://duckdb.org/docs/installation/

[2] https://duckdb.org/2021/05/14/sql-on-pandas.html


Thanks! Those sound like awesome features! (.df() is extremely useful).


I like especially (2), well thought out presentation thanks


Which GUI do you use now? For me I have to use Impala via HUE, only after I have created an external table from the parquet file. I really don't like web based GUIs, especially when my employer has a crappy server running the GUI rendering.


I don't because I haven't found a good one for Parquet. I would really like something like SSMS for SQL Server.

I would poke at local Parquet files either through a local instance of Spark (or DataBricks when I was using that for remote data folders) or use Pandas' read_parquet from Jupyter.


Same, I would really love a local-running tool like SSMS or Teradata SQL Assistant. CDSW has been good to use with Python (interacting with spark.sql) as a somewhat IDE however it's also web-based. The best thing I have gotten to work is Python within VS Code (running locally) tunnelling to a CDSW project, where I again can run spark.sql. However, it still prints tables out in ASCII way instead of the nice way of SSMS or Teradata SQL Assistant.


We used Apache Drill a few years back and it could do exactly that - run locally and provide a SQL interface to query parquet files. Still great to see more and better options!


Drill is a fair choice (it supports tons of formats), but for me it was a bit heavy -- it required Java and plus it returned results in a form that wasn't easily manipulated (the result is a table in the console drawn with text-chars [1]). If you had a really long result set, you couldn't do a buffered scroll due to text UI limitations -- everything gets printed out and you have to either do manual pagination using LIMIT clauses or just have a really long scrollback buffer. This kind of breaks the continuity of many data analysis workflows.

With DuckDB, the return result is a Python data structure (list of tuples), which is amenable to further manipulation without exporting to another format.

Many GUI db tools often return results in a tabular form which can be copy-pasted into other GUI apps (or copy-pasted as CSV).

[1] https://drill.apache.org/docs/drill-in-10-minutes/#querying-...


You didn't connect to drill using JDBC? You can use any open source GUI with that, like DBeaver.


Oh I didn't think to do that. That's something I could definitely explore, particularly for apps like Tableau. Thanks!


> Finally, an easy way to query Parquet data without spinning up a local Spark instance

What's the burning pain-point here? Spinning up a spark shell is extremely quick and easy.


That's a fair question, and I guess my answer would be "ergonomics". I tend to work in Python not Scala, and there's a certain amount of boilerplate that is necessary to even get Spark functional. For instance, this is a code snippet to spin up local Spark (not Spark Shell) I have stored in a file which I have to look up because there's no way I can commit this to memory:

    import findspark
    findspark.init()
    from pyspark.sql import SparkSession

    sqlContext = SparkSession.builder.master("local[*]").getOrCreate()
    df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')
Not to mention fussy things you have to do in Spark to alias a table:

    df.createOrReplaceTempView("tableA")
    sqlContext.sql("SELECT count(*) from tableA").show()
Compare this to these 2 (!) DuckDB commands in the Python REPL that I just ran moments ago:

    >>> import duckdb
    >>> duckdb.query("SELECT COUNT(*) FROM '*.parquet'").fetchall()
    [(2,)]


> I tend to work in Python not Scala

Fair enough. I’m more comfortable in Scala than Python. Have you tried using a local Jupyter notebook? That would seem to tick all the boxes. https://jupyter.org/install


I do use Jupyter and the boilerplate code above is what I usually deploy in Jupyter.

Spark syntax is still just a touch more complicated than necessary for my needs because it's solving the much more general distributed computing problem. (not to mention the impedance mismatch -- Spark is very much a Scala native app, and it feels that way to a Python user).

DuckDB on the other hand seems to provide a quick way to interact with local Parquet files in Python, which I appreciate.


DuckDB is awesome.. I played with it as a pandas replacement a while back, because I prefer SQL to pandas' arcane api. A few things were slower than they should be, but it's clearly getting better all the time, and now the basics are in place I'm looking forward to all sorts of crazy optimisation work. Well done and thanks Duckdb team!


I tried out DuckDB before (previous comment: https://news.ycombinator.com/item?id=24531732 ) and thought it looked interesting, but this tutorial has really made me excited about the project.

I've been meaning to figure out Parquet for ages, but now it turns out that "pip install duckdb" gives me everything I need to query Parquet files in addition to the other capabilities DuckDB has!


Now, if only I could somehow take that parquet processing optimizer out of DuckDB and into a Postgres fdw reading parquet files... that would make my ETLs work great. Does anyone know of another open source project that uses an optimizer to perform queries on parquet files?


Here is a DuckDB FDW for Postgres! I have not used it, but it sounds like what you need! https://github.com/alitrack/duckdb_fdw


DuckDB is designed to be used as an embedded library, so it should be totally feasible to use it as part of a FDW.

Alternatively, you can run DuckDB as part of a plpython function:

  CREATE FUNCTION pyduckdb ()
    RETURNS integer
  AS $$
    import duckdb
    con = duckdb.connect()
    return con.execute('select 42').fetchall()[0][0]
  $$ LANGUAGE plpythonu;
  SELECT pyduckdb();


There is a parquet fdw for postgres:

https://github.com/adjust/parquet_fdw


Yes, I know, but is it as smart as duckdb?


I havent tried it yet, but the parquet_fdw docs say you have to run ANALYZE on the table for optimal performance.


Why won't you just copy your parquet data into Postgres?


Three reasons I can think of:

- Postgres is a row store optimized for transactional workloads, whereas Parquet is a column-oriented format optimized for analytical workloads.

- Database storage is typically very expensive SSD storage optimized for fast IO and high availability. Parquet files, on the other hand, can be stored in inexpensive object storage such as S3.

- Loading is an additional and possibly unnecessary step


Sure but what is the scenario where you wouldn't want to use DuckDB and you'd want to use PostgreSQL instead but wouldn't want to load into PostgreSQL? DuckDB is already a SQL engine. The only additional benefit I'd imagine from PostgreSQL would be stuff you get from it being on disk?

But that's why I'm asking, I don't know.


It depends on your situation: If you already have a Postgres instance then the FDW is probably the easier solution.


Oh yeah that makes sense. I assumed the commenter was just looking for any system to run SQL over Parquet files. If you don't want to use this system that makes it easy, fair enough.


> If you don't want to use this system that makes it easy, fair enough.

Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community.

https://news.ycombinator.com/newsguidelines.html


It didn't read unkind or snarky at all to me - just an amicable way of ending a thread of investigation where the result was a simple "a ha, I understand" moment.


Vertica provides great support to work with Parquet files as external tables: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...

We use this feature to build a staging layer for incoming files before loading to the data warehouse.


The best part of this is just how easy it is! Just a pip install and you're up and running using industry standard Postgres SQL!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: