Hacker News new | past | comments | ask | show | jobs | submit login
Query Parquet files in SQLite (cldellow.com)
84 points by cldellow on June 23, 2018 | hide | past | favorite | 19 comments



I really liked that 2/3 of the post is a walk through the solutions using existing tools before getting to the project.

I also like how it combines a "big data" format (parquet) with the main "your data isn't actually big data" tool of choice (sqlite).


Yes as a demonstration of a good expository post I rate this as 100% of what I looking for.


We used to pump CSV files into BigQuery from remote locations. Data availability became a problem because of choppy/low bandwidth internet in many of the remote wind/solar sites we work in. We explored converting CSV’s to Parquet format mainly for the reduced size Parquet offers. BigQuery accepts Parquet files but still in beta. We then just zipped the CSV files which reduced the size to almost 1/8 and BigQuery accepts zipped files directly. Works like a charm, only downside is if your CSV files have zagged rows then errors are thrown up. You can change some settings and have BigQuery accept a certain number of zagged rows.

If you are considering only file sizes then i would recommend this zipping method. If you are more interested in query time capabilities/advantages then you must consider Parquet.

We are still experimenting with Parquet because our dataset keeps increasing by about 100GB everyday.


MySQL, Postgres etc. all support transparent compression. I'd be curious how small the database would end up after compression, and what the impact would be on querying time.

I'm skeptical it would be as good as the parquet/sqlite option the author came up with (postgres I believe does compression value-by-value, can't remember how MySQL does it).


I can't speak for MySQL, but I suspect the Postgres compression you're referring to is TOAST (https://www.postgresql.org/docs/current/static/storage-toast...).

Its sweet spot is for much larger rows. In fact, it only kicks in for rows whose content is larger than the page size (2KB or so), so it doesn't trigger for this case, where the average row size is about 120 bytes (and only 80 bytes of that is content).

I bet you could build the DB, stop Postgres, move its data dir to a squashfs filesystem, and then start Postgres in read-only mode for a huge space savings with minimal query cost, though.

Hmm, in fact, it'd be easy to do that with the SQLite DB since it's just a single file. I might give that a shot.


Squashing the SQLite file works pretty well -- it's a bit bigger and slower than Parquet, but maybe a reasonable trade off to not have to deal with Parquet.

I added a section at https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html#s... to mention it. Thanks for the inspiration!


On the MySQL side, https://dev.mysql.com/doc/refman/8.0/en/innodb-compression-b....

It reads like row level compression + index compression? They claim indexes make up a fair chunk of the disk usage, so there may be advantages.


For analytics/OLAP you can use ZFS compression with a large block size, zstd support is just around the corner too. I would still use compression for a OLTP database, but with much lower block size, max 16kb.


If you have a database that supports SQLMED, you can probably already query outward table data without loading/serialising between formats. You can even optionally generate an index on some systems.

MariaDB has the Connect Engine, Postgres has Foreign Data Wrappers (fdw), define an openrowset for mssql using an odbc or ole driver and config. DB2 and informix have federated options.

The author has missed a lot of detail about the breadth of options the FDW for csv provides. Just like a normal table you should define column names and ideally their types so you don’t have just ordinal/generates column names (if there is no header line), which the author does anyway for SQLite. The same problem of denormalisation exists, in which case, data scrubbing cannot be avoided; yet more serialisation eating precious cycles and memory.

My point is, read the manual for your database, it might surprise you as to what it can already do, and how much time it can save you.


Hey there - I'm the author.

I think you may have missed the point of the article. This is essentially an FDW for SQLite so that it can read Parquet files, because CSVs are space inefficient.


With the sqlite-parquet-vtable code on GitHub, the parquet/Makefile contents make it look a bit scary when wondering if it will work on other platforms (eg Win/OSX).

The "You're almost certainly going to regret your life." in the README.md goes along with that. ;)

How hard was it to get up and running for your Ubuntu 16.04 system?

Asking because we (sqlitebrowser.org) have started moving our Win installer to be MSI based, and are considering having useful SQLite extensions being added to the mix.

https://github.com/sqlitebrowser/sqlitebrowser/issues/1401#i...

Thoughts? :)


I'd suspect you'd have a hard time building it for other platforms, but once you got it to build, it'd work. It only uses the standard library and parquet-cpp. From what I can tell, parquet-cpp has a documented build experience for both Windows and OS X.

I may be overstating the difficulty of building it for Linux. I hadn't written C++ code in ages prior to starting this, so I'm sure I was tripping over things that others would sail past. I was also simultaneously trying to build a version of pyarrow that supported some bleeding edge features. However, I _definitely_ had an incorrect mental model of where the libraries were getting installed, which led to some wasted time and frustration.

With time and some emotional distance, I'd like to take a stab at cleaning up the Linux build so that it can hook into travis-ci and codecov.

I think once that happened, getting the OS X build to work shouldn't be too difficult. I don't have an OS X machine, though, so I'm not in a great position to test that.

As for Windows... if I know very little about building C++ code the right way in Linux, I know nothing for Windows :( I do have a Windows machine, so I'm sure with enough goading I could be convinced to take a look at it. Feel free to open issues for an OS X and/or Windows build, although I can't guarantee a timely resolution to either of them!


Thanks, will do. :)


AWS Redshift can directly query Parquet files from S3, but I found that functionality to be somewhat poorly documented. It's not clear which Parquet types map to which Redshift types.


AWS Athena does thesame and is fully managed. Works reasonably well unless you have lists with complex objects in them.


Nice work! How did you convert csv to parquet? That is the step I currently struggle with, most tools I found require me to run some kind of server part.


The author released a tool called csv2parquet about 3 hours ago: https://github.com/cldellow/csv2parquet


I use Apache drill, selecting from a CSV table into a Parquet table. If you use AWS they now provide the ability to convert also.


I have to admit I got stuck on just how small the Canadian census data is.

Like, arguably too large to efficiently work with as a single plaint text file, but not quite so large to warrant, you know, tooling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: