
Query Parquet files in SQLite - cldellow
https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html
======
rmgraham
I really liked that 2/3 of the post is a walk through the solutions using
existing tools before getting to the project.

I also like how it combines a "big data" format (parquet) with the main "your
data isn't actually big data" tool of choice (sqlite).

~~~
banku_brougham
Yes as a demonstration of a good expository post I rate this as 100% of what I
looking for.

------
minpur
We used to pump CSV files into BigQuery from remote locations. Data
availability became a problem because of choppy/low bandwidth internet in many
of the remote wind/solar sites we work in. We explored converting CSV’s to
Parquet format mainly for the reduced size Parquet offers. BigQuery accepts
Parquet files but still in beta. We then just zipped the CSV files which
reduced the size to almost 1/8 and BigQuery accepts zipped files directly.
Works like a charm, only downside is if your CSV files have zagged rows then
errors are thrown up. You can change some settings and have BigQuery accept a
certain number of zagged rows.

If you are considering only file sizes then i would recommend this zipping
method. If you are more interested in query time capabilities/advantages then
you must consider Parquet.

We are still experimenting with Parquet because our dataset keeps increasing
by about 100GB everyday.

------
Twirrim
MySQL, Postgres etc. all support transparent compression. I'd be curious how
small the database would end up after compression, and what the impact would
be on querying time.

I'm skeptical it would be as good as the parquet/sqlite option the author came
up with (postgres I believe does compression value-by-value, can't remember
how MySQL does it).

~~~
cldellow
I can't speak for MySQL, but I suspect the Postgres compression you're
referring to is TOAST
([https://www.postgresql.org/docs/current/static/storage-
toast...](https://www.postgresql.org/docs/current/static/storage-toast.html)).

Its sweet spot is for much larger rows. In fact, it only kicks in for rows
whose content is larger than the page size (2KB or so), so it doesn't trigger
for this case, where the average row size is about 120 bytes (and only 80
bytes of that is content).

I bet you could build the DB, stop Postgres, move its data dir to a squashfs
filesystem, and then start Postgres in read-only mode for a huge space savings
with minimal query cost, though.

Hmm, in fact, it'd be easy to do that with the SQLite DB since it's just a
single file. I might give that a shot.

~~~
cldellow
Squashing the SQLite file works pretty well -- it's a bit bigger and slower
than Parquet, but maybe a reasonable trade off to not have to deal with
Parquet.

I added a section at [https://cldellow.com/2018/06/22/sqlite-parquet-
vtable.html#s...](https://cldellow.com/2018/06/22/sqlite-parquet-
vtable.html#sqlite-database-indexed--squashed) to mention it. Thanks for the
inspiration!

------
mathnode
If you have a database that supports SQLMED, you can probably already query
outward table data without loading/serialising between formats. You can even
optionally generate an index on some systems.

MariaDB has the Connect Engine, Postgres has Foreign Data Wrappers (fdw),
define an openrowset for mssql using an odbc or ole driver and config. DB2 and
informix have federated options.

The author has missed a lot of detail about the breadth of options the FDW for
csv provides. Just like a normal table you should define column names and
ideally their types so you don’t have just ordinal/generates column names (if
there is no header line), which the author does anyway for SQLite. The same
problem of denormalisation exists, in which case, data scrubbing cannot be
avoided; yet more serialisation eating precious cycles and memory.

My point is, read the manual for your database, it might surprise you as to
what it can already do, and how much time it can save you.

~~~
cldellow
Hey there - I'm the author.

I think you may have missed the point of the article. This is essentially an
FDW for SQLite so that it can read Parquet files, because CSVs are space
inefficient.

------
justinclift
With the sqlite-parquet-vtable code on GitHub, the parquet/Makefile contents
make it look a bit scary when wondering if it will work on other platforms (eg
Win/OSX).

The "You're almost certainly going to regret your life." in the README.md goes
along with that. ;)

How hard was it to get up and running for your Ubuntu 16.04 system?

Asking because we (sqlitebrowser.org) have started moving our Win installer to
be MSI based, and are considering having useful SQLite extensions being added
to the mix.

[https://github.com/sqlitebrowser/sqlitebrowser/issues/1401#i...](https://github.com/sqlitebrowser/sqlitebrowser/issues/1401#issuecomment-399766764)

Thoughts? :)

~~~
cldellow
I'd suspect you'd have a hard time building it for other platforms, but once
you got it to build, it'd work. It only uses the standard library and parquet-
cpp. From what I can tell, parquet-cpp has a documented build experience for
both Windows and OS X.

I may be overstating the difficulty of building it for Linux. I hadn't written
C++ code in ages prior to starting this, so I'm sure I was tripping over
things that others would sail past. I was also simultaneously trying to build
a version of pyarrow that supported some bleeding edge features. However, I
_definitely_ had an incorrect mental model of where the libraries were getting
installed, which led to some wasted time and frustration.

With time and some emotional distance, I'd like to take a stab at cleaning up
the Linux build so that it can hook into travis-ci and codecov.

I think once that happened, getting the OS X build to work shouldn't be too
difficult. I don't have an OS X machine, though, so I'm not in a great
position to test that.

As for Windows... if I know very little about building C++ code the right way
in Linux, I know nothing for Windows :( I do have a Windows machine, so I'm
sure with enough goading I could be convinced to take a look at it. Feel free
to open issues for an OS X and/or Windows build, although I can't guarantee a
timely resolution to either of them!

~~~
justinclift
Thanks, will do. :)

------
adrianN
AWS Redshift can directly query Parquet files from S3, but I found that
functionality to be somewhat poorly documented. It's not clear which Parquet
types map to which Redshift types.

~~~
bob_roboto
AWS Athena does thesame and is fully managed. Works reasonably well unless you
have lists with complex objects in them.

------
brianoconnor
Nice work! How did you convert csv to parquet? That is the step I currently
struggle with, most tools I found require me to run some kind of server part.

~~~
simonw
The author released a tool called csv2parquet about 3 hours ago:
[https://github.com/cldellow/csv2parquet](https://github.com/cldellow/csv2parquet)

------
tylerl
I have to admit I got stuck on just how small the Canadian census data is.

Like, _arguably_ too large to efficiently work with as a single plaint text
file, but not quite so large to warrant, you know, tooling.

