Hacker News new | past | comments | ask | show | jobs | submit login

There was a little bit of unfair comparison. I didn't have to load the data over a busy network connection, and I didn't have to decrypt the data once it was loaded (I had the benefit of a firewall and the raw data on a USB3 hard drive). I think there was a conversion from CSV to parquet on the cluster as well. And the engineer who set up the cluster was multitasking so I'm sure there was some latency issues just from that. But my analysis still only took a few hours (5? Maybe 6?).

There are a handful of things that make a difference. First of all, don't use inserts, use the .import command. This alone is enough to saturate all the available write bandwidth on a 7200rpm drive. It is not transactional, so you don't have to worry about that...it bypasses the query engine entirely, really is more like a shell command that marshals data directly into the table's on disk representation. You can also disable journaling and increase page sizes for a tiny boost.

Once imported into SQLite you get the benefit of binary representation which (for my use case) really cut down on the dataset size for the read queries. I only had a single join and it was against a dimensional table that fit in memory, so indexes were small and took insignificant time to build. One single table scan with some aggregation, and that was it.

Good tips. I'll give it a try!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact