Hacker News new | comments | show | ask | jobs | submit login

Speed is only one factor of a data warehouse that should be considered, as someone who has moved through 3 of these in the past year we have settled on a mix of Spark, Presto and BigQuery depending on the workload.

- Presto is not good at longer queries, if a node dies the query fails and it needs to be restarted. It is however orders of magnitude faster for any of the other solutions when it comes to Geospatial functions, the team behind it are simply wizards.

- BigQuery is also super fast and a fantastic tool for adhoc analysis of huge amounts of data, they have just started to implement GIS functionality in this so we are watching it closely. Some of our analyists have been stung on pricing where partitions weren't possible meaning we were charged for scanning 10TB+ of data for a relatively simple query - it has a learning curve for sure!

- Redshift was our original data warehouse, it was great for prescribed data in an ETL pipeline, however scaling a cluster takes hours and data skew meant that the entire cluster would fill up during queries if sort keys and distribution keys weren't precisely calibrated - quite difficult when you have changing dimensions of data.

- Spark / EMR / Tez has been our standout workhorse for many things now, it is much slower than any of the above but there are many tools that work with Spark and the ecosystem is growing rapidly, we had to perform a cross join of 16B records to 140M ranges and every single one of the above solutions either crapped out on us or became prohibitively expensive to run this at scale and get meaningful output. Spark took longer (1h 25m) but the progress was steady and quantifiable. Presto often died mid query for a number of reasons (including that we wanted to run this on pre-emptible instances on GCP and it doesnt support fault tolerance).

File formats are a HUGE differentiator when it comes to these systems as well - we chose ORC as our file format due to the availability of bloom filters and predicate pushdown in Presto, this means we can load a 10TB dataset in a couple of minutes and query the files directly without having to specifically load them into a store.

Our preference is ORC > Parquet > AVRO > CSV in order.

Basically I will say that these benchmarks are quite good for determining speed but sometimes there are other factors other than raw speed that will bite you in the ass unless you are aware of them :)




Geospatial for BigQuery just went Beta:

https://cloud.google.com/bigquery/docs/gis-intro


I've been in the private beta for a while now, using it for ad-hoc queries, still missing a few things I require but I think it will be better down the line!


Late reply, but which things is it missing for your needs? I work on the BigQuery team, so I can pass along any feedback that you have.


Have given most things in the Beta Group, First off I love the format conversion capabilities, just missing a few things like Clustering Algorithms. If parity can be achieved with the PostGIS ST functions then it will be fantastic!


Thanks!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: