- Presto is not good at longer queries, if a node dies the query fails and it needs to be restarted. It is however orders of magnitude faster for any of the other solutions when it comes to Geospatial functions, the team behind it are simply wizards.
- BigQuery is also super fast and a fantastic tool for adhoc analysis of huge amounts of data, they have just started to implement GIS functionality in this so we are watching it closely. Some of our analyists have been stung on pricing where partitions weren't possible meaning we were charged for scanning 10TB+ of data for a relatively simple query - it has a learning curve for sure!
- Redshift was our original data warehouse, it was great for prescribed data in an ETL pipeline, however scaling a cluster takes hours and data skew meant that the entire cluster would fill up during queries if sort keys and distribution keys weren't precisely calibrated - quite difficult when you have changing dimensions of data.
- Spark / EMR / Tez has been our standout workhorse for many things now, it is much slower than any of the above but there are many tools that work with Spark and the ecosystem is growing rapidly, we had to perform a cross join of 16B records to 140M ranges and every single one of the above solutions either crapped out on us or became prohibitively expensive to run this at scale and get meaningful output. Spark took longer (1h 25m) but the progress was steady and quantifiable.
Presto often died mid query for a number of reasons (including that we wanted to run this on pre-emptible instances on GCP and it doesnt support fault tolerance).
File formats are a HUGE differentiator when it comes to these systems as well - we chose ORC as our file format due to the availability of bloom filters and predicate pushdown in Presto, this means we can load a 10TB dataset in a couple of minutes and query the files directly without having to specifically load them into a store.
Our preference is ORC > Parquet > AVRO > CSV in order.
Basically I will say that these benchmarks are quite good for determining speed but sometimes there are other factors other than raw speed that will bite you in the ass unless you are aware of them :)