
BigQuery vs. Redshift vs. Athena simple benchmark - gabriel_b
https://www.gab.lc/articles/bigquery-vs-redshift-vs-athena
======
rsynnott
For a dataset this size, though, you'd probably realistically be using
ORC/Parquet rather than CSV with Athena, which would cut query times and cost
dramatically. Note that the table shows that Athena is scanning all the data
each time; that would not be the case if ORC or Parquet was used.

~~~
mej10
Do you know of any good ways of outputting log data directly into Parquet
format without hadoop/spark?

~~~
mason55
I'm curious about this as well. I was evaluating Athena a few months ago and
was surprised that there was no good way to get my data into Parquet or ORC
format without spinning up an EMR cluster and loading into a Hive table of the
desired format.

My guess is that in the past there was no real use for these formats unless
you already had a Hadoop cluster running. If Amazon wants these "Hadoop as a
service" concepts to take off it seems like it would be wise for them to make
it easier to get data onto S3 in a better format than CSV.

~~~
arnon
Mostly what I'm hearing here is:

"You're using it wrong"

"You should start up a Hadoop cluster, create a table, load the CSV into that
table, then export it as Parquet, and then load that into S3, so that Athena
can scan it"

Wouldn't you be honestly just better off creating a table and loading the CSV
into a columnar database directly, like on Amazon RDS (Even if it means you
bring the server instance down after you're done)?

------
aanfhn
Loading a single, large file into Redshift? I get the impression the author
has a passing knowledge of Redshift at least. I'm not too familiar with
BigQuery, but for Redshift, loading files in batches of total slices in the
cluster is the recommended approach. No wonder it took 9+ hours to load that
file. And the author also doesn't mention distributing the data on a
particular column. I wonder if he did random distribution.

And his problem with the field delimiter really shows lack of experience; of
course you can't use a multi-length field delimiter. I've never seen anyone
use a comma+space for a delimiter. That file could probably be about 75% of
the original file size if he re-created it with just the comma as the field
delimiter.

Sorry to bash on the author - don't mean to sound harsh but a lot of people
are trying to do benchmarks but with minimal context and whatnot

~~~
gabriel_b
> of course you can't use a multi-length field delimiter

It is indeed a non conventional format but thats how the source CSV was
formatted on
[https://sdm.lbl.gov/fastbit/data/samples.html](https://sdm.lbl.gov/fastbit/data/samples.html)

------
openasocket
The redshift loading result seems suspect to me. I know firsthand that
redshift can scale to load over a trillion records / hour (with a big enough
cluster). Even with a basic setup this should be at least an order of
magnitude faster. I'm not sure exactly what the problem is, maybe try breaking
the file up into smaller batches and load those.

It would also be helpful to see what schema you used for redshift,
specifically the encoding and the distribution and sort key(s).

To give Athena more of a fighting chance, other people have mentioned Parquet
or ORC, but also remember to partition the data. Generally you're supposed to
give Athena a directory with data partitioned into different subfolders based
on field values. Like if you're dealing with time-series type data you can
partition your data in the format
"year=<xxxx>/month=<yy>/day=<zz>/<uuid>.csv". I'm guessing you should do that
for this data partitioning by eventTime but it kind of depends. Of course then
you need some other component to put the data into S3 into the desired format,
and you should probably count that as part of the loading time.

DISCLAIMER: work for AWS, not on the redshift or athena teams, though I do use
redshift for work.

------
pingles
Difficult to draw much of/any conclusion without the Redshift schema.

From direct experience of working on a moderate Redshift cluster performance
is horrific once you start querying tables outside of their designed with sort
and distribution keys. This is fine when you have defined use but when you're
doing ad-hoc analysis or want to query across lots of dimensions we couldn't
make Redshift work.

In contrast, I've seen BigQuery perform much more consistently in the face of
large datasets with varying workloads.

------
rockostrich
I think you can get the "data scanned per query" from Redshift by querying the
STL_QUERY_METRICS table.

[http://docs.aws.amazon.com/redshift/latest/dg/r_STL_QUERY_ME...](http://docs.aws.amazon.com/redshift/latest/dg/r_STL_QUERY_METRICS.html)

I would be interested in seeing the settings for the table in Redshift. I
don't know much about how BigQuery works, but Redshift relies heavily on
distribution and sort keys to efficiently query data.

~~~
gabriel_b
I will look into that, thanks

------
scapecast
@gabriel_b - can you describe the process / command you used to upload data
into Redshift from S3? I have a hard time believing those huge load times.

