
Redshift Performance and Cost - AirbnbNerds
http://nerds.airbnb.com/redshift-performance-cost
======
TY
Redshift is based on ParAccel, not on Postgres. ParAccel uses APIs similar to
Postgres due to historical reasons, but not the technology.

For a basic overview: <http://en.wikipedia.org/wiki/Paraccel>

As for the rest of the article, it feels like a basic Data Warehousing 101 re-
discovered. It should have been titled "Analytics: Back To The Future" :-)

~~~
meritt
No kidding. The amount of startups that have flocked to hadoop for "data
analytics" over the past 5 years is extremely disheartening. Almost all of the
cases are far more suitable for any off-the-shelf RDBMS much less a column-
oriented one. Same thing with MongoDB.

How much time and money would have been saved learning Database
Theory/SQL/Data Warehousing/Dimensional Modeling instead of cramming
everything into an unstructured data-store?

~~~
mixedbit
Which off-the-shell RDBMS can handle queries over 3 billion rows?

~~~
meritt
Counter-question: Which startup has a actual data table with over 3 billion
rows?

~~~
PanMan
We have just crossed 2 billion items in our datastore. While not 3 billion
yet, I expect that to happen later this year.

Too bad Redshift can't handle JSON files: Converting everything will be
annoying.

~~~
fujibee
Our idea is to change from JSON on loading to Redshift, continuously.
<http://www.hapyrus.com/pages/flydata-for-redshift>

------
panarky
The article mentions this briefly, but it should be emphasized: parallel
loading from S3 is MUCH faster.

This weekend I loaded 2 billion rows from S3 both ways:

\- From a single gzipped object: 4 hours 42 minutes

\- From 2000 gzipped slices of 1M rows each: 17 minutes

(Loading from gzipped files is considerably faster, in addition to saving S3
charges.)

The article notes that choice of distribution key is critical. I'd add that
choice of sort key is equally important. In my testing, a better sort key
improved compression from 1.5:1 to 4:1, and also made common queries 5x
faster.

Unfortunately, you only get one dist key and one sort key per table, so less
common queries could get slower.

~~~
fujibee
Also if you launch the more instance in a cluster, the faster to load. Our
survey: [http://www.slideshare.net/Hapyrus/scalability-of-amazon-
reds...](http://www.slideshare.net/Hapyrus/scalability-of-amazon-redshift-
data-loading-and-query-speed) We tried much more files (5MB each) to load, but
it takes longer time in total.. We're trying to get appropriate size and file
numbers.

------
csarva
This is the second article I've seen where the authors forget to multiply by
the number of redshift nodes. A single XL node is $0.85/hr so 16 nodes would
be $13.60/hr. Still cheaper than their Hive configuration obviously but less
than a buck?

~~~
taf2
damn yeah, I initially thought wow for $632.4 bucks i'm setting this up next
week or even later this week provisioning the redshift - but 16x that at
$10118.4 - I'll continue wait... probably a good thing to stay focused on
features anyway

------
monstrado
Which storage format did you use for Hive? This is very important to how
performance plays out, are you using snappy or LZO compression? Also, this is
a relevant comment from a Hive committer
(<http://news.ycombinator.com/item?id=5248485>).

------
twog
Still running on posterous. Weird to think that this is probably the last
posterous blog post I will read.

------
polskibus
It's good to see a Redshift evaluation. I'm wondering how does Redshift
compare to hadoop airbnb setup when taking data loading and transformation
into consideration as well as running aggregate queries? I mean if you want to
run analysis fairly often, do you need to reload everything in Redshift? From
maintenance point of view, is Hadoop setup more flexible and cheaper than
Redshift?

~~~
bloomfilter
The hadoop setup we have is actually EMR, and we use s3 for data storage, so
for us Hive/Hadoop doesn't save much in terms of data loading. And once you
have a process setup to load and update the data in Redshift, you don't have
to reload everything

------
serbaut
The first query seems awfully slow. I have a six node vertica cluster with a
100 column table with 7Bn rows in it and a similar query takes less than 3
seconds.

~~~
mallipeddi
Disclosure: I work on the Redshift team.

The OP's cluster is a 16-node hs1.xlarge cluster (has 3 spindles per node).
There's actually a more powerful node-type hs1.8xlarge which has 24 spindles
on each node. More info: <http://aws.amazon.com/redshift/pricing/>

So it's not fair to compare Redshift performance to your Vertica cluster
unless the hardware is similar.

------
jacques_chester
Dumb question: was the data ETL'd into a star schema first? That can make a
big difference, especially in columnar stores.

~~~
bloomfilter
We sort of made it a partial star schema, but not strictly. since random join
can be expensive

