
AWS Redshift: How Amazon Changed The Game - rgrzywinski
http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/
======
vosper
This blog post is great - extremely detailed and informative. We're running
Infobright (paid version) and so I love to hear from people using other
technologies at large scale (> 1B rows) and how it compares to what they were
using before. Redshift has definitely attracted our attention.

I listened to the recent Redshift performance webinar and was surprised by:

\- the hard maximum 15 concurrent queries (14 when you reserve one for
management)

\- the need to pre-assign queries into queues so that fast queries don't get
blocked by slower ones

\- the recommendation to vacuum after large inserts (showing its Postgres
roots) and to limit query concurrency to 3 while vacuuming.

Have these limitations caused problems for anyone?

Edit: formatting, grammar.

~~~
monstrado
We're currently running Impala in production with a table that currently has
over 4.5B rows which powers an internal log analysis website. We don't have
any hard limitations for concurrent queries, and no vacuuming since the data
lives within Hadoop. We've been pretty happy with it so far.

~~~
dzderic
What kind of performance are you getting, and how many nodes do you have?

~~~
monstrado
We have a 14 node cluster, the nodes have anywhere between 4-6 disks.
Performance has been pretty amazing, we can do ad-hoc queries on this 4.5B row
table. Each node has read throughput at about ~1.3GB/s for full table scans
(data is snappy compressed, store as RCFile: columnar).

~~~
vosper
That sounds pretty fantastic - when you say "ad-hoc" do you mean that it's
fast enough to be directly queried from a UI - are we talking seconds or
minutes for your queries?

What drawbacks have you found with Impala? I've been keeping an eye on it, and
also Shark: <http://shark.cs.berkeley.edu/>

~~~
monstrado
It depends, if you plan to scan our entire data set it could take 30-40
seconds (roughly ~2.8TB), but we have our data partitioned based on a key that
makes sense for the kind of data you'd need to populate a web page and these
queries are fast enough (< 2 seconds) for aggregations that come in via AJAX.

We haven't yet had a chance to optimize our environment either. For example,
our nodes are still running a pretty old version of CentOS, so we have LLVM
disabled (which would help a lot for huge batch computations...see
[http://blog.cloudera.com/blog/2013/02/inside-cloudera-
impala...](http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-
runtime-code-generation/)).

Also, our data is stored in RCFile, which is not exactly the most optimized
columnar storage format. We're working on a plan to get everything over the
new Parquet (<http://parquet.io/>) columnar format for another boost in
performance.

We haven't come across any real drawbacks using Impala as of yet, it fits our
needs pretty well.

Disclaimer: I work for Cloudera in their internal Tools Team, we like to dog
food our stuff :).

Edit: One drawback of Impala is the lack of UDF support, but this is something
that will be coming in a later release.

------
mwexler
If you like Redshift, you really should look at BitYota
(<http://www.bityota.com/>). Launched the same day at the AWS Re:invent show
as Redshift, it provides much of the performance of Redshift with much more
customization, as well as a more rounded set of SQL extensions (including
SQL-2003 OLAP functions and UDFs in multiple languages).

I don't work for them, but they make a good product.

~~~
binarydud
One of the big advantages of redshift is the postgres protocol compatibility.
Which means I don't have to rewrite code to connect to redshift and run
queries. It seems that BitYota wants me to run queries via json. Really? Why
would I lock my code into using something like that?

~~~
mwexler
I think you are misreading the Bityota pages. On the contrary, you CAN run
queries with JSON... but you can also just pass in regular SQL statements. I
guess Postgres direct protocol is handy; I find that ODBC covers many of my
needs, so I haven't actually looked at Bityota as a Postgres drop-in.

Might be worth playing with what Bityota offers; Postgres is awesome but
sometimes makes you bend over backwards to do things that are easier in other
systems. Paraccel/Redshift has some good extensions, but sometimes there are
better ways to do things.

------
stcredzero
You down with MPP? You know what it means?

(Seriously, what does it mean?)

~~~
graedus
I had to look that up. Massively Parallel Processing.

~~~
samspenc
What the author is trying to say is that (IMHO) running a database on a MPP
(quite common these days, or easily stood up in a data center or Amazon) is
still a pain.

Outside of Postgres, none of the truly open-source databases scale well with
MPPs. You're still stuck with mostly single-core processing.

So you can pay good $$$$$ to the commercial database providers to provide you
with a good database that _may_ scale to your needs.

Or use Redshift for a great analytics DB at a reasonable price. (Note: we
heavily use HBase where I am at, but that's comparing apples to oranges.)

~~~
nemothekid
Apples to Oranges, how so, if you don't mind me asking? What about HBase makes
it different from RedShift/Greenplum/Oracle?

~~~
jackowayed
HBase isn't built for analysis workloads. It doesn't have a complex query
engine, so you end up having to do massive scans (which aren't especially
fast), transfer a ton of data to the node where your analytics code is
running, and do the computation there. If things are too big to run on one
machine, that's your problem, not HBase's.

On the other hand, Impala, RedShift, Oracle Exadata, etc. let you ask the
database to do work at a much higher level, which allows for much better
performance because the data storage and computation layers can work in tandem
(so you can prune down to only the data your query needs at each storage node
before hitting the network, for example), and the database does the work of
optimizing for multiple cores and nodes, not the writer of the analysis
routine.

------
ironchef
I think there are still definitely workloads suitable to other data store
setups like graph, tabular, etc. (at least based on performance testing i've
seen in the past). On the plus side a lot of standard crappy workloads (weblog
reduction) will work wonderfully in columnar setups like redshift.

------
CurtMonash
The comments on pricing of alternative products were ... odd. E.g., Netezza
was available for purchase, hardware of course included, at $20K/TB several
years ago, counting 2.25X compression.

Well, except for the last such comment, which contradicted some of the prior
ones by 1-2 orders of magnitude.

------
monstrado
> Sure, sure there are other open-source options out there that aren’t Hadoop,
> but for example Impala 1.0 just came out...

Although Impala isn't technically Hadoop, it's daemons are co-located with
each your Hadoop Data Nodes, and talks directly to HDFS and HBase. Also, there
is no ETL process with Impala, you could drop a few billion row flat files
into Hadoop and query it within a few minutes.

------
joebeetee
"..hop onto Hadoop and ride the open-source rodeo with Hive or Pig by your
side." - Laughed at this sentence. Well written!

------
dzhiurgis
I wish Salesforce would follow. At the moment 2Gb of database storage costs
nearly same as 2Tb of storage at Redshift. And yes, that doesn't include the
licence itself.

------
pbbakkum
Well written article! Realistic from an ops perspective but also has marginal
AWS cost calculations, which are very interesting but difficult to estimate.

