
Amazon RedShift vs. local PostgreSQL - rarestblog
http://rarestblog.com/blog/2013/02/15/redshift-postgresql-in-the-cloud/
======
iblaine
What this test is essentially doing is comparing Postgres against a single
node of Redshift. It is not surprising that Postgres is faster. But Redshift
is not meant to be used on a single node.

What Postgres & Redshift represent are are two different products for two very
different problems. Postgres is good for small sets of transactional data like
orders in a shopping cart system (less than 1TB). Redshift is good for big
sets of data involving user behavior and clickstream analysis (greater than
1TB). I would not want to manage clickstream data on a single instance of
Postgres nor would I want to manage an order system in Redshift.

A better test of Redshift would be to see how it compares to
Asterdata...particularly with both in AWS. That should be telling.

~~~
eksith
We don't run a shopping cart, but one of our databases at present is at 11.3TB
on PostgreSQL 9.1 and we're by no means dealing with small sets. We routinely
juggle several Gigs at a time when we need to do analytics. We didn't see a
reason to put this on a cloud since bandwidth + electricity is still cheaper
for us than bandwidth + storage in the cloud at present.

~~~
monstrado
If you have a few servers to spare, I'd recommend installing Cloudera Impala
on them. You can use Apache Sqoop to pull the data out of Postgres and into
HDFS.. Directly after, you can run SQL queries which will query the data in
parallel (similar to redshift).

------
monstrado
I don't think comparing RedShift to Postgres is accurate, RedShift was not
designed for transactions, it was designed to store/query billions of rows
using a columnar storage format...it's more like an analytic database
(Greenplum, Teradata). Also, these databases are designed to scale out, and so
you usually don't really see compelling performance gains until you start
adding a few nodes to help influence parallelization.

With that being said, I'd be interested to see how RedShift compares to
Impala.

------
lcampbell
I really don't understand what's going on here.

* You're measuring request latency. What part of that (for RedShift) is due to the network? (EDIT: I re-read and saw you're using `SELECT 1` as a gauge for round-trip latency and subtracting it from the results. Are you only doing this for RedShift, or also for local PostgreSQL? To me, it seems like that heuristic is over broad -- it encapsulates not only network latency, but syscall overhead, query parsing, etc).

* In your tests, PostgreSQL _without indices_ performs on-par with RedShift. Does RedShift not support indexing? Is there some metric you're trying to show by not using indices? As designed, this benchmark does not map to any use-case I've ever seen.

~~~
rarestblog
* I noted in post that "SELECT 1" was done in 100ms round-trip, so I subtracted that from all points where it was reasonable. (No point to subtract it from 8 seconds).

* Not sure about the index support. Didn't try.

My idea was quite simple. I have some data at work (databases up to 30GB).
Sometimes we hope to find something better. The main question was - will
RedShift help, will it be radically faster? Will it be radically easier?

The answer for me - no, it won't help in my case, we need that 30GB data in
real time, it looks like RedShift is more when you have 1TB+ data. Yes, it is
radically easier.

~~~
lcampbell
Thanks for the reply. I figured for a dataset of that size, the main
bottleneck might be not indexing -- maybe RedShift stores row data in a
higher-latency medium while keeping indexes in-memory. Curious, I checked the
documentation[1] and found this:

> Amazon Redshift doesn’t require indexes or materialized views and so uses
> less space than traditional relational database systems.

Reading through the rest of their FAQ, it sounds like they echo your
conclusion -- RedShift shines the most for use-cases where the dataset is
large enough that, to use PostgreSQL, you'd have to shard out multiple
instances.

\--

[1] <http://aws.amazon.com/redshift/faqs/#0030>

------
rubyrescue
very interesting. one of the reasons we picked mysql for a very high-volume
app over postgres is that we have RDS and didn't want to do
backups/snapshots/etc. Could we now use RedShift as a postgres-API RDS?

~~~
agotterer
You wouldn't want to replace RDS (or MySQL/postgres) with Redshift for OLTP
workloads. Redshift is built for analytics and complex query workloads. The
major benefit is the ability to do MPP (Massive Parallel Processing) and
distribute the work out to nodes. I believe you can have up to 100 nodes in
the cluster with the 8XL machines and 32 with the single XL.

------
amalag
You need to run this with a column store database like Infobright. Postgres is
more of a transactional database, Infobright is suited towards the similar
large dataset analytics that this is aimed towards.

~~~
scotth
Redshift isn't Postgres. It's built on top of Postgres, similar to how
Infobright is built on top of MySQL.

~~~
zwily
Redshift is built on ParAccel, not Postgres. It speaks the Postgres API, but
the storage engine is not Postgres.

~~~
scw
Some of the code is based on Postgres, based on their own marketing materials:
"Paraccel has leveraged Postgres for some of its parsing and planning
functions". So the head node continues to have Postgres origins, but not the
compute nodes. Also see this write-up on their paper discussing the links to
Postgres: [http://dbmsmusings.blogspot.com/2009/07/paraccel-and-
their-p...](http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-
puzzling-tpc-h.html)

------
eduardordm
I run 3 large oracle RDS instances I wonder if redshift could be effectively
used the same way, we have been thinking about migrating to postgresql.

~~~
fusiongyro
How big is your data?

~~~
eduardordm
16tb per instance

~~~
fusiongyro
Cool. :)

I wouldn't know a better way to deal with that with Postgres than sharding
across instances. But my "big data" is about 0.75 TB so it fits nicely in one
instance. I don't know how people with real problems do it.

------
Whitespace
Wouldn't it have been better to do an EXPLAIN ANALYZE for the timing
measurements instead of having the results returned locally?

~~~
rarestblog
Honestly I don't know. Do you think it would be better? Could you explain why?

~~~
kapilvt
because it shows the estimated cost (plan) of the queries independent of the
magically subtracted network roundtrip time. it also serves to show if the
redshift even supports it.it.

~~~
rarestblog
Ahhh, you meant on the RedShift side. I did not think of it. It would have
definitely been better. Good point!

It seems that it actually supports it
<http://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html>

------
ozgune
This is a pretty interesting. I wonder how query performance differs between
Redshift and local PostgreSQL for other types of benchmarks as well, say TPC-H
queries. (And I guess how Redshift scales out as the dataset size increases in
TPC-H.)

------
csummers
I'd like to see some more information about the local setup, including
hardware and the postgresql.conf. Otherwise, this tells me very little in
terms of comparison.

~~~
rarestblog
In anyway this test won't tell you much, just how different systems behave to
bigger load.

The local setup was quite usual: PostgreSQL 9.2, Mint 13, default conf in
VirtualBox in iMac i5 12GB. (read: home computer, no tuning)

For me the result is that mostly RedShift is on par with local PostgreSQL,
sometimes even winning for <5M rows. So with better PostgreSQL tuning you can
probably stretch it, but not for as much as RedShift can do for REALLY big
data.

Also the big deal was that RedShift scaled linearly.

~~~
fusiongyro
It scaled linearly, but also went unresponsive for five minutes. (Yeah yeah,
it's a new service).

The default Postgres configuration is pretty weak. work_mem is set _way_ to
low, for instance, and that's bitten me a few times. I wouldn't say it's
unrealistic--lots of people run with it that way in production and never find
out how easily they could speed things up. Even me, for years.

But ultimately I'm more swayed by your interaction with it and I hate the
endless benchmark tweaking that comes after every blog post about performance
testing stuff. The point of this Redshift thing is hugeness first and
foremost, so it's interesting.

------
crazydoggers
Data warehousing often involves star schemas, which means lots of joins in
your queries. I'd love to see how a real world OLAP tool performs on this.

~~~
ramarnat
While you could use Redshift as a source for OLAP, most OLAP tools will have
their own data store. But if you are referring to ROLAP, then it can perform
well if tuned properly for the star schema. This would include BI tools like
Microstrategy and Mondrian with Jaspersoft.

The main issue with Redshift is the lack of multiple sort orders on a table.
Take a look at our blog post on first impressions gleaned during the preview.
Disclosure: we are one of a couple of systems integrator partners for
Redshift.

[http://www.full360.com/2013/02/14/aws-redshift-
full360-first...](http://www.full360.com/2013/02/14/aws-redshift-
full360-first-impressions.html)

