

Amazon Redshift - What You Need To Know - kungfooey
http://dailytechnology.net/2013/08/03/redshift-what-you-need-to-know/

======
monstrado
These are very different databases, PostgreSQL is a transactional database,
while Redshift (aka: ParAccel) is an analytical database. Each of these
databases have implemented much different design decisions, which improve
queries on certain type of workloads.

PostgreSQL is optimized for blazing fast record mutations, inserts, while
maintaining adequate query response times on medium to large size data.
Redshift (ParAccel), or your other analytical databases like Greenplum,
Teradata, or Netezza make optimizations that make sense for queries that
pertain to the majority of a tables data (full table scans).

For example, Redshift stores stores a tables columns in separate locations,
allowing you to not only skip reading columns which don't pertain to the
query, but also allows for easier disk parallelization. Keeping your columns
separate from each other slows down things like record reconstruction, since
you're performing n disk seeks, where n = # of columns...this is bad for
databases where you need a single record. Databases like Redshift are meant to
compliment your MySQL and PostgreSQL databases, not replace them.

Shameless plug (I work at Cloudera): For reasons unknown, the article
dismissed Hadoop without listing any reasons. If you're interested in having a
secondary system designed for doing ad-hoc queries over your large datasets
(billions of rows), I suggest trying out Impala. You can run it across a few
servers you have sitting around...
[http://rideimpala.com/](http://rideimpala.com/)

~~~
kungfooey
> These are very different databases..

Sure, they are. The point is that the API is similar, so developers already
familiar with SQL, or libraries that speak SQL (SQLAlchemy in our case) can
speak to Redshift. I attempted to make that point in the post, hopefully I can
make that more clear.

> For reasons unknown, the article dismissed Hadoop without listing any
> reasons.

I didn't feel that it was the place for it, but the basics are that we didn't
have the skills in place to restructure our system to use Hadoop. Second, our
data structures are constantly changing, which Hadoop didn't accommodate very
well. Third, we were not able to accommodate ad-hoc queries as every reduce
operation needed to be figured out prior to running. Last, we were looking for
something to handle "large" data sets, but not "big data." Hadoop works great
for "big data" (petabytes) but it was overkill for us (working with
terabytes).

I'm sure all of these problems could have been resolved, but the point of my
write-up is not to kick dirt at Hadoop so much as to explain how Redshift fit
our needs. Other companies, I'm sure, have different needs and should of
course make their own evaluations.

~~~
monstrado
Thanks for the clarifications, I wasn't not meaning to discredit your article
in anyway. I am just trying to help people understand that these databases are
very different from eachother, and were created to solve different use cases.

> Hadoop works great for "big data" (petabytes) but it was overkill for us
> (working with terabytes).

This is a very common misunderstanding about Hadoop, terabytes of data is
still very large when you want to apply complex transformations, or allow
someone to run queries over it with low latency expectations (a few seconds).

Impala does not use MapReduce, it was designed with low latency goals in
mind...speeds are comparable to RedShift
([https://amplab.cs.berkeley.edu/benchmark/](https://amplab.cs.berkeley.edu/benchmark/)),
although they are citing an older version.

~~~
kungfooey
Interesting. I will definitely check out Impala. Thanks for bringing it up.

------
nemothekid
In the COUNT(*) example, looks like you annotated the wrong snippet.

"5 seconds! That’s an improvement. Note that I didn’t make any adjustements to
the data: no indexes, no differences in table structure." should be 1.5s.

~~~
jmduke
Looks like he typo'd and put a space in between the "1." and the "5", which
got interpreted as an ordered list.

~~~
kungfooey
Whoops, thanks for catching that. I didn't actually have a space there, but it
did get interpreted that way.

------
stevoski
select count( * ) from dummy_table;

takes 10 minutes on PostgreSQL to return the result 21454134?

With H2, an open source embedded SQL database I use daily, a "select count( *
) from table_name" query returns the result instantly. I assumed therefore
that this was the norm...

~~~
saurik
As mentioned by sargun (who was irritatingly downvoted by someone) the
difference is that PostgreSQL uses the MVCC concurrency model, which allows
you to do quite complex queries concurrently with other people making writes
to the database, but comes at the cost that there is no longer any objective
count of the rows in a given table: the number of rows is related to which
transactions are considered live at any given moment.

[http://wiki.postgresql.org/wiki/Slow_Counting](http://wiki.postgresql.org/wiki/Slow_Counting)

~~~
kungfooey
In my case, this was inadvertently a good comparison. Most of the "ad-hoc"
queries we're doing are AVG, SUM, or similar operations that require scanning
the entire table.

I may try to edit this post to compare those operations, as it may be more
meaningful.

Thanks for posting the link. TIL.

------
superails
> Redshift smelled enough like PostgreSQL

I'm curious. How does something smell like Postgres? Does it use some of the
same PG-specific datatypes?

~~~
kungfooey
It uses similar data types and has a similar API. The differences between the
two are summarized in the documentation.

[http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-
and...](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-
sql.html)

------
ceyhunkazel
I think beside familiar PostgreSQL enviroment there is not much advantage over
SAP HANA. I listed advantages of HANA on my posting
[https://news.ycombinator.com/item?id=6466222](https://news.ycombinator.com/item?id=6466222)

~~~
mtam
Redshift is a SQL relational database for large structured data volumes,
whereas Amazon EMR (Elastic MapReduce) is for non-structured data. The
Redshift FAQ has more details on this topic:
[http://aws.amazon.com/redshift/faqs/#0120](http://aws.amazon.com/redshift/faqs/#0120)

Other than the marketing buzz from SAP, I have not seen any reputable
(neutral) comparison between Hana and Redshift that includes price. I also
have never met anyone that uses/run Hana that was not already an SAP ERP
client, albeit the fact that SAP keeps saying Hana is the fastest-growing
product in the company's history.

