Hacker News new | comments | show | ask | jobs | submit login
Redshift Performance and Cost (airbnb.com)
145 points by AirbnbNerds 1610 days ago | hide | past | web | 45 comments | favorite

Redshift is based on ParAccel, not on Postgres. ParAccel uses APIs similar to Postgres due to historical reasons, but not the technology.

For a basic overview: http://en.wikipedia.org/wiki/Paraccel

As for the rest of the article, it feels like a basic Data Warehousing 101 re-discovered. It should have been titled "Analytics: Back To The Future" :-)

No kidding. The amount of startups that have flocked to hadoop for "data analytics" over the past 5 years is extremely disheartening. Almost all of the cases are far more suitable for any off-the-shelf RDBMS much less a column-oriented one. Same thing with MongoDB.

How much time and money would have been saved learning Database Theory/SQL/Data Warehousing/Dimensional Modeling instead of cramming everything into an unstructured data-store?

I think part of the issue why so many people have gone with Hive is that good, production-ready column stores are expensive. Redshift is posed to change that. If you're shopping in this space, Infobright is also worth checking out.

And even for moderate data sizes (10+ GB per table), row store DBs tend to become painful. This is especially true when you need to support ad-hoc reporting queries, since the usual technique of matching your schema, indexes, and queries won't be effective any more. With true ad-hoc reporting, your only hope becomes lots of shallow indices rather than ones tuned to a particular query.

Dimensional modeling (I'm a fan of Kimball's approach) mitigates these problems quite well while still offering very flexible ad-hoc reporting. Works great on a row-based RDBMS, even better on columnar.

http://en.wikipedia.org/wiki/Dimensional_modeling http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...

Redshift is indeed a solid product but all these comparisons against Hive are surprising, as that's not the right tool in the first place. Infobright, greenplum, aster, vertica, etc are the products which Redshift seeks to disrupt.

I realised a few years ago that pretty much every database course taught only teaches OLTP. OLAP never really gets a lookin.

At my university, standard normalisation was taught in the "databases" course. OLAP was mentioned as part of the "advanced databases" course.

The database course at that time blew about half its time on building PHP applications to talk to the database. I hate to second guess my professors, but I can't help but feel that a more productive use of the time would have been to teach normalised OLTP in the first half, and dimensionally modelled OLAP in the second half. Better yet, to divide them into two courses and spend some time talking about database history ("here's why network and hierarchical databases sucked") and maybe some introduction to how query planners work.

Do you know of any resource that talks about the same topic as your comment, but in more detail?

Philip Greenspan's SQL tutorial is a nice starting point: http://philip.greenspun.com/sql/

It's a bit old, but still pretty good.

I'm not sure I follow you.

To be more stereotypical: I am intrigued and would like to sign up to your newsletter.

Well ... really, just read a good pair of textbooks on each side of the spectrum. Date's Databases and Kimball's The Data Warehouse Toolkit are good.

Edit: actually, maybe not Date. It's up to you. It's good, but it's controversial because he's not a fan of SQL and so he uses his own language.

The one I used in uni was Ramakrishnan & Gehrke's Database Management. It was OK but there's a certain amount of at-the-time trendy bullshit that to me detracts from a focus on relational databases for their own sake.

Edit 2: and Joe Celko's SQL for Smarties contains good oil on the relational paradigm.

We use Infobright at SnowPlow (https://github.com/snowplow/snowplow), and are currently working on our Redshift integration.

One thing to be aware about with both is the lack of any support for wide tables - Infobright inherits MySQL's limit of 65,535 bytes per row (and UTF8 means 3 bytes per char); with Redshift you can stored wider rows but you can't query them (http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE...). Obviously not a deal breaker, but it locks you firmly in the densely populated rows, relational mindset.

This aside, we're super-excited about Redshift!

Hi, I have started your project previously and while I haven't had a chance to test it out, I must say the idea of using cloud-front as a collector is a superb brilliant idea to scale an analytics platform and would be both very scalable, reliable and economical.

Btw, do you have more experience to share? e.g. with infobright, how many events can be processed per second? what would be the "ETL latency"? can infobright handle 10TB of data easily, any caveat besides the row limit? Thanks.

Hi tszming - happy to share more over email. alex@snowplowanalytics.com

How much time and money would have been saved by running analytics on samples instead of the whole population?

Which off-the-shell RDBMS can handle queries over 3 billion rows?

Counter-question: Which startup has a actual data table with over 3 billion rows?

We have just crossed 2 billion items in our datastore. While not 3 billion yet, I expect that to happen later this year.

Too bad Redshift can't handle JSON files: Converting everything will be annoying.

Our idea is to change from JSON on loading to Redshift, continuously. http://www.hapyrus.com/pages/flydata-for-redshift

When you log every mousedown because the founder misunderstands A/B testing, 3 billion rows is easy to come by. Besides - you're busy changing the world, so you should expect to use the same technology as Facebook and Google.

In 2007 I worked for a firm with a 4 billion row join table in PostgreSQL. Might've been 7 or 8, I don't recall which. It ran on a quad core server with 16Gb of RAM. Joins going through this table took about 2-3 seconds to complete.

But I suspect the join must have been over an indexed column, so it did not touched 4bln rows, otherwise 2-3 seconds would be hard to believe. The group by query in the article must access all 3bln rows, which makes a huge difference.

All the columns were indexed.

I remember it well, because I was trying to explain why having tens of gigabytes of indexes wouldn't help them much if they only had 16Gb of RAM.

In terms of group-by performance, it depends a lot on the kind of data and how it's stored. For example, taking a sum on a columnar store is quite amenable to parallel solutions and a lot of databases will do that way.

I can't think of an off the shelf RDBMS which can't handle queries on 3 billion rows.

SQL Server can

Oracle can

Postgres can

Even MySQL can (!)

The limitations are almost always in the hardware, not the software.

If you're looking at column based systems, you can look at Greenplum (does both row and column-based storage), InfiniDB (MySQL based), and all sorts of expensive but very fast appliance options like Netezza, Teradata, etc.


SAP HANA would be one but it is basically in memory so very, vey expensive.

And it's not a RDBMS. It's basically the same technology as RedShift, but not cloud based (yet).

Actually HANA One is available in the AWS Marketplace: https://aws.amazon.com/marketplace/pp/B009KA3CRY/ref=mkt_ste...

Seriously can you and your ilk just please stop.

It's so exhausting to hear how much smarter you are and if we just educated ourselves we would realise the error of our ways. People who choose the technologies aren't stupid or masochistic. They understand their use case and the fact is that there are plenty of situations where SQL is suboptimal.

I don't think he is epeen waving (where e this time is education).

Sometimes with technologies going through the Gartner Hype Cycle people choose the incorrect one, because of the buzz, the glamour around it.

NoSQL is most definately in vouge, quite rightly, too many people often use heavy RBDMS when they are not required.

But too many people perhaps are too quick to dismiss the regular database without actually understanding it.

Any suggestion to avoid hype of technology, question your use cases fully is in my mind a good suggestion.

Thanks for pointing it out, we have correct it in our post

The article mentions this briefly, but it should be emphasized: parallel loading from S3 is MUCH faster.

This weekend I loaded 2 billion rows from S3 both ways:

- From a single gzipped object: 4 hours 42 minutes

- From 2000 gzipped slices of 1M rows each: 17 minutes

(Loading from gzipped files is considerably faster, in addition to saving S3 charges.)

The article notes that choice of distribution key is critical. I'd add that choice of sort key is equally important. In my testing, a better sort key improved compression from 1.5:1 to 4:1, and also made common queries 5x faster.

Unfortunately, you only get one dist key and one sort key per table, so less common queries could get slower.

Also if you launch the more instance in a cluster, the faster to load. Our survey: http://www.slideshare.net/Hapyrus/scalability-of-amazon-reds... We tried much more files (5MB each) to load, but it takes longer time in total.. We're trying to get appropriate size and file numbers.

This is the second article I've seen where the authors forget to multiply by the number of redshift nodes. A single XL node is $0.85/hr so 16 nodes would be $13.60/hr. Still cheaper than their Hive configuration obviously but less than a buck?

damn yeah, I initially thought wow for $632.4 bucks i'm setting this up next week or even later this week provisioning the redshift - but 16x that at $10118.4 - I'll continue wait... probably a good thing to stay focused on features anyway

Thanks for pointing this out! We just updated the article to reflect this.

Which storage format did you use for Hive? This is very important to how performance plays out, are you using snappy or LZO compression? Also, this is a relevant comment from a Hive committer (http://news.ycombinator.com/item?id=5248485).

Still running on posterous. Weird to think that this is probably the last posterous blog post I will read.

It's good to see a Redshift evaluation. I'm wondering how does Redshift compare to hadoop airbnb setup when taking data loading and transformation into consideration as well as running aggregate queries? I mean if you want to run analysis fairly often, do you need to reload everything in Redshift? From maintenance point of view, is Hadoop setup more flexible and cheaper than Redshift?

The hadoop setup we have is actually EMR, and we use s3 for data storage, so for us Hive/Hadoop doesn't save much in terms of data loading. And once you have a process setup to load and update the data in Redshift, you don't have to reload everything

The first query seems awfully slow. I have a six node vertica cluster with a 100 column table with 7Bn rows in it and a similar query takes less than 3 seconds.

Disclosure: I work on the Redshift team.

The OP's cluster is a 16-node hs1.xlarge cluster (has 3 spindles per node). There's actually a more powerful node-type hs1.8xlarge which has 24 spindles on each node. More info: http://aws.amazon.com/redshift/pricing/

So it's not fair to compare Redshift performance to your Vertica cluster unless the hardware is similar.

Dumb question: was the data ETL'd into a star schema first? That can make a big difference, especially in columnar stores.

We sort of made it a partial star schema, but not strictly. since random join can be expensive

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact