Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Redshift – New Features (aws.typepad.com)
29 points by ahjones on Nov 11, 2013 | hide | past | favorite | 10 comments

I'd be interested in hearing from anybody using RedShift in production. Currently shopping around for a database to store 1.5 billion interesting things, and RedShift's at the top of my list.

Specifically - how many (small/large) nodes are you using, how big is your data, how many simultaneous users does your setup support, and what kind of performance do you see?

Thanks in advance.

Not in production yet, but my company's in the process of moving from Infobright to Redshift, mostly because Infobright, while extremely fast, is a single-server solution and we think that within a year we'll have outgrown its capacity.

Our data is 3 denormalized tables with the same ~1000 column schema, the largest of which contains 7B rows. Most of our queries are simple aggregations down a few columns, with simple constraints like date range + customer ID.

So far we're really impressed with Redshift's load performance and ease of getting up-and-running, but we're still an order of magnitude away from Infobright's query performance. Next steps are playing with distribution and sort keys, as well as trying different cluster configurations; to be fair to Redshift we've not run it on more than a 6 XL cluster yet.

We use Redshift in production to power a customer-facing app, but don't have web apps hit it directly (that really couldn't work with the concurrent query limits). Workers query RS and cache results in another database and the web app hits that database. We handle writes in the low 10s of billions per day. Our data format is very simple and we get great compression, so we currently fit everything on two clusters — one of 3 XLs and one of 6 XLs.

Performance is great — except when it's not. Simple aggregations on tables that are < 1B rows and two table joins on tables that are < 100M rows are blazingly fast (maybe 1-30 seconds depending on the query). Larger tables than those and it can start to crawl.

We're using it as a replacement for our enterprise data warehouse. Total size so far ~1 TB compressed, daily delta ~10s of GB compressed. We're using a two XL node cluster. The best part is that you can increase the cluster size without downtime (except you can't switch from XL to 8XL this way). Right now we have a single digit number of simultaneous users (analysts). Did not have to tweak the Workload Management configurations too much so far.

I'd be interested to hear from others with bigger workload than ours. For example, someone from Netflix (They switched their data warehouse from Vertica to Redshift this year)

I have not used Redshift in production but it's been getting pretty good reviews thus far. I did some testing using pretty small dataset which you can read more about here: http://bicortex.com/amazon-redshift-review-data-warehouse-in... I don't think it's the answer to all data storage/warehousing needs but it does look promising and it should only get better.

Opened your page with interest but the tiny white text on black background is eye-searing.

Even with manual zooming it is still a bit harsh due to the colours (Maybe just me though.).

HyperLogLog based COUNT DISTINCT APPROXIMATE, that's an interesting development in there (simple algorithm, but complex math).

I wonder what hashes they are using.

Only tangentially related to your comment, but last year there was a series of postings "Damn Cool Algorithms" in which the last post covered HyperLogLog in the post on cardinality estimation [0]. Some discussion in the HackerNews submission [1].

[0] http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali... [1] https://news.ycombinator.com/item?id=4488946

one of the biggest benefits of HLL is that if you are only interested in set cardinalities, you don't have to store the original items at all, just a small HLL data structure which can be used to compute set unions, and with some caveats, intersections.

It would be really great if Redshift supported HLL as a proper data type, and not just an optimization for COUNT(). A proper data type would allow us to store pre-aggregated data instead of billions of unnecessarily granular rows.

It sounds as if multi-node Redshift could actually be useful now (because it's picked up the minimally necessary data distribution options).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact