Amazon Redshift Now Available to All Customers

alexdean · on Feb 15, 2013

We're hugely excited about this for SnowPlow (https://github.com/snowplow/snowplow) - Redshift Postgres is a really attractive storage target for eventstream data. Bit of a shame they don't support hstore/JSON yet but hopefully that will come in time.

We're going to work on SnowPlow-Redshift integration next week, using the COPY command + SnowPlow S3 event files. It's great timing as we've been hitting the limits of what we can do in Infobright (which inherits MySQL's limit of 65532 bytes per row - an unfortunate restriction for a columnar database).

amalag · on Feb 15, 2013

I think this is a smart move. I know companies who are doing their custom data warehousing using Infobright (another column store database), the free version. I am sure they will be interested to dump a lot of custom scripts and do all their querying on Amazon since their data is there anyway.

arielweisberg · on Feb 15, 2013

Initially I was really excited by Redshift, but when I got a chance to play with it I found out that there is no JDBC support for any kind of bulk insert or trickle loading.

The Postgres JDBC driver when you try and do batch inserts runs each statement individually and you end up inserting 10s of rows a second.

I wish they had gone with something like Vertica.

jasonkester · on Feb 15, 2013

Look for the COPY command. Upload your raw data to S3 (gzipped if you like) and you can pull it straight in from there.

mallipeddi · on Feb 15, 2013

You can use the COPY command to do bulk imports from S3. We also support importing from DynamoDB.

arielweisberg · on Feb 16, 2013

What if DynamoDB doesn't solve my problem because I need transactions?

Why do I have to write code to perform an extra step and pay the extra cost and latency of pushing data through S3 just to get it into Redshift?

Not supporting trickle loading is a leaky abstraction IMO. It's not a ton of code to log statements until you have enough to justify an import and you shouldn't push that complexity on every database user.

Postgres supports copying from a binary stream, why not support that?

shanif · on March 1, 2013

I'd have to agree with arielweisberg here. Our organization was really excited about Redshift a few days ago, but after seeing each of our individual INSERTs take upwards of 2 seconds, and hearing that we should first upload to S3 or Dynamo, we decided the platform would not fit our needs.

Our goal is minimal architecture complexity, and to upload log files or other data to a file system before loading it into a data warehouse just doesn't make sense.

We're currently looking into Hadoop/HDFS/Impala due to cost constraints (Vertica would have been our primary choice). If anyone has any other suggestions it would be great to hear them.

darksaints · on Feb 15, 2013

It runs each statement individually? Or commits each statement individually?

Column stores are destined to have slower inserts, due to how the data is stored on disk. But if they are actually committing each statement individually, that is a problem.

arielweisberg · on Feb 16, 2013

That is not the case with Vertica. Trickle loading is table stakes functionality IMO.

espeed · on Feb 15, 2013

What are the best options for clickstream tracking for storing in a data warehouse?

I've looked at Snowplow (https://github.com/snowplow/snowplow) -- is that what most people are using, are you rolling your own, etc?

ra · on Feb 17, 2013

We use snowplow JS with a custom django app that we include in each project. It stores clicks and events in Redis, as well as gziped logfiles for permanent storage. The redis data expires after a configurable period of time.

UnoriginalGuy · on Feb 15, 2013

What's the difference between this and Amazon's RDS or S3? Is it just data storage with an easy way to query the for-mentioned data?

Seems like an "odd" product that kind of compete's with Amazon's existing offerings in many ways...

SpikeGronim · on Feb 15, 2013

This is optimized for bulk analysis. RDS is more optimized for low latency queries. This is because Redshift is column oriented: http://en.wikipedia.org/wiki/Column-oriented_DBMS .

kevindication · on Feb 15, 2013

Yep. Redshift is based off of ParAccel. We used that DB for some really awesome large scale analytics on a previous project.

http://www.zdnet.com/amazon-redshift-paraccel-in-costly-appl...

jrnkntl · on Feb 15, 2013

This seamlessly integrates with existing Data Warehouse software solutions and Redshift handles everything you'd normally had to configure, import, read, query yourself using RDS; plus it uses S3 to store its backups and let the client restore their datasets from it. So it doesn't compete with it, my understanding is that it might as well utilize both but in a ready made package that integrates with existing data warehousing solutions.

mallipeddi · on Feb 15, 2013

Amazon RDS is for OLTP workloads. Redshift is a distributed, column-oriented store that's designed for OLAP workloads. For more info: http://aws.amazon.com/redshift/faqs/#0110

dirtyaura · on Feb 15, 2013

Both RedShift and Google's BigQuery are very useful for enterprise applications that involve aggregating data in many ways.

valhallarecords · on Feb 15, 2013

How is this different/better than Google BigQuery?

How does speed/performance compare to something like that shown here:

https://cloud.google.com/bigquery-tour

grzaks · on Feb 15, 2013

Perfect, but how do we upload those already collected 300GB raw data there ...

TY · on Feb 15, 2013

Use AWS Import/Export service to bring it into S3, then you can load it from there:

http://aws.amazon.com/importexport

beagledude · on Feb 17, 2013

you can ship actual disks/drives to AWS for bulk loading or set up a high connection tunnel with them for bulk loads