We're hugely excited about this for SnowPlow (https://github.com/snowplow/snowplow) - Redshift Postgres is a really attractive storage target for eventstream data. Bit of a shame they don't support hstore/JSON yet but hopefully that will come in time.
We're going to work on SnowPlow-Redshift integration next week, using the COPY command + SnowPlow S3 event files. It's great timing as we've been hitting the limits of what we can do in Infobright (which inherits MySQL's limit of 65532 bytes per row - an unfortunate restriction for a columnar database).
I think this is a smart move. I know companies who are doing their custom data warehousing using Infobright (another column store database), the free version. I am sure they will be interested to dump a lot of custom scripts and do all their querying on Amazon since their data is there anyway.
Initially I was really excited by Redshift, but when I got a chance to play with it I found out that there is no JDBC support for any kind of bulk insert or trickle loading.
The Postgres JDBC driver when you try and do batch inserts runs each statement individually and you end up inserting 10s of rows a second.
What if DynamoDB doesn't solve my problem because I need transactions?
Why do I have to write code to perform an extra step and pay the extra cost and latency of pushing data through S3 just to get it into Redshift?
Not supporting trickle loading is a leaky abstraction IMO. It's not a ton of code to log statements until you have enough to justify an import and you shouldn't push that complexity on every database user.
Postgres supports copying from a binary stream, why not support that?
I'd have to agree with arielweisberg here. Our organization was really excited about Redshift a few days ago, but after seeing each of our individual INSERTs take upwards of 2 seconds, and hearing that we should first upload to S3 or Dynamo, we decided the platform would not fit our needs.
Our goal is minimal architecture complexity, and to upload log files or other data to a file system before loading it into a data warehouse just doesn't make sense.
We're currently looking into Hadoop/HDFS/Impala due to cost constraints (Vertica would have been our primary choice). If anyone has any other suggestions it would be great to hear them.
It runs each statement individually? Or commits each statement individually?
Column stores are destined to have slower inserts, due to how the data is stored on disk. But if they are actually committing each statement individually, that is a problem.
We use snowplow JS with a custom django app that we include in each project. It stores clicks and events in Redis, as well as gziped logfiles for permanent storage. The redis data expires after a configurable period of time.
This seamlessly integrates with existing Data Warehouse software solutions and Redshift handles everything you'd normally had to configure, import, read, query yourself using RDS; plus it uses S3 to store its backups and let the client restore their datasets from it. So it doesn't compete with it, my understanding is that it might as well utilize both but in a ready made package that integrates with existing data warehousing solutions.
Amazon RDS is for OLTP workloads. Redshift is a distributed, column-oriented store that's designed for OLAP workloads. For more info: http://aws.amazon.com/redshift/faqs/#0110
We're going to work on SnowPlow-Redshift integration next week, using the COPY command + SnowPlow S3 event files. It's great timing as we've been hitting the limits of what we can do in Infobright (which inherits MySQL's limit of 65532 bytes per row - an unfortunate restriction for a columnar database).