Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Redshift Now Available to All Customers (amazon.com)
91 points by jrnkntl on Feb 15, 2013 | hide | past | favorite | 21 comments



We're hugely excited about this for SnowPlow (https://github.com/snowplow/snowplow) - Redshift Postgres is a really attractive storage target for eventstream data. Bit of a shame they don't support hstore/JSON yet but hopefully that will come in time.

We're going to work on SnowPlow-Redshift integration next week, using the COPY command + SnowPlow S3 event files. It's great timing as we've been hitting the limits of what we can do in Infobright (which inherits MySQL's limit of 65532 bytes per row - an unfortunate restriction for a columnar database).


I think this is a smart move. I know companies who are doing their custom data warehousing using Infobright (another column store database), the free version. I am sure they will be interested to dump a lot of custom scripts and do all their querying on Amazon since their data is there anyway.


Initially I was really excited by Redshift, but when I got a chance to play with it I found out that there is no JDBC support for any kind of bulk insert or trickle loading.

The Postgres JDBC driver when you try and do batch inserts runs each statement individually and you end up inserting 10s of rows a second.

I wish they had gone with something like Vertica.


Look for the COPY command. Upload your raw data to S3 (gzipped if you like) and you can pull it straight in from there.


You can use the COPY command to do bulk imports from S3. We also support importing from DynamoDB.


What if DynamoDB doesn't solve my problem because I need transactions?

Why do I have to write code to perform an extra step and pay the extra cost and latency of pushing data through S3 just to get it into Redshift?

Not supporting trickle loading is a leaky abstraction IMO. It's not a ton of code to log statements until you have enough to justify an import and you shouldn't push that complexity on every database user.

Postgres supports copying from a binary stream, why not support that?


I'd have to agree with arielweisberg here. Our organization was really excited about Redshift a few days ago, but after seeing each of our individual INSERTs take upwards of 2 seconds, and hearing that we should first upload to S3 or Dynamo, we decided the platform would not fit our needs.

Our goal is minimal architecture complexity, and to upload log files or other data to a file system before loading it into a data warehouse just doesn't make sense.

We're currently looking into Hadoop/HDFS/Impala due to cost constraints (Vertica would have been our primary choice). If anyone has any other suggestions it would be great to hear them.


It runs each statement individually? Or commits each statement individually?

Column stores are destined to have slower inserts, due to how the data is stored on disk. But if they are actually committing each statement individually, that is a problem.


That is not the case with Vertica. Trickle loading is table stakes functionality IMO.


What are the best options for clickstream tracking for storing in a data warehouse?

I've looked at Snowplow (https://github.com/snowplow/snowplow) -- is that what most people are using, are you rolling your own, etc?


We use snowplow JS with a custom django app that we include in each project. It stores clicks and events in Redis, as well as gziped logfiles for permanent storage. The redis data expires after a configurable period of time.


What's the difference between this and Amazon's RDS or S3? Is it just data storage with an easy way to query the for-mentioned data?

Seems like an "odd" product that kind of compete's with Amazon's existing offerings in many ways...


This is optimized for bulk analysis. RDS is more optimized for low latency queries. This is because Redshift is column oriented: http://en.wikipedia.org/wiki/Column-oriented_DBMS .


Yep. Redshift is based off of ParAccel. We used that DB for some really awesome large scale analytics on a previous project.

http://www.zdnet.com/amazon-redshift-paraccel-in-costly-appl...


This seamlessly integrates with existing Data Warehouse software solutions and Redshift handles everything you'd normally had to configure, import, read, query yourself using RDS; plus it uses S3 to store its backups and let the client restore their datasets from it. So it doesn't compete with it, my understanding is that it might as well utilize both but in a ready made package that integrates with existing data warehousing solutions.


Amazon RDS is for OLTP workloads. Redshift is a distributed, column-oriented store that's designed for OLAP workloads. For more info: http://aws.amazon.com/redshift/faqs/#0110


Both RedShift and Google's BigQuery are very useful for enterprise applications that involve aggregating data in many ways.


How is this different/better than Google BigQuery?

How does speed/performance compare to something like that shown here:

https://cloud.google.com/bigquery-tour


Perfect, but how do we upload those already collected 300GB raw data there ...


Use AWS Import/Export service to bring it into S3, then you can load it from there:

http://aws.amazon.com/importexport


you can ship actual disks/drives to AWS for bulk loading or set up a high connection tunnel with them for bulk loads




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: