
Amazon Announces new Data Warehousing Product - secalex
http://aws.amazon.com/redshift/
======
dude_abides
This has the potential of really disrupting the enterprise data warehouse
sector. All the MPP vendors today (HP Vertica, EMC Greenplum, Teradata) have
exhorbitant pricing and ridiculous licensing. With their pricing - 1000 $ per
TB per year, I would be really worried if I were Teradata (Not so much if I
were IBM).

~~~
jasondc
A lot of large enterprises won't be comfortable hosting their data outside of
their own data centers. The killer application is making a portable, on
premises version of this functionality without the high price.

~~~
optimusclimb
I've heard this argument time and time again in the context of various
solutions/technologies. I still today feel that for 95% of the companies that
"feel this way", it's simply the result of foolish paranoia among older, upper
management. The type of thing that separates a "Fast Company" from stodgy and
likely to be disrupted companies.

~~~
andyzweb
Question: would you trust all your medical data/history and shopping
data/history tied up in their cloud?

~~~
falcolas
What makes you think it isn't already?

Hospitals & doctors outsource, and as long as the provider is HIPPA compliant
(which AWS is[1]), your data is probably out there already.

[1]
[http://awsmedia.s3.amazonaws.com/AWS_HIPAA_Whitepaper_Final....](http://awsmedia.s3.amazonaws.com/AWS_HIPAA_Whitepaper_Final.pdf)

------
23david
Have to say that this is pretty amazing. The price is so low that it's a no-
brainer to just give it a try. For the same 2TB capability, a Vertica license
would run between $20-40K, with high annual subscription fees.

The bigger question for me is why Amazon has been able to figure out the
technical details necessary to run this kind of service for this price. It's
just ridiculous. Talk about taking the oxygen out of the market...

~~~
perlgeek
> The bigger question for me is why Amazon has been able to figure out the
> technical details necessary to run this kind of service for this price.

I guess they grew the infrastructure for themselves, optimizing it bit by bit
over the years. And then noticed that it could be sold too.

~~~
justincormack
They seem to have taken their own business requirements for amazon.com and
reimplemented them on commodity hardware.

------
bravura
Does anyone have insight into how painful it is for non-technical people to
_query_ their data warehouses?

I'm building a tool that allows business people and non-technical analysts to
query their data warehouses using _natural language_. (Currently, you must ask
a technical person to write ad-hoc queries for you, or build you a dashboard.
This bogs down your data people.)

Does anyone have insight into the demand for such a product?

[edit: I'd love to chat with anyone with insight into this topic. Reach me at
Joseph at metaoptimize dot com]

~~~
meritt
I would suggest reading some books on the topic of Dimensional Modeling [1]
such as "The Data Warehouse Toolkit" [2]. The critical thing you need to
expose to your users is the ability to ask for things which make sense in
their world that are actually really difficult for even an engineer to code.
Things like: "Show me average 9am-12pm sales on Mondays, Wednesday and Fridays
for 1st quarter, 2012"

    
    
      [1] http://en.wikipedia.org/wiki/Dimensional_modeling
      [2] http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

~~~
alex_anglin
Speaking as someone who does his fair share of dimensional modeling, I would
just point out that the example you cite could only involve two tables in a
well designed dimensional model (sales fact and time/date dimension, I
reckon). The challenge is in getting to that point.

To speak to OPs point about difficulty in querying data warehouses, most
business intelligence tools that I'm aware of provide semantic layer[1]-type
capabilities, whereby the user interface of the tool is presented in the
language of the business domain. Nevertheless, I still agree that this is
still difficult work, unfortunately. That it is getting more complicated in
some respects, such as through unstructured data, doesn't help either.

[1] <http://en.wikipedia.org/wiki/Semantic_layer>

~~~
meritt
I guess I wasn't clear enough if I came across like my example was complex.
It's one easily solved via DM and one that's extremely hard to execute in most
non-dimensionally-modeled setups. That's exactly why I'm a huge advocate of DM
instead of just throwing a ton of servers, hadoop & MR at everything.

------
monstrado
I'm curious what technology they are using to power it. According to the
website, the technology described seems very similar to what Cloudera recently
open sourced (Impala), which sits along side Hadoop allowing ad-hoc MPP style
querying on petabytes of data.

<https://github.com/cloudera/impala>

~~~
jeremyjh
I'm guessing it is quite a bit different from that. It is a relational data
warehouse. It supports a Postgres protocol and API, which sounds more like
what Netezza has built. In fact, I would expect Netezza to be one of the most
likely companies to partner with Amazon at this kind of price-point.

~~~
lazyjones
Other candidates:

* Yahoo's Everest

* Greenplum

* Aster Data

All mentioned here: [http://www.cubrid.org/blog/dev-platform/database-
technology-...](http://www.cubrid.org/blog/dev-platform/database-technology-
for-large-scale-data/)

The Register wrote that Amazon's solution is a column-oriented database
possibly based on Postgres, like Yahoo's:

[http://www.theregister.co.uk/2012/11/28/amazon_aws_redshift_...](http://www.theregister.co.uk/2012/11/28/amazon_aws_redshift_data_warehousing/)

------
kanwisher
Should be interesting if this will be a viable competitor to column oriented
sql engines like Vertica or other OLAP solutions like SAP HANA. It would be
nice if there was a simple SQL based olap solution that I can spin up for
offline reporting that can scale terrabytes of data

------
zrail
Now if only Amazon would offer PostgreSQL on normal RDS.

~~~
seiji
Doesn't RDS crap itself whenever there's a core AWS problem?

~~~
semiquaver
Yes, since it relies on EBS for persistence, which has been one of the
flakiest parts of AWS so far.

~~~
ceejayoz
All it takes is an EBS outage to determine this.

In the last one, RDS and ELB had issues (and were flagged as such on their
status board) due to needing EBS, but I don't believe DynamoDB was.

------
23david
Update! The entire keynote is now available on youtube:
<http://www.youtube.com/watch?v=8FJ5DBLSFe4>

The discussion about Amazon Redshift begins at 52:50
[http://www.youtube.com/watch?feature=player_detailpage&v...](http://www.youtube.com/watch?feature=player_detailpage&v=8FJ5DBLSFe4#t=3175s)

------
rpicard
What is the use case for something like this versus a regular RDS service?

~~~
dgreensp
The answer is in the term "data warehousing" --
<http://en.wikipedia.org/wiki/Data_warehouse> \-- which has implication that
you're going to be doing data mining on vast amounts of data, often historical
data like logs or transaction histories.

Google has systems like this for analyzing its request logs. Think of how many
HTTP requests hit Google's front-end servers per second or hour or day. Each
one has a few dozen pieces of data associated with it -- URL, client IP,
headers, etc. Suppose I want to make a bar chart of how many requests came
from France containing a certain header, each day for the last year. The
system can do this query quickly if the requests are already bucketed by time
interval, organized by column, compressed, and stored so that exactly the
information needed can be brought into RAM quickly.

It is a little funny, when you step back, that "storing," "archiving," and
"warehousing" are different things and Amazon has services for each. Try
explaining the difference between S3, RDS, EBS, Glacier, and Redshift to a
layperson.

~~~
rpicard
Thanks for the response. Would it make sense to say that this is more likely
to be used for metadata (i.e. analytics, logs, etc.) while a normal RDB (or
NoSQL DB) would be used for application data (i.e. users, settings, etc.)?

------
K2h
It's called Redshift!

wow.. I just finished reading the sci-fi book a few weeks ago - "Redshift
Rendezvous" by John E Stith. I wonder if this is where the name comes from? In
the book Redshift is the name of the space ship that runs cargo mission
through folded space, the obvious problem that since you are traveling within
just a few m/s of the speed of light just walking on the ship while underway
causes color shift - thus redshift.

I read that Stith has a physic degree and worked as an Engineer for NORAD
Cheyenne mountain. That made me really interested in what novel he would come
up with. <http://www.neverend.com/short-bio-john-e-stith>

~~~
jmoiron
Redshift is a real physical phenomena describing the way light wavelengths get
"shifted" (stretched, to visualize) towards the red as they are seen coming
from something moving away from the observer:

<http://en.wikipedia.org/wiki/Redshift>

<http://en.wikipedia.org/wiki/Hubbles_law>

------
kzahel
It seems that the price (~$1 / GB / year) in the best case (3 year reserved)
is comparable to S3 at its lowest tiers (~$0.1 / GB / month)

~~~
pierrend
It's "Price per TB per Year" not GB.

------
23david
Very cool that this will support regular sql queries and queries can be sent
using postgresql drivers. Postgresql drivers are super stable and supported
everywhere. Driver support is usually overlooked with 'Enterprise' Data
Warehousing solutions. I recall that it was really hard to get the Vertica
drivers installed and stable under Linux.

I took a few screenshots from the keynote and included one showing the mention
of Postgresql and ODBC/JDBC support. Included here if you want to see for
yourself: <http://wp.me/p2sRpx-1e>

------
sologoub
Maybe a naive question, but how does this compare with Google Big Query?

~~~
rorrr
<https://developers.google.com/bigquery/docs/pricing#table>

$1474 per TB per year for storage alone ($0.12 * 1024 * 12)

plus

$35.84 per TB queried

Amazon is definitely cheaper.

------
polskibus
I cannot find information on whether Redshift supports queries in MDX. Lots of
DWs today are run on Microsoft SQL Server Analysis Services and its MDX spec
is now supported by several DW vendors. MDX support would mean it would be
easy to switch the DW engine and leave your visualisation suite (or Excel,
what the hell) and make it for an easy switch to the cloud - you'd just pick a
different data source in your tool.

------
mgl
Looks impressive and very interesting, signed up to review and compare with
Teradata/Netezza.

Can we run more complex in-database processes implemented as stored procedures
on this platform or is it going to be limited to pure SQL querying/analytics?

And does anyone have an idea how to upload 1 TB of data to this service using
Internet connection from your in-house company server? ;)

~~~
maineldc
AWS has pretty good support for taking external drives and importing them to
S3 which could then be used with this service:

<http://aws.amazon.com/importexport/>

I am assuming that you have 1TB to start, not generating 1TB per day which
obviously changes the equation.

------
alexatkeplar
This looks awesome - we'll definitely be plugging SnowPlow into this.

------
baltcode
So is this the Amazon clone of Google's Spanner?

~~~
scott_s
No. Spanner is a globally distributed database which supports transactions. It
is meant for applications which need to make frequent updates to a database,
but the storage for the database may be distributed around the world.

Redshift is a different usage model. You upload your data once, then ask
questions of it - but you don't update it. Google does have something similar
to Redshift: BigQuery (<https://cloud.google.com/products/big-query>).

