
Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data - hepha1979
https://aws.amazon.com/blogs/aws/amazon-redshift-spectrum-exabyte-scale-in-place-queries-of-s3-data/
======
georgewfraser
This appears to be Amazon Athena / Presto, embedded into Redshift. The syntax
for CREATE EXTERNAL TABLE is exactly the same, and the supported file formats
and compression encodings are a subset of Athena. The big advantage of an
approach like this versus running Redshift and Athena separately is that you
can write a single query that joins data stored in Redshift with data stored
in Athena.

~~~
idunno246
And you can actually materialize a query in redshift since Athena doesn't have
create table as select. This is neat, though I wonder if opening Athena to
connect to Postgres(presto can) and do the join inside Athena would have been
better

~~~
openasocket
I'd say that I have more confidence in the Redshift query planner and
execution engine than that of presto. And having the option to join that with
data stored in-memory on redshift nodes is very attractive.

------
teddyknox
When I think exabyte scale queries on a columnar datastore I think
aggregations, but then I have this question: Why do we need to do exabyte
scale queries in the first place? Wouldn't statistical inference via random
sampling be faster and accurate enough?

(Granted, often times aggregations are happening after some filtering, at
which point the relation being aggregated might be considerably smaller than
exabyte scale.)

~~~
awgupta
it really depends on what you are doing. A large data set shouldn't be limited
to longitudinal analysis. If you're storing every log record or every stock
bid/ask, there may be times that you need to understand the specifics of what
exactly was going on. There may be a lot of filtering on the underlying corpus
for these sorts of exact match queries, but data set sizes continue to grow.

~~~
awgupta
that said, I agree that approximate functions should be part of a modern
database system. Redshift has approximate count distinct (based on
hyperloglog) and approximate percentiles (based on quantile summaries)

------
jnordwick
This seems really slow for queries especially when taking into account all the
computing power being thrown at it:

Over 6 billion rows (not huge by modern standards), a relatively common
aggregation query with 4 basic aggregates (2 sum, 2 avg), one where, and two
group by clauses, over 1 table (no joins) takes about 4.25 minutes (254.650
seconds).

On some column databases on good hardware with a single machine you can
probably get a couple seconds, probably faster.

~~~
teej
Just for comparison's sake, I recreated Jeff's example queries on my in-house
table that's ~8B rows. COUNT(*) takes 6s and the SELECT with 2 groups, 2 sums,
and 2 averages takes 52s. The downside is that the table takes 600GB of disk
space compressed which is a major chunk of our cluster. But here's the kicker
- I would never query this table for data I needed in under 5 min so my
current performance is meaningless.

Given our access patterns for this table, I'm going to investigate using
Redshift Spectrum for it. It seems like a huge win for us.

~~~
jnordwick
What are you using?

~~~
teej
Sorry, this is just vanilla redshift.

------
filereaper
Very BigQuery like system.

Has anything been sacrificed for this type of scale-out?

BQ and other Dremel based systems are weak at joins for star based schemas,
and frequently advise denormalization of data.

~~~
scribu
> Very BigQuery like system.

What makes you say that?

The main innovation in BigQuery was the ability to store and query nested
data.

Amazon Redshift doesn't support querying nested data. It only has some
convenience functions for loading flat data from nested JSON files hosted on
S3.

And what I assume Spectrum does is just perform that loading step behind the
scenes.

~~~
puzzle
At a first read, it sounded like you didn't need to run a cluster ("Spectrum
scales to thousands of instances"), which is BigQuery's big advantage. Later,
it says that you still need a cluster, but it's only used in the final
processing stages. So it looks like it's 3/4 of the way to BQ.

------
johns
Can anyone clarify the when or why you would use this instead of or with
Athena?

~~~
bladeaod
It seems to me that this is a bridge between Redshift/S3, so you can join data
from both sources. I believe Athena is S3 only

------
nside
Great PR!

How much does it cost to store 1 exabyte on S3?

1 exabyte = 1,000,000,000 GB

Cost of storage 1GB on S3 in us-west-2 = $0.024

That's 24 millions dollars. What am I missing?

~~~
Dunedan
Fun fact: Using any other public US AWS region would save 3 million dollars
per month!

Aside from that: Using infrequent access storage for the parts of the data
which don't get frequently accessed would save a lot and I'm pretty sure at
that scale AWS would be happy to discuss possible discounts as well.

------
bsg75
Does anyone know what this is based on? As a fork of Postgres v8, I would not
expect it to be a foreign data wrapper.

~~~
filereaper
Looks like Google BigQuery which is based on the Dremel paper.

~~~
jeffbarr
This is not BigQuery.

~~~
puzzle
It's not, but it resembles BQ a lot in the central part of the query
execution, where it transparently performs work on your behalf on thousands of
other machines that you don't have to be aware of. Is that correct?

~~~
openasocket
1\. No, you still need a Redshift cluster, which performs the work. A better
example would be something like Athena.

2\. That's not really a big similarity. Having many machines coordinate for
data processing is a really common thing. So if BQ and Redshift Spectrum are
similar, than so is Athena, Presto, any mapreduce-based system, Spark, etc.

~~~
puzzle
My main point is that there is work behind the scenes happening that you don't
have to worry about, on an undefined number of machines that you don't have to
care about. Yes, there is still a cluster which does some kind of planning and
post-processing now, plus there is obviously a lot of coordination, but it's
not like Hadoop or classic Redshift, where you are constrained only to the
hardware that you paid for and set up. It's the provisioning in the central
phase that resembles BQ the most.

~~~
openasocket
In that case Athena is more what you're talking about. Redshift Spectrum is
weirder. You still have to have a Redshift cluster you provision and pay for,
but for some S3 processing it calls out to other, independent, servers. I
guess I'd consider that a hybrid approach.

------
nieksand
From the blog post, it's very unclear what the positioning is between Redshift
Spectrum and Athena.

------
buremba
ORC support?

~~~
fs111
what for if you have parquet? orc feels like a "me too" technology from Horton

~~~
buremba
We have 300tb of orc data and can't convert it to Parquet. Also, ORC performs
better for our use-case.

~~~
hcoyote
What, specifically, is preventing the conversion? I've converted hundreds of
TBytes to Parquet (including moving away from that HIVE ACID stuff).

~~~
buremba
Mainly the development cost. We trust OCR and have good experience with it,
it's stable, fast and compact. We don't have any experience with Parquet and
it's just too hard/risky to make this kind of conversion on a production
system.

