
Amazon Athena: Query S3 Using SQL - polmolea
https://aws.amazon.com/athena/
======
danso
This looks very neat. I'm someone who deals with a lot of plaintext data from
a variety of sources, and so I find using ack/grep and csvkit to be efficient
enough for my purposes of exploration. I love using SQL and SQLite but rarely
do it for "fun" \-- that is, I'll use it when I've committed to building a
project, but not for exploration. This seems like it could lighten the
friction quite a bit.

If anyone from AWS is here: how is this used internally at Amazon?

~~~
ktamura
The real question to ask is, will Amazon contribute back to open source?
Presto itself is plenty proven and scalable: after all, it was created at
Facebook.

------
bsg75
"Amazon Athena uses Presto with ANSI SQL support and works with a variety of
standard data formats, including CSV, JSON, ORC, and Parquet."

I wonder if this is essentially a Presto SaaS product?

~~~
maslam
Yes

------
spullara
It looks really interesting but I'm surprised they launched it with the create
table flow broken. The query you see here was generated by their wizard...

[https://www.dropbox.com/s/s4cw5x7yyrdl3ch/Screenshot%202016-...](https://www.dropbox.com/s/s4cw5x7yyrdl3ch/Screenshot%202016-11-30%2009.56.04.png?dl=0)

------
jakozaur
Looks very similar to Google Big Query.

Even the pricing is same: $5 / TB of data scanned.

~~~
estefan
When I tried it it was slower than bigquery. Plus you've got to mess about
creating hive schemas.

~~~
spullara
I don't know why you are getting downvoted. For all those data formats you
have to painstakingly make table schemas for them before you can query them.
Not like Snowflake or BigQuery. One of the biggest strikes against Presto
IMHO.

~~~
ktamura
It's not Presto per se, but running any data processing workload against
unoptimized data formats is the issue.

Then again, both BigQuey and Snowflake require that you move data into their
storage engine (Redshift too), and that's an additional step that's
proportional to the size and complexity of your data. At the same time, it's
stupid to store your logs as OLAP optimized formats and completely lose
legibility. In sum, Athena trades off performance for convenience.

No matter what database vendors say, you can't defy the principles of computer
science.

~~~
fhoffa
Note that BigQuery has been able to read files straight from GCS, Drive, and
even Google Spreadsheets for a while:

[https://cloud.google.com/bigquery/federated-data-
sources](https://cloud.google.com/bigquery/federated-data-sources)

(I'm Felipe Hoffa and I work for Google
[https://twitter.com/felipehoffa](https://twitter.com/felipehoffa))

------
buremba
I hope that Amazon contributes back to the Presto community.

------
nimrody
Would be useful if AVRO files were supported. This was the data can also be
imported into Redshift if needed (Redshift does support Avro).

Other formats are schema-less (JSON,CSV, etc.) or not supported by Redshift
(ORC, Parquet). Perhaps less efficient for some queries (AVRO is not a
columnar format) but still useful.

------
dhananjayc
Is it possible to connect Athena to existing Hive Metastore?

------
nodesocket
Anybody have an example of storing NGINX access logs and using Athena to
search them?

------
neximo64
Any examples of queries and what this can do? S3 was file storage as far as i
thought?

~~~
raghavsethi
Athena (Presto) supports standard ANSI SQL - you can query data that's stored
in S3.

~~~
neximo64
How does that work though, so say my bucket has 10,000 json files in it and I
want all of them with the name attributes being like '%john'. Is that
possible?

------
cdevs
$5 a terabyte jeebus...don't f that query up

~~~
bsg75
Just like with BigQuery, a carefully thought out partitioning scheme is
critical, or your queries need to be carefully locked down to prevent
excessive table scanning. I burned through my BigQuery trial credit fast, by
not using partitions during a quick-and-dirty test.

------
nulagrithom
Wondering if I could use this like SQLite for Lambdas. I'd like to build some
serverless apps, but the commitment to a monthly fee from DynamoDB puts me
off. Could I use Athena to drive down my cost to zero as long as the app is
unused?

~~~
brilliantcode
DynamoDB is like $5 or $10 bucks a month? but I understand the need to keep it
to a minimum.

Athena is really interesting and if it can be as it is advertised "Serverless
SQL" then they've got a killer product in the pipes: A future where developers
no longer need to spend time on scaling, configuring, maintaining,
strategizing deployments but upload code and instantly begin reaping the
benefits of serverless tech.

The only missing component that would be a killer feature is something that
answers to Azure's Active Directory. It would be nice if we had serverless
plug-and-play user authentication and access control that integrated with
Lambda and Athena.

I'd imagine some sort of "RoR on Serverless" type of framework that will
scaffold out CRUD, User Management & REST Api is going to be in the works as
well.

The only potential downside I see at the moment for Serverless is the
uncertainty surrounding cold boots, it will directly affect user experience.
It's fine when you got enough traffic to keep things in the "warm" state but
there needs to be no dead zone when the call to the API Gateway is taking many
seconds waiting for Lambda function to fire.

~~~
asteadman
Re: users auth. Isn't that what Cognito is supposed to be? I mean, I don't
fully understand it, but I think so.

As for the cold boot issue, I thought the standing solution was to have a
"fast-exit" ping-like code-path within the lambda. Query it on a regular basis
(you can even do it with a lambda scheduled-event). That way your lambda
should be kept warm.

~~~
brilliantcode
TIL Cognito!

That completely flew under my radar, not sure why I didn't see it before (oh
that's right I was heads down in Azure).

With Athena the circle is complete for me.

That fast exit ping thing is pretty cool, any more information regarding that?

Your comment is probably the most valuable one I came across to date since
signing up, I wish there was a way to award a gold star like on reddit :D

There's very little objection at this point in moving to a Serverless
architecture = Athena (SQL) + Lambda (CPU) + Cognito (User).

~~~
asteadman
What's your usecase for Athena (Really curious how this changes anything)?

You can hook your lambda up to a cloudwatch scheduled event
([http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Sc...](http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html)).
If you don't do much when this happens (like, you exit immediately), then you
won't be charged much compute time. Can't find the reference to how often you
should ping it to keep it warm, probably varies, IIRC hourly would be plenty.

~~~
brilliantcode
building user management, security has always been a "build your own wheel".
While I think a properly secured SQL table with encrypted keywords is
certainly possible the risk remains. We see even for prolific and large
websites their entire user tables are dumped online on a regular basis.

Off-loading this burden on a cloud vendor whose sales are directly tied to the
security, gives me peace of mind. Ashley Madison had no financial incentive to
keep their user base as secure as possible or actively defend against it using
R&D. Yahoo Mail as well. But AWS and Active Directory are products that sell
this security and for them to fuck up on an epic scale would mean their end.

so the tldr is piggybacking on the tremendous resources from a cloud giant
which frees up resources to focus on the core product.

This is not to say that it's impossible to roll out your own wheel, I'm just
saying it makes more sense to align the financial incentives of vendors
maintaining the user base security. This may or may not mean I'm open to
focusing on companies solely focused on user auth/management products, it's
tough to beat the branding and trust built into AWS & Azure that developers
are voting everyday with their money & data.

------
balls187
Tried it twice, and it crashed big time.

~~~
balls187
Also gives me a 500 on US-WEST-2

------
asafm
I wonder why they haven't chose Apache Drill over Presto. Anyone knows?

------
intrasight
what does "point your data in S3" mean?

~~~
justinsaccount
Are you talking about this? you left out a word.

Simply point to your data in Amazon S3

~~~
intrasight
Still makes no sense. Please explain if you understand.

~~~
asteadman
To me the obvious use case is querying your log files as stored on s3. Query
for a specific combination of features, or do some (simple) processing on
them.

It's really only useful for a small list of file formats. Doesn't really do
much for you if you primarily use s3 for binary data or static web hosting.

------
mrwnmonm
John Forstrom: Amazon Athena - welcome to 2010!
[https://twitter.com/jforstrom/status/804007642246938624](https://twitter.com/jforstrom/status/804007642246938624)

