
A Billion NYC Taxi and Uber Rides in AWS Redshift - marklit
http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html
======
ktamura
Quite frankly, for data of this scale, PostgreSQL is more than adequate on a
modern machine (although the OP's step-by-step guide is surely help for those
new to Redshift)

For those interested in funny and insight analyses of the data, check out this
blog: [http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-
ta...](http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-
uber-trips-with-a-vengeance/)

~~~
JimmyAustin
I'm not sure I would want to try analyzing a 267 GB database on most modern
machine. Redshift, on the other hand, is optimized quite well for rapidly
querying flat sets like this.

On the other hand, since Redshift is PostgreSQL 8.0.2 under the hood, any code
you write for analysis should be easy to move over (with the exception of
having to add dist and sort keys).

~~~
manigandham
> I'm not sure I would want to try analyzing a 267 GB database on most modern
> machine.

Why? Modern servers come with more RAM than that. Even multi-terabyte datasets
are not a problem. Today's hardware is way better than most think.

~~~
lwhalen
Could not agree more. Production data needs to be bigger than a petabyte or
three before my default response is anything other than 'let's load it into
RAM.' :-)

~~~
sokoloff
You mean terabyte, right?

~~~
lwhalen
Well shucks, this is why I try not to HN at 3am on the regular... Yes,
terabyte, not petabyte. And I'm well past the edit interval at this point; how
embarrassing, mea culpa :-)

------
Narkov
Really interesting article but I wish there was more data on the
speed/performance of the Redshift queries. It seems to cover everything but
the actual performance metrics!

~~~
JimmyAustin
From my experience analysing similar sized datasets, it really comes down to
the structure of the query and dataset. If you nail your dist keys and your
queries don't have many large joins, it's incredibly quick. If your dist keys
aren't great and you have many joins, it will take a long long time.

~~~
stingraycharles
Exactly. You have to keep in mind that other than distkeys / sortkeys, there
are no indexes. So all you can do to speed things up is organize your data
better.

For a ballpark figure, we work with a ~ 500million rows / 12 columns data set,
and complex queries with a lot of joins typically take around 10 - 20 minutes
to complete on Redshift's smallest cluster ($250/m).

------
abalone
I've been curious whether this dataset could be exploited to invade privacy,
e.g. by targeting rides to/from from a sensitive address (medical-related, a
strip club). Or by cross-checking it with other data, like a private detective
comparing a pick-up recorded on a security camera and looking up where they
went to in this database. I guess this makes it easier to find out.

~~~
HappyTypist
Addresses are indeed anonymised somewhat, so it might be difficult.

------
jorgecurio
really wish the AWS tutorials were this clean and a smiling photo of the
author on the left menu and it gives a super human feel and it was easier to
understand for me.

