
How to Analyze Every Reddit Submission and Comment, in Seconds, for Free - minimaxir
http://minimaxir.com/2015/10/reddit-bigquery/
======
astazangasta
Question: why the hell do people use CSVs? Use tabs, goddammit. Documents are
littered with commas. It's trivial to parse a TSV with awk - not so a CSV.
There is literally no reason to prefer CSV to TSV, and yet it seems the former
is the default for many people. Why? As a guy who parses text files all day, I
never ever want to see a CSV again. End rant.

~~~
minimaxir
Pretty much every programming language in existence has a utility for parsing
CSVs into tabular data while handling commas sanely. (using awk for processing
is not a typical pipeline)

More importantly, CSVs are handled natively by statistical programs such as
Excel and R (you can change the expected delimiter while importing, but in my
experience that leads to unexpected results)

~~~
Pxtl
Microsoft often fails with CSVs actually. I've found I can't copy/paste from
Sql Server Management Studio into Excel if you have a mismatched double-quote
in a cell - it fails silently and garbles your data.

That was embarrassing.

The CSV format is so messy that even big corporate implementations will have
subtle bugs like that. Do not trust it.

~~~
0x0
Microsoft excel has the worst CSV defaults. The separating character is even
locale-specific! Many locales use semicolon ";" instead of "," as a separator.
Great stuff when creating web app exporters.

------
jules
To figure out what's the best time to post to reddit you want the number of
successful submissions divided by the total number of submissions in that time
period, not just the count of successful submissions. Otherwise you may just
be measuring which time slot has the most submissions.

~~~
minimaxir
The normalization constraint is enforced by the limited size of the front page
(30 entries).

~~~
hueving
But the front page can churn a lot more during the busy times.

------
ZainRiz
Really cool analysis. It's especially fascinating to see how BigQuery is so
simple to parse and query and can easily give you access to powerful
information like "What's the best time to post this on reddiy?"

------
vegabook
Man those charts are beautiful, and this coming from someone who knows R
graphics and ggplot pretty well. Please may I ask a) what platform (OS/X /
Windows / Linux?) and b) what are the axis and title fonts?

~~~
minimaxir
I have a separate post on how the ggplot2 charts are generally done:
[http://minimaxir.com/2015/02/ggplot-
tutorial/](http://minimaxir.com/2015/02/ggplot-tutorial/)

Platform is OSX, title font is Source Sans Pro, axis numeric fonts are Open
Sans Condensed Bold (the same fonts used on the web page itself; synergy!)

~~~
e12e
This is great, thank you. One note, though: I've yet to try it, but according
to this stackoverflow-post:

[http://stackoverflow.com/questions/1898101/qplot-and-anti-
al...](http://stackoverflow.com/questions/1898101/qplot-and-anti-aliasing-
in-r)

it should now be possible to get AA on both Windows and Linux via the Cairo-
library.

I'll be playing a bit with R notebooks for my stats intro class, if I find the
time I'll try to see if a) I get AA out of the box, or b) I can get AA easily
by setting some parameters.

[ed: And if I do, I suppose pr to [https://github.com/minimaxir/ggplot-
tutorial](https://github.com/minimaxir/ggplot-tutorial) will be in order]

~~~
vegabook
Funny you should mention a subject which is close to my heart :-).
[http://stackoverflow.com/questions/6023179/anti-aliasing-
in-...](http://stackoverflow.com/questions/6023179/anti-aliasing-in-r-
graphics-under-windows-as-per-mac)

------
stevepike
To get back to the point of the article, is 130MM rows past the point where
you should start reaching for "big data" tools? I recently did an ad-hoc
analysis of a 50MM row, 3GB dataset and saw similar performance to the OP with
a local postgres database. Is it worth using specialized analysis tools for
exploring these types of datasets? I ended up writing a script to slice the
data, but with `wc -l` taking 2 minutes on my SSD, it wasn't a great
interactive experience. What do other people use for this stuff?

~~~
e12e
With "just" 3GB of data, did you try just working on a copy located on a
tmpfs/ramdisk?

~~~
stevepike
Wow, on my desktop with 16GB of cheap, 4 year old ram and a Samsung SSD I
bought this year, I'm getting 30x faster io on the ramdisk. Clever solution.

~~~
MikeTV
Even old DDR-200 (aka PC-1600, 1600 MB/s)[1] is faster than the theoretical
maximum of SATA III (600 MB/s). Wikipedia says SATA 3.2 can reach 1,969 MB/s,
but the more recent DDR3-1600 is 12,800 MB/s and some DDR4 reach 21,300 MB/s
[2] ... SSDs aren't even close.

DDR4 also theoretically supports 512 GB sticks [3] Can't wait.

1) [http://www.crucial.com/usa/en/support-memory-speeds-
compatab...](http://www.crucial.com/usa/en/support-memory-speeds-
compatability)

2) [http://www.crucial.com/usa/en/memory-ballistix-
elite](http://www.crucial.com/usa/en/memory-ballistix-elite)

3)
[https://en.wikipedia.org/wiki/DDR4_SDRAM](https://en.wikipedia.org/wiki/DDR4_SDRAM)

------
fhoffa
Note that part of the beauty of BigQuery is how it allows you to go deeper
into any dimension without pre-planning (AKA defining partitions or indexes).

Case in point, many comments in this page point at the "best time to post"
chart. But what if that time varies by subreddit?

The answer at
[https://www.reddit.com/r/bigquery/comments/3neghj/qotd_best_...](https://www.reddit.com/r/bigquery/comments/3neghj/qotd_best_time_to_post_for_success_on_each/)

------
putlake
Slightly OT but BigQuery has another fascinating and useful public dataset:
the HTTP Archive. [https://www.igvita.com/2013/06/20/http-archive-bigquery-
web-...](https://www.igvita.com/2013/06/20/http-archive-bigquery-web-
performance-answers/)

~~~
rohwer
Great link. Just ran Ilya's first query--median time to first render--against
September Pages:

"median": "2389", "seventy_fifth": "3543", "ninetieth": "5191"

How did the Internet page renders get slower since 2013?

2.2s median, 3.3s for 75th percentile, and 4.7s for 90th percentile in
httparchive:runs.2013_06_01_pages]

------
newman314
Does anyone know what "R tricks" was used to generate the graph depicting when
the best time to post?

I quite like that visualization and would like to use it for something else.

"2.5 seconds, 2.39 GB processed, and a few R tricks results in this:"
[http://minimaxir.com/img/reddit-bigquery/reddit-
bigquery-2.p...](http://minimaxir.com/img/reddit-bigquery/reddit-
bigquery-2.png)

~~~
Veratyr
He put a link to his R code in the blog post:
[https://github.com/minimaxir/reddit-
bigquery/blob/master/red...](https://github.com/minimaxir/reddit-
bigquery/blob/master/reddit_bigquery.ipynb)

------
noobie
Amazing how the 1000Gb can suffice. Kudos!

