
FiveThirtyEight has a GitHub repo with story-related data and scripts - fnordo
https://github.com/fivethirtyeight/data/
======
smcl
No story about Nate Silver and 538 is complete without mentioning his new
rival in the prediction space - Carl Diggler - who has been kicking his butt
on the presidential primaries, calling tough elections correctly where Nate
has refused to give more than a "Maybe Sanders, Maybe Clinton..." prediction.

Carl is a work of fiction, invented by a couple of journalists/friends as a
parody of pundits:
[https://www.washingtonpost.com/posteverything/wp/2016/05/09/...](https://www.washingtonpost.com/posteverything/wp/2016/05/09/our-
fictional-pundit-predicted-more-correct-primary-results-than-nate-silver-did/)
\- the story is hilarious. Watching this whole thing unfold on Twitter has
been a total blast too (check out @carl_diggler), they never break character
and have a lot of running gags (such as Carl's troubles in "Family Court").
Here's a wee excerpt from that WP link:

" Carl exists to satirize all that is vacuous, elitist and ridiculous about
the media class. From his sycophantic love of candidates in uniform to his
hatred of Bernie Bros, from his reverence for “the discourse” to his constant
threats of suing the people who troll him on Twitter, Carl is predicated on
being myopic, vain and — frankly — wrong.

But something funny happened along the way. Biederman and I, who are neither
statisticians nor political scientists, started making educated guesses for
our parody about the results of the primaries. And we were right. A lot.

We beat the hacks at their own game by predicting every Democratic winner on
Super Tuesday. We told readers who would win in the unpredictable caucuses
that FiveThirtyEight didn’t even try to forecast, such as those in Minnesota,
Wyoming and even American Samoa. We called 19 out of the past 19 contests.
FiveThirtyEight, whose model cannot work without polling, accurately predicted
13 "

~~~
InclinedPlane
538 is the classic "fighting the last war" problem. In 2012 the issue was that
polling was unreliable, it was difficult to figure out what was actually going
on with the electorate. FiveThirtyEight figured out how to crack that problem,
by applying number crunching. This cycle, the problem is that the electorate
hasn't actually decided what it wants yet, and there statistical methods
aren't helpful at all. There being many examples of 538 "predicting" a 99%
probability of a certain victory due to a large margin from polling and then
on election day things going completely differently because the electorate
changed its mind.

~~~
ssharp
Was 538 that good at predicting outcomes of the 2012 primaries? I know it's
well established that they did a good job on the 2008 and 2012 general
elections.

Primaries have to be harder because there is far less polling information
compared to the general election. Yes, they were unable to predict the Trump
victory, but if the get 90-100% of states correct in the general election,
then I doubt that much has changed with how polling reflects the electorate.

~~~
freehunter
They were able to predict Trump winning. All of their data showed it. The
polls showed Trump winning everywhere. What they failed to do was trust the
statistics. The polls show Trump as favorable, and they say "yeah but he won't
make it past the first state." The polls show Trump beating other candidates
and they say "he has a ceiling of 30%". Other candidates start to drop out and
polls still show voters rallying behind Trump, and they say "the polls are
wrong, the voters will go to Cruz".

I'm not a Trump supporter by any means, but 538 went _way_ out of their way to
ignore every poll that was put in front of them in favor of punditry.

------
danso
Sorry to seemingly play a game of one-upsmanship, as 538's repo is wonderful,
but BuzzFeed also posts its data and stories online on their BuzzFeedNews
Github account...however, what I really like about their repo setup is that
they have a repo for every project...and then a separate repo that has a
tabular listing of repo, date, description, and story link * ...IMO, they have
the best setup (in terms of discovery) among the news nerds:

[https://github.com/BuzzFeedNews/everything](https://github.com/BuzzFeedNews/everything)

\- * edit: Oops, I forgot that 538's data repo also has a readable table
listing below the fold...I suppose having separate repos per dataset vs one
main repo is not a clear advantage depending on the user.

They also list their standalone datasets and tools, e.g. their standardized
H-2 certification data [1], for which they've probably used in their many H-2
visa stories, and twick [2], the tool they use to quickly fetch newsworthy
Twitter account data on short notice...

And those aren't even all the useful tools that that team has created...Jeremy
Singer-Vine (their data editor) has several great Python utilities in his
personal repo (github.com/jsvine), including waybackpack, pdfplumber (a
wrapper around pdfminer and similar to Ruby's tabula), and markovify.

[1] [https://github.com/BuzzFeedNews/H-2-certification-
data](https://github.com/BuzzFeedNews/H-2-certification-data)

[2] [https://github.com/jsvine/twick](https://github.com/jsvine/twick)

------
minimaxir
The importance of statistical transparency is part of the reason why I've
switched to GitHub and Jupyter notebooks, as GitHub renders them natively
which makes it obvious what code is being run and what the results of the code
are. (two recent examples of mine are processing AngelList data
[[https://github.com/minimaxir/sfba-
compensation/blob/master/a...](https://github.com/minimaxir/sfba-
compensation/blob/master/angelist_sfbayarea_jobs.ipynb)] and creating graph
networks from Reddit data [[https://github.com/minimaxir/reddit-
graph/blob/master/subred...](https://github.com/minimaxir/reddit-
graph/blob/master/subreddit_network_pdf.ipynb) ])

That said, there are a few reasons I've seen writers/researchers intentionally
_not_ release data. Either the code is sloppy/inefficient/embarrasing, or the
data is used in bad content marketing for a startup that specializes in data
collection so that they can say "we have data, pay us if you want more, neener
neener neener." (The latter of which I see submitted to HN all the time and
serves as a pet peeve of mine)

------
jawns
I picked a project at random -- the Tarantino one, which lists every death or
curse word in every Tarantino movie -- and browsed through the data:

[https://github.com/fivethirtyeight/data/blob/master/tarantin...](https://github.com/fivethirtyeight/data/blob/master/tarantino/tarantino.csv)

I found it interesting that the author was willing to catalog each curse word
by writing it in its entirety -- except one, the n-word (which, by the way,
appears 179 times in Tarantino's movies, with the bulk in "Django Unchained"
and "Jackie Brown"). The article for which this data set was compiled
similarly censors none but the n-word:

[http://fivethirtyeight.com/features/complete-catalog-
curses-...](http://fivethirtyeight.com/features/complete-catalog-curses-
deaths-quentin-tarantino-films/)

I happen to think that no one of any ethnic group should use the n-word, and I
do not use it myself, but I think in the context of cataloging curse words,
where you're writing at a sort of meta level (i.e. writing about words, rather
than invoking them yourself) it's just as acceptable as in the other cases to
write the word out. But the fact that the author made this one exception
reveals a bit about how he sees the word in relation to the others.

For instance, you might argue that the varied versions of the f-word might be
strong language, but in a different category than the n-word because the
latter is an ethnic slur.

But the author has chosen to write out other ethnic slurs in their entirety
(e.g. the w-word, referring to Mexicans, and the j-word, referring to Japanese
people, and the g-word, related to Asians), and also slurs related to other
groups (e.g. the f-word, referring to gay people).

So, even if you're talking about just the set of curse words that insult a
particular group of people, the author has set the n-word apart.

~~~
Normal_gaussian
n-word -> nigger and f-word -> fuck I understand, but what is the w-word?

I really don't get this obsession over hiding the existence of these words.
They obviously don't exist in isolation and are a symptom of a different
problem. Perhaps by hiding them people seek to pretend the world has solved
these problems?

On top of that, their usage as negative words is only propagated by these
shortenings. Words like nigger had (and still do have) neutral and positive
connotations [0].

Language is something I deeply believe in, and try to use accurately. People
controlling which words I can and can't say really, really, upsets me at level
I can't quite find the words express.

[0]
[https://en.wikipedia.org/wiki/Nigger#Etymology_and_history](https://en.wikipedia.org/wiki/Nigger#Etymology_and_history)

~~~
graedus
W-word is "wetback", I think. I've never heard anyone use it.

~~~
freehunter
The only time I've ever heard the word was in discussions surrounding its
existence as a racial slur. There are much more common slurs for Hispanic
immigrants. But I've never heard it called "the w-word".

~~~
thesimpsons1022
I've heard it a few times growing up in southern California. believe it or not
people are pretty racist.

------
sideproject
I'm reading through Nate Silver's Signal vs. Noise book right now - highly
recommend it. And this repo is an awesome way to replicate the results that
they show on their site (and perhaps some stuff in the book too).

------
gklitt
This is great. Next up, academic researchers please...

~~~
theaustinseven
Seriously though... The number of researchers who expect us to accept wild
claims at face value without releasing data is unbelievable. It should be a
prerequisite for most journals since it would help protect the journals from
false experiments.

------
vintermann
Data is nice, but even with it, 538 has gone alarmingly down the path of
clickbait lately. Especially Casselman, with smarmy headlines like "The Rising
Unemployment Rate Is Good News" or "Stuck In Your Parents’ Basement? Don’t
Blame The Economy".

~~~
pessimizer
It's not lately. There was a precipitous drop in quality (of math and subject)
as soon as they went to ESPN.

