
Dataset: Ten Years of NFL Plays Analyzed, Visualized, Quizzified - glaugh
http://blog.statwing.com/dataset-ten-years-of-nfl-plays-analyzed-visualized-quizzified-downloadable/
======
burntsushi
If you're willing to limit yourself to the last five years, you can avoid the
pain of parsing those free-form text descriptions with nflgame[1] or nfldb[2].
Disclaimer: I am the author of those tools. We've been slowly building up a
small community of people using it. In fact, I'm currently in a small fantasy
playoff league that's running off nflgame.

They also include live updates while games are playing.

I'd be curious to see if you get any substantially different results using
structured data (nflgame gets it from NFL.com's JSON feed) as opposed to
parsing the text descriptions.

[1] -
[https://github.com/BurntSushi/nflgame](https://github.com/BurntSushi/nflgame)

[2] -
[https://github.com/BurntSushi/nfldb](https://github.com/BurntSushi/nfldb)

~~~
glaugh
OP here. That is awesome.

I'd definitely recommend people use this over the download at the bottom of
the blog post, it's _really_ painful to parse the free text (there's a lot of
weird edge cases). I'll add a link to this stuff in the post.

edit: clarity

update: link added to post

~~~
burntsushi
> it's really painful to parse the free text

I can only imagine. I've tried motivating myself to do it a few times so I
could increase the amount of data in nfldb (I believe they are available in
one form or another all the way back to 1999), but it's a rather daunting task
when there are so many statistical categories.[1]

[1] - [https://github.com/BurntSushi/nfldb/wiki/Statistical-
categor...](https://github.com/BurntSushi/nfldb/wiki/Statistical-categories)

------
jib
The 4th and 2 on the 2 argument bothers me every time someone brings it up. To
me it feels like its an incorrect use of statistics. EV isn't the end all -
kicking the field goal has way lower variance. If your goal is winning
consistently then giving up high variance high EV plays for low variance
slightly lower EV plays is often the right choice, as a football season is
made up of a very low number of discrete events.

About 45% of all games finish with a spread of 7 or less according to a quick
search. Making a play that has a close to 50% chance of making you be down 3
points is costing you a lot of the margin if you think you are a close
favorite.

You can't win more in football but you can lose a sure win, so if you believe
you are say a 3-4 point favorite then the right play is to take the field goal
every time - giving up the safe points means you take half the games and make
them a crap shoot.

~~~
m_myers
The question specifies "early in the game"; the correct call can and will vary
in a late game or late first half situation (and of course depends on the
strengths and weaknesses of the teams involved).

As Bill Barnwell of Grantland is fond of saying, in the first half, your only
goal is to maximize your expected points. You don't know exactly how many
points you're going to need in the game until the end.

~~~
loganfrederick
The Bill Barnwell aphorism reads pretty empty to me. You could just as easily
flip it to: "In the first half, your only goal is to minimize your opponent's
expected points. You don't know exactly how many points they will need until
the end."

His phrase doesn't really have any impact on the argument being made here
unless you really dissect the relative importance of offense versus defense in
the game, which requires a lot more evidence to make a case one way or the
other.

~~~
jamesaguilar
You have missed the point of what Barnwell is saying. He is saying that
tactical calling of plays to trade expected point value against variance do
not make sense in the first half, because you don't know how close it's going
to be.

To illustrate: in an infinitely long game, you always pick whatever gives you
the highest expected value. In a game with one play left, you always pick the
play with the highest probability to put you over the amount of points the
opponent has, even if has a lower expected value than other plays.

The quote is saying that the first half is more similar to playing an
infinitely long game than a game with one play left.

------
socrates1998
This sounds really good and NFL teams would definitely want something like
this, but even coaches would wary about using mostly data.

For example, let's say you are the coach of the Patriots and you have been
running the ball very successfully the past few games, even winning games
because of your running game.

And it's 4th down and 2 yards to go against the Broncos.

The data says, run the ball. Especially since you have done it well in the
past. However, you forgot to take into account the stud defensive tackle that
has just started playing really well for the Broncos. So, you try to run the
ball and you lose the game.

This is just one example of the inability of data to deal with match-ups and
schemes.

As both a person who likes data and coached football, I would love the
integration of the two, but football has too many variables.

If you have all this data, you are actually going to make the wrong decision
because the matchup is bad for your team.

Matchups and schemes trump data.

~~~
JackFr
Additionally, it could be that the statistics show a result that is an
artifact of coaches decisions. If a particular decision is chosen by the pro
coaches extremely rarely, but has a moderately better expected result, it
could be that the _typical_ expected result is very poor, but the only times
coaches make the decision is when they observe a decisive matchup or a special
situation -- making it appear statistically like that decision is the better
one. (Like looking at basketball shooting percentages, and wondering why
centers don't take all the shots given how high their FG%s are.)

~~~
socrates1998
That is an excellent point.

Coaches do what they think will be successful. If they can't run the ball
well, then they don't run the ball, skewing the data.

Also, tendencies change from week to week. You really need a deep
understanding of football to get the right, correct data for that week.

This is why data hasn't dominated football, you need to have tons of knowledge
before anything else.

~~~
the_watcher
The best use of data is to give coaches a high level picture of what the
likely result of a given play call will be (trusting sample size to even it
out). You hire a coach to then adjust for those other things. However, if the
data says that 99% of the time a given play call will have a specific result,
coaches shouldn't try to be the exception (unless the game situation demands
it). There is a happy medium here. The Rays have figured it out pretty well
with Joe Maddon, the NFL will get there too. No one thinks a machine should
call all the plays (yet). We just think machine learning and big data should
be leaned on as tools the same way that film is. Don't forget, Bill Sharman
was considered a nut for watching film in the 1980s. Imagine coaching without
now.

------
MengYuanLong
I am going to a group discussion tomorrow about first-refusal rights in the
NFL and happened to do a brief naive analysis of extra point vs 2 point
conversions earlier today. In brief, the EV of a 2 point conversion was .91
while an extra point was .99. That said, for every 19 extra point attempts
there was only one two point conversion attempt. Frankly, I am all for
variance so I am rooting for the more ambitious two-point attempt.

------
cubecul
"You got 2 out of 5 answers correct. When you try this quiz with a sorry quiz-
taker like you, that’s the result you’re going to get."

Sassy. Jesus Christ.

~~~
glaugh
OP here

Made the message a bit friendlier for future folks, thanks for the feedback :)

~~~
hansy
"You got 4 out of 5 answers correct. When you try this quiz with a sorry quiz-
taker like you, that’s the result you’re going to get. [That's a joke, we
think you did fine :)]"

------
market_hacker
I think there may be a problem with this kind of analysis - it seems to me
that the "riskier" plays (2 point conversion, going for it, etc.) - are more
likely attempted when coaches think they will work - not randomly. To really
do a fair analysis of expectancy you would need trials where the play
selection is chosen randomly. Anyone else agree with me?

------
jackschultz
Hey guys, I did this a few months ago. At least attempted to organize the data
from the descriptions on Advanced NFL Stats.
([https://github.com/jackschultz/nfl-data](https://github.com/jackschultz/nfl-
data)). Turns out there were tons of special cases. Just curious as to how you
decided to organize the data.

------
neovive
Do coaches have access to real-time data on the field or in the cooridnator's
box? If so, if they followed the datasets exactly, would the results reverse
themselves over time due to adjustments by offense or vice versa?

~~~
JackFr
There is some game theory involved. If everyone ALWAYS ran for two point
conversions, the defense could just ignore the pass and put 11 in the box and
stuff the run.

The its most likely that the optimal solution is a mixed strategy; you're
going to need to mix it up to keep defenses honest.

------
gojomo
Fun stuff!

You might want to make it clear you want the decisions most likely to succeed,
not the decision most common among professional coaches (who are presumably
optimizing some other form of career-stability-against-criticism).

------
vacri
5/5\. Looks like I choose the right thing to do each time... except... _which_
run should I be calling :)

------
viveksodera
Numberfire (www.numberfire.com) may have this data, if not more, for their
fantasy football tool.

