
Fun with NFL Stats, Bokeh, and Pandas - J253
https://j253.github.io/blog/fun-with-nfl-stats.html
======
sndean
> The small spikes at 5 yard increments is interesting and I don't really have
> a good explanation other than to think that whoever recorded the yardage
> data liked rounding to the nearest 5 if it was close. Anyone else have any
> other ideas?

I'll go with the idea that the refs are biased with their ball placement and
tend to put the ball on lines [0]. Also, players/teams practice, speak, think
in 5 yard increments, so that's got to bias things a bit. "We've got to get to
the 35 yard line for Morten to have a chance."

[0] [https://gutterstats.wordpress.com/2015/11/03/are-nfl-
officia...](https://gutterstats.wordpress.com/2015/11/03/are-nfl-officials-
biased-with-their-ball-placement/)

~~~
slg
There are multiple other factors involved that aren't discussed fully in that
article.

To start the ref bias is not necessarily unintentional. Measuring first downs
and exact placement of the line of scrimmage becomes more difficult and takes
more time when the ball is not initially spotted on an 5 yard increment or
individual hash mark. The league offices might directly instruct refs to spot
the ball on an exact yard line if there is any doubt of the spot in order to
speed up the game. An obvious example of this intentional bias is when the
ball is punted out of bounds. It is nearly impossible for a ref to get an
exact spot in that situation and yet it is almost always spotted exactly on a
hash mark.

Also lots of drives will start on a set yard line and penalties are often
handed out in 5 yard increments. So while there are no rules that will start a
drive on the 15, there are rules that will start a drive on the 20 or 25 and a
standard 5 or 10 yard penalty would place the ball exactly at the 15 yard
line.

Lastly, the spot is a continuous data point that is being recorded by humans
as a discrete data point. That means the herding could be exaggerated in the
recorded data and not necessarily as pronounced during the actual game. A ball
might be spotted by the ref at 14.4 yards but the person responsible for data
entry might eye the spot and record it as being at the 15 yard line.

------
dbt00
As someone who's done a lot of numerical analysis and watches a lot of
football, the analysis here is pretty rudimentary.

> On third down, pass attempts outnumber run attempts at almost a 4 to 1 clip.
> This is likely out of increased desperation to make a first down.

Running is a low variance low yardage option, passing is a high variance high
yardage option. Passing on third and medium to long is an obvious dominant
strategy. Pulling the goaltender in hockey or bringing the keeper forward in
soccer when trailing late in games/matches serves much the same effect.

------
arglebarnacle
Does anyone have any insight about what kind of jobs are out there for people
with the kind of skills demonstrated in this post?

I have a lot of data exploring, cleaning and visualizing skills, python/SQL
skills and experience using it to make business decisions, but this type of
thing falls short of what most people would consider "data science"

~~~
J253
Agreed. And I definitely make no claims about this being earth-shattering
"data science". I just happened spend a few hours over the weekend making some
plots and commenting about what I saw with some Python tools.

I'll also state that I am neither a data scientist nor a statistician. I'm a
Python application engineer with a background in mechanical engineering, so
that might help set the context a bit more.

~~~
rhcom2
Most companies don't need "earth-shattering 'data science'", they need a way
convey a narrative with their information and maybe try to deduce something
from it.

I work as a programmer at an architecture company and we do visualizations
like this all the time for campus classroom usage for example. Is it
groundbreaking? Of course not, but it helps the clients and designers a ton.

------
chaosbutters
I feel like this is a waste of bokeh's potential and matplotlib would have
sufficed. I love bokeh for the interactive capability and controlling what you
plot, zooming, and just overall more immersive feeling than a static 2d plot.

Still very interesting and insightful and lovely plots generated.

~~~
J253
Thanks. And I agree with Bokeh being overkill for this. My original intention
was to make it fully interactive but I hit some snags on keeping the
interactivity through Pelican SSG so I just kept 'em static.

------
nubb
I've always enjoyed this project for pulling nfl stats.
[https://github.com/BurntSushi/nflgame](https://github.com/BurntSushi/nflgame)

~~~
burntsushi
That project is no longer maintained because I don't use it any more, but
others have picked up the baton: [https://github.com/derek-
adair/nflgame](https://github.com/derek-adair/nflgame)

Back in the day, I used nflgame along with

    
    
        https://github.com/BurntSushi/nfldb
        https://github.com/BurntSushi/nflvid
        https://github.com/BurntSushi/nflfan
    

to setup a simple local web UI that allowed me to quickly search through every
play and _watch any single play I wanted_. Video footage was available as soon
as the game was over, and play info was available live as the game was
playing. It was amazing.

This worked because nflvid downloaded full HD NFL games from their CDN, which
was unprotected at the time. (I paid for an NFL Game Pass subscription and
never distributed the video footage.) They also had XML files that delineated
the time at which each play started and its duration. Some ffmpeg slicing and
dicing was all it took to cut up a full game and associate each clip with each
play. That's all part of what nflvid does.

I hacked all of this together in my free time years ago, and I bet a lot of
people would find it amazing. One wonders why the NFL doesn't build this and
sell it themselves. When I used Game Pass a few years ago, you could search
for plays with rudimentary criteria, but only over a single game at a time. It
was artificially very limited.

~~~
diminoten
We briefly spoke via GitHub about a month ago, and during that convo (it was
in a ticket), you mentioned that the source has inaccuracies. Is there any
elaboration there or do the NFL people use a different data source to do
things like Fantasy and official stats?

~~~
burntsushi
I'm not an NFL insider. I don't know what they do internally. I only know that
1) the undocumented NFL GameCenter JSON is not 100% accurate and that 2) any
user of a fantasy league would notice these inaccuracies. I did a test a while
back by comparing GameCenter data with Yahoo's data. Kickers tend to have the
most inaccuracies: [https://github.com/BurntSushi/nflgame/blob/master/test-
data/...](https://github.com/BurntSushi/nflgame/blob/master/test-data/results-
yahoo-2012-play/k.tsv) QB stats are more solid for example, but there are
still minor problems: [https://github.com/BurntSushi/nflgame/blob/master/test-
data/...](https://github.com/BurntSushi/nflgame/blob/master/test-data/results-
yahoo-2012-play/qb.tsv)

From those observations, you can't really make any solid conclusions. But if
you think about it for a bit, you might be able to reason your way to some
guesses. For example, one possibility is that the GameCenter data is NFL's own
construction that's only used for their GameCenter interfaces, where as places
that "official" data is needed might be powered by Elias[1]. Why the
discrepancy? Again, I don't know. It could be legacy software related. It
could be contract/legal related. Or it could jus tbe plain old bugs. e.g.,
Maybe GameCenter hooks into an initial lossy but fast feed that is updated
during the game, but never receives updates from a slower but more accurate
feed later.

Or maybe the NFL purposely inserts data canaries because they know this JSON
feed is unprotected, and they intend on using those data canaries to detect
folks using their data in an unlicensed fashion. I'm pretty sure IMDb does
this, for example. Or maybe they just insert errors purposely to make it too
costly for anyone to use this data in situations that require 100% accuracy
(like fantasy football leagues).

My guess is some innocuous blend of legal and legacy software reasons.

[1] - [http://www.esb.com/](http://www.esb.com/)

------
tunesmith
For a while I had a process to extract the top ten highest-WPA plays from my
favorite team's (Broncos) most recent game. But then my data source dried up.
I'm glad to find out about nflscrapR, it seems like that I might be able to
figure out how to do that report again with recent play-by-play data.

Incidentally, that report was _really_ fun during the 2011 Broncos season.
Normally when you are finding the plays with the largest WPA swings, you'd
expect them to be distributed among both teams. But since Tim Tebow became
starting quarterback, I searched for the top ten largest WPA swings for the
rest of the season - and from what I recall, every single one of those
dramatic plays was in the Broncos favor. Weird. :-)

------
MaxLeiter
> Passing becomes more and more popular as you use up your downs.

Just in case author sees this, this is wrong, right? It should read Running
becomes more and more popular as you use up your downs? (Regarding
[https://j253.github.io/blog/images/article_01/01_play_by_dow...](https://j253.github.io/blog/images/article_01/01_play_by_down.png))

~~~
dragonwriter
> Just in case author sees this, this is wrong, right? It should read Running
> becomes more and more popular as you use up your downs?

Er, no, the author is right; the share of all plays that are passing plays
goes up with down number (until dropping at 4), the graph shows that quite
clearly.

------
catbird
Very cool! I would love to see heatmaps of play type with downs on one axis
and yards-to-first-down on the other.

~~~
petersalas
Something like this?

[http://www.yardsgained.com/#(passes_~_sacks)_~_first_down_at...](http://www.yardsgained.com/#\(passes_~_sacks\)_~_first_down_attempts_~_distance_~_down%2F%3A%2F%2B)

~~~
catbird
Oh wow, that site is excellent. I think this is the query I was thinking of
before, the percentage of passes conditional on down and distance:

[http://www.yardsgained.com/#(passes_~_sacks)_~_(passes_~_sac...](http://www.yardsgained.com/#\(passes_~_sacks\)_~_\(passes_~_sacks_~_rushes\)_~__distance_~_down%2F%3A%2B%2B%2F%2B)

------
seanplusplus
this is super cool! well done.

