
Every NFL play for the past 10 years in CSV format - edw519
http://www.advancednflstats.com/2010/04/play-by-play-data.html
======
edw519
From Line 42536 of the 2008 CSV file:

 _20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short
middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison
for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards.
Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant
challenged the runner broke the plane ruling and the play was
Upheld.,7,10,2008_

They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}

Seriously, I had plans for the next 4 days, but I just scrapped them. Funny
how jazzed I get when it's data that I can really relate to...

I've already structured my data warehouse and started the loads. (I'll
probably need a whole day just to parse the text in Field 10.) Then I'm going
to build a Business Intelligence system on top of it. I will finally have the
proof I need that I, not the offensive coordinator, should be texting each
play to Coach Tomlin.

See you guys on Monday.

EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...

fleaflicker: Cool website & domain name. Thanks for the tips. I expect
shortcomings in the data, but it looks like it's in a lot better shape than
the usual free form enterprise quality/vendor/customer comments I usually have
to parse. We'll see...

MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app
that enables others to do their own analyses, answering questions that nobody
else is asking. Like "Which Steeler makes the most tackles on opposing runs or
more than 5 yards when it's 3rd down and longer than 8 yards to go, the
temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"

jerf: Nice thought. I've spent years trying to earn enough money to buy the
Pittsburgh Steelers just to fire the slackers and fumblers and win the Super
Bowl every year. Maybe I should just take an easier route and solve that
problem like any self-respecting hacker should: with data & logic. No Steeler
game this weekend; I may have found my destiny </sarcasm>

~~~
fleaflicker
You'll find that the text descriptions aren't consistently formatted. It's
tough to extract structured data from all play descriptions.

For example, first initial plus last name does does not uniquely identify a
player. You'll need accurate roster data first, and even then there are
clashes.

We store play data by its structured components (players involved, play type,
player roles, etc) and then derive the text description. This allows us to
reassemble pbp data from different pro games to show a "feed" for your fantasy
team.

Baseball has a smaller set of play outcomes/transitions so its easier to model
this way. As your example from the Steelers Super Bowl shows, football plays
can be very complex.

~~~
fennecfoxen
"It's tough to extract structured data from all play descriptions."

Which means you can treat it a bit like a text mining program. NASA had a text
mining contest in 2007 as part of the SIAM conference on data mining which was
really similar - instead of football plays it was textual descriptions of
aeronautics incident reports and their classification. There were several
papers that came out of that (I was with a group that did one of them, using
an approximate nonnegative matrix classification approach - got beat out by
some ensemble approaches).

Anyway - if you'd like to do something with unstructured football play
descriptions, text mining might be able to empower you to some extent without
going through a full manual analysis, and those papers could be a good
starting point. I think some of them ended up in a volume titled _Survey of
Text Mining II_.

~~~
textminer
Incredibly interested in your work here. For small-dimensional problems (or
problems with features that can be engineered to be small-dimensional),
ensemble methods through random forests and bagging and the like are
incredibly useful.

But for high-dimensional text problems that're pure classification, I tend to
rely simply on 1NN classifiers (against a single centroid of training data of
a target category, of which there tend to be many). I've spent a lot of time
with NMF, for its potential as an incredibly interesting data-exploration tool
("There's a pronoun cluster! There's a Spanish cluster! There's a 404 Error
axis!") or low-dimension projection step. I've even spent a good amount of
time on implementing the algorithm in a number of memory-efficient ways.

Could you expand a bit on how you used NMF for these problems in practice
(similar to how a sparse autoencoder captures reduced-dimensional features en
route to supervised learning), or how others used ensemble methods?

~~~
fennecfoxen
Afraid it's been a while, and I wasn't really at the core of the project
design - if you're REALLY interested look up _Anomaly Detection Using
Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume
still teaches at the University of Tennessee, Knoxville).

The main idea, though, is to generate a term-by-document matrix (count words,
maybe throw out stopwords, normalize counts), then do Math to factor your
matrix (approximately) into two: term-by-feature and feature-by-document. When
you want to classify a new document, you can use its contents (more terms) to
calculate a feature vector.

(The math seems to typically involve random initialization followed by
iterative improvements. Other work in the field discusses the specifics.)

The matricies are "nonnegative" because, conceptually, features are a
_positive_ thing, and you can't say that a certain term makes something _less_
a member of a feature cluster (only more).

The tricky part is figuring out how to map features to things which are
semantically interesting to your application, and I don't want to comment too
much on the state of that because it's been five years and I honestly forgot
what exactly we did there, and it was all done in Matlab (which I'd never used
before), and there's probably more recent work in the field. But if you fiddle
with it manually, you can come up with your matrices and essentially have a
nice little classifier.

------
tghw
Looking through the 2002 season, there's an oddity around touchdowns and extra
points. It seems that the 6 points for the touchdown are bundled with the
extra point, and the score is not updated until the extra point is complete.

It seems this might result in bugs, as in the Oct 20, 2002 game between Dallas
and Arizona. In the third quarter, with a score of Arizona 6 - Dallas 0,
Dallas scored a touchdown (row 13900) but "aborted" the extra point (row
13901). The 6 points for the Cowboys are not recorded in the data.

The game eventually went to overtime, with the Cardinals kicking a winning
field goal in OT for a final score of Arizona 9 - Dallas 6, but the data here
records it as Arizona 6 - Dallas 0.

------
danso
There's a FAQ for this data that is on the site's main nav:

<http://www.advancednflstats.com/2007/02/contact.html>

Of particular interest:

 _Where did you get your data?_

Most of my team data comes from open online sources such as espn.com, nfl.com,
myway.com, and yahoo.com. It's easy for anyone to grab whatever they're
interested in from those sites.

 _My play-by-play data comes from a source that's not publicly available, and
at this time I regret that I cannot share it._ However, I am working hard to
develop a way to spread the wealth. One of my biggest goals is to help create
a larger, more open, and more collaborative community for football research.

\----

There's no real terms of service so I'm curious as to the constraints in using
this for commercial purposes. I most definitely want to use this for teaching
purposes (how to text-mine, how to build a web app from data, etc) but want to
know what terms the data can be redistributed.

~~~
bendmorris
IANAL, but it has been ruled that NFL player names and statistics are
protected by the First Amendment, i.e. no one "owns" it and anyone is free to
use it for any purpose.

[http://blogs.trb.com/sports/custom/business/blog/2009/04/cbs...](http://blogs.trb.com/sports/custom/business/blog/2009/04/cbssportscom_wins_fantasy_game.html)

However, you do have to get the data, and unauthorized access of computers
(which constitutes trespassing) can be a legal gray area. I'd love to hear a
lawyer weigh in on the legality of scraping the data directly from espn.com.

~~~
pseut
Last time I checked, the play by play data on espn.com was pretty error-
ridden. This was three or so years ago, so it might have changed, and I was
hypothetically interested in the score columns, so it may not matter depending
on other hypothetical uses. But I'd hypothetically avoid scraping ESPN for
that reason alone.

------
petersalas
This seems like as good a time as any to share something I've been working on
which uses the same source data, even though it's pretty rough at the moment
(slow, bad data, only currently goes through week 8 of 2012, etc.):

<http://nfl-query.herokuapp.com/>

The basic syntax is [stats] [conditions] : [row] / [column].

There's some autocompletion to try to make it possible to discover what is
accepted.

Examples:

passing yards : team / season

first downs / first down attempts : down / distance

rushing yards min 100 rushing yards : player, game, quarter

rushing yards / carries min 200 carries : player

One of the biggest problems is that it's currently way too easy to shoot
yourself in the foot by making a really slow query.

------
arscan
I'm frankly surprised that this information is allowed to be distributed. I
spent awhile in the financial services industry, and while it was really easy
to obtain "public" information like stock quote data, I recall that we weren't
allowed to simply scrape data from public sites... we had to pay a license fee
to get a feed of the data if we were planning on repackaging & distributing
it.

It seems to me that the NFL would want to have exclusive rights to distribute
this data and charge people a fee for access to it. Clearly I'm no expert in
these legal affairs though.

~~~
aidenn0
IANAL, but I asked one about this while ago; let's see if I can remember: It's
complicated. The NFL broadcasts are copyrighted, and come with a statement
that (among other things) distributing descriptions of the game is not
allowed. That could be considered a derivative work.

On the other hand, a live performance is generally not protected from
copyright, so if you attend a live game to collect the data, you may be in the
clear.

The data isn't owned by the NFL, but all recordings of the games are, and so
any data obtained by watching recordings of the games could potentially be
controlled by the NFL.

~~~
_delirium
It might not even violate the NFL's copyright if extracted from tapes. For one
thing, something is only a "derived work" for copyright purposes if it's a
"creative work" subject to copyright at all, and in the U.S., data sets
comprising factual information aren't typically considered "creative". For
another, it's not clear whether data about a recording is derived from the
recording for copyright purposes. For example, a re-edit or mash-up of a film
is clearly a derived work, but is a count of how many minutes each character
speaks a derived work? Or is a Spotify-style algorithmic analysis of a song's
musical style a derived work of the song?

I wouldn't want to put a large bet on where exactly those lines are drawn,
though.

------
dude_abides
Here is an idea: build a predictive model of an offensive coach that predicts
the play he will call, given a game situation (and based on that, build a
predictiveness quotient for a coach).

~~~
fleaflicker
It doesn't work like that in practice. Football is very dependent on matchups.
Coaches will vary gameplans from week-to-week to exploit weaknesses they see
on film.

~~~
dude_abides
Matchup would be a part of the model. My experience with predictive modeling
in various domains has taught me that people tend to underestimate how
predictive they are (NFL offensive/defensive coaches are no exception).

~~~
lftl
I'm interested in doing some predictive modeling for a couple of project ideas
I've been kicking around. Are there any specific resources you would recommend
as good starter material?

------
euroclydon
How many of you are thinking right now: I'm going to generate an HTML page for
every game and throw ads on it? Be Honest!

~~~
DanBC
Is anyone going to try a 'moneyball' style Fix_Your_Fantasy_League_LineUp site
with ads?

~~~
404error
I might try to use the data to create mock drafts.

------
burntsushi
The CSV file format is nice, but if you're looking for a Python API to play
with NFL stats without having to parse play-data fields, check out nflgame
[1]. I've written up a quick primer. [2] It also includes the ability to get
play-by-play statistics live.

[1] - <https://github.com/BurntSushi/nflgame>

[2] - <http://blog.burntsushi.net/nfl-live-statistics-with-python>

------
ImJasonH
I've started uploading these CSVs to a public Google BigQuery dataset called
[nfl], so you can run queries over them like this:

    
    
        SELECT off, COUNT(off) AS count
        FROM [nfl.2012reg]
        WHERE description CONTAINS "INTERCEPTED"
        GROUP BY off
        ORDER BY count DESC
    

(This counts the number of plays that resulted in an interception by the team
that threw the interception, sorted from most to fewest INTs)

~~~
danvoell
I'm new to BigQuery, how do I access a public dataset? I ran the query and got
the error Not Found: Dataset 578707073226:nfl

~~~
ImJasonH
Sorry, I didn't actually make it public it seems. Should work now.

~~~
danvoell
thanks!

------
patrickk
Here's some soccer data, doesn't include play-by-play though (soccer generally
isn't suited to that kind of breakdown, although Opta Sports do track it).

<http://www.football-data.co.uk/downloadm.php>

Tons of European leagues, going back to 1993 in some cases.

Here's some sites that give detailed stats and match reports:

<http://www.eplindex.com/>

<http://www.whoscored.com/>

<http://www.soccerstats.com/>

<http://www.soccerway.com/>

<http://www.squawka.com/>

Man City use their petro-dollars to open up Opta Sports (detailed match stats)
to all: <http://www.mcfc.co.uk/the-club/mcfc-analytics>

Someone needs to compile stats equivalent to these NFL ones for european
football! Hmmmm...

------
ScottWhigham
The comments on that are awesome too - great advice for parsing, categorizing,
and such. I couldn't download 2010 though - "Sorry, we are unable to generate
a view of the document at this time. Please try again later."

~~~
ScottWhigham
If you click the little down arrow (top left), it will download the file. Just
a heads-up in case others see this message as well.

------
gavinlynch
Amazing!!! Thanks to www.advancednflstats.com for doing all the leg-work.
Highly recommend their site too. Their in-game win probability statistics are
always a must-have for me on game-day ^_^

~~~
tghw
I really like his 4th Down analysis:

[http://www.advancednflstats.com/2009/09/4th-down-study-
part-...](http://www.advancednflstats.com/2009/09/4th-down-study-part-1.html)

The tl;dr version can be found at:

<http://www.advancednflstats.com/2010/05/4th-down-briefs.html>

The conclusion is that teams should go for it on 4th down much more often than
they currently do.

He also has a calculator where you can get the exact values:

<http://wp.advancednflstats.com/4thdncalc1.php>

~~~
yukoncornelius
The Patriots have used this analysis:

[http://www.math.toronto.edu/mpugh/Teaching/Sci199_03/Footbal...](http://www.math.toronto.edu/mpugh/Teaching/Sci199_03/Football_game_theory.htm)

I also believe Belichek/Adams have funded some football economics research.

------
danso
This looks like great fun...Judging by some of the sample entries, it will
also be an instructive example of the limitations of CSV and why serious
analysts who want to work with unstructured data need to know a scripting
language, or at least regexes.

Sample description field: > _20020905_SF@NYG,1,59,20,NYG,SF,3,11,81,(14:20)
(Shotgun) K.Collins pass intended for T.Barber INTERCEPTED by T.Parrish
(M.Rumph) at NYG 29. T.Parrish to NYG 23 for 6 yards (T.Barber).,0,0,2002_

In the comments section of the OP, someone posted this sample Excel function:

    
    
        	=IF(ISNUMBER(SEARCH("right   tackle",J2)),"rush",IF(ISNUMBER(SEARCH("right
    	guard",J2)),"rush",IF(ISNUMBER(SEARCH("left
    	guard",J2)),"rush",IF(ISNUMBER(SEARCH("up                              the
    	middle",J2)),"rush",IF(ISNUMBER(SEARCH("left
    	tackle",J2)),"rush",IF(ISNUMBER(SEARCH("left
    	end",J2)),"rush",IF(ISNUMBER(SEARCH("right
    	end",J2)),"rush",IF(ISNUMBER(SEARCH("pass",J2)),"pass",IF(ISNUMBER(SEARCH("kneel",J2)),"kneel",IF(ISNUMBER(SEARCH("punt",J2)),"punt",IF(ISNUMBER(SEARCH("kicks",J2)),"kickoff",IF(ISNUMBER(SEARCH("extra
    	point",J2)),"extrapoint",IF(ISNUMBER(SEARCH("sacked",J2)),"sack",IF(ISNUMBER(SEARCH("PENALTY",J2)),"penalty",IF(ISNUMBER(SEARCH("field
    	goal",J2)),"fieldgoal",IF(ISNUMBER(SEARCH("FUMBLES",J2)),"fumble",IF(ISNUMBER(SEARCH("spiked",J2)),"spike",IF(ISNUMBER(SEARCH("scrambles",J2)),"rush","rush"))))))))))))))))))
    
    

Dear god, at what point do people finally realize that it's worth learning
some simple scripting to work with text files?

~~~
HelloMcFly
The Excel function looks ridiculous, but it probably didn't take more than 10
minutes to make, tops. Nested conditionals are easy.

At any rate, what would you recommend most to accomplish the task? I'm
learning Python and know R a bit, so I was just wondering how I was going to
about combing through the data.

~~~
danso
Python or Ruby is fine...the main trick is to be able to process those fields
with regular expressions...which, IIRC, requires throwing in VBscript if you
were to handle it solely in Excel.

Python and Ruby would also allow for more elegant-looking -- i.e. more
maintainable -- functions to handle that field.

~~~
pseut
R is fine too: it has regular expressions and probably excels if you plan to
do statistics using all of that data. Python seems to have reasonable
statistics functionality as well (with pandas, etc) but I haven't used it
personally.

~~~
HelloMcFly
Thanks for the note. I've not really looked into how one would do that with R
(doing it in Python seems more clear), but am checking it out now. If anyone
else is looking I'm finding this PDF helpful:
[http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenR...](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf)

------
AlwaysBCoding
God this is such interesting stuff. How do we still not have a fully featured
open source NFL stats-rosters-game charting API? Who wouldn't want to
contribute to that project?

Other than cool data visualization stuff, the obvious implication is the
potential to devise a profitable system to pick games against the spread. The
guys at Football Outsiders have done a decent job at it and made a proprietary
algorithm that picked games at 58% this year ( which is over the threshold you
need to be profitable in Vegas ). But even those guys are still having some
trouble getting access to and aggregating the data in a usable format.

I really want to sit down and start playing around with some of this data so I
appreciate you putting this together for everyone. The NFL needs an open
source API and this is definitely a step in the right direction.

~~~
burntsushi
I believe my library, nflgame [1], would fit the bill. Features all play-by-
play data back to 2009, and includes the ability to track play-by-play data
live.

[1] - <https://github.com/BurntSushi/nflgame>

------
nchuhoai
For my ML class, I used this dataset to play around with predicting plays. Im
obviously no expert, but wanted to share my report anyways:

<https://www.dropbox.com/s/cy04oxaq83mxvoz/report.pdf>

------
evanjacobs
A bit OT, but I thought this might be a good opportunity to mention the
upcoming SportsHackDay in Seattle from Feb 1-3 which culminates in a group
viewing of the SuperBowl. <http://sportshackday.com/>

------
kevinburke
I wrote a small wrapper around the 4th down calculator on that site, which
should help you figure out if your team should go for it on 4th down:

<http://downanddistance.herokuapp.com/>

~~~
stuff4ben
Need a bounds check or two in there. Tried setting the number of yards you
need to 1 and the yards away from the endzone to 99 and it threw up a nice
exception. Cool calculator though!

~~~
brianbreslin
As a Madden (game) aficionado my first thought was "ALWAYS go for it on 4th!"
but then again I play super recklessly...

------
jredwards
I've used data from Brian Burke's site before. I think it's the exact PBP data
the NFL has, but you'll find that the structure and common phrasings change
over the years. I had to write a lot of regular expressions and I was still
catching edge cases for weeks.

btw, pro-football-reference has pbp data now too, and it probably goes back a
lot further, but I think they discourage mass scraping of their site.

------
pmarsh
There is a lot to have fun with here. I would imagine though that in a lot of
NFL coaching rooms there has to be a balance between coaching and analysis.

Like someone else said, it's about match-ups.

Semi-related : [http://profootballtalk.nbcsports.com/2013/01/03/polian-
think...](http://profootballtalk.nbcsports.com/2013/01/03/polian-thinks-
moneyball-wont-work-in-nfl/)

------
p4bl0
As a French not interested in sports at all this would have made no sense at
all to me before I watched the TV series The League [1]. Now I kind of enjoy
the fact that these stats exist and are available in an open format, even if I
don't really care myself.

[1] <http://www.imdb.com/title/tt1480684/>

------
zempf
There's an interface into this sort of play-by-play data (since 2000) at
<http://pro-football-reference.com/play-index/play_finder.cgi> \-- lets you do
queries on down/distance/position on field/score differential, all that sort
of stuff.

------
grogenaut
This is slightly off topic but does anyone know of a resource to get the odds
at gametime historically for NFL games?

~~~
glamp
[https://www.dropbox.com/s/ikczgv737lllh0a/nfl_spreads_1985-2...](https://www.dropbox.com/s/ikczgv737lllh0a/nfl_spreads_1985-2010.csv)

~~~
grogenaut
thanks!

------
crabasa
If you live in the vicinity of Seattle, there is a sports-themed hackathon
going on Superbowl weekend. Google, ESPN and a bunch of tech companies are
sponsoring. The grand prize will be passes to the Sloan Sports Conference.
More details to come:

<http://sportshackday.com>

------
bsims
You should send this to the Buffalo Bill's new analytics department. Maybe a
Hacker could get hired to an NFL team.

[http://www.nfl.com/news/story/0ap1000000121582/article/bill-...](http://www.nfl.com/news/story/0ap1000000121582/article/bill-
polian-moneyball-does-not-work-in-the-nfl)

------
activus
It would be interesting to take this data and build an app around it for
fantasy football. If you have all the tendencies, and how players like your
player have played against certain teams, you could make better guesses on who
to play.

~~~
cacciatc
I did something like that for an AI class in college. We used FuzzyCLIPS to
write an expert system for drafting fantasy football teams. A ruby script
pulled the CSV data from some other site that had the previous three years of
NFL data, and then converted the CSV to fact files which the system then read
in.

When all was said and done it worked, but made some pretty crappy draft picks!
I should find that code....

------
eel
FYI, for baseball fans, you can get similar data about each play in the MLB
from <http://www.retrosheet.org>

------
wildster
Shame it does not have player ids. I wonder how many players with the same
surname and the same first initial play for the same team.

------
bdittmer
Now all we need is some historical line & associated movement data...

------
JohnFromBuffalo
What ... no NHL? Oh ya.

------
winstonian
Would be great if there were columns:

Head Coach \- Offensive Coordinator \- Defensive Coordinator \- Formation \-
Play

------
mcs
wow

------
asc76
Wow. Simply. Wow.

