From Line 42536 of the 2008 CSV file: *20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:...

fleaflicker · on Jan 3, 2013

You'll find that the text descriptions aren't consistently formatted. It's tough to extract structured data from all play descriptions.

For example, first initial plus last name does does not uniquely identify a player. You'll need accurate roster data first, and even then there are clashes.

We store play data by its structured components (players involved, play type, player roles, etc) and then derive the text description. This allows us to reassemble pbp data from different pro games to show a "feed" for your fantasy team.

Baseball has a smaller set of play outcomes/transitions so its easier to model this way. As your example from the Steelers Super Bowl shows, football plays can be very complex.

fennecfoxen · on Jan 3, 2013

"It's tough to extract structured data from all play descriptions."

Which means you can treat it a bit like a text mining program. NASA had a text mining contest in 2007 as part of the SIAM conference on data mining which was really similar - instead of football plays it was textual descriptions of aeronautics incident reports and their classification. There were several papers that came out of that (I was with a group that did one of them, using an approximate nonnegative matrix classification approach - got beat out by some ensemble approaches).

Anyway - if you'd like to do something with unstructured football play descriptions, text mining might be able to empower you to some extent without going through a full manual analysis, and those papers could be a good starting point. I think some of them ended up in a volume titled _Survey of Text Mining II_.

textminer · on Jan 4, 2013

Incredibly interested in your work here. For small-dimensional problems (or problems with features that can be engineered to be small-dimensional), ensemble methods through random forests and bagging and the like are incredibly useful.

But for high-dimensional text problems that're pure classification, I tend to rely simply on 1NN classifiers (against a single centroid of training data of a target category, of which there tend to be many). I've spent a lot of time with NMF, for its potential as an incredibly interesting data-exploration tool ("There's a pronoun cluster! There's a Spanish cluster! There's a 404 Error axis!") or low-dimension projection step. I've even spent a good amount of time on implementing the algorithm in a number of memory-efficient ways.

Could you expand a bit on how you used NMF for these problems in practice (similar to how a sparse autoencoder captures reduced-dimensional features en route to supervised learning), or how others used ensemble methods?

fennecfoxen · on Jan 4, 2013

Afraid it's been a while, and I wasn't really at the core of the project design - if you're REALLY interested look up _Anomaly Detection Using Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume still teaches at the University of Tennessee, Knoxville).

The main idea, though, is to generate a term-by-document matrix (count words, maybe throw out stopwords, normalize counts), then do Math to factor your matrix (approximately) into two: term-by-feature and feature-by-document. When you want to classify a new document, you can use its contents (more terms) to calculate a feature vector.

(The math seems to typically involve random initialization followed by iterative improvements. Other work in the field discusses the specifics.)

The matricies are "nonnegative" because, conceptually, features are a _positive_ thing, and you can't say that a certain term makes something less a member of a feature cluster (only more).

The tricky part is figuring out how to map features to things which are semantically interesting to your application, and I don't want to comment too much on the state of that because it's been five years and I honestly forgot what exactly we did there, and it was all done in Matlab (which I'd never used before), and there's probably more recent work in the field. But if you fiddle with it manually, you can come up with your matrices and essentially have a nice little classifier.

JL2010 · on Jan 3, 2013

I had asked a question on stack-overflow a while ago asking for some guidance on parsing this exact kind of stuff. http://stackoverflow.com/questions/8198923/natural-language-...

jerf · on Jan 3, 2013

I eagerly wait your next eBook, "How I Pivoted Into The NFL".

iansinke · on Jan 3, 2013

"How I Pivot-tabled Into The NFL"

jtchang · on Jan 3, 2013

And here is that play :)

http://www.youtube.com/watch?v=oM1iXHY8s9o

aantix · on Jan 3, 2013

Can we use the play descriptions to find corresponding clips on Youtube to reconstruct the entirety of the past 10 years of the league? :)

sjs382 · on Jan 3, 2013

Funny, I was looking for just that line to see how the more complicated plays were described.

First thing that I noticed was that the Game ID matched CBS's website's URLs: 20090201_PIT@ARI == http://www.cbssports.com/nfl/gametracker/playbyplay/NFL_2009...

Also, I went to compare this PBP to both ESPN and CBS and found that both have the exact PBP data, which is interesting because it seems that they got this data directly from the NFL (or from the same source, at least). I guess this makes sense, but it's something I hadn't considered.

For reference, ESPN's PBP for the same game: http://espn.go.com/nfl/playbyplay?gameId=290201022&perio...

anon987 · on Jan 3, 2013

It looks like the same format they use for NFL Game Rewind too. I would guess that there is an official syntax and the data is provided by the NFL because if not you would have all types of formats and opinions about the game baked into each team's data. I would also guess the same office that keeps records (game, individual, all time, etc) are the ones that keep the play by play too.

Overall this is neat but it's hard to find real life context within this data. Was the QB pressured, was a coverage blown, was there a pre-snap audible or motion or change by the defense, what was the formation, how much sleep did the players get the night before, etc etc.

sswezey · on Jan 3, 2013

I think they all get their data for Elias Sports Bureau

tesmar2 · on Jan 3, 2013

It looks like a lot of the work you are hoping to do has already been done on http://statsheet.com/nfl

Though perhaps not as open...

tvon · on Jan 3, 2013

See also, http://www.pro-football-reference.com/, but I'd love to see something on github.

burntsushi · on Jan 3, 2013

It exists. Check out nflgame. [1]

[1] - https://github.com/BurntSushi/nflgame

MattSayar · on Jan 3, 2013

Please post your results, I'd love to read your analysis.

sjs382 · on Jan 3, 2013

Agreed. I'd love to learn about the process, too. Not just the results/findings.

mav3r1ck · on Jan 3, 2013

@edw518, Awesome! Thank you for sharing what you're doing here with the data. I was thinking of playing with this dataset too and since I'm new to this field (data), look forward to learning from you if you post more info in the future!

Thanks!

larrydag · on Jan 3, 2013

One of the "new" ways that Burke (data creator) et al are using with this type of data is finding the Expected Points Added for each plays. The EPA allows one to determine how valuable players are to a team's performance.

http://www.advancednflstats.com/2010/01/expected-points-ep-a...

I've been trying to work at the college football level with this same strategy but I'm still trying to figure out how its calculated. It seems trivial but it takes a lot of data organizing.