Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From Line 42536 of the 2008 CSV file:

20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008

They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}

Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...

I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.

See you guys on Monday.

EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...

fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...

MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"

jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>



You'll find that the text descriptions aren't consistently formatted. It's tough to extract structured data from all play descriptions.

For example, first initial plus last name does does not uniquely identify a player. You'll need accurate roster data first, and even then there are clashes.

We store play data by its structured components (players involved, play type, player roles, etc) and then derive the text description. This allows us to reassemble pbp data from different pro games to show a "feed" for your fantasy team.

Baseball has a smaller set of play outcomes/transitions so its easier to model this way. As your example from the Steelers Super Bowl shows, football plays can be very complex.


"It's tough to extract structured data from all play descriptions."

Which means you can treat it a bit like a text mining program. NASA had a text mining contest in 2007 as part of the SIAM conference on data mining which was really similar - instead of football plays it was textual descriptions of aeronautics incident reports and their classification. There were several papers that came out of that (I was with a group that did one of them, using an approximate nonnegative matrix classification approach - got beat out by some ensemble approaches).

Anyway - if you'd like to do something with unstructured football play descriptions, text mining might be able to empower you to some extent without going through a full manual analysis, and those papers could be a good starting point. I think some of them ended up in a volume titled _Survey of Text Mining II_.


Incredibly interested in your work here. For small-dimensional problems (or problems with features that can be engineered to be small-dimensional), ensemble methods through random forests and bagging and the like are incredibly useful.

But for high-dimensional text problems that're pure classification, I tend to rely simply on 1NN classifiers (against a single centroid of training data of a target category, of which there tend to be many). I've spent a lot of time with NMF, for its potential as an incredibly interesting data-exploration tool ("There's a pronoun cluster! There's a Spanish cluster! There's a 404 Error axis!") or low-dimension projection step. I've even spent a good amount of time on implementing the algorithm in a number of memory-efficient ways.

Could you expand a bit on how you used NMF for these problems in practice (similar to how a sparse autoencoder captures reduced-dimensional features en route to supervised learning), or how others used ensemble methods?


Afraid it's been a while, and I wasn't really at the core of the project design - if you're REALLY interested look up _Anomaly Detection Using Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume still teaches at the University of Tennessee, Knoxville).

The main idea, though, is to generate a term-by-document matrix (count words, maybe throw out stopwords, normalize counts), then do Math to factor your matrix (approximately) into two: term-by-feature and feature-by-document. When you want to classify a new document, you can use its contents (more terms) to calculate a feature vector.

(The math seems to typically involve random initialization followed by iterative improvements. Other work in the field discusses the specifics.)

The matricies are "nonnegative" because, conceptually, features are a _positive_ thing, and you can't say that a certain term makes something less a member of a feature cluster (only more).

The tricky part is figuring out how to map features to things which are semantically interesting to your application, and I don't want to comment too much on the state of that because it's been five years and I honestly forgot what exactly we did there, and it was all done in Matlab (which I'd never used before), and there's probably more recent work in the field. But if you fiddle with it manually, you can come up with your matrices and essentially have a nice little classifier.


I had asked a question on stack-overflow a while ago asking for some guidance on parsing this exact kind of stuff. http://stackoverflow.com/questions/8198923/natural-language-...


I eagerly wait your next eBook, "How I Pivoted Into The NFL".


"How I Pivot-tabled Into The NFL"



Can we use the play descriptions to find corresponding clips on Youtube to reconstruct the entirety of the past 10 years of the league? :)


Funny, I was looking for just that line to see how the more complicated plays were described.

First thing that I noticed was that the Game ID matched CBS's website's URLs: 20090201_PIT@ARI == http://www.cbssports.com/nfl/gametracker/playbyplay/NFL_2009...

Also, I went to compare this PBP to both ESPN and CBS and found that both have the exact PBP data, which is interesting because it seems that they got this data directly from the NFL (or from the same source, at least). I guess this makes sense, but it's something I hadn't considered.

For reference, ESPN's PBP for the same game: http://espn.go.com/nfl/playbyplay?gameId=290201022&perio...


It looks like the same format they use for NFL Game Rewind too. I would guess that there is an official syntax and the data is provided by the NFL because if not you would have all types of formats and opinions about the game baked into each team's data. I would also guess the same office that keeps records (game, individual, all time, etc) are the ones that keep the play by play too.

Overall this is neat but it's hard to find real life context within this data. Was the QB pressured, was a coverage blown, was there a pre-snap audible or motion or change by the defense, what was the formation, how much sleep did the players get the night before, etc etc.


I think they all get their data for Elias Sports Bureau


It looks like a lot of the work you are hoping to do has already been done on http://statsheet.com/nfl

Though perhaps not as open...


See also, http://www.pro-football-reference.com/, but I'd love to see something on github.


It exists. Check out nflgame. [1]

[1] - https://github.com/BurntSushi/nflgame


Please post your results, I'd love to read your analysis.


Agreed. I'd love to learn about the process, too. Not just the results/findings.


@edw518, Awesome! Thank you for sharing what you're doing here with the data. I was thinking of playing with this dataset too and since I'm new to this field (data), look forward to learning from you if you post more info in the future!

Thanks!


One of the "new" ways that Burke (data creator) et al are using with this type of data is finding the Expected Points Added for each plays. The EPA allows one to determine how valuable players are to a team's performance.

http://www.advancednflstats.com/2010/01/expected-points-ep-a...

I've been trying to work at the college football level with this same strategy but I'm still trying to figure out how its calculated. It seems trivial but it takes a lot of data organizing.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: