Hacker News new | past | comments | ask | show | jobs | submit login
Dataset: Ten Years of NFL Plays Analyzed, Visualized, Quizzified (statwing.com)
75 points by glaugh on Jan 31, 2014 | hide | past | web | favorite | 38 comments

If you're willing to limit yourself to the last five years, you can avoid the pain of parsing those free-form text descriptions with nflgame[1] or nfldb[2]. Disclaimer: I am the author of those tools. We've been slowly building up a small community of people using it. In fact, I'm currently in a small fantasy playoff league that's running off nflgame.

They also include live updates while games are playing.

I'd be curious to see if you get any substantially different results using structured data (nflgame gets it from NFL.com's JSON feed) as opposed to parsing the text descriptions.

[1] - https://github.com/BurntSushi/nflgame

[2] - https://github.com/BurntSushi/nfldb

Another option for some of these types of queries is an app I made called Yards Gained:


The data is from play-by-play text parsing going back to 2000.

I recently joined nfl.com as a FE dev; I see a lot of 404s on Game Center data feeds. The feeds are being requested past the end of the data (the feed tells you where to look for new data until last feed returns 200 with no data - beyond that it's a 404). Not sure how to reduce the 404s, suppose we could document the feeds and make them openly available. Hmm.

I like finding projects in the wild that do creative things with nfl.com data. This guy is building an Arduino Fantasy Football trophy: https://github.com/sambrenner/future-trophy And this person built an OSX app using score strip XML for current scores: https://github.com/kchau/NFL-Menu

> I recently joined nfl.com as a FE dev

Wow, nice! Tell me, how many different unique identifiers to you have for each player/game? :P (Elias id, GSIS id, profile id, ...)

> I see a lot of 404s on Game Center data feeds.

Hmm, I'm not sure what you mean? It seems like the URL stays the same: http://www.nfl.com/liveupdate/game-center/2012080953/2012080...

> Not sure how to reduce the 404s, suppose we could document the feeds and make them openly available.

Yeah, that'd be great! I had figured that you guys kept quiet about them purposefully.

One of the things that has bit me is that the JSON feeds at the URL above only exist back to 2009. I haven't been able to discover a similar feed for older games. Any ideas? :-)

Those projects are pretty neat, btw. The trophy one is really cool.

OP here. That is awesome.

I'd definitely recommend people use this over the download at the bottom of the blog post, it's really painful to parse the free text (there's a lot of weird edge cases). I'll add a link to this stuff in the post.

edit: clarity

update: link added to post

> it's really painful to parse the free text

I can only imagine. I've tried motivating myself to do it a few times so I could increase the amount of data in nfldb (I believe they are available in one form or another all the way back to 1999), but it's a rather daunting task when there are so many statistical categories.[1]

[1] - https://github.com/BurntSushi/nfldb/wiki/Statistical-categor...

> update: link added to post

Thank you so much for helping to spread the word. I really appreciate it!

Oh man, this is just the best -- I can't wait to play around with this.

Come on IRC/FreeNode on #nflgame if you have any questions or need help.

I don't want to hijack from the OP, just want to point out that these are both awesome. Seriously, as a a data geek and huge NFL fan, great work to both of you.

Wow, is this being updated each year to account for any changes to json/xml schema by NFL.com?

It's JSON. I've been an active maintainer since I released it two years ago. Proof is in the issue tracker and the wiki. :-)

(But there haven't been much---if any---changes to the JSON feed's structure. At most, they add some statistical categories.)

please have my babies

I like your username.

The 4th and 2 on the 2 argument bothers me every time someone brings it up. To me it feels like its an incorrect use of statistics. EV isn't the end all - kicking the field goal has way lower variance. If your goal is winning consistently then giving up high variance high EV plays for low variance slightly lower EV plays is often the right choice, as a football season is made up of a very low number of discrete events.

About 45% of all games finish with a spread of 7 or less according to a quick search. Making a play that has a close to 50% chance of making you be down 3 points is costing you a lot of the margin if you think you are a close favorite.

You can't win more in football but you can lose a sure win, so if you believe you are say a 3-4 point favorite then the right play is to take the field goal every time - giving up the safe points means you take half the games and make them a crap shoot.

The question specifies "early in the game"; the correct call can and will vary in a late game or late first half situation (and of course depends on the strengths and weaknesses of the teams involved).

As Bill Barnwell of Grantland is fond of saying, in the first half, your only goal is to maximize your expected points. You don't know exactly how many points you're going to need in the game until the end.

The Bill Barnwell aphorism reads pretty empty to me. You could just as easily flip it to: "In the first half, your only goal is to minimize your opponent's expected points. You don't know exactly how many points they will need until the end."

His phrase doesn't really have any impact on the argument being made here unless you really dissect the relative importance of offense versus defense in the game, which requires a lot more evidence to make a case one way or the other.

You have missed the point of what Barnwell is saying. He is saying that tactical calling of plays to trade expected point value against variance do not make sense in the first half, because you don't know how close it's going to be.

To illustrate: in an infinitely long game, you always pick whatever gives you the highest expected value. In a game with one play left, you always pick the play with the highest probability to put you over the amount of points the opponent has, even if has a lower expected value than other plays.

The quote is saying that the first half is more similar to playing an infinitely long game than a game with one play left.

Indeed, the pure point-differential EV isn't enough. And beyond variance, there are mental factors involved: cascading micro-rewards from scoring, thinkability of a win or loss, social dominance.

I believe the NYTimes "4th Down Bot" actually performs a likelihood-of-winning analysis, as well.

This sounds really good and NFL teams would definitely want something like this, but even coaches would wary about using mostly data.

For example, let's say you are the coach of the Patriots and you have been running the ball very successfully the past few games, even winning games because of your running game.

And it's 4th down and 2 yards to go against the Broncos.

The data says, run the ball. Especially since you have done it well in the past. However, you forgot to take into account the stud defensive tackle that has just started playing really well for the Broncos. So, you try to run the ball and you lose the game.

This is just one example of the inability of data to deal with match-ups and schemes.

As both a person who likes data and coached football, I would love the integration of the two, but football has too many variables.

If you have all this data, you are actually going to make the wrong decision because the matchup is bad for your team.

Matchups and schemes trump data.

Additionally, it could be that the statistics show a result that is an artifact of coaches decisions. If a particular decision is chosen by the pro coaches extremely rarely, but has a moderately better expected result, it could be that the typical expected result is very poor, but the only times coaches make the decision is when they observe a decisive matchup or a special situation -- making it appear statistically like that decision is the better one. (Like looking at basketball shooting percentages, and wondering why centers don't take all the shots given how high their FG%s are.)

That is an excellent point.

Coaches do what they think will be successful. If they can't run the ball well, then they don't run the ball, skewing the data.

Also, tendencies change from week to week. You really need a deep understanding of football to get the right, correct data for that week.

This is why data hasn't dominated football, you need to have tons of knowledge before anything else.

The best use of data is to give coaches a high level picture of what the likely result of a given play call will be (trusting sample size to even it out). You hire a coach to then adjust for those other things. However, if the data says that 99% of the time a given play call will have a specific result, coaches shouldn't try to be the exception (unless the game situation demands it). There is a happy medium here. The Rays have figured it out pretty well with Joe Maddon, the NFL will get there too. No one thinks a machine should call all the plays (yet). We just think machine learning and big data should be leaned on as tools the same way that film is. Don't forget, Bill Sharman was considered a nut for watching film in the 1980s. Imagine coaching without now.

I think it's more than "additionally", your point is the major point.

It's difficult, but data can in fact control for matchups and schemes.

The point of all the data is to give coaches more objective information. To use your example: The data says run the ball. The stud defensive tackle has just started playing really well. Here's the beauty of the way this data can be presented: While there is a way to present it as "expected points," you can also present it as "likelihood of a given result."

In this case, the positive outcomes for the offense are a first down (or score, which would also be a first down). If the data says you will pick up a first down 51% of the time and you know the Broncos tackle has recently begun playing well, you adjust down for it, and likely decide not to run (and no analytics guy will fault you for it). If the data says you'll pick up a first down 85% of the time, you need a better reason to justify not going for it (is this new tackle the greatest of all time? Or just an upgrade on what they had?). Data is not best used as a replacement for a coach. It is best used to give a coach the likelihood of a given event occurring, and then let the coaches knowledge of matchups, injuries, and personnel issues adjust that likelihood up or down.

I am going to a group discussion tomorrow about first-refusal rights in the NFL and happened to do a brief naive analysis of extra point vs 2 point conversions earlier today. In brief, the EV of a 2 point conversion was .91 while an extra point was .99. That said, for every 19 extra point attempts there was only one two point conversion attempt. Frankly, I am all for variance so I am rooting for the more ambitious two-point attempt.

"You got 2 out of 5 answers correct. When you try this quiz with a sorry quiz-taker like you, that’s the result you’re going to get."

Sassy. Jesus Christ.

Richard Sherman [1] must have written this quiz.

[1] http://www.youtube.com/watch?v=yjOkTib5eVQ

OP here

Made the message a bit friendlier for future folks, thanks for the feedback :)

"You got 4 out of 5 answers correct. When you try this quiz with a sorry quiz-taker like you, that’s the result you’re going to get. [That's a joke, we think you did fine :)]"

Haha it was totally fine and funny, just a little surprising. Didn't expect to see something with so much attitude in a page like that.

I think there may be a problem with this kind of analysis - it seems to me that the "riskier" plays (2 point conversion, going for it, etc.) - are more likely attempted when coaches think they will work - not randomly. To really do a fair analysis of expectancy you would need trials where the play selection is chosen randomly. Anyone else agree with me?

Hey guys, I did this a few months ago. At least attempted to organize the data from the descriptions on Advanced NFL Stats. (https://github.com/jackschultz/nfl-data). Turns out there were tons of special cases. Just curious as to how you decided to organize the data.

Do coaches have access to real-time data on the field or in the cooridnator's box? If so, if they followed the datasets exactly, would the results reverse themselves over time due to adjustments by offense or vice versa?

There is some game theory involved. If everyone ALWAYS ran for two point conversions, the defense could just ignore the pass and put 11 in the box and stuff the run.

The its most likely that the optimal solution is a mixed strategy; you're going to need to mix it up to keep defenses honest.

Fun stuff!

You might want to make it clear you want the decisions most likely to succeed, not the decision most common among professional coaches (who are presumably optimizing some other form of career-stability-against-criticism).

5/5. Looks like I choose the right thing to do each time... except... which run should I be calling :)

Numberfire (www.numberfire.com) may have this data, if not more, for their fantasy football tool.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact