
Open Football Data - vinhnx
http://openfootball.github.io
======
keithxm23
This is impressive for the amount of work put in to formatting the data and
making it easy to use in different ways.

For details and advanced analytics though, this one is much better:
[https://github.com/soccermetrics/soccermetrics-client-
py](https://github.com/soccermetrics/soccermetrics-client-py)

~~~
toyg
Holy fullbacks, Batman! I didn't know about SoccerMetrics. The Python client
seems to be just a wrapper for their REST api:
[http://soccermetrics.github.io/fmrd-summary-
api/started.html](http://soccermetrics.github.io/fmrd-summary-
api/started.html)

Lots of numbers to crunch there!

This is all very nice but it would be nicer if there was some sort of cheap
software that amateur teams could use to gather and then analyze their own
data. There's a massive market out there for this sort of thing, the football
world is very conservative and tends to move slowly.

~~~
hhamil
Hi,

I'm the founder of Soccermetrics and the creator of the Soccermetrics API.
Thanks for the attention.

I've had the site for a little over five years now. I had been working on data
models and algorithms for analysis of soccer matches and thought I would have
a go at creating a company out of it. I even applied to YC which was a bit of
a laugh in retrospect :) Right now I have another job that pays the bills but
there are a few projects I do on the side, the API being one of them.

The API is the latest iteration of my data models exposed to the world as well
as my attempt to build as close to a REST API as I could. I don't claim
perfection and I'm sure others will have their opinion on it, which I welcome.

I wrote the Python client which is as you say a wrapper over the API, which
serves well as a starting point. If you have ideas on how to extend it, please
fork and contribute.

There are a few software tools out there that do what you wish. Statzpack is
one, SportyBird is another. I have my doubts about how big the market really
is for this kind of service, but everyone is very early in this space.

~~~
toyg
Thanks so much for your work and for your awesome suggestions!

My family (everyone except me, lol) sort-of runs an amateur football club, so
I have some second-hand knowledge of that world (at least in Italy). They tell
me that systematic, professional and data-driven approaches are incredibly
scarce but very effective. It's a system that still runs on personal networks
and a lot of outdated knowledge and "magic", not unlike the baseball world
described in _Moneyball_ [1]. As you said it's early days, but still, most
coaches under 40 now bring a tablet with them on the bench, and not to take
funny pictures.

[1]
[http://www.amazon.co.uk/gp/product/0393324818/ref=as_li_ss_t...](http://www.amazon.co.uk/gp/product/0393324818/ref=as_li_ss_tl?ie=UTF8&camp=1634&tag=subclassed-21&creative=19450&creativeASIN=0393324818&linkCode=as2)

~~~
hhamil
What your family says is true, but "systematic, professional, and data-driven"
packages that are appealing to semi-pro or amateur clubs tend to be expensive.
To further complicate things, each club will insist that certain specific
events be tracked in order to verify that their gameplan is being implemented,
which leads to "custom" data that are actually a tagged composite of basic
data points.

By the way, thanks for the pull requests on the client. Developing the client
and the API backend has been a learning experience at every step, so I
appreciate contributions from experienced developers.

------
sourc3
You, my friend, are the best. As a huge soccer fan and a developer, getting
this sort of data is really hard unless you shell out hundreds of dollars a
month.

Already thinking about the apps that will use this! Thank you.

~~~
sourc3
And kudos for calling it football :)

~~~
leemcalilly
Yes this is a stellar idea!

------
cabbeer
Does anyone know if this is available for (American)Football?

~~~
dirtestbird
Ive used this
[https://github.com/BurntSushi/nfldb](https://github.com/BurntSushi/nfldb) and
this
[https://github.com/BurntSushi/nflgame](https://github.com/BurntSushi/nflgame)
to make this [http://www.ffgraf.com/](http://www.ffgraf.com/)

~~~
burntsushi
Author here! That's a pretty nice visualization :-)

I thought I'd just squeeze in a few words about nflgame/nfldb. Both offer
access to the same stuff: play-by-play data back to 2009. Both can be used
with _live_ games so that they are updated in real time (well, at least as
frequently as NFL.com).

nflgame is responsible for pulling the JSON data and provides some rudimentary
searching features. But it's slow.

nfldb stores all this data for you in a relational database. It comes with a
script that updates the database while games are playing so that you can get
access to live data. (It will even migrate the database for you if I've made
any changes to the schema.)

Here's a quick example that shows how to get all of Julian Edelman's touchdown
plays from last season:

    
    
        import nfldb
        
        db = nfldb.connect()
        
        q = nfldb.Query(db)
        q.game(season_year=2013, season_type='Regular')
        q.player(full_name='Julian Edelman').play(offense_tds=1)
        for g in q.as_plays():
            print g
    

Easy as pie!

There's an _extensive_ wiki (almost 20,000 words) with tons of examples and
explanation:
[https://github.com/BurntSushi/nfldb/wiki](https://github.com/BurntSushi/nfldb/wiki)

Other features: aggregating data, player meta data (college, height, weight,
etc.) and fuzzy player name matching.

------
rpedela
A very cool project, but I have one question/issue.

The data format seems to be a custom text format which admittedly I could be
wrong about. Is it possible to use TSV or CSV instead since it would be
infinitely more useful since it could be directly imported into relational
databases, Excel, etc.

------
fiatjaf
What about RSSSF? [http://www.rsssf.com/](http://www.rsssf.com/)

~~~
m0skit0
Impressive, thanks for sharing!

------
ddispaltro
Is there an open database for horse racing?

~~~
phillc73
That depends on your definition of open.

The short answer is no. I've searched long and hard, high and low, for free
(beer) horse racing databases for UK/IRE and Australia. To a lesser extent
I've searched for HK, FR and GER data. I'm yet to find anything that is
comprehensive and no cost.

There's a couple that I do use for UK/IRE racing which cost in the region of
£35-£45 per month for access. Betwise/Smartform provides an historical
database in MySQL, and daily race card/results updates. UKHorseRacing.co.uk
provides CVS files with historical race data, their ratings and race results.
I take these CVS files, combine them into a SQLite database and interrogate
with R.

A slightly longer answer is, sort of. The Betfair API is currently open access
for non-commercial and low volume use (as far as I'm aware). This will allow
you to retrieve basic racing data - the cards before that race with horse
name, jockey, barrier etc and the race results post-race including the Betfair
Starting Price. After interrogating the API, you'll need to obviously compile
the data into your own database. A bit of work, but feasible. Betfair has a
developer programme and their are API bindings available in a number of
different languages. I use R (R package developed by Betwise mentioned above),
but I know Python is available. One caveat to mention is that Betfair are
upgrading their API, so this will obviously have an impact on existing
programs using the old one.

If anyone else has additional information or could point me in the direction
of something else "free" I'd appreciate it as well.

~~~
joosters
As well as Betfair's 'live' API, they also provide historical betting data at
[http://data.betfair.com/](http://data.betfair.com/)

It is free but you need an active account with them to download the CSV files.

~~~
phillc73
At this page you can download all historical Betfair price data in CSV format.

[https://promo.betfair.com/betfairsp/prices/index.php](https://promo.betfair.com/betfairsp/prices/index.php)

------
chevreuil
The data format bothers me. Why not use a standard one like JSON?

~~~
wambotron
They do have a json web service:
[http://footballdb.herokuapp.com/api/v1/event/en.2013_14/roun...](http://footballdb.herokuapp.com/api/v1/event/en.2013_14/round/today)

~~~
grahamel
all the scores are null

how often is the feed updated?

~~~
Nilzor
Im guessing "not often enough"

------
fatihpense
Also have a look at this one: [http://www.football-
data.co.uk/data.php](http://www.football-data.co.uk/data.php)

~~~
iamwithnail
Yeah, it's pretty awesome. I built [http://test.gmbl.io](http://test.gmbl.io)
off that data set to learn to code. Good place to start.

Kickdex is also pretty awesome, they use the Opta data to produce real time
indices for teams and players.

------
abeisgreat
I'm curious if this data is actually public domain. Where are they sourcing it
from? Are they legally allowed to redistribute? Etc.

~~~
bronson
Why wouldn't they? It's just raw facts, presented in their own minimal style.

~~~
buro9
In the UK this rule applies:
[http://www.ipo.gov.uk/types/copy/c-otherprotect/c-databaseri...](http://www.ipo.gov.uk/types/copy/c-otherprotect/c-databaseright.htm)

    
    
        For copyright protection to apply, the database must
        have originality in the selection or arrangement of
        the contents and for database right to apply, there
        must have been a substantial investment in obtaining,
        verifying or presenting its contents. It is possible
        that a database will satisfy both these requirements
        so that both copyright and database right apply.
    

They would have a "database right" if they had placed a person at each match
to gather the data and verify it, as that is a substantial investment.

How they originally acquired the data is important and shouldn't be presumed.

However that doesn't stop you from implementing your own database and re-
acquiring the facts in some trivial way. Just bear in mind that accessing
historical data may breach someone else's database right.

Database rights are usually proven by fake data inserted into the database to
catch people copying it.

For example you could argue that the Rare Record Price Guide (
[http://www.rarerecordpriceguide.com/](http://www.rarerecordpriceguide.com/) )
is just a collection of facts, and decide to copy it... but you'll discover
when sued, that a few of the bands in the guide are fictional and designed to
demonstrate that the database is theirs, and that it's not trivial to acquire
and verify the data.

~~~
bronson
Great replies joosters and buro9. Thanks.

So, for the sake of argument, if the dataset had no fake data then it would be
OK? Or would they still need to demonstrate "substantial investment", no
matter the state of the data?

If the latter, then that gets weird quick. How many lines of code is
considered substantial? How many hours hunched over a microfiche machine? It
sounds like it would ultimately depend on the skill of your lawyer.

~~~
ugexe
You need to think of fake data being a more broad term than you are. If we
talk about play by play for american college football you will notice how
ncaa.com, espn.com, foxsports.com and others have slight differences in what a
play's down/togo/time/etc is. It is not as simple as ESPN inserting an entire
fake team or fake game; if you were to compare to the last example it would be
a real record with a slightly modified price. I analyze college football data
sets and can determine where they came from, so I have no doubt that companies
can as well.

If you have enough data sources you could theoretically recreate a play by
play from all of them and have a data set that would be difficult to prove was
stolen from someplace in particular. I say theoretically because (at least
with college football) you are often not given enough information to recreate
the game (simple example would be how long a play took to execute to determine
drive possession time), so often you are left using a best guess method.

------
yitchelle
This is an interesting lecture [1] at Linuxwochen Wien 2013 that focuses on
the usage of football.db. More data should be put into public domain.

[https://cfp.linuxwochen.at/en/lww2013/public/events/61](https://cfp.linuxwochen.at/en/lww2013/public/events/61)

------
llimllib
I did something pretty similar, but it seems definitely less comprehensive:
[https://github.com/llimllib/soccerdata/](https://github.com/llimllib/soccerdata/)
. Will be using this, thanks!

------
ntietz
This is really cool! Does anyone know if there are similar datasets for other
sports out there? Even less clean datasets, as long as they have permissive
licensing to allow sanitation and republication.

~~~
cwyers
The gold standard for freely-available sports data is baseball, with the
Retrosheet project:

[http://retrosheet.org/](http://retrosheet.org/)

The license on the data is a pretty permissive one, simply requiring
attribution of the data to the Retrosheet project. Software to process
Retrosheet files is available, under the GPL:

[http://chadwick.sourceforge.net/doc/index.html](http://chadwick.sourceforge.net/doc/index.html)

~~~
bnycum
I'll add this. Sean Lahman's Database is also widely used. Though it's mainly
whole season statistics, not game by game. Along with post-season, all star
games, schools, salaries.

[http://www.seanlahman.com/baseball-
archive/statistics/](http://www.seanlahman.com/baseball-archive/statistics/)

Then of course MLB has a bunch of data here, mainly the PitchF/X data since
2008 is gathered from here.

[http://gd2.mlb.com/components/game/mlb/](http://gd2.mlb.com/components/game/mlb/)

~~~
cwyers
There's several scrapers to parse the MLB XML data, the most popular (I think)
is Baseball On A Stick, in Python:

[http://sourceforge.net/projects/baseballonastic/](http://sourceforge.net/projects/baseballonastic/)

------
redshirtrob
This looks cool. I see Gold Cup and NA Champion's League repos. Is there a
plan to add MLS data? I know some people who would be super excited to get
baseball-reference.com level data for MLS.

------
ngoel36
Is there anywhere to get real-time play-by-play data?

~~~
packetslave
There are several, and you'll pay a lot of money for them.

~~~
thom
That's true. There are some circumstances in which Opta let you do interesting
things with non-realtime data though:

[http://www.optasports.com/playground-
section.aspx](http://www.optasports.com/playground-section.aspx)

------
fiatjaf
I don't know from where did this came from, but I like open formats. From
where do the data come?

------
dalek2point3
its a shame that this is not being done under the wikidata framework. those
guys have been thinking about databases like this for a while, and can be
reliably trusted to at least keep it up for a reasonable amount of time.

------
veganarchocap
Where's Derby County's stats?! Just kidding this looks great!

------
ins429
awesome, exactly what I need

------
rurabe
perfect timing. thanks!

