Depending on which data you need, there are already some good sources of free football data.[1][2][3]
Someone has also conveniently wrapped much of this in an R library.[4]
Football is actually one of the better sports in terms of easily obtainable data at no cost. Rugby is much more difficult to find extensive datasets, although there are some interesting attempts.[5]
Decent cricket data also exists in a few places[6], but generally requires faster and more regular updating. However, there are R libraries for cricket data too.[7] This one scrapes from the ESPN Cricinfo site.
It is possible to obtain horse racing data for the UK and Ireland at a reasonable price, for personal use[8] and Hong Kong does a great job of making a huge volume of horse racing data available at no cost, but not in a particularly machine usable format (extensive scraping required). Sadly, other large racing jurisdictions such as Australia and the US don't have anything free, or even reasonably priced, as far as I'm aware. Ray Paulick has covered this as a general problem for the sport for a few years now.[9]
I would argue that almost all of that information in your post is stats not data.
The type of data that people in this thread are talking about would be more in-line with detailed positional information about each of the players on a football pitch over 90 minutes. In a cricket context, it would be more along the lines of the exact release angle and speed for each of the bowlers.
This type of information is clearly available, as Michael Caley is able to quickly generate xG maps for an entire game[1], but I do not believe it's public.
Your [9] link points out that much more information is available to baseball betters, but even baseball has a significant walled garden in terms of data. For example, the raw data used to generate the stats in [2] is not open to the public.
You make a good point and my post requires clarity.
My links were all to post-event data, not live in-play data sources. I still wouldn't call, for example in a cricket match, the number of wickets taken by a bowler a stat. It's just data. A stat is derived from the data, for example bowling stike rate or economy. Or that a trainer had a winner at a certain race track. That's just the post-event data. If you want to derive further statistics, you have to calculate it yourself.[1]
The links above just have, for the most part, raw event data.
The number of wickets taken is a stat. The raw data that informs it is the collective set of all balls bowled by a bowler.
I'm not being needlessly pedantic, it's an important distinction when considering the level of analysis that one is able to perform. If you are doing major cricket analytics, you need ball-by-ball information, including as much information about the bowler's position, movement and arm motion, batter's position, movement and stroke information, how the field is set up, conditions of the pitch, situation in the match, etc.
For example, consider a situation where we're attempting to compare two bowlers. Bowler A may have got a wicket off a shot that 95% of batters would not play, whereas Bowler B did not get a wicket despite bowling a ball that achieves a wicket 10% of the time. The stats suggest that bowler A is in better form, but a data-driven view of the game suggests that bowler B is actually in better form.
As it stands, stats are available in abundance for every major sport, but detailed data is not. If a better had access to the latter, and they were were able to parse it with an in-depth understanding of the sport, they'd be at a huge advantage versus betters that did not, and they would reap the benefits.
The equivalent from Opta is thousands a year per competition. I was fortunate enough to get to play with detailed Opta data and ChyronHego data as part of a Man City hack day a couple of years ago. The latter data simply isn't commercially available.
For cricket, you can do something interesting with ball by ball data, but ideally you want ball tracking data. You want to know speed of release, length, speed and movement after the ball has pitched, and speed after interaction with the batsman along with angles, etc. - and that's just to get started. Ideally you want positional data on fielders, etc. too.
Don't get me wrong, this is a great starting set to get people interested, but there's a way to go for high-quality data being accessible to the hobbyist or academic researcher (although I believe Opta gives academics discounts to help make them "the" standard for clubs, etc.)
Depending on which data you need, there are already some good sources of free football data.[1][2][3]
Someone has also conveniently wrapped much of this in an R library.[4]
Football is actually one of the better sports in terms of easily obtainable data at no cost. Rugby is much more difficult to find extensive datasets, although there are some interesting attempts.[5]
Decent cricket data also exists in a few places[6], but generally requires faster and more regular updating. However, there are R libraries for cricket data too.[7] This one scrapes from the ESPN Cricinfo site.
It is possible to obtain horse racing data for the UK and Ireland at a reasonable price, for personal use[8] and Hong Kong does a great job of making a huge volume of horse racing data available at no cost, but not in a particularly machine usable format (extensive scraping required). Sadly, other large racing jurisdictions such as Australia and the US don't have anything free, or even reasonably priced, as far as I'm aware. Ray Paulick has covered this as a general problem for the sport for a few years now.[9]
[1]http://www.football-data.co.uk/data.php
[2]https://github.com/openfootball
[3]https://github.com/jokecamp/FootballData
[4]https://github.com/dashee87/footballR
[5]http://api.drop22.net/
[6]https://cricsheet.org/
[7]https://github.com/tvganesh/cricketr
[8]https://www.betwise.co.uk/
[9] https://www.paulickreport.com/news/the-biz/gardner-horse-rac...