
How I'm Predicting Baseball Outcomes - zsch
http://blog.zachschnell.com/post/50397273741/the-braves-tomorrow-and-in-october
======
icelancer
As someone who has written 10k+ loc on simulators, markov chaining, and done
tons of analysis on baseball projections... I don't know where to start.

I honestly don't. (You are very, very far off the correct methodology.)

But maybe this is a good place, I guess.

<http://www.oakton.edu/user/4/pboisver/AABaseballMath.html>

I am not trying to be a jerk. But ask yourself: If it were that trivial, why
wouldn't it be a snap to crush online sportsbetting?

~~~
thret
[http://espn.go.com/blog/playbook/dollars/post/_/id/2935/meet...](http://espn.go.com/blog/playbook/dollars/post/_/id/2935/meet-
the-worlds-top-nba-gambler)

Just a random article. I've met him through poker. He had a radio show for a
while, and he stated that if you have a 2% edge over a bookie, you can easily
make a million dollars a year. Anyone who thinks they have an edge and isn't
also very wealthy is simply mistaken.

~~~
icelancer
2% is enormous and can't last for very long, of course.

------
feniv
Rather than waiting for data from future games, you should backtest on old
data and see how it performs (<http://en.wikipedia.org/wiki/Backtesting>).

You might also be interested in checking out the book The Signal and The Noise
by Nate Silver (He runs the FiveThirtyEight political/data blog that notably
predicted last year's election results with great accuracy.)

~~~
pondababa
More relevant--before he did statistical analysis of elections, he was a
leading figure in sabermetrics, the statistical analysis of baseball. His
program is still one of the best.

~~~
icelancer
And before that, he was an excellent online gambler. Or was it during that
time? ;)

PECOTA is alright, but not really his anymore. cwyers took it over at BPro.
It's also tough to say if it's one of the best for any number of statistically
boring reasons (read Phil Birnbaum's blog for more info).

~~~
jaredmck
Yeah, PECOTA isn't one of the best at all, it just has ranges for variance,
which is an imprecise metric as no one knows what he's saying a percentile is
under the model. What is luck and what is skill?

Birnbaum is one of the best bloggers on advanced stuff. Tango's blog is the
best period as anyone good participating in the "open source" movement so to
speak is there pr shows up when they write good stuff. Until you're aware of
the work to date, you will be spinning in circles with awful biased errors.

------
pesenti
My bet is that your model is no better than (weighted) chance to predict the
next game. Just run the model on the years before 2012 and you'll see what I
mean.

~~~
Dwolb
On my phone right now so I can't put this analysis on paper, but this feels a
lot like looking at the associative property of addition.

i.e. the author regrouped the data to come up with the same result

------
anovikov
As a guy who spent years on contract work on sports prediction, i can say this
is so naive that even don't worth to be discussed.

To start with, you can't 'predict' result of a single game. You can have some
advantage (or more likely, disadvantage) over the quality of the betting like
a bookie gives you.

------
iguana
Isn't this an example of the gambler's fallacy, that previous outcomes impact
future outcomes?

~~~
gojomo
If game results were independent, yes. But lots of things in baseball aren't
independent: teams play the same opponent repeatedly in short series, runs of
home or away series occur, pitchers rotate, players get injured, teams may
slack when overperforming or intensify their efforts to avoid extended losing
streaks, etc.

Most of all: baseballers are fairly superstitious. They believe in streaks,
lucky rituals, jinxes, and the gamblet's fallacy (being 'due' for a win or a
loss). So some serial correlation could be a self-fulfilling prophesy.

Still, I'd expect tons of other available team stats to outperform the last N
game results in predictive power.

~~~
jaredmck
Baseball players believe those things, but no one can prove they actually
exist.

------
mlntn
If you want a broader dataset, check out the data from Retrosheet
(<http://www.retrosheet.org/>). You can get box scores and play-by-play data
from many years back to dive really deep into stats. You can use Chadwick
command-line tools (<http://chadwick.sourceforge.net/>) to parse the data into
SportsML or other formats.

------
teeeler
See also: <http://cran.r-project.org/web/packages/Lahman/index.html>

Baseball stats pre-molded into a nicely workable form, available from your
handy R interpreter.

------
gz5
Hall of Fame manager Earl Weaver said: "Momentum? Momentum is the next day's
starting pitcher."

Baseball fan in me says that is correct but statistician in me would like to
see more models like yours to quantify it.

------
gwern
What, this is just streak-based? Do streaks even exist in baseball?

~~~
zsch
yeah for now... They're most definitely a thing, though I have a documents
worth of baseball elements I hope to incorporate

~~~
jerf
Are they?

    
    
        jerf@jerfhom:~$ python
        Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
        [GCC 4.7.2] on linux2
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import random
        >>> 94.0/(94+68)
        0.5802469135802469
        >>> winp = 94.0/(94+68)
        >>> games = []
        >>> for x in range(50):
        ...     games.append('w' if random.random() < winp else 'L')
        ... 
        >>> ''.join(games)
        'wLwLwwwwwLwLwLwwwLLLwLwwLwLLLwwLwwwwwLwwwLwwLLwLLw'
    

In my full simulation of 162 games, the longest streak was a 7 game _losing_
streak, despite the higher win percentage. Of course you'll get different
results each run; my next run produced a 9 game winning streak, which some
quick Googling suggests is in line with what happened in 2010.

Combine this with the fact that real play is not drawn uniformly (you may play
a much worse team against which you have a much better win percentage for
several games in a row) and I don't see much need for some sort of meaningful,
statistically-predictive "streak" to explain game results.

~~~
zsch
The 2012 data I used as the basis of my program actually had the same thing
you describe – the longest streak was an 8 game losing streak despite having
more wins than losses overall.

And I understand exactly where you're coming from. This is very preliminary,
and if anything it was good coding practice for me. Though I very much intend
to incorporate more significant factors like the lineup, the opposing team,
and their history.

~~~
jaredmck
First improvement: do this for every team ever. Then combine for all teams,
first in an individual season, then try basing the win% iteratively based on
more history.

Based on these models, you should have some good examples of selection bias,
and see how the model changes based on what you are not testing for, but what
is implicit in the data (since data is merely a set of samples of data
generated by one iteration of the (unknowable to some degree) true talent
functions for each team (player, lineup decision, injury, close call by an
ump, etc.)

If you're interested in going down the rabbit hole, there's tons of people who
can show the way (and they're nice! At least tangotiger is way nicer than he
should be in listening to people who have put no effort in understanding what
is good and what is beginner's blind bliss)

Hot and cold streaks are just random variance, so is whether balls are hit
within reach of fielders or safely out of reach, given a certain contact
quality (ground ball, fly ball, infield pop up, or line drive all have vastly
different tendencies to fall for a hit - line drives ~.600-700 babip if I
recall, FB ~ low .200ish, GB ~ .300, pop up 0ish?) point is these are all
known, to se degree, given the historical data.

If anyone wants to explore this stuff further let me know & I can point you to
the right spots to help a specific interest?

------
localhost3000
google tangotiger and then read his book. baseball prediction has been done to
death by the sabermetrics community but this guy is one of the absolute best.

~~~
the_cat_kittles
tango is awesome, but saying he is the best is misleading. He is just one
point of view, albeit a fairly wise and proven one. There are many other
people with other opinions, using other techniques, and they have gotten good
results- He is a bit of a curmudgeon when it comes to modern statistical
techniques and machine learning.

edit: sorry, misread "one of the" as "the"...

------
metdos
Not baseball, but for the college football <http://winningformula.espn.com/>

------
jaredmck
<http://fangraphs.com>

~~~
jaredmck
Replying to myself since I hit enter on mobile - this is a great site with
data, analysis, and win expectancy charts in real time during games based on
Score and run states.

------
skizm
I could make a better program:

\- check bovada.lv for the line

\- guess that as the winner

\- fin.

(crowdsource ftw)

Streaks from last year's team? Seriously?

------
mattdennewitz
see also: <http://www.baseball-reference.com/about/wpa.shtml>

~~~
zsch
Their data is excellent – thanks for sharing. Interesting when they take into
account so much more than streaks. I would love to dive into the relation of
more of that data in the future.

