
Prophet: forecasting at scale - benhamner
https://research.fb.com/prophet-forecasting-at-scale/?
======
confounded
Worth noting Prophet is R/Python wrappers to some models with reasonable
defaults, written in and fit by _Stan_ , a probabilistic programming language,
and Bayesian estimation framework.

Stan is amazing in that you can fit pretty much any model you can describe in
an equation (given enough time and compute, of course)!

More on Stan here: [http://mc-stan.org/](http://mc-stan.org/)

~~~
dragandj
... and if you like Clojure, you might try Bayadera, which has its own engine
running the analysis on the GPU.

[http://github.com/uncomplicate/bayadera](http://github.com/uncomplicate/bayadera)

~~~
diab0lic
I'm pretty interested in this as I do most of my work on the JVM and I love
trying this out on our stream processor at work.

Cloned and tried to build it but I'm getting an error regarding
uncomplicate:commons:0.3.0-SNAPSHOT being unavailable on clojars. Is that
something you currently have installed to your local maven repo? I don't see
it here:
[https://clojars.org/repo/uncomplicate/commons/](https://clojars.org/repo/uncomplicate/commons/)

I can get it to build with 0.2.2 but it is missing the "releaseable?"
function.

In any case this looks awesome and I'll be keeping an eye on it / playing with
it for some new projects.

EDIT: I was able to get it building by cloning your commons library and
running "lein install". :)

~~~
dragandj
Please also note that it currently requires OpenCL 2.0 compatibility, and is
optimized for AMD GPUs. I plan to add CUDA support later this year.

~~~
diab0lic
Yeah I'm running into issues with this using a mid 2015 MacBook pro w/ Nvidia
gpu. Looks like it only supports OpenCL 1.2.

------
rodionos
I didn't know wikipedia page view counters are available for public usage.

The wikipediatrend R package relies on
[http://stats.grok.se/](http://stats.grok.se/), which in turn relies on
[https://dumps.wikimedia.org/other/pagecounts-
raw/](https://dumps.wikimedia.org/other/pagecounts-raw/) which has been
deprecated.

The new dump is located at
[https://dumps.wikimedia.org/other/pageviews/](https://dumps.wikimedia.org/other/pageviews/)

Data is available in hourly intervals.

* pageviews-20170227-050000
    
    
      en Peyton_Manning 58 0
    

[edit] There is a wikipedia-hosted OSS viewer for these logs, e.g. Swedish
crime stats:

[https://tools.wmflabs.org/pageviews/?project=en.wikipedia.or...](https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-
access&agent=user&range=latest-90&pages=Crime_in_Sweden)

~~~
abbe98
The Wikimedia foundation provides an public page view API for most Wikimedia
projects:

[https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI](https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI)

~~~
rodionos
Thanks, that's a good resource. I'm surprised though. It seems that Top-1000
articles by monthly views are 90% about celebrities and movies. I think tags
or categories would be most useful.

[https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.w...](https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-
access/2017/01/all-days)

------
saosebastiao
This is an interesting project, and in one of the areas where almost all
businesses could do better. Anecdotally, there is a ton of money left on the
table by established businesses that do it poorly, which also leaves lots of
room for resume-padding technical experience. So anything that claims to
improve the state of the art of automated forecasting is definitely worth
watching.

That being said this claim in point #1 baffles me:

> Prophet makes it much more straightforward to create a reasonable, accurate
> forecast. The forecast package includes many different forecasting
> techniques (ARIMA, exponential smoothing, etc), each with their own
> strengths, weaknesses, and tuning parameters. We have found that choosing
> the wrong model or parameters can often yield poor results, and it is
> unlikely that even experienced analysts can choose the correct model and
> parameters efficiently given this array of choices.

The forecast package contains an auto.arima function which does full parameter
optimization using AIC which is just as hands free as is claimed of Prophet. I
have been using it commercially and successfully for years now. Maybe prophet
produces better models (I'll definitely take a look myself), but to claim that
it's not possible to get good results without experience seems a bit
disingenuous.

As an aside, anybody interested in a great introductory book on time series
forecasting should check out Rob Hyndman's book which is freely available
online. [https://www.otexts.org/fpp](https://www.otexts.org/fpp)

~~~
dxbydt
> Anecdotally, there is a ton of money left on the table by established
> businesses...

True. fwiw, I worked on the same project at Twitter 4 years back - the
Facebook folks call it capacity planning at scale, we called it capacity
utilization modeling. The goal was the same - there are all these "jobs" \-
10s of 1000s of programs running on distributed clusters, hogging CPU, memory
and disk. Can we look at a snapshot in time of the jobs usage, and then
predict/forecast what the next quarter jobs usage would be ? If you get these
forecasts right ( within reasonable error bounds ), the folks making
purchasing decisions ( how many machines to lease for the next quarter for the
datacenters) can save a bundle.

From an engineering pov, every job would need to log it's p95 and p99 CPU
usage, memory stats, disk stats...Since Twitter was running some 50k programs
back then (2013ish) on these Mesos clusters, the underlying C++ API had hooks
to obtain CPU and memory stats, even though the actual programs running were
all coded up in Scala (mostly), or python/Ruby (bigger minority), or
C/Java/R/perl ( smaller minority ). There's an interesting Quora discussion on
why Mesos was in C++ while rest of Twitter is Scalaland...mostly because you
can't do these sort of CPU/memory/disk profiling in the jvmland as well as you
can in C++.

OK, so you now have all these CPU stats. What do you do with them ? Before you
get to that, you have the usual engineering hassles - how often should you
obtain the CPU stats ? Where would you store them ?

So at Twitter we got these stats every minute ( serious overkill :) and stored
them in a monstrous JSON ( horrible idea given 50000 programs * number of
minutes in day * all the different stats you were storing :))

So every day I'd get a gigantic 20gb JSON from infra, then I'd have to do the
modeling.

In those days, you couldn't find a single Scala JSON parser that would load up
that gigantic JSON without choking. We tried them all. Finally we settled on
GSON - Google's JSON parser written in Java, that handled these gigantic jsons
with no hiccups.

Before you get to the math, you would have to parse the JSON and build a data
structure that would store these (x,t) tuples in memory. You had 50k programs,
so each program would get a model, each model originated from a shitton of
(x,t) tuples, the t being minutely and the fact that some of these programs
had been running for years, meant you had very large datasets.

The math was relatively straightforward...I used so called "LAD" \- least
absolute deviation from mean, as opposed to simple OLS, because least squares
wasn't quite predictive for that use case. Building the LAD modeling thing in
Scala was somewhat interesting...Most of the work was done by the commons math
Apache libraries, I mostly had to ensure the edge cases wouldn't throw you
off, because LAD admits multiple solutions to the same dataset - it's not like
OLS where you give it a dataset and it finds a unique best fit line. Here
you'd have many lines sitting in an array, depending on how long you let the
Simplex solver run. Then came the problem of visualizing these 50,000
piecewise line models using javascript heh heh. The front end guys had a ball
with the models I spit out.

If someone's doing this from scratch these days, NNs would be your best bet.
Regime changes are a big part of that.

------
schlarpc
Moderately relevant short story: [https://www.facebook.com/notes/robin-
sloan/julie-rubicon/985...](https://www.facebook.com/notes/robin-sloan/julie-
rubicon/985697811525170)

~~~
JoshTriplett
That was the first thing I thought of when I saw the title.

------
techno_modus
It seems that they have developed a model for only univariate forecasts and
only numeric regular time series which is a classical use case in statistics.
Yet, most data sources have many dimensions (for example, energy consumption,
temperature, humidity etc.) as well as categorical data like current state
(On, Off). The situation is even more difficult if the data is not a regular
time series but is more like asynchronous event stream. It would be
interesting to find a good forecasting model for some of these use cases. In
particular, it is interesting if this Prophet model can be generalized and
applied to multivariate data.

~~~
unoti
> most data sources have many dimensions (for example, energy consumption,
> temperature, humidity etc.) as well as categorical data like current state
> (On, Off). The situation is even more difficult if the data is not a regular
> time series but is more like asynchronous event stream. It would be
> interesting to find a good forecasting model for some of these use cases.

I'm guessing you already know about this based on the way you described the
situation, but the Hyndman Forecasting book [1] discusses various models at
length for doing multivariate forecasting models. It's loaded with code and
samples in R.

1\. [https://www.otexts.org/fpp](https://www.otexts.org/fpp)

------
cardosof
That's very cool, congrats and thank you to the Facebook guys!

A few days ago I was asked to do some forecasting with a daily revenue series
for a client. Due to her business' nature the series was really tricky with
weekdays and months/semesters having some specific effects on the data. I as
many use Hyndman's forecast package, but I threw this data at prophet and it
delivered a nice plot with the (correct) overall trend and seasonalities. Very
cool and easy to do something.

------
yoghurtio
We at [https://yoghurt.io/](https://yoghurt.io/) have been working towards
similar forecasting solution. So far the feedback has been that automated
solutions can also bring good results at a far lesser cost compared to hiring
an expert analyst.

Its a completely managed solution. No need to setup anything yourself.Just
upload the data and predict next week's data, today itself. There is a free
trial and if anyone here is looking for an extended trial, they can reach out
to me.

~~~
redindian75
your website is very sparse on details - any examples/demos?

~~~
yoghurtio
Example: Like you want to predict the app downloads of your website coming
week. Just upload the data in time series format against the date and app
downloads from last 30 weeks. It will return the next 7 days predicted app
downloads along with the analytical confidence. It can predict any KPI like
visitors, app downloads, conversion etc. Just signup and start predicting.

~~~
ainiriand
Your website is not working for me. The upload never completes. Tried Chrome
56 and firefox 51.

~~~
yoghurtio
Can you please try uploading XLS or XLSX format. Normally, it should show
error message in this case.We are going to fix it soon. CSV and other formats
support would be coming soon.

------
anacleto
This is so great!

I've been using CasualImpact by Google [0] for months. This seems pretty
straightforward.

[0]
[https://google.github.io/CausalImpact/CausalImpact.html](https://google.github.io/CausalImpact/CausalImpact.html)

------
jl6
I wonder what Sungard/FIS think of the name, which is the same as their
commercial financial modelling/forecasting tool.

~~~
vinw
FIS Prophet is targeted at actuaries, and really no-one else so I don't know
if anyone will care. They have had the name a lot longer than Facebook though!

~~~
T-A
This other Prophet has also been around for a while:
[https://github.com/Emsu/prophet](https://github.com/Emsu/prophet)

~~~
vinw
Right, but some of the source code in FIS/Prophet goes back to the 1980s.

------
pacifika
The more facebook grows the more tools it aligns tooling with intelligence
services.

------
asafira
So...How much will this do at forecasting stock prices? =)

Very cool though --- I would be interested to dive into the methods they've
implemented sometime in the near future!

~~~
blazespin
Probably just help verify that the stock market is a random walk with a meager
trend upwards that doesn't beat inflation + trading costs.

~~~
etjossem
> Probably just help verify that the stock market is a random walk with a
> meager trend upwards that doesn't beat inflation + trading costs.

That doesn't sound right. Let me clear that up for you. Since 1950:

    
    
      S&P 500 Annual Price Change: 7.2%
      S&P 500 Annual Div Dist: 3.6%
      S&P 500 Annual Total Return: 11.0%
      Annual Inflation: 3.8%
      Annual Real Price Change: 3.3%
      Annual Real Total Return: 7.0 %
    

Buying the straight S&P 500 beats inflation by seven percent, on average,
every year. You're welcome!

~~~
T-A
Buying the S&P 500 in 1950 and holding 67 years does.

One sample tells you nothing about randomness. What if you buy in August 1929?
What if you hold for a more realistic 20 or 30 years from peak earning years
to retirement?

~~~
etjossem
Bought way back in August 1929:

    
    
      Annual Total Return: 9.1%
      Annual Real Total Return: 5.9%
    

Bought in January 1987, held for a realistic 30 years:

    
    
      Annual Total Return: 9.8%
      Annual Real Total Return: 7.0%
    

There's always going to be some deviation, but over any given multi-decade
holding period, you will generally end up with a predictable 5-9% annualized
(inflation-adjusted) return. That is more than zero. My point stands: long-
term investment in the S&P 500 _can_ be reasonably expected to gain value
faster than inflation.

If you're interested, here's a simulator that looks at historic market data.
You'll note that even the lowest possible percentile of 30-year holding
periods will still yield a 3.43% inflation-adjusted total return:
[https://dqydj.com/sp-500-historical-return-calculator-
popout...](https://dqydj.com/sp-500-historical-return-calculator-popout/)

~~~
T-A
You conveniently ignored half the problem by buying in 1929 and holding for 88
years, which is reasonable if you are currently about 140 years old.

If not, look at [http://www.macrotrends.net/1319/dow-jones-100-year-
historica...](http://www.macrotrends.net/1319/dow-jones-100-year-historical-
chart)

Let's buy in August 1929 at 5338.69, and sell 20 years later, in August 1949,
at 1822.87 (inflation-adjusted). Congratulations, you lost two thirds of your
money.

Sell 30 years later instead? August 1959, at 5525.23. Wow, after 30 years
you're up almost 3.5%!

------
hnarayanan
Is there a way to extend these models to handle spatial variation (e.g.
weather forecasting, property price estimation etc.) as well?

~~~
rodionos
This would be non-trivial. Consider this paper on marijuana usage where the
researchers had to group statistics by adjacent counties in Oregon and
Washington in order to control the tests.

[https://papers.ssrn.com/sol3/papers2.cfm?abstract_id=2841267](https://papers.ssrn.com/sol3/papers2.cfm?abstract_id=2841267)

~~~
hnarayanan
Thank you for the pointer, will read the article.

All my attempts thus far have pointed me to something called Gaussian
Proceeses that I am still working through grokking.

------
dmichulke
I have been working for a few years on a similar project using evolutionary
algorithms on top of other models (linear / ann). It works quite well (e.g.,
for equidistant energy demand / supply forecasts) but there's still lots of
stuff to do.

It's major benefit is that it figures out relationship to the target time
series by itself, so you can just throw in all time series and see what comes
out.

Language is Clojure, 20kloc, incanter, encog. If anyone is interested in
working for/with it, let me know. I currently develop a Rest Api for it and
plan to release it as open source once the major code smells are dealt with.

~~~
feld
Why not release sooner and document the code smells? Maybe you'll get patches

~~~
dmichulke
I'd like to have a tested use case that mostly and simply works. Something to
put in readme.md that shows how it works and that it works. Almost there...

------
Steeeve
This actually looks incredibly useful and pretty simple to learn.

Between this and Stan I think my free time for the next week is gone.

------
zebrafish
So.... I don't understand how this is better or worse than using forecast.

You talk about having to choose the best algorithm but it seems like Prophet
is just another algorithm to choose from. Is there some kind of built in grid-
search or are you just stating that results from your AM have been more
accurate than ARIMA?

------
hn_username
This is a nice piece of work - thanks for sharing with the community!

Some feedback: it'd be nice to see you actually quantify how accurate
Prophet's forecasts are on the landing page for the project. In the Wikipedia
page view example, you go as far as showing a Prophet forecast, but it'd be
nice to have you take it one step further and quantify its performance. Maybe
withhold some of the data you use to fit the model and see how it performs on
that out of sample data. It's nice that you show qualitatively that it
captures seasonality, but you make bold claims about its accuracy and the data
to back those claims up is conspicuously absent. Related, it might be worth
benchmarking its performance against existing automated forecasting tools.

I'll definitely be checking it out!

------
SmellTheGlove
For us insurance/financial services folks, I would like to simply clarify that
this is not the Sungard/FIS risk management platform that is also called
Prophet! :D

I got really excited for a second. Actually, I'm still pretty excited about
this even if it was something else entirely.

------
nickfzx
This looks amazing, congratulations.

We're planning to add forecasting to our SaaS analytics product
([https://chartmogul.com](https://chartmogul.com)) later this year, I'm going
to look and see if we can use this in our product now.

~~~
tommynicholas
I was trying to sort out whether adding this to an existing charting/analytics
product makes sense but it looks like you've checked it out and think it does.
I couldn't tell only because it seems to be built to do the charting/plotting
itself, but I guess you can just use the data/API to get the forecasts then
plot them yourself yes?

I may do a test implementation into Airbnb Superset actually to see how it
flies.

------
minimaxir
Interesting definition of "scale" in this context, as it does _not_ imply "big
data" like every other usage of the word scale in data science. The tool works
on, and is optimized, for day-to-day, mundane data.

See also the R vignette, which shows that the data is returned per-column
which gives it a lot of flexibility if you only want certain values:
[https://cran.r-project.org/web/packages/prophet/vignettes/qu...](https://cran.r-project.org/web/packages/prophet/vignettes/quick_start.html)

------
paulvs
For a corporate credit analyst working at a bank, what are some good
introduction material for getting into forecasting using tools like these?

I see this being applicable to analysts when deciding on on a company's credit
worthiness.

~~~
zebrafish
There are some models out there which could be used but i'm not sure that
forecasting is actually what you would use.

I would think if you're already assigning credit ratings, you can set that as
your dependent variable and use things like company revenue, number of
employees, age of company, etc. as your independent variables. You can use a
number of different models to assess credit worthiness based on this data.
Evaluate several to determine the most accurate.

------
syntaxing
The fact that Prophet follows the "sklearn model API" and that it's very well
integrated with pandas makes it super appealing and usable!

------
monkeydust
Very cool, got loads of sensor data around my house over a years worth so
curious to throw it at Prophet.

Has anyone managed to get this working on windows with Juypter (Anaconda
build) struggling with Pystan errors. Any guidance welcomed.

------
eternalban
/please ignore: Oracle & Prophet. Oracle sifts through signs but Prophet has a
line to the larger picture. I suppose the next 'product' will be called
Messiah to complete the picture.

------
elwell
Why do we need Prophet when we already have Temple OS
([http://www.templeos.org/](http://www.templeos.org/))?

------
nodesocket
Are there any startups/services where you pass it a series and it returns
forecast models? That's something I'd be willing to pay for.

~~~
yoghurtio
You can try [https://yoghurt.io/](https://yoghurt.io/). Its fully managed
platform and no need to setup anything yourself. Example: Like you want to
predict the app downloads of your website coming week. Just upload the data in
time series format against the date and app downloads from last 30 weeks. It
will return the next 7 days predicted app downloads along with the analytical
confidence. It can predict any KPI like visitors, app downloads, conversion
etc. Just signup and start predicting.

~~~
nodesocket
Is it possible for example to send you monthly revenue numbers for my startup
for the last two years (24 data points) and have yoghurt predict the next two
years of monthly revenue?

~~~
throwaway_374
If the model is autoregressive you can only forecast N steps ahead. Any
further forecasting will be based on these generated near-future forecasts. In
English, no. See
[https://www.youtube.com/watch?v=tJ-O3hk1vRw#t=01h16m](https://www.youtube.com/watch?v=tJ-O3hk1vRw#t=01h16m)

~~~
nodesocket
Thanks for posting this talk by Jeffrey Yau. I am 9 minutes into it and can't
stop watching. He explains things very easily and clearly.

------
alexpetralia
Slightly inconvenient that the main image <figure> needs to be replaced by an
<img> tag just to have the image appear in print outs.

------
poppingtonic
This is very interesting. Forecasters who participate in the Good Judgment
Project, such as myself, will find this useful.

------
recurser
Very cool. Could this be re-purposed for detecting anomalies/outliers in time
series data?

~~~
techno_modus
>Could this be re-purposed for detecting anomalies/outliers in time series
data?

If you define anomaly as something unexpected then yes. In this case, if the
reality differs significantly from the forecast (=expectation) then it is an
anomaly (according to our definition). In numeric univariate case, there could
be positive anomalies where you get more than expected, and negative anomaly
where you get less than expected.

------
ayayecocojambo
Can we use other features (like temperatue?), or it has to be only time-based?

------
agounaris
How different this framework is from statsmodels?

~~~
adw
Statsmodels is a grab-bag of various statistical models from linear regression
upwards. This is an opinionated library for (some relevant parts of)
econometrics.

------
hubot
can someone explain what's the meaning of this line

> df['y'] = np.log(df['y'])

~~~
slashcom
df is a dataframe, which is like a spreadsheet. This line takes the logarithm
of the column named 'y' and updates it in place.

~~~
hubot
thanks. that part i can understand but why do that?

------
Helmet
just wanted to point out to potential windows users - this will only run on
python 3.5 due to dependencies (pystan only works on python 3.5 for windows)

------
fagnerbrack
Facebook...

