Hacker News new | past | comments | ask | show | jobs | submit login
Prophet: forecasting at scale (fb.com)
520 points by benhamner on Feb 27, 2017 | hide | past | web | favorite | 110 comments

Worth noting Prophet is R/Python wrappers to some models with reasonable defaults, written in and fit by Stan, a probabilistic programming language, and Bayesian estimation framework.

Stan is amazing in that you can fit pretty much any model you can describe in an equation (given enough time and compute, of course)!

More on Stan here: http://mc-stan.org/

... and if you like Clojure, you might try Bayadera, which has its own engine running the analysis on the GPU.


I'm pretty interested in this as I do most of my work on the JVM and I love trying this out on our stream processor at work.

Cloned and tried to build it but I'm getting an error regarding uncomplicate:commons:0.3.0-SNAPSHOT being unavailable on clojars. Is that something you currently have installed to your local maven repo? I don't see it here: https://clojars.org/repo/uncomplicate/commons/

I can get it to build with 0.2.2 but it is missing the "releaseable?" function.

In any case this looks awesome and I'll be keeping an eye on it / playing with it for some new projects.

EDIT: I was able to get it building by cloning your commons library and running "lein install". :)

Also, there are a few presentations availavle at http://dragan.rocks. just follow the blog posts.

Please also note that it currently requires OpenCL 2.0 compatibility, and is optimized for AMD GPUs. I plan to add CUDA support later this year.

Yeah I'm running into issues with this using a mid 2015 MacBook pro w/ Nvidia gpu. Looks like it only supports OpenCL 1.2.

This looks like it could be awesome but it has almost no information about what its purpose is or how to use it.

You are right. The docs have been due to be written for many months now, and that is the main reason the library has not been released yet. On the other hand, the test folder contains many tests, among them full examples from many chapters from the book Doing Bayesian Dara Analysis, recommended above.

Oh I see. That's cool! I do want to check it out but will probably wait for its release.

And I wait with bated breath.

Readme has neither useful docs, nor any link to docs. =/

Do you know of any good beginner tutorials for Stan or probabilistic programming in general? All the examples that I found seemed quite complex and I was a bit overwhelmed by all the math. Which might also be a sign that I should brush up my math skills. What kind of math/stats should I revise to be able to better understand probabilistic programming?

Probabilistic Programming & Bayesian Methods for Hackers [1] by Cameron Davidson-Pilon is exactly what you want, starting from a computational-first perspective, then introducing the maths later, although it uses PyMC rather than Stan. It's freely available as a set of Jupyter notebooks, as well as a printed edition.

[1] http://camdavidsonpilon.github.io/Probabilistic-Programming-...

Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan. It is very approachable and also has lots of practice problems. It's not a math-heavy book at all.

Edit: I wouldn't recommend Probabilistic Programming and Bayesian Methods for Hackers. When I tried using it, I felt that too much was glossed over. The book that I recommend excels at conveying a strong intuition for how these various techniques work.

I'd second this. DBDA made things click for me where a handful of other books failed.

Doing Bayesian Data Analysis https://amzn.com/dp/B00PYZ2VR6/ ($65).

Free is hard to beat, especially for someone just testing the waters.

Probabilistic graphical models.

That's the foundation. The way you set up your model is by nodes and edges that specify the flow of influence (directed or undirected). Then it seems that there are general methods for inference and learning on any kind of graph one might pose.

For simple graphs (and simple is something one might want when modelling) the methods should be fairly effective.

Unfortunately, the biggest book on the subject that I know (Koller & Friedman) isn't accessible. Koller's course is also not that accessible.

Stan is nice but its GPL license is taboo in my corporate environment :( .

I am puzzled how they managed to release Prophet under BSD with such a dependency.

Stan has a BSD core. Prophet must avoid the GPLv3 interfaces.

It doesn't avoid the R interface, which is GPL'd (version 3).

Smells like more Facebook licensing drama, something new this time instead of the revocable patent license!

The Python API (pystan) is also GPLv3.

I didn't know wikipedia page view counters are available for public usage.

The wikipediatrend R package relies on http://stats.grok.se/, which in turn relies on https://dumps.wikimedia.org/other/pagecounts-raw/ which has been deprecated.

The new dump is located at https://dumps.wikimedia.org/other/pageviews/

Data is available in hourly intervals.

* pageviews-20170227-050000

  en Peyton_Manning 58 0
[edit] There is a wikipedia-hosted OSS viewer for these logs, e.g. Swedish crime stats:


BigQuery also has the public dataset of Wikipedia page views. Handy for quick SQL and sampling.

An intro by Felipe Hoffa (Google): https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_...

The Wikimedia foundation provides an public page view API for most Wikimedia projects:


Thanks, that's a good resource. I'm surprised though. It seems that Top-1000 articles by monthly views are 90% about celebrities and movies. I think tags or categories would be most useful.



What's up with Java? (Set "logarithmic scale" to improve the visualization)

the wiki page is about Java the country, not the programming language. Haven't found any relevant news around that time though.

Indeed, that's why I used "Java (programming language)".

Cool! I wonder what spiked the views for artificial intelligence on 10/11/2016?


I think that spike peaks on October 12, which is when this was released: https://obamawhitehouse.archives.gov/blog/2016/10/12/adminis...

This is an interesting project, and in one of the areas where almost all businesses could do better. Anecdotally, there is a ton of money left on the table by established businesses that do it poorly, which also leaves lots of room for resume-padding technical experience. So anything that claims to improve the state of the art of automated forecasting is definitely worth watching.

That being said this claim in point #1 baffles me:

> Prophet makes it much more straightforward to create a reasonable, accurate forecast. The forecast package includes many different forecasting techniques (ARIMA, exponential smoothing, etc), each with their own strengths, weaknesses, and tuning parameters. We have found that choosing the wrong model or parameters can often yield poor results, and it is unlikely that even experienced analysts can choose the correct model and parameters efficiently given this array of choices.

The forecast package contains an auto.arima function which does full parameter optimization using AIC which is just as hands free as is claimed of Prophet. I have been using it commercially and successfully for years now. Maybe prophet produces better models (I'll definitely take a look myself), but to claim that it's not possible to get good results without experience seems a bit disingenuous.

As an aside, anybody interested in a great introductory book on time series forecasting should check out Rob Hyndman's book which is freely available online. https://www.otexts.org/fpp

> Anecdotally, there is a ton of money left on the table by established businesses...

True. fwiw, I worked on the same project at Twitter 4 years back - the Facebook folks call it capacity planning at scale, we called it capacity utilization modeling. The goal was the same - there are all these "jobs" - 10s of 1000s of programs running on distributed clusters, hogging CPU, memory and disk. Can we look at a snapshot in time of the jobs usage, and then predict/forecast what the next quarter jobs usage would be ? If you get these forecasts right ( within reasonable error bounds ), the folks making purchasing decisions ( how many machines to lease for the next quarter for the datacenters) can save a bundle.

From an engineering pov, every job would need to log it's p95 and p99 CPU usage, memory stats, disk stats...Since Twitter was running some 50k programs back then (2013ish) on these Mesos clusters, the underlying C++ API had hooks to obtain CPU and memory stats, even though the actual programs running were all coded up in Scala (mostly), or python/Ruby (bigger minority), or C/Java/R/perl ( smaller minority ). There's an interesting Quora discussion on why Mesos was in C++ while rest of Twitter is Scalaland...mostly because you can't do these sort of CPU/memory/disk profiling in the jvmland as well as you can in C++.

OK, so you now have all these CPU stats. What do you do with them ? Before you get to that, you have the usual engineering hassles - how often should you obtain the CPU stats ? Where would you store them ?

So at Twitter we got these stats every minute ( serious overkill :) and stored them in a monstrous JSON ( horrible idea given 50000 programs * number of minutes in day * all the different stats you were storing :))

So every day I'd get a gigantic 20gb JSON from infra, then I'd have to do the modeling.

In those days, you couldn't find a single Scala JSON parser that would load up that gigantic JSON without choking. We tried them all. Finally we settled on GSON - Google's JSON parser written in Java, that handled these gigantic jsons with no hiccups.

Before you get to the math, you would have to parse the JSON and build a data structure that would store these (x,t) tuples in memory. You had 50k programs, so each program would get a model, each model originated from a shitton of (x,t) tuples, the t being minutely and the fact that some of these programs had been running for years, meant you had very large datasets.

The math was relatively straightforward...I used so called "LAD" - least absolute deviation from mean, as opposed to simple OLS, because least squares wasn't quite predictive for that use case. Building the LAD modeling thing in Scala was somewhat interesting...Most of the work was done by the commons math Apache libraries, I mostly had to ensure the edge cases wouldn't throw you off, because LAD admits multiple solutions to the same dataset - it's not like OLS where you give it a dataset and it finds a unique best fit line. Here you'd have many lines sitting in an array, depending on how long you let the Simplex solver run. Then came the problem of visualizing these 50,000 piecewise line models using javascript heh heh. The front end guys had a ball with the models I spit out.

If someone's doing this from scratch these days, NNs would be your best bet. Regime changes are a big part of that.

That was the first thing I thought of when I saw the title.

It seems that they have developed a model for only univariate forecasts and only numeric regular time series which is a classical use case in statistics. Yet, most data sources have many dimensions (for example, energy consumption, temperature, humidity etc.) as well as categorical data like current state (On, Off). The situation is even more difficult if the data is not a regular time series but is more like asynchronous event stream. It would be interesting to find a good forecasting model for some of these use cases. In particular, it is interesting if this Prophet model can be generalized and applied to multivariate data.

> most data sources have many dimensions (for example, energy consumption, temperature, humidity etc.) as well as categorical data like current state (On, Off). The situation is even more difficult if the data is not a regular time series but is more like asynchronous event stream. It would be interesting to find a good forecasting model for some of these use cases.

I'm guessing you already know about this based on the way you described the situation, but the Hyndman Forecasting book [1] discusses various models at length for doing multivariate forecasting models. It's loaded with code and samples in R.

1. https://www.otexts.org/fpp

That's very cool, congrats and thank you to the Facebook guys!

A few days ago I was asked to do some forecasting with a daily revenue series for a client. Due to her business' nature the series was really tricky with weekdays and months/semesters having some specific effects on the data. I as many use Hyndman's forecast package, but I threw this data at prophet and it delivered a nice plot with the (correct) overall trend and seasonalities. Very cool and easy to do something.

We at https://yoghurt.io/ have been working towards similar forecasting solution. So far the feedback has been that automated solutions can also bring good results at a far lesser cost compared to hiring an expert analyst.

Its a completely managed solution. No need to setup anything yourself.Just upload the data and predict next week's data, today itself. There is a free trial and if anyone here is looking for an extended trial, they can reach out to me.

your website is very sparse on details - any examples/demos?

Example: Like you want to predict the app downloads of your website coming week. Just upload the data in time series format against the date and app downloads from last 30 weeks. It will return the next 7 days predicted app downloads along with the analytical confidence. It can predict any KPI like visitors, app downloads, conversion etc. Just signup and start predicting.

Your website is not working for me. The upload never completes. Tried Chrome 56 and firefox 51.

Can you please try uploading XLS or XLSX format. Normally, it should show error message in this case.We are going to fix it soon. CSV and other formats support would be coming soon.

This is so great!

I've been using CasualImpact by Google [0] for months. This seems pretty straightforward.

[0] https://google.github.io/CausalImpact/CausalImpact.html

I wonder what Sungard/FIS think of the name, which is the same as their commercial financial modelling/forecasting tool.

FIS Prophet is targeted at actuaries, and really no-one else so I don't know if anyone will care. They have had the name a lot longer than Facebook though!

This other Prophet has also been around for a while: https://github.com/Emsu/prophet

Right, but some of the source code in FIS/Prophet goes back to the 1980s.

The more facebook grows the more tools it aligns tooling with intelligence services.

So...How much will this do at forecasting stock prices? =)

Very cool though --- I would be interested to dive into the methods they've implemented sometime in the near future!

> So...How much will this do at forecasting stock prices? =)

Probably quite poorly (due to stocks appearing "random" at scale), especially for indexes, which are a sum of their parts.

On the other hand, this would probably be quite useful for things that have non-random trends (like the Global Energy Forecasting Competition: http://www.drhongtao.com/gefcom)

It would probably perform pretty poorly as other has suggested. This is mainly due to the fact that stock prices by itself is a pretty non-stationary dataset/measurement. Most of these probabilistic models are poorly equipped to make accurate predictions for non-stationary data since it's trends are seemingly similar to noise.

Faced with phenomena I view as self-affine, other students take an extremely different tack. Most economists, scientists and engineers from diverse fields begin by subdividing time into alternating periods of quiescence and activity. Examples are provided by the following contrasts: between turbulent flow and its laminar inserts, between error-prone periods in communication and error-free periods, and between periods of orderly and agitated ("quiet" and "turbulent") Stock Market activity. Such subdivisions must be natural to human thinking, since they are widely accepted with no obvious mutual consultation. Rene Descartes endorsed them by recommending that every difficulty be decomposed into parts to be handled separately. Such subdivisions were very successful in the past, but this does not guarantee their continuing success. Past investigations only tackled variability and randomness that are mild, hence, local. In every field where variability / randomness is wild, my view is that such subdivisions are powerless. They can only hide the important facts, and cannot provide understanding. My alternative is to move to the above-mentioned apparatus centered on scaling.

-Mandelbrot, in the foreward to Multifractals and 1/f Noise.

it's worth saying that Mandelbrot was apparently a large influence to E Fama, who proposed the efficient market hypothesis in the first place.

Probably just help verify that the stock market is a random walk with a meager trend upwards that doesn't beat inflation + trading costs.

> Probably just help verify that the stock market is a random walk with a meager trend upwards that doesn't beat inflation + trading costs.

That doesn't sound right. Let me clear that up for you. Since 1950:

  S&P 500 Annual Price Change: 7.2%
  S&P 500 Annual Div Dist: 3.6%
  S&P 500 Annual Total Return: 11.0%
  Annual Inflation: 3.8%
  Annual Real Price Change: 3.3%
  Annual Real Total Return: 7.0 %
Buying the straight S&P 500 beats inflation by seven percent, on average, every year. You're welcome!

Buying the S&P 500 in 1950 and holding 67 years does.

One sample tells you nothing about randomness. What if you buy in August 1929? What if you hold for a more realistic 20 or 30 years from peak earning years to retirement?

Bought way back in August 1929:

  Annual Total Return: 9.1%
  Annual Real Total Return: 5.9%
Bought in January 1987, held for a realistic 30 years:

  Annual Total Return: 9.8%
  Annual Real Total Return: 7.0%
There's always going to be some deviation, but over any given multi-decade holding period, you will generally end up with a predictable 5-9% annualized (inflation-adjusted) return. That is more than zero. My point stands: long-term investment in the S&P 500 can be reasonably expected to gain value faster than inflation.

If you're interested, here's a simulator that looks at historic market data. You'll note that even the lowest possible percentile of 30-year holding periods will still yield a 3.43% inflation-adjusted total return: https://dqydj.com/sp-500-historical-return-calculator-popout...

You conveniently ignored half the problem by buying in 1929 and holding for 88 years, which is reasonable if you are currently about 140 years old.

If not, look at http://www.macrotrends.net/1319/dow-jones-100-year-historica...

Let's buy in August 1929 at 5338.69, and sell 20 years later, in August 1949, at 1822.87 (inflation-adjusted). Congratulations, you lost two thirds of your money.

Sell 30 years later instead? August 1959, at 5525.23. Wow, after 30 years you're up almost 3.5%!

You don't need prophet for that.


> the stock market is a random walk with a meager trend upwards that doesn't beat inflation + trading costs.

That assumes that the efficient-market hypothesis holds true, but it has yet to be thoroughly proven or disproven... (and funds like Medallion would strongly suggest otherwise for the medium term: https://www.bloomberg.com/news/articles/2016-11-21/how-renai...)

It doesn't assume the Effiecient Market hypothesis - empirical studies of returns support random returns without the imposing a model (non-parametric tests).

That's not to say returns are actually random, but in any given time range, it appears to be.

EMH and random walk theory are intrinsically linked; you can't have one without the other...

Or are you saying that movements aren't actually random, and only appear to be?

The latter

Some people are making pretty penny for being so random.

The same could be said about the lottery.

Is there a way to extend these models to handle spatial variation (e.g. weather forecasting, property price estimation etc.) as well?

This would be non-trivial. Consider this paper on marijuana usage where the researchers had to group statistics by adjacent counties in Oregon and Washington in order to control the tests.


Thank you for the pointer, will read the article.

All my attempts thus far have pointed me to something called Gaussian Proceeses that I am still working through grokking.

I have been working for a few years on a similar project using evolutionary algorithms on top of other models (linear / ann). It works quite well (e.g., for equidistant energy demand / supply forecasts) but there's still lots of stuff to do.

It's major benefit is that it figures out relationship to the target time series by itself, so you can just throw in all time series and see what comes out.

Language is Clojure, 20kloc, incanter, encog. If anyone is interested in working for/with it, let me know. I currently develop a Rest Api for it and plan to release it as open source once the major code smells are dealt with.

Why not release sooner and document the code smells? Maybe you'll get patches

I'd like to have a tested use case that mostly and simply works. Something to put in readme.md that shows how it works and that it works. Almost there...

This actually looks incredibly useful and pretty simple to learn.

Between this and Stan I think my free time for the next week is gone.

So.... I don't understand how this is better or worse than using forecast.

You talk about having to choose the best algorithm but it seems like Prophet is just another algorithm to choose from. Is there some kind of built in grid-search or are you just stating that results from your AM have been more accurate than ARIMA?

This is a nice piece of work - thanks for sharing with the community!

Some feedback: it'd be nice to see you actually quantify how accurate Prophet's forecasts are on the landing page for the project. In the Wikipedia page view example, you go as far as showing a Prophet forecast, but it'd be nice to have you take it one step further and quantify its performance. Maybe withhold some of the data you use to fit the model and see how it performs on that out of sample data. It's nice that you show qualitatively that it captures seasonality, but you make bold claims about its accuracy and the data to back those claims up is conspicuously absent. Related, it might be worth benchmarking its performance against existing automated forecasting tools.

I'll definitely be checking it out!

For us insurance/financial services folks, I would like to simply clarify that this is not the Sungard/FIS risk management platform that is also called Prophet! :D

I got really excited for a second. Actually, I'm still pretty excited about this even if it was something else entirely.

This looks amazing, congratulations.

We're planning to add forecasting to our SaaS analytics product (https://chartmogul.com) later this year, I'm going to look and see if we can use this in our product now.

I was trying to sort out whether adding this to an existing charting/analytics product makes sense but it looks like you've checked it out and think it does. I couldn't tell only because it seems to be built to do the charting/plotting itself, but I guess you can just use the data/API to get the forecasts then plot them yourself yes?

I may do a test implementation into Airbnb Superset actually to see how it flies.

Interesting definition of "scale" in this context, as it does not imply "big data" like every other usage of the word scale in data science. The tool works on, and is optimized, for day-to-day, mundane data.

See also the R vignette, which shows that the data is returned per-column which gives it a lot of flexibility if you only want certain values: https://cran.r-project.org/web/packages/prophet/vignettes/qu...

For a corporate credit analyst working at a bank, what are some good introduction material for getting into forecasting using tools like these?

I see this being applicable to analysts when deciding on on a company's credit worthiness.

There are some models out there which could be used but i'm not sure that forecasting is actually what you would use.

I would think if you're already assigning credit ratings, you can set that as your dependent variable and use things like company revenue, number of employees, age of company, etc. as your independent variables. You can use a number of different models to assess credit worthiness based on this data. Evaluate several to determine the most accurate.

The fact that Prophet follows the "sklearn model API" and that it's very well integrated with pandas makes it super appealing and usable!

Very cool, got loads of sensor data around my house over a years worth so curious to throw it at Prophet.

Has anyone managed to get this working on windows with Juypter (Anaconda build) struggling with Pystan errors. Any guidance welcomed.

/please ignore: Oracle & Prophet. Oracle sifts through signs but Prophet has a line to the larger picture. I suppose the next 'product' will be called Messiah to complete the picture.

Why do we need Prophet when we already have Temple OS (http://www.templeos.org/)?

Are there any startups/services where you pass it a series and it returns forecast models? That's something I'd be willing to pay for.

You can try https://yoghurt.io/. Its fully managed platform and no need to setup anything yourself. Example: Like you want to predict the app downloads of your website coming week. Just upload the data in time series format against the date and app downloads from last 30 weeks. It will return the next 7 days predicted app downloads along with the analytical confidence. It can predict any KPI like visitors, app downloads, conversion etc. Just signup and start predicting.

I'm curious...are you worried about this release? Seems like all I'd need to compete with you (vastly simplified but for arguments sake) is hack together a simple webpage with a submit button that uses Prophet. Assuming both models yield reasonably useful results (obviously you could compete on accuracy or ease of use where you're currently ahead for business-y customers).

Is it possible for example to send you monthly revenue numbers for my startup for the last two years (24 data points) and have yoghurt predict the next two years of monthly revenue?

If the model is autoregressive you can only forecast N steps ahead. Any further forecasting will be based on these generated near-future forecasts. In English, no. See https://www.youtube.com/watch?v=tJ-O3hk1vRw#t=01h16m

Thanks for posting this talk by Jeffrey Yau. I am 9 minutes into it and can't stop watching. He explains things very easily and clearly.

Its very simple to use Yoghurt, just upload the data and rest it does automatically. 24 data points is less to make any accurate prediction. You need more data points. However, Yoghurt currently supports 1 week prediction only and very soon we will be adding prediction upto 1 Month and plus.

Excuse my ignorance, but how does 1 week fit into the equation? Why does the time scale (x-axis) matter? I.E. if I pass 180 points of revenue (y-axis) does it matter if they were sampled each day or each hour in terms of forecasting?

They probably take into account day-specific trends , such as if the data shows sales are usually lower on a Monday than a Tuesday, they would take that into account in the forecast. This is as far as I understand.

So, assuming they are doing this, the time scale does matter. What I am trying to say is that these solutions (like prophet) are opinionated and that is why they can get accurate, as they are taking into account these time-scale specific trends.

But being opinionated means that they are assuming stuff about your data. For example saying that the number of sales you make in a day is a function of or correlated to the day of the week is probably a reasonable statement. However if you move away from sales and marketing, and try to forecast say the number of seismic events in a day, nature doesn't care if it's a Monday or Tuesday or holiday. So any such correlation the program is able to find out and use in forecasting would be incorrect. Like maybe there are more earthquakes on Monday than any other day in a particular dataset, but that would just be incidental and doesn't mean earthquakes are more likely to occur in future on Mondays. It's not a good example but there could be other such cases where such assumptions could be wrong.

Yes it would matter. Our algorithm(SandDune)is built around measuring data on a daily basis at this stage. It takes daily input data and predicts the next week's data on a daily basis. If you give it 180 daily data points, it will predict next 7 data points.

I see. Any plans to support monthly data? That is more useful for financial models.

We are working on it and will be live with it soon.

Slightly inconvenient that the main image <figure> needs to be replaced by an <img> tag just to have the image appear in print outs.

This is very interesting. Forecasters who participate in the Good Judgment Project, such as myself, will find this useful.

Very cool. Could this be re-purposed for detecting anomalies/outliers in time series data?

>Could this be re-purposed for detecting anomalies/outliers in time series data?

If you define anomaly as something unexpected then yes. In this case, if the reality differs significantly from the forecast (=expectation) then it is an anomaly (according to our definition). In numeric univariate case, there could be positive anomalies where you get more than expected, and negative anomaly where you get less than expected.

My guess would be yes. I'm thinking this could be used to find out how effective a particular marketing campaign was. Just compare the forecast with actuals and the difference would be the number of sales/clicks you got from that campaign.

Can we use other features (like temperatue?), or it has to be only time-based?

How different this framework is from statsmodels?

Statsmodels is a grab-bag of various statistical models from linear regression upwards. This is an opinionated library for (some relevant parts of) econometrics.

can someone explain what's the meaning of this line

> df['y'] = np.log(df['y'])

I have not read the code, but assuming df is a pandas dataframe, it sets the 'y' column to the log of what was previously the 'y' column.


df is a dataframe, which is like a spreadsheet. This line takes the logarithm of the column named 'y' and updates it in place.

thanks. that part i can understand but why do that?

just wanted to point out to potential windows users - this will only run on python 3.5 due to dependencies (pystan only works on python 3.5 for windows)


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact