How FlightCaster (YC S09) was built (RoR + JVM + Clojure + Hadoop)

minalecs · on Aug 19, 2009

This is an awesome post, would love to see more overview and analysis like this from more of the YC companies. Good stuff.

dschobel · on Aug 19, 2009

I sent it to all my functional programming fancying friends who are working at big anonymous corporations in ten year old codebases of OO code and their response was equal parts despair and envy.

bradfordcross · on Aug 20, 2009

I prefer a functional to an imperative style, but IMO, let's not bash OO too badly. :-)

I think the kind of message-centric OO that Alan Kay was talking about was very different from what we see in practice today.

Likewise, if you read "The Art of Meta-object Protocol" there are some deep insights about the configurability and power you get whan you have pre and post hooks into everything in the system. MOP systems are very powerful.

It is also nice to have the implicit "this," "self" or "message recipient" that you don't have to pass around all the time into the same family of functions (methods.)

The kind of typical "enterprise java" code that you see in the wild is what I call "class-oriented programming" - using classes as containers for procedural code. This is at the extreme end of crappiness for the imperative world.

If you go back to the initial spirit of OO, and combine that together with techniques like constructor injection with good citizenship, lots of immutable value objects, and polymorphic strategies, you get a style of OO that is more friendly with FP. It still isn't that delightful in verbose languages like java, but is a lot better than the norm.

The problem is that not a lot of people deeply grok OO in this way and you don't run into projects using this style often.

But let's not bash OO, let's learn from the cool stuff and throw out the garbage.

greentree · on Aug 20, 2009

"If you go back to the initial spirit of OO, and combine that together with techniques like constructor injection with good citizenship, lots of immutable value objects, and polymorphic strategies, you get a style of OO that is more friendly with FP. It still isn't that delightful in verbose languages like java, but is a lot better than the norm."

Any references for this style of development? Have you written about your development/architecture practices anywhere?

onewland · on Aug 19, 2009

I resemble this remark.

dschobel · on Aug 19, 2009

If only there were places where you could work with cool tech and have stability as well.

sigh

icey · on Aug 19, 2009

You can use anything you want for personal tools, right?

Like, if you know you're going to have to sanitize a ton of data, couldn't you use anything you wanted?

onewland · on Aug 31, 2009

Actually, I just did this in Ruby (I'm the grandparent post). Still, I'd love to use Scheme/Lisp/Haskell all day.

alain94040 · on Aug 20, 2009

If it's stable, by definition it won't remain cool for long.

That's why what you ask for doesn't exist.

gcheong · on Aug 20, 2009

I'm also interested in hearing about how the idea to create something like this came to them.

bradfordcross · on Aug 20, 2009

The idea for Flightcaster is pain-driven. :-)

physcab · on Aug 19, 2009

I'm curious to know what type of machine learning algorithms are being used for the prediction. Anyone have any thoughts?

dschobel · on Aug 19, 2009

When trying to learn a really complicated concept in absence of a clear domain model the best solution when I was in school a few years back was neural networks.

That would be my first instinct if I had to solve this problem.

bradfordcross · on Aug 20, 2009

Flightcaster doesn't intend to be a complete black box. To an extent, the more that sophisticated users know, the better - because they can help us by reporting specific details of use cases where we are not doing well so that we can find ways to incorporate those cases into the research and the model.

@dschobel - you are right that it is difficult to learn complicated domain models in domains like this with a lot of subtle logic. To deal with this we use a blend of analytical and inductive learning. There is a nice discussion of combining analytical and inductive learning in "Machine Learning" - Mitchell, ch. 11 & 12.

As to the specifics of whether we use SVMs, Bayesean Networks, Decision Trees, and so on - we don't feel that it would be beneficial to get into detail.

This is not just to keep everything top-secret, but also because all of this is an area of active research and we may completely change the approach at any moment. Most of what gives us in an edge is 1) the ability to deal with the data (preprocessing, joining, etc.) 2) the infrastructure to do fast, distributed learning, and 3) the domain expertise to add richness of logic to the individual classifiers we use. The exact techniques by which we weave everything together are open to constant tweaking.

Another important point outside of the specific learning algorithms is that we use rigorous approaches like 100-fold cross-validation and such. But I like to think of it like Popper - so I call it invalidation rather than validation. :-)

IMO, the process of bringing scientific rigor to a problem is also more important than the hypothesis under test. A lot of the best stuff comes from a rigorous application of scientific method that yields unexpected results that lead you to a an alternative hypothesis.

fizx · on Aug 19, 2009

I'd be somewhat surprised if it isn't Bayesian networks.

Oblig. wiki link: http://en.wikipedia.org/wiki/Bayesian_network

zemariamm · on Aug 19, 2009

me too, but I also would love to know which data sources they are using

gojomo · on Aug 19, 2009

They may consider their methods too proprietary to supply details, so I'll speculate:

Information on earlier-in-day flights alone would probably be enough (mixed with historical data) to be fairly accurate about later-in-day flight delays -- even those that aren't continuations of the same plane -- because of giant overlaps in delay-causing conditions (weather, airport/mechanical mishaps, crew issues, etc.).

Weather forecasts might give another advantage, especially in predicting 'seed' delays that then hint at later delays.

If there are any other semi-public feeds related to FAA reporting or air-traffic control -- even if mostly meant for other pilots or General Aviation -- those would be incredibly valuable.

If those regional and national maps of planes-in-flight also contain sufficient positioning detail to notice when they're spending a little extra time on runways, or waiting for/at gates, etc. -- another positive early influencer for predictions.

physcab · on Aug 19, 2009

I don't think it would be harmful for them to give a clue about what algorithms they use. There is so much tuning required to get these algorithms to perform the way you'd expect that I think they could still keep their IP locked up even if they gave a general hint.

With that said, I'll speculate as well: Perhaps you'd need some type of ideal dataset, one that included departure times/arrivals and distances and weather conditions of flights that came in on time as expected. Then you might start introducing some noisy data, ie. flights that made the same trip but came in late or early with same weather conditions. Then you'd add in the effect of weather conditions and see how flights fared. I'd speculate that you could get away with doing some type of regression analysis, but maybe you'd need to resort to a more complex algorithm for classifying ("On-time", "Early", "Late") based on a series of features ("Distance","Weather","Mechanical", "Time", etc). SVM could pull this off, or perhaps even a naive bayesian classifier. For research purposes I'd probably check out RVM because it might need less information to classify. Not sure if it would be realistic to use it though...this problem is in need of a highly scalable solution.

bradfordcross · on Aug 20, 2009

You are right that Flightcaster doesn't want to tell everyone the recipe for our special sauce, but we will say that the kinds of features that you mention are the kinds of features and sources that we are looking at.

It is all based on captured real-time data, so we are limited by what we can get access in real time. You are correct that some is public and some is semi-public. It is not the most efficient space so there is a lot of data that we will need to screen scrape and such.

A lot of the problem is just obtaining and pre-processing all the data from heterogeneous sources, and performing distributed joins to get it into the proper view for analysis.

mbarr · on Aug 19, 2009

http://www.flightcaster.com/faq#data_sources

What are FlightCaster's data sources?

FlightCaster uses data from:

FAA Bureau of Transportation Statistics

FAA Air Traffic Control System Command Center

FlightStats

National Weather Service

euroclydon · on Aug 19, 2009

I wonder if they'll be able to transition all that logic toward analyzing a different set of data if predicting flight delays doesn't prove that lucrative? I guess if they take this far enough though, the government or an airline consortium might just buy it.

bradfordcross · on Aug 20, 2009

Yes. Some people seem to want us to do traffic too. Seems like we've gotten ourselves into trouble by attacking the messy problems. :-)

brown9-2 · on Aug 20, 2009

Very interesting read.

Let's say I know next to nothing about Clojure, Lisp, functional programming, etc.

Could anyone suggest some resources for a beginner on this topic who is interesting in learning more about functional programming? Some tutorials perhaps?

tim_sw · on Aug 21, 2009

programming clojure is good for java devs coming to clojure, functional programming, check this one out (http://learnyouahaskell.com/) for haskell Little Schemer (and subsequent books are good too) I personally started from the Little Schemer a few years back.

zandorg · on Aug 19, 2009

Sounds like a great service - I'm not in the USA and don't travel much, but it sounds like some great data mining!

californiaguy · on Aug 19, 2009

Wait, so... this tells me that my flight is "probably" delayed?

What if I take that as gospel and show up to the airport late or otherwise make plans based on that data and then it turns out it wasn't actually delayed?

Isn't the equilibrium action for me to show up at the airport on time no matter what?

physio · on Aug 19, 2009

Agreed; this is fairly useless on its own. However, the TC blurb says:

"In the future, the company plans to offer a list of alternative flights so you can quickly rebook once you learn of a delay."

I think the real application of this is going to be purchasing fully-refundable tickets, then switching an 85%-likely-delayed flight for an 85%-likely-on-time flight, for a 72% chance of making the right call.

If enough people do this, airlines are going to end up cancelling entire flights when everyone switches their tickets. This will mean they either A) offer cheaper flights on the cancelled-then-rebooked "new" flight for anyone who might need it, B) offer discounts if you DON'T cancel, or C) raise the prices of refundable tickets so high that you will go to another carrier at the outset.

Better information for customers inevitably leads to more competition and lower prices.

Of course airlines are barely surviving as it is, so this kind of thing would (eventually) kill off some number of stragglers.

On an editorial note, I say good riddance. As far as I'm concerned, the entire airline industry can go out of business for not standing up for their customers. I don't want to be physically molested or scanned naked, deal with "freedom baggies", a lack of water, taking off my shoes (note in other public venues you MUST wear shoes due to health codes), power-tripping morons, people rifling through my luggage, stuff stolen out of my luggage, late luggage, damaged luggage, "lost" luggage, showing my ID, showing my ID to 3 different people, mission creep leading to arrests for NON-safety-related issues, "behavior detection" specialists looking to harass nervous and/or agitated individuals, "no fly lists", quasi-police-powers bestowed upon flight attendants so "interfering with a flight crew", e.g. arguing with a stewardess, is now a federal crime; a hundred other things, all capped off by secret laws which heretofore had always been held to be unconstitutional, but now we aren't allowed to know what the laws are, pertaining to aviation security.

Finally, if you want to know if a given flight will be late, the answer "yes" also works about 85% of the time.

greyboy · on Aug 20, 2009

If enough people do this, airlines are going to end up cancelling entire flights when everyone switches their tickets.

With a family member who spent many years working for a large carrier, this isn't how the airline industry works (in America, and I assume many other nations). Flights are almost never cancelled except due to mechanical failure, crew fatigue, or inclement weather. The reason being, that same plane taking you from NYC to SFO is also the plane that 130 people are waiting for in SFO to SEA. So, whether it is 100% full or only has 2 people, it still makes the trip. Have you ever been on a plane with only 2 passengers aboard? I have - and you can pick any seat you want! They will even send an empty plane out (with crew, of course).

C) raise the prices of refundable tickets so high that you will go to another carrier at the outset.

Refundable tickets are already that high - almost nobody buys them. They are already 3-5x as high as your average discount ticket and almost solely purchased by business travellers or foreigners who need that flexibility. The 99% of cow-herded casual travellers buy the bottom-of-the-barrel discount tickets, usually from Travelocity and the like, with the most restrictions. Unfortunately, I don't see that changing unless there is a massive uprooting of the current airline industry.

californiaguy · on Aug 20, 2009

Armed with this info, why would I ever buy an 85% late flight to begin with?

... I think I answered my own original question.

But wait, if historically delayed flights were delayed because of the high passenger loads and congestion effect of peak times and routes, if everyone were armed with the same historical information, would the new passenger distribution "stampede" to historically non-delayed routes, thus causing those to be delayed, or would the distribution normalize over time to help congestion overall?

I guess we'll see.