
Google Prediction API - pelle
http://code.google.com/apis/predict/
======
m0th87
This is easily the most interesting announcement so far. Machine learning has
so many applications, but its use is constrained by the high barriers to
entry. Recommendation engines, for example, are huge sales drivers, but few
among even the largest ecommerce stores use them. A simple prediction
interface that's built on the ML expertise at Google is a win for everyone.

~~~
sketerpot
Directed Edge, a promising YC startup, makes recommendation engines
surprisingly easy:

<http://www.directededge.com/>

It's quite a bit higher-level than what Google is offering here, with all the
benefits and drawbacks that entails.

~~~
hussong
Thanks for the friendly plug :-)

You can find the full documentation on our developer site at
<http://developer.directededge.com/>, and we offer a free developer account
for non-commercial purposes: <http://www.directededge.com/signup-
developer.html>

If anybody is giving the Google Prediction API a whirl for recommendations,
we'd love to hear about your findings!

------
zefhous
I'd love to see how well it could predict comment ratings from Hacker News.

The following data would be a good start:

1\. Text of comment

2\. How many points the comment has

3\. How many points the article has

4\. Time article was posted

5\. Time comment was posted

I'd also be interested to see what kind of user bias there is. If you don't
provide user names, you could see what kind of rating a comment _should_ have
based on its content, and what rating it actually has because certain users
are generally loved (pg) or hated (jasonmcalacanis) by the community.

~~~
crystalis
To be fair, several posters can earn their points on name alone:

\- grellas can post a one liner on a law issue

\- DarkShikari can post about video codecs

\- tptacek can post about security

\- patio11 can post about bingo

\- edw519 can post a grocery list

------
T_S_
Not enough details available on how it works. Would rather build my own at
this point. Plus the way this is billed oversimplifies the whole model design
process.

Sorry to sound so negative, but I just earned a PhD in Machine Learning. How
would you feel if you were replaced by an API? :-(

~~~
akadien
I know. It's worse than being outsourced.

~~~
eob
Next thing we know, the Tea Party is going to start protesting APIs taking our
jobs.

EDIT: In seriousness, though, this is a _good_ thing for ML PhDs like you (and
eventually me in a couple more years). Building the progress that's already
been made in the field into easy-to-use APIs frees us up to work on the next
steps.

Now if they just released an Integer Linear Programming API that would give us
access to their compute cloud...... :)

------
metaguri
I was guessing that google (in their never-ending desire to consume more data)
would want to use us as guinea-pigs to improve their algorithms. It's not 100%
clear to me, but from the terms of service:

 _By submitting, posting, displaying, or transmitting Data on or through the
Service, you give Google permission to process your Data for the sole purpose
of enabling Google to provide you with the Service in accordance with its
privacy policy. You hereby grant Google all licenses to your Data necessary to
process the Data and provide you with the Service in accordance with its
privacy policy. As a part of the Service and through provided interfaces,
Google may allow you to remotely access, view, and download results of the
processing of your Data._ (via
<http://code.google.com/apis/predict/docs/terms.html>)

I imagine that they might claim the right to use your data anonymously to
improve their algorithms, much like they do for your personal data in their
other apps. I mean, what better way to refine their supervised learning
algorithms than via an endless supply of training sets? But I hate wading
through legalese, anyone have any insights?

~~~
btilly
Did you miss the phrase, _for the sole purpose of enabling Google to provide
you with the Service in accordance with its privacy policy_? That phrase tells
me that the ONLY thing they get permission to do with the data is use it for
processing your requests.

And they definitely need that. In order to process requests, Google has to
make a bunch of copies of your data, create models, etc. Furthermore Google
will need to keep copies so that it can use the model it generated for future
requests. If the data that you have uploaded is confidential, proprietary,
etc, then this requires copyright permission. (Particularly since in the
previous clause they made it clear that you retain full copyright.)

~~~
metaguri
No, I did see that, but to zoom in further, what is meant by _"in accordance
with its privacy policy"_?

For Google's other apps, their _privacy policy_ lets them confidentialize and
then use data about you to improve their services. Don't see anything that
stops them from doing stuff with your data as part of "providing you with the
service", without giving up your ownership of it.

Yes, I'm speculating. But it just struck me as a reason for google to offer
this service. And honestly, if I were a user I might be ok with them using my
data anonymously to improve said service that I am using.

------
frisco
Need more information. What kind of algorithms are they employing on the
backend?

------
carbocation
From the very little information that I see available so far, it appears that
Google will first stab at discrete predictions. That is, I don't see
probabilistic output yet.

Also, from [http://code.google.com/apis/predict/docs/developer-
guide.htm...](http://code.google.com/apis/predict/docs/developer-guide.html),
it is clear that they perform accuracy analysis using the training data. That
is, there is no "testing" vs "training" dataset distinction at this point;
there is just cross-validation of the training set.

~~~
jey
> That is, there is no "testing" vs "training" dataset distinction at this
> point; there is just cross-validation of the training set.

If they just create a test set from the training set, and omit that from the
training, what's the difference? The main thing is that you don't want to
include the test set in the training step, and I assume they're doing that.

~~~
carbocation
According to what they wrote, there is no separate testing set, so they
estimate accuracy based on the training set. They use cross-validation to
reduce their overfitting bias. (I could be wrong since I haven't actually used
their service, but this is the take-home message of the wording of their
explanation. If they intended to convey otherwise, they used the wrong
language.)

------
caffeine
Darn ... I'm halfway through writing one of these. I guess I just have to make
it better..

~~~
jimbokun
But wonderful validation of your idea, no?

------
jbrennan
I knew this was coming.

~~~
izendejas
I had this idea myself, basically machine learning as a service about 3 years
or more ago. Somehow, I also knew Google would implement something like this.
So while I still consider this a viable startup idea, I knew it would be tough
to compete against a behemoth that already has tons of data and experience
training countless machine learning algorithms.

~~~
rw
The "AI API" is the dream application. Just imagine: you get to implement (and
even discover) cutting-edge prediction algorithms, make them scale, and expose
them via your protocol-of-choice. No frontends to write, no Joe Luser to
support, just beautiful math and hardcore infrastructure engineering.

There are reasons this hasn't been done before. I don't think it's a viable
startup idea. Think about the capital you would need to even get something
like this off the ground. Doing the AI API 'right' would require a
supercomputer, if you take it far enough.

A guy can dream though. Maybe doing high-frequency trading would get you most
of the way there.

IMHO, go for the niche: elsewhere in this thread Directed Edge was mentioned,
and they are a good example of a business tackling a well-defined problem with
exciting tools and techniques.

~~~
izendejas
interesting. i actually envisioned a simple web interface that anyone,
including joe luser, could use. the idea was to empower anyone to be more data
driven, from the individual business owner in africa, to the small and medium
business owners everywhere. i did recently run into directed edge. there's
also data applied (their ui is too complex though).

by implementing a web app that anyone could use (mobile or not), i also
envisioned a sort of community/market place where people could post their data
and do simple stuff and/or have experts try to tackle it for a service fee.
and whatever algorithms came out of that would be made available for future
data that has similar features. i recently came across a similar site. can't
remember the name.

anyway, i know this is not necessarily a viable startup idea. and if it is,
it's ultimately all about execution. i'm still dreaming though and was psyched
google launched their predict api.

------
mseebach
Are there any (preferably FOSS) libraries that does anything like this?

~~~
conesus
In Python there is a wonderful library called the Natural Language Toolkit
(NLTK) available free and open source at <http://www.nltk.org/>.

With NLTK you can build classifiers, decision trees, and train/predict with
bayesian classifiers similarly to Google's Prediction API examples. It's
pretty easy to get started, and it's code that you run locally, so there is no
network traffic.

I use it on <http://www.protopub.com> for classifying rss feed stories based
on user feedback, so Protopub can recommend future stories that you might
like. NLTK is far easier than rolling your own classifiers, but even that is
not too difficult. See the O'Reilly book Programming Collective Intelligence.

~~~
sketerpot
There's also Weka, which can use almost exactly the same file format that
Google is using, and do the same kind of things (though perhaps with different
algorithms). It's pretty pleasant.

<http://www.cs.waikato.ac.nz/ml/weka/>

~~~
jim-greer
I recently used Weka to create a simple rules-based fraud model for
Kongregate. It worked very well, and had a lot of options for algorithms. The
UI is a little weird, but it's worth checking out.

------
natfriedman
This is interesting:

"Automatically selects from several available machine learning techniques"

So not only does it learn, it's learning which learning techniques work best
for different problems.

------
mikecane
As a non-techie, I don't understand the language example they're using. It
seems to me many prediction engines are originally built to try to forecast
winning lottery numbers or other such gambling events. Google expects me to
believe they did this for language?

~~~
btilly
The point is that they took a large number of documents, which are clearly
labelled as to language, gave it as a training set to the machine, and they
now have a classifier that lets you input random text and tells you the
language it was probably written in.

In principle you can do this with any data set and any set of discrete
outcomes.

In general, though, you should expect that the resulting classifier won't give
you much insight on why it came up with the answers that it did. Plus it
frequently is less accurate than a trained human. But it is much, much
cheaper.

~~~
jodrellblank
_they now have a classifier that lets you input random text and tells you the
language it was probably written in._

Incidentally, Google Translate does this and starts guessing the source
language as you start typing. I found it interesting that when you type a
single character, w is guessed as Polish, i is Norwegian, s is Czech, e is
Portuguese...

~~~
ovi256
Frequency of first-character of words in those languages.

------
abossy
Are there any input/output samples?

~~~
sketerpot
I couldn't find any, but here's an explanation of the input format, with some
example snippets:

[http://code.google.com/apis/predict/docs/developer-
guide.htm...](http://code.google.com/apis/predict/docs/developer-
guide.html#data-format)

It's pretty straightforward.

------
tszming
Basically what I see is they implemented an open platform for running
classification algorithms that gives discrete categories as output. Automatic
selection from multiple machine learning methods - maybe just simple cross-
validation.

------
tybris
"Upload your data to Google Storage for Developers, then use the Prediction
API to make real-time decisions in your applications."

I can understand the necessity of this, but that'll be some serious lock-in.

~~~
pyre
Not necessarily. Unless I'm misunderstanding, you're not transforming your
historical data in Google Storage. So as long as you kept it backed up outside
of Google Storage, then you shouldn't have any issues.

------
lsb
Would you give every data point of _yours_ over to Google?

~~~
frisco
Data doesn't have to be readable. Often, preprocessed datasets are totally
incomprehensible for everyone except whoever prepared it. Multiple fields are
combined into composites, rescaled, transposed, etc.

~~~
tel
You're providing a lot of entropy to a body that has even more sitting around.
If anyone could find some part of a dataset that might be sufficient to
deanonymize your data Google could.

Not saying you need to be so paranoid, just that non-readability of data might
be comparable to obfuscating your javascript to keep people from prying into
your code.

At least to a sufficiently knowledgeable corporation.

------
joubert
Does google use the data you upload for other purposes besides just driving
your (private) results via the prediction api?

~~~
deepu_k
From their TOS seems like they won't use it for any other purpose. This is
discussed on a comment thread above.

------
cb33
I predict this will be a hit.

