
PredictionIO – A machine learning server - Anon84
https://github.com/PredictionIO/PredictionIO
======
peter_l_downs

        https://github.com/PredictionIO/PredictionIO/blob/develop/process/engines/itemsim/evaluations/scala/topkitems/src/main/scala/io/prediction/evaluations/itemsim/topkitems/TopKItems.scala
    

Why is this structured like this? These directories seem ridiculous to me, it
would be great if someone could explain.

~~~
mcphilip
Is the inconviently deep tree data structure housing an OO framework really so
important that it should be discussed front and center on HN? Isn't the
function of the framework more interesting? Honest question.

~~~
vidarh
People who might want to understand the code and make changes in order to take
advantage of it will care about form. It matters. I often pick tools based not
only on function, but on how easy it will be for me to make changes or
maintain the code base if necessary, and I doubt I'm the only one.

~~~
tsenkov
It matters, but not as much as this:
[https://news.ycombinator.com/item?id=6574237](https://news.ycombinator.com/item?id=6574237)

------
ethnt
This seems like a cool project, but I'm really not appreciating the
unsolicited email I got based on the fact that I starred the Play Framework
repo on GitHub[1].

[1] [https://dl.dropboxusercontent.com/u/2938195/predictionio-
ema...](https://dl.dropboxusercontent.com/u/2938195/predictionio-email.png)

~~~
victorf
I got an unsolicited email to check out a Github project once and I didn't
mind it. I ended up looking at the new project and it was neat. What made this
one so irritating?

~~~
eropple
So have I - "I saw your project X, thought you might be interested in this."
That's one thing--personalized, actually in the context of what I've done and
detailing _why_ I might be interested.

"You starred Play so come look at our thing" is not. It's an email blast. It's
spam.

------
Radim
From their docs:

    
    
        Note: Please be patient. It may take a long time to train the data model the first time even for very small dataset. It is normal because PredictionIO implements an distributed algorithm by default, which is not optimized for small dataset. You can change that later.
    

Sums up my experience with the Mahout/Hadoop world nicely. Not a good fit for
small-medium projects -- too complex, too cumbersome, too slow. By the time
you really need the scale (=often never, save for using "Big Data" for
marketing), you're big enough and know enough about your domain to roll a
custom, efficient, domain-optimized solution.

Bringing machine learning to the masses is an honourable goal though, so
thumbs up for PredictionIO.

------
mapleoin
Offtopic: Haha, I guess this is the extreme example of using original titles
for submissions (see the discussion at:
[https://news.ycombinator.com/item?id=6572466](https://news.ycombinator.com/item?id=6572466))

I wonder if this title was given in jest.

~~~
leeoniya
it's great pg gave an explanation. unfortunately, the justification and end
result is no less absurd.

i think letting the community upvote and downvote the titles themselves would
give clear indication to mods for changing them. instead, they choose to
justify doing nothing by saying they dont have the resources to read all
articles and evaluate each of them.

just seems like intentional friction and reluctance to fix a recurring and
aggravating issue that has so many viable solutions.

~~~
skj
This is an example of people searching for things to complain about. If the yc
folks spent the energy to address this problem, the same people would find
something else to gripe about. The only difference made by not addressing the
problem is that they didn't waste their time.

~~~
leeoniya
you can apply this logic to every problem and never address anything.

~~~
skj
Only if every problem were as inconsequential as this one is.

------
paulasmuth
Could you expand on how prediction.io would handle a real world data set
containing a few million items/users? How long would it take to generate a
single user<->user recommendation at this scale? Does prediction.io require
that I keep the whole dataset in main memory and how much memory would I need?

I'm asking, because for us (dawanda.com, one of the biggest ecommerce
platforms in germany) most of the development effort on our soon-to-be-
opensourced recommendation engine was spent on scaling the CF up from a few
thousand test records to a 150 million record production data set.

In the first iteration we also built it completely in scala, but as we were
putting more and more data into it, memory usage was exploding. We realized
that boxed types had too much overhead and that we had to implement the whole
sparse rating/similarity matrix in C [1]. Also we decided to go for a hybrid
memory/disk approach which allowed us to process 80GB datasets on a machine
with only 64GB main memory.

How did you manage to solve the memory consumption issue for prediction.io in
scala? Did you use java raw memory access or did you also swap out data to
disk/ssd?

[1]
[http://github.com/paulasmuth/libsmatrix](http://github.com/paulasmuth/libsmatrix)

~~~
dszeto
PredictionIO is a serving and evaluation framework on top of a bunch of
algorithms. Currently a majority of them come from the Apache Mahout library
[1].

Computation time and resource requirement depend on the choice of technology.
If a non-distributed implementation is chosen using the framework, the rule of
thumb from Apache [2] is a good guideline. For distributed implementations
based on Hadoop, the 10M MovieLens data set [3] finish training on a single
m1.large AWS instance (7.5GB RAM) within 30 minutes. Although we do not have
an accurate account of how much computation time and resource will be required
for your production data set's scale, a user has reported using his own
production data set of similar size with 2M users, and finished training in
about an hour using Amazon EMR.

That said, PredictionIO does not do anything special on memory consumption or
has a special memory access model. It really depends on the underlying
libraries that do the actual work.

We imagine your project requires a much faster turnaround time according to
your spec, which is an interesting application to us as well.

PS. The work you posted is pretty cool. :)

[1] [http://mahout.apache.org/](http://mahout.apache.org/) [2]
[https://cwiki.apache.org/confluence/display/MAHOUT/Recommend...](https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+First-
Timer+FAQ) [3]
[http://grouplens.org/datasets/movielens/](http://grouplens.org/datasets/movielens/)

------
elwell
Has anyone integrated this into their project yet? Would be great to see
examples of it working well.

~~~
victorhooi
We're using this at my work for a product recommendation engine. It's
essentially a wrapper around Apache Mahout (which uses Hadoop). It make the
whole Hadoop/Mahout setup much more accessible, but it still has the same
drawbacks (sucks memory like anything, lots of overhead to ramping up the
jobs)

------
devd
Interesting - The server is licensed under AGPL, the clients are Apache v2.0
and there is a promise that a client is a separate "work"

~~~
saosebastiao
I understand the reason the AGPL exists, but has it ever been used by a
project successfully? They can't (or rather, won't) be used by businesses,
because corporate lawyers aren't dumb. So that leaves personal and academic
projects. I've never seen a successful open source project that could survive
on toy interest like that.

~~~
oomkiller
The mongodb server is a clear example of this.

~~~
devd
Thanks for pointing this out - I was under the assumption that mongodb is
under dual GPL/commercial license

------
JDDunn9
Are there any benchmarks out there for the different prediction engines out
there?

~~~
roboben
can you please name the other engines you know? we're currently seeking for a
good system, but only found libs like mahout yet. thanks.

~~~
dszeto
[http://graphlab.org/](http://graphlab.org/) is also a good choice

------
meritt
I'm gonna say the subtitle would be better for this submission :)

> PredictionIO, a machine learning server for data engineers and software
> developers.

------
namuol
Very cool, but for some reason I really just want to know if you're using
text-to-speech for that short demo video. I honestly can't tell with
certainty.

~~~
GhotiFish
It's like they narrated the script with a TTS engine, then got an actual human
to mimic it.

------
b0b0b0b
a collaborative filtering server.

It strikes me as a lot of overengineering with no real meat at the core.

------
vladimir_zv
Why no contact details? Is this a side project? If so, kudos to you on the
execution. very slick.

------
yeukhon
Server. What server? You mean an OS or just Server application.

 _edit_

Why downvote? This is a good question. _smh_

------
mrcactu5
this seems almost too good to be true. does it solve everything?

~~~
dmazin
Yes.

It. Solves. _Everything_.

~~~
btown
So it just returns 42?

------
anmalhot
is it functionally different from the Google prediction APIs?

------
csharp_gooru
This project sucks, because there is no proper examples that actually tell me
how i can make predictions.

~~~
nivla
Try their main site [http://prediction.io/](http://prediction.io/) . It has
videos and other write ups. Its still not thorough, guess you might have to
install and play around with it to get a better understanding.

