Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Machine Learning – Make Data-Driven Decisions at Scale (amazon.com)
251 points by leef on Apr 9, 2015 | hide | past | web | favorite | 51 comments

Meh. The more I do machine learning in industry the more I realize how little the ML part matters compares to everything else. A typical project I've seen takes 3-6 months and contains thousands lines of code, but the machine learning part will take a week or two and be 100 lines of code. What Amazon ML is doing would probably take an hour and 30 lines of R code you can easily find online.

And here's the not-too-hidden secret: the ML part is the fun part. It's a big reason we spend months creating banking.csv. Josh Willis did a very funny presentation at MLconf partly about this. It's like waiting in line at a theme park for an hour, and then paying someone to cut in line at the last minute and record the ride for you. https://www.youtube.com/watch?v=4Gwf5zsg4vI&feature=youtu.be...

The hardest part in machine learning is not training the model but debugging the model. How do you improve precision/recall after the first cut? Do you need more training data? Is some of your training data bad? Is it properly distributed? Does your feature have bug? Are you missing features to cover some cases? Is your feature selection effective? Did you tuned parameters carefully?

All these scenarios are difficult to debug because it's "statistical debugging". There are no breakpoints to put or watch windows to look at. There is no stack trace and there are no exceptions. Any Joe can train a model given training data, it takes fair bit of genius to debug these issues and push model performance to next level. Unfortunately all these new and old "frameworks" almost completely ignore this debugging part. I think the first framework that has great debugging tools will revolutionize ML like Borland revolutionized programming with its visual IDEs.

This. The pity is that immediately we get the results after a week the project is over and we move back to data wrangling hell!

You hit the nail on the head. Completely agrees with all my experience at Kaggle and applying machine learning across a broad number of industries

Maybe it's just me, but the "tedious" feature design and extraction IS the fun part. Am I the only one?

I mean, it's time consuming and frustrating, but it's also the essence of ML work and the place where I get to apply creativity and gain insight.

Agree 100%, in that light, anyone know how far we are away from having data wrangling be more automated? I saw a demo for a product called Paxata a few weeks ago, it looked like a good start. Anyone know more about things like that?

There are lots of new attempts at data wrangling approaches/tools, each with different caveats - Datameer, Platfora, Trifacta..

I can say that this is my day job now.

I think the "1 part fun 9 parts of perspiration" ratio is typical of most software fields - especially fields working in established industries. That's why dealing with software in a professional context is called a job and not an enjoyable hobby which it otherwise would be :)

I think this is one of advertised advantages of deep learning: it will find useful and unobvious features in your data corpus without much effort from your side.

I think that works in theory, but in many real world cases it actually takes a human to map the data into a subset of salient features. It's not simply a matter of excluding irrelevant dimensions.

Actually, with deep learning the fact of success is leading the theory. We don't know why deep learning works as much as we know it does work.

Edit: in certain domains such as images and speech

Isn't the point here that you can do it on huge datasets that don't work nicely with R

There are plenty of tools for that already. The point here is to make it as easy as possible.

I guess this could be useful for some people, but it seems rudimentary to me. If I'm reading their FAQ right they're just fitting a logistic regression to everything. I'm hoping this is just a starting point. Also, not being able to export the actual model seems like a huge dealbreaker to me.

My guess is they're using liblinear or vowpal wabbit under the hood. Both support SGD-based learning and work well in a streaming setting where data could be on disk or in memory.

do you mean that it takes more work to do the stuff surrouding the machine learning like gathering data to build a dataset that takes months and other resources where as the fun stuff is actually very short and easy to do.

I smell commoditization.

Can you elaborate?

Is it just Amazon's catching up with Azure ML launched last year? (And cutting prices by 80%)

Azure ML also supports R and Python custom code, which can be dropped directly into your workspace.

And this was even before Microsoft acquired Revolution Analytics. Amazon ML seems to be less flexible in regards to importing your own models:

Q: Can I export my models out of Amazon Machine Learning?


Q: Can I import existing models into Amazon Machine Learning?





No... it's Amazon ML and Azure ML trying to catch up with BigML. They copied many things from our service but forgot to copy the ease of use. Services like Azure ML, Amazon ML and even Google Predict API work like a black box, and lock your model away, making you extremely dependent on their proprietary service. With BigML, you can easily export your models and use them anywhere for free. If the goal is to democratize machine learning, then the ability to extract your models and use them as you see fit is essential, and only BigML offers that level of freedom.

I just try out BigML and look awesome. I use Google Prediciton API to fill a value on form of a web request. I need the result immediately. Why BigML require two web request and take so long to get a prediction of a trained model?

If you use BigML's web forms, the first request caches the model locally so that all the subsequent predictions are performed directly in your browser.

Yeah sure, why not make your business process depend on a closed proprietary cloud-based product?

(in all fairness Amazon are better than many when it comes to unexpectedly withdrawing products)

I would be less worried about that and more worried about cost. I know of two different startups that aren't profitable but would be if they hadn't put their entire platform on amazon services. One of those startups was lucky enough to be acquired but it's going to take them many unprofitable years to migrate away.

So the pricing is $100 per million data points, at minimum. That doesn't seem like it scales well for big data at all.

However, that's 5x cheaper than what BigML is offering (https://bigml.com/pricing/credits) for its ad hoc service, so I might be wrong.

BigML cofounder here. Most BigML customers doing machine learning at scale use either BigML subscriptions (starting $30/mo) or private deployments – both of which provide unlimited model training and predictions and are suitable for developers and large enterprises alike. In addition, with BigML you can export your models (for cluster analysis and anomaly detection and not just classification/regression) to run locally and/or to be incorporated in related systems and services.

Did they basically just put a wrapper around VW[1] ?

[1] https://github.com/JohnLangford/vowpal_wabbit

No -- see https://aws.amazon.com/machine-learning/faqs/ --

"Q: What algorithm does Amazon Machine Learning use to generate models?

Amazon Machine Learning currently uses an industry-standard logistic regression algorithm to generate models."

But disappointingly:

"Q: Can I export my models out of Amazon Machine Learning?


Q: Can I import existing models into Amazon Machine Learning?


Note that they are doing classification and regression on iid feature vectors. Of course, ML is much larger than this setting, but this setting is generic enough that it has some applicability to lots of problems.

This does not mean they are not using Vowpal Wabbit. It is very easy to run Vowpal Wabbit with a logistic loss function.

Also, vw is what I'd consider "industry standard."

Or Weka.

> at scale

I'd say Apache Spark

I am really amazed at the kind of things Amazon turns into a service. And this ML service is just wow'ing. I have fiddled with basic SVM's before, but this takes away the part of writing code and makes it sort of a end user product(you are still expected to know basics about ML). On the other hand, I also don't think this will take off very well. Maybe a few companies/startups who have cash in their pocket will use it/try it out, but the audience is really limited beyond that in my opinion.

> Maybe a few companies/startups who have cash in their pocket will use it/try it out

Honestly, I'd see it the other way around. Small companies without a DS team might be drawn to this. I don't see how any company with a lick of sense would lock down their prediction model into AWS. They very clearly won't let you export your model once the training is done.

Small companies without a DS team will likely fall into the ML pitfalls which make the resulting analysis invalid.

> Maybe a few companies/startups who have cash in their pocket will use it/try it out

This would be really nice to use at my startup, but its cost prohibitive even on a very large budget.

I am setting up Spark Streaming to handle model creation and updates for recommendations based on what a user interacts with. If I were to even attempt something similar with this AWS service, its $10 for every 1 million predictions which isn't sustainable (not including the costs to create and update the model).

> but the audience is really limited beyond that in my opinion.

Definitely, largely as a result of cost. I would love to not have to worry about Spark in my infrastructure (its another piece...) but at this price the AWS service is just too expensive.

Worse than that, isn't it $0.10 per 1000? So $100 for 1 million predictions.

Hah yeah I realized that when I walked away. So expensive.

At first glance, this looks to go somewhat beyond Google's Prediction API, which (at least from my experience) is pretty limited in its usefulness.

Its nice to see tools for analyzing your data as well as multi-class classification, and some tune-able parameters but this doesn't seem to bring anything 'new' to the game.

All the hard parts, feature selection, noise, unlabeled data, etc are still up to the end user, which makes me wonder how many people will try this out and get poor results.

It would be nice to get an idea of what sort of model they are using on the backend or even having a choice of models.

The system also uses logistic regression and is limited to 100 gb dataset. Prediction with LR isn't that expensive and training can be done online with something like stochastic gradient descent. That can be done on a single computer. Given that the models aren't exportable and you can import a model, I'm hard pressed to see the immediate value. Long term, though, I'm sure there's plenty of growth.

It's kind of unclear, but it looks from the screenshots as if AWS is doing feature selection behind the scenes. But it seems that unless AWS does feature selection or model selection really efficiently behind the scenes, the cost of that extra work time is placed on the user.

What differences did you notice beyond Google's Prediction API?

This may be different now, but when I used Prediction API a few years ago, I don't remember it having any data analysis tools or multi-class classification. The UI was also pretty lacking. Haven't looked at in a while but perhaps it has gotten better?

Did anyone actually give it a try? I only get this error with any dataset (even a humble Iris): Amazon ML cannot create an ML model: 1 validation error detected: Value null at 'predictiveModelType' failed to satisfy constraint: Member must not be null

go to the datasources tab and see if there's an error message from data source creation. i had the same error due to an issue with variable names.

I like the "Introduction to Machine Learning" which sort of briefly outlines the basics of machine learning for people who don't know about it.

I predict we will see more cloud based machine learning services. Since machine learning is hard to learn and write for the average person, providing the services will greatly help them.

It would be good if there were an open source tool like Libreoffice that does Machine Learning in their spreadsheet app. It would be a good feature to add, and then the competitors would have to add it to their software as well.

Google's competing product: https://cloud.google.com/prediction/docs

Cannot find it (in N. Virginia)? Is that only me?

(if anyone has the direct link for the console, please share :)


(weird, still doesn't show in the menu)

It takes the teams a while to launch everything.

Some have already taken this kinda thing a few steps further:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact