Hacker News new | past | comments | ask | show | jobs | submit login

There's this enormous focus on 'web scale' technologies. This focus necessarily invokes visualizing and making sense of terabytes and eventually even petabytes of data; conventional approaches would take thousands or millions of man hours to accomplish the same level of analysis that computers can perform in hours or days.

Tom Mitchell's definition of machine learning algorithms as those that improve their performance at some task with experience is precisely the way in which humans go about learning what's necessary to perform the same tasks that formerly took thousands or millions of hours.

For highly dimensional problems, such as text classification (i.e., spam detection) or image classification (i.e., facial detection), it's almost impossible to hard code an algorithm to accomplish its goal without using machine learning. It's much easier to use a binary spam/not spam or face/not face labeling system that, given the attributes of the example, can learn which attributes beget that specific label. In other words, it's much easier for a learning system to determine what variables are important in the ultimate classification than trying to model the "true" function that gives rise to the labeling.




Great comment.

Probably also worth speculating on why this is happening NOW. Why is this breaking out of CS departments in 2011 and not 2002?

The datasets are new.

Bandwidth? Storage capacity? Computing power? All of the above?


Actually, this has been actively researched since ICs started gaining widespread usage in the 1970s! Even before that there were plenty of journal papers produced that deal with the basics of ML and AI.

It wasn't until the 1990s that computers started becoming reasonably priced and more accessible to researchers and hobbyists that we began seeing an exponential growth in the amount of research output. In many way, one could argue that the proliferation and development of AI has very much followed Moore's law, since these are extremely complex and costly calculations.

Bandwidth increases have certainly increased the availability of data sets (Google has its entire ngrams data set fully available, and it's multiple terabytes in size), but storage capacity (hard disk, RAM, and CPU cache) and computing power have really formed the bottle neck. It's not just storage capacity, either: I/O read/write times are also immensely important. It's all just a huge balancing act right now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: