The description of Naive Bayes is misleading. Almost all supervised learning problems assume that "Inputs are classified in isolation where no input has an effect on any other inputs" (quote from the article), but that's not why Naive Bayes is called naive.
The naive assumption made by Naive Bayes is that the features (or attributes) of each input point are independent. Let me explain from a simple example:
Suppose you want to find people who receive benefits they are not entitled to. The input data might have two attributes: cash on bank account, and amount received in benefits. Although you could look for data that have a high value in both attributes, the naive assumption made in Naive Bayes says that you can in fact make your classification without correlating multiple attributes; Naive Bayes assumes you can explain the labeling of data just by looking at attributes in isolation. In this example, this assumption is clearly unfounded, since if you only look at benefits or only at cash balance, you won't be able to tell how a person should be classified.
The data independence assumption made by almost all ML algorithms is that different data points are not correlated: the label of a single data point (person in the above problem) does not depend on the attributes of other data points.
I agree. Most supervised learning classifiers are derived based on the independent and identically distributed assumption for each (x,y) pair.
To be more specific about the Naive Bayes assumption, the features of a data point are conditionally independent instead of simply independent. This means that given a certain label, these set of features are independent.
Hi, I'm Stephanie Kim and wrote the talk/post. Thanks for the comment! Yes you are correct I should have specified that it is the features of each input rather than the inputs that are regarded as independent from one another! I will revise that in the post. Again, thanks for pointing that out since it's an important distinction, especially for people just starting out!
I cannot recommend scikit-learn enough to anyone interested in machine learning who likes python. I have been working with it for part of my Thesis, and it can do so much, with so little code. It is amazing.
When I first started using scikit-learn, I was overwhelmed with the number of classes and options available. I just chose some basic classifiers I was familiar with and stuck with most of the default settings. The book explains many of the other models and when they would be useful, but also spends a lot of time exploring the datasets (using pandas), preprocessing data and building data pipelines, finding the best hyperparameters, best ways to evaluate a models performance, etc. The library feels less like a big bag of algorithms now and more like a cohesive data pipeline.
Just came out a few weeks ago so it's relatively unknown, I've read the first few chapters and so far it's good! The first author Andreas Mueller is one of the scikit core devs.
I've been somewhat addicted to HackerRank challenges over the last couple of weeks. Why is not important, don't judge :)
The python packages and tooling around learning and science are truly amazing.
Try and do the Craigslist category classification without using python and see what I mean.
I've made a similar list for economists. It included a list of practical applications of ML. Developers can get a sense of what the discipline can do before jumping in.
2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard. https://www.kaggle.com/competitions
Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.
## Emerging applications
1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers. http://cs229.stanford.edu/
I recently started playing around with the data sets on past Kaggle competitions and have been learning a lot.
The Data Science use cases there are quite interesting. Are there any publicly available data-sets (other than the ones available in competitions) to work with (especially for the marketing use cases)?
Hi, skymind cofounder here. To offer some context on our book there: We have appendixes covering some of the fundamental concepts such as linear algebra and statistics.
If you're interested in the big data side of things there's Spark (http://spark.apache.org/) and MLlib for it (http://spark.apache.org/docs/latest/ml-guide.html). H20 (http://www.h2o.ai/) also provides ML algorithms on top of Spark (and I think independent of Spark as well, not sure of the current status). These are all written on the JVM either in Scala (Spark) or Java (H20).
In the slides for unsupervised learning, what is meant by "Maximum Entropy"? Doesn't this just imply that the distribution will be uniform; i.e. it's no better than making a blind guess?
I have only seen a maximum entropy model as part of the supervised realm where it is a discriminative model. In other words, given some labeled data, we can draw a decision boundary. Maximum entropy in this context is almost certainly associated with the information theory definition, where the entropy of a collection of data based on the distribution of classes is measured. High entropy if each class is equally probable. Lower Entropy otherwise.
Honestly, this is a good run through of resources and examples of different machine learning algorithms/techniques be it supervised, unsupervised, or model validation... however, the wording used and mistakes made when describing supervised learning or Naive Bayes shows that this is an attempt at taking an O'Rielly book and trying to summarize it in a short article... while making errors... How did it get so many points on ycombinator?
Hi! I'm Stephanie Kim and wrote the article. This post and slides were from a talk I gave for a basic introduction to machine learning at a woman's programming conference in Seattle. I did update the language which was a mistake rather than a misunderstanding of Naive Bayes. I have professional machine learning experience and while I am definitely not an expert the talk was geared for web developers with no prior experience in machine learning. Thanks for your feedback.
Hi Stephanie. Sorry if my comment sounded harsh and nit picky. I actually passed it off to a fellow software engineer at work and he found it really insightful and useful for the work he is doing. Not everybody makes the effort to share their knowledge and I really appreciate you doing so. Have a good day :)
Supply-demand dynamics at play. The general answer is no, though. Chances are higher if you use these in your own domain and become an applied expert or if you win some competition on Kaggle or similar tough envinronment.
The naive assumption made by Naive Bayes is that the features (or attributes) of each input point are independent. Let me explain from a simple example:
Suppose you want to find people who receive benefits they are not entitled to. The input data might have two attributes: cash on bank account, and amount received in benefits. Although you could look for data that have a high value in both attributes, the naive assumption made in Naive Bayes says that you can in fact make your classification without correlating multiple attributes; Naive Bayes assumes you can explain the labeling of data just by looking at attributes in isolation. In this example, this assumption is clearly unfounded, since if you only look at benefits or only at cash balance, you won't be able to tell how a person should be classified.
The data independence assumption made by almost all ML algorithms is that different data points are not correlated: the label of a single data point (person in the above problem) does not depend on the attributes of other data points.