The naive assumption made by Naive Bayes is that the features (or attributes) of each input point are independent. Let me explain from a simple example:
Suppose you want to find people who receive benefits they are not entitled to. The input data might have two attributes: cash on bank account, and amount received in benefits. Although you could look for data that have a high value in both attributes, the naive assumption made in Naive Bayes says that you can in fact make your classification without correlating multiple attributes; Naive Bayes assumes you can explain the labeling of data just by looking at attributes in isolation. In this example, this assumption is clearly unfounded, since if you only look at benefits or only at cash balance, you won't be able to tell how a person should be classified.
The data independence assumption made by almost all ML algorithms is that different data points are not correlated: the label of a single data point (person in the above problem) does not depend on the attributes of other data points.
To be more specific about the Naive Bayes assumption, the features of a data point are conditionally independent instead of simply independent. This means that given a certain label, these set of features are independent.
When I first started using scikit-learn, I was overwhelmed with the number of classes and options available. I just chose some basic classifiers I was familiar with and stuck with most of the default settings. The book explains many of the other models and when they would be useful, but also spends a lot of time exploring the datasets (using pandas), preprocessing data and building data pipelines, finding the best hyperparameters, best ways to evaluate a models performance, etc. The library feels less like a big bag of algorithms now and more like a cohesive data pipeline.
Just came out a few weeks ago so it's relatively unknown, I've read the first few chapters and so far it's good! The first author Andreas Mueller is one of the scikit core devs.
The Jupyter notebooks including the book's code are on GitHub: https://github.com/amueller/introduction_to_ml_with_python
I've been somewhat addicted to HackerRank challenges over the last couple of weeks. Why is not important, don't judge :)
The python packages and tooling around learning and science are truly amazing.
Try and do the Craigslist category classification without using python and see what I mean.
Worth checking out, in parallel to or in place of NLTK: https://spacy.io
APPLIED MACHINE LEARNING CASES
1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things. https://www.kaggle.com/wiki/DataScienceUseCases
2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard. https://www.kaggle.com/competitions
Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.
## Emerging applications
1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers. http://cs229.stanford.edu/
2. CMU ML Department, Student projects. More advanced problems, compared to CS229. http://www.ml.cmu.edu/research/data-analysis-projects.html
3. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals. http://arxiv.org/list/stat.ML/recent
4. CS journals. Applied ML research also appear in engineering journals. https://scholar.google.com/citations?view_op=top_venues&hl=e...
5. CS departments. For example: CMU ML Department, PhD dissertations. http://www.ml.cmu.edu/research/phd-dissertations.html
1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations. http://www.nyc.gov/html/analytics/downloads/pdf/annual_repor...
2. UK Government, Tax Agent Segmentation. https://www.gov.uk/government/uploads/system/uploads/attachm...
3. Data.gov, Applications. Some are ML-based. http://www.data.gov/applications
4. StackExchange, Applications. http://opendata.stackexchange.com/questions/3346/examples-of...
## See also
The original article: https://antontarasenko.com/2015/12/28/machine-learning-for-e...
A related list of cases: https://www.quora.com/What-are-some-practical-applications-o...
The Data Science use cases there are quite interesting. Are there any publicly available data-sets (other than the ones available in competitions) to work with (especially for the marketing use cases)?
Also checkout "Academic Torrents". Lots of large datasets here, from millions of Tweets, to labeled photos of fish in the wild.
and they have a textbook to go with it:
as well as an online course:
For other machine learning libraries in java: