
Introduction to Machine Learning for Developers - erinjerri1678
http://blog.algorithmia.com/introduction-machine-learning-developers/
======
rav
The description of Naive Bayes is misleading. Almost all supervised learning
problems assume that "Inputs are classified in isolation where no input has an
effect on any other inputs" (quote from the article), but that's not why Naive
Bayes is called naive.

The naive assumption made by Naive Bayes is that the _features_ (or
attributes) of each input point are independent. Let me explain from a simple
example:

Suppose you want to find people who receive benefits they are not entitled to.
The input data might have two attributes: cash on bank account, and amount
received in benefits. Although you could look for data that have a high value
in both attributes, the naive assumption made in Naive Bayes says that you can
in fact make your classification without correlating multiple attributes;
Naive Bayes assumes you can explain the labeling of data just by looking at
attributes in isolation. In this example, this assumption is clearly
unfounded, since if you only look at benefits or only at cash balance, you
won't be able to tell how a person should be classified.

The data independence assumption made by almost all ML algorithms is that
different data points are not correlated: the label of a single data point
(person in the above problem) does not depend on the attributes of other data
points.

~~~
machineman44
I agree. Most supervised learning classifiers are derived based on the
independent and identically distributed assumption for each (x,y) pair.

To be more specific about the Naive Bayes assumption, the features of a data
point are conditionally independent instead of simply independent. This means
that given a certain label, these set of features are independent.

------
Bedon292
I cannot recommend scikit-learn enough to anyone interested in machine
learning who likes python. I have been working with it for part of my Thesis,
and it can do so much, with so little code. It is amazing.

~~~
atsheehan
If anyone is interested in learning more about scikit-learn, I'd recommend
"Hands-On Machine Learning with Scikit-Learn and Tensorflow" from O'Reilly:

[http://shop.oreilly.com/product/0636920052289.do](http://shop.oreilly.com/product/0636920052289.do)

When I first started using scikit-learn, I was overwhelmed with the number of
classes and options available. I just chose some basic classifiers I was
familiar with and stuck with most of the default settings. The book explains
many of the other models and when they would be useful, but also spends a lot
of time exploring the datasets (using pandas), preprocessing data and building
data pipelines, finding the best hyperparameters, best ways to evaluate a
models performance, etc. The library feels less like a big bag of algorithms
now and more like a cohesive data pipeline.

~~~
a_bonobo
There's also o'reilly's 'Introduction to Machine Learning with Python', which
is very much about scikit-learn

[http://shop.oreilly.com/product/0636920030515.do](http://shop.oreilly.com/product/0636920030515.do)

Just came out a few weeks ago so it's relatively unknown, I've read the first
few chapters and so far it's good! The first author Andreas Mueller is one of
the scikit core devs.

The Jupyter notebooks including the book's code are on GitHub:
[https://github.com/amueller/introduction_to_ml_with_python](https://github.com/amueller/introduction_to_ml_with_python)

------
adamnemecek
For anyone trying to get into the field, I put together a list of resources I
found useful:

[https://news.ycombinator.com/item?id=12900448](https://news.ycombinator.com/item?id=12900448)

------
peterhadlaw
Although I did my studies with NLTK, it looks like spaCy has stepped up to the
plate, particular with NLP related tasks.

Worth checking out, in parallel to or in place of NLTK:
[https://spacy.io](https://spacy.io)

~~~
voiceclonr
Never heard of it. Thanks for the pointer!

------
pknerd
For Python programmers Harrison's Website is awesome resource

[http://pythonprogramming.net](http://pythonprogramming.net)

------
anton_tarasenko
I've made a similar list for economists. It included a list of practical
applications of ML. Developers can get a sense of what the discipline can do
before jumping in.

APPLIED MACHINE LEARNING CASES

## Business

1\. Kaggle, Data Science Use cases. An outline of business applications. Few
companies have the data to implement these things.
[https://www.kaggle.com/wiki/DataScienceUseCases](https://www.kaggle.com/wiki/DataScienceUseCases)

2\. Kaggle, Competitions. (Make sure you chose “All Competitions” and then
“Completed”.) Each competition has a leaderboard. When users publish their
solutions on GitHub, you can find links to these solutions on the leaderboard.
[https://www.kaggle.com/competitions](https://www.kaggle.com/competitions)

Industrial solutions are more powerful and complex than these examples, but
they are not publicly available. Data-driven companies post some details about
this work in their blogs.

## Emerging applications

1\. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.”
Hundreds of short papers.
[http://cs229.stanford.edu/](http://cs229.stanford.edu/)

2\. CMU ML Department, Student projects. More advanced problems, compared to
CS229. [http://www.ml.cmu.edu/research/data-analysis-
projects.html](http://www.ml.cmu.edu/research/data-analysis-projects.html)

3\. arXiv, Machine Learning. Drafts of important papers appear here first.
Then they got published in journals.
[http://arxiv.org/list/stat.ML/recent](http://arxiv.org/list/stat.ML/recent)

4\. CS journals. Applied ML research also appear in engineering journals.
[https://scholar.google.com/citations?view_op=top_venues&hl=e...](https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_theoreticalcomputerscience)

5\. CS departments. For example: CMU ML Department, PhD dissertations.
[http://www.ml.cmu.edu/research/phd-
dissertations.html](http://www.ml.cmu.edu/research/phd-dissertations.html)

## Government

1\. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data
Analysis describes their data management system and improvements in
operations.
[http://www.nyc.gov/html/analytics/downloads/pdf/annual_repor...](http://www.nyc.gov/html/analytics/downloads/pdf/annual_report_2013.pdf)

2\. UK Government, Tax Agent Segmentation.
[https://www.gov.uk/government/uploads/system/uploads/attachm...](https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/396175/Report348.pdf)

3\. Data.gov, Applications. Some are ML-based.
[http://www.data.gov/applications](http://www.data.gov/applications)

4\. StackExchange, Applications.
[http://opendata.stackexchange.com/questions/3346/examples-
of...](http://opendata.stackexchange.com/questions/3346/examples-of-useful-
applications-that-are-being-developed-using-open-data)

## See also

The original article: [https://antontarasenko.com/2015/12/28/machine-learning-
for-e...](https://antontarasenko.com/2015/12/28/machine-learning-for-
economists-an-introduction/)

A related list of cases: [https://www.quora.com/What-are-some-practical-
applications-o...](https://www.quora.com/What-are-some-practical-applications-
of-big-data/answer/Anton-Tarasenko-2?srid=i9vl)

~~~
lonewolf_ninja
I recently started playing around with the data sets on past Kaggle
competitions and have been learning a lot.

The Data Science use cases there are quite interesting. Are there any publicly
available data-sets (other than the ones available in competitions) to work
with (especially for the marketing use cases)?

~~~
goberoi
I highly recommend subscribing to the "Data is Plural" mailing list. You'll
get interesting datasets mailed to you each week! Here's the last one:
[http://tinyletter.com/data-is-plural/letters/data-is-
plural-...](http://tinyletter.com/data-is-plural/letters/data-is-
plural-2016-11-02-edition)

Also checkout "Academic Torrents". Lots of large datasets here, from millions
of Tweets, to labeled photos of fish in the wild.
[http://academictorrents.com/browse.php?c6=1&sort_field=times...](http://academictorrents.com/browse.php?c6=1&sort_field=times_completed&sort_dir=DESC&page=1)

~~~
lonewolf_ninja
Thanks for that! I looked it up and just subscribed to the mailing list. Looks
quite interesting.

------
mi100hael
Cool, this is a helpful intro. Anyone have any recommended reading for ML in a
JVM context?

~~~
esfandia
Weka is a popular machine learning toolkit in Java:
[http://www.cs.waikato.ac.nz/ml/weka/](http://www.cs.waikato.ac.nz/ml/weka/)

and they have a textbook to go with it:
[http://www.cs.waikato.ac.nz/ml/weka/book.html](http://www.cs.waikato.ac.nz/ml/weka/book.html)

as well as an online course:
[https://weka.waikato.ac.nz/dataminingwithweka/preview](https://weka.waikato.ac.nz/dataminingwithweka/preview)

------
pineapple_sauce
In the slides for unsupervised learning, what is meant by "Maximum Entropy"?
Doesn't this just imply that the distribution will be uniform; i.e. it's no
better than making a blind guess?

~~~
machineman44
I have only seen a maximum entropy model as part of the supervised realm where
it is a discriminative model. In other words, given some labeled data, we can
draw a decision boundary. Maximum entropy in this context is almost certainly
associated with the information theory definition, where the entropy of a
collection of data based on the distribution of classes is measured. High
entropy if each class is equally probable. Lower Entropy otherwise.

------
machineman44
Honestly, this is a good run through of resources and examples of different
machine learning algorithms/techniques be it supervised, unsupervised, or
model validation... however, the wording used and mistakes made when
describing supervised learning or Naive Bayes shows that this is an attempt at
taking an O'Rielly book and trying to summarize it in a short article... while
making errors... How did it get so many points on ycombinator?

~~~
princesspea
Hi! I'm Stephanie Kim and wrote the article. This post and slides were from a
talk I gave for a basic introduction to machine learning at a woman's
programming conference in Seattle. I did update the language which was a
mistake rather than a misunderstanding of Naive Bayes. I have professional
machine learning experience and while I am definitely not an expert the talk
was geared for web developers with no prior experience in machine learning.
Thanks for your feedback.

~~~
machineman44
Hi Stephanie. Sorry if my comment sounded harsh and nit picky. I actually
passed it off to a fellow software engineer at work and he found it really
insightful and useful for the work he is doing. Not everybody makes the effort
to share their knowledge and I really appreciate you doing so. Have a good day
:)

------
highCs
If one understand decently all of that, does he get a job?

~~~
lisivka
If you will be able to debug all that, you will get a job.

~~~
tnecniv
Which is way harder than it sounds.

------
hota_mazi
Not even a mention of TensorFlow or Torch?

