Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Machine Learning for Developers (algorithmia.com)
547 points by erinjerri1678 on Nov 10, 2016 | hide | past | favorite | 31 comments

The description of Naive Bayes is misleading. Almost all supervised learning problems assume that "Inputs are classified in isolation where no input has an effect on any other inputs" (quote from the article), but that's not why Naive Bayes is called naive.

The naive assumption made by Naive Bayes is that the features (or attributes) of each input point are independent. Let me explain from a simple example:

Suppose you want to find people who receive benefits they are not entitled to. The input data might have two attributes: cash on bank account, and amount received in benefits. Although you could look for data that have a high value in both attributes, the naive assumption made in Naive Bayes says that you can in fact make your classification without correlating multiple attributes; Naive Bayes assumes you can explain the labeling of data just by looking at attributes in isolation. In this example, this assumption is clearly unfounded, since if you only look at benefits or only at cash balance, you won't be able to tell how a person should be classified.

The data independence assumption made by almost all ML algorithms is that different data points are not correlated: the label of a single data point (person in the above problem) does not depend on the attributes of other data points.

I agree. Most supervised learning classifiers are derived based on the independent and identically distributed assumption for each (x,y) pair.

To be more specific about the Naive Bayes assumption, the features of a data point are conditionally independent instead of simply independent. This means that given a certain label, these set of features are independent.

Hi, I'm Stephanie Kim and wrote the talk/post. Thanks for the comment! Yes you are correct I should have specified that it is the features of each input rather than the inputs that are regarded as independent from one another! I will revise that in the post. Again, thanks for pointing that out since it's an important distinction, especially for people just starting out!

I cannot recommend scikit-learn enough to anyone interested in machine learning who likes python. I have been working with it for part of my Thesis, and it can do so much, with so little code. It is amazing.

If anyone is interested in learning more about scikit-learn, I'd recommend "Hands-On Machine Learning with Scikit-Learn and Tensorflow" from O'Reilly:


When I first started using scikit-learn, I was overwhelmed with the number of classes and options available. I just chose some basic classifiers I was familiar with and stuck with most of the default settings. The book explains many of the other models and when they would be useful, but also spends a lot of time exploring the datasets (using pandas), preprocessing data and building data pipelines, finding the best hyperparameters, best ways to evaluate a models performance, etc. The library feels less like a big bag of algorithms now and more like a cohesive data pipeline.

There's also o'reilly's 'Introduction to Machine Learning with Python', which is very much about scikit-learn


Just came out a few weeks ago so it's relatively unknown, I've read the first few chapters and so far it's good! The first author Andreas Mueller is one of the scikit core devs.

The Jupyter notebooks including the book's code are on GitHub: https://github.com/amueller/introduction_to_ml_with_python

I second this.

I've been somewhat addicted to HackerRank challenges over the last couple of weeks. Why is not important, don't judge :)

The python packages and tooling around learning and science are truly amazing. Try and do the Craigslist category classification without using python and see what I mean.

For anyone trying to get into the field, I put together a list of resources I found useful:


Although I did my studies with NLTK, it looks like spaCy has stepped up to the plate, particular with NLP related tasks.

Worth checking out, in parallel to or in place of NLTK: https://spacy.io

Never heard of it. Thanks for the pointer!

For Python programmers Harrison's Website is awesome resource


I've made a similar list for economists. It included a list of practical applications of ML. Developers can get a sense of what the discipline can do before jumping in.


## Business

1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things. https://www.kaggle.com/wiki/DataScienceUseCases

2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard. https://www.kaggle.com/competitions

Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.

## Emerging applications

1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers. http://cs229.stanford.edu/

2. CMU ML Department, Student projects. More advanced problems, compared to CS229. http://www.ml.cmu.edu/research/data-analysis-projects.html

3. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals. http://arxiv.org/list/stat.ML/recent

4. CS journals. Applied ML research also appear in engineering journals. https://scholar.google.com/citations?view_op=top_venues&hl=e...

5. CS departments. For example: CMU ML Department, PhD dissertations. http://www.ml.cmu.edu/research/phd-dissertations.html

## Government

1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations. http://www.nyc.gov/html/analytics/downloads/pdf/annual_repor...

2. UK Government, Tax Agent Segmentation. https://www.gov.uk/government/uploads/system/uploads/attachm...

3. Data.gov, Applications. Some are ML-based. http://www.data.gov/applications

4. StackExchange, Applications. http://opendata.stackexchange.com/questions/3346/examples-of...

## See also

The original article: https://antontarasenko.com/2015/12/28/machine-learning-for-e...

A related list of cases: https://www.quora.com/What-are-some-practical-applications-o...

I recently started playing around with the data sets on past Kaggle competitions and have been learning a lot.

The Data Science use cases there are quite interesting. Are there any publicly available data-sets (other than the ones available in competitions) to work with (especially for the marketing use cases)?

I highly recommend subscribing to the "Data is Plural" mailing list. You'll get interesting datasets mailed to you each week! Here's the last one: http://tinyletter.com/data-is-plural/letters/data-is-plural-...

Also checkout "Academic Torrents". Lots of large datasets here, from millions of Tweets, to labeled photos of fish in the wild. http://academictorrents.com/browse.php?c6=1&sort_field=times...

Thanks for that! I looked it up and just subscribed to the mailing list. Looks quite interesting.

Cool, this is a helpful intro. Anyone have any recommended reading for ML in a JVM context?

Weka is a popular machine learning toolkit in Java: http://www.cs.waikato.ac.nz/ml/weka/

and they have a textbook to go with it: http://www.cs.waikato.ac.nz/ml/weka/book.html

as well as an online course: https://weka.waikato.ac.nz/dataminingwithweka/preview

Hi, skymind cofounder here. To offer some context on our book there: We have appendixes covering some of the fundamental concepts such as linear algebra and statistics.

For other machine learning libraries in java:




If you're interested in the big data side of things there's Spark (http://spark.apache.org/) and MLlib for it (http://spark.apache.org/docs/latest/ml-guide.html). H20 (http://www.h2o.ai/) also provides ML algorithms on top of Spark (and I think independent of Spark as well, not sure of the current status). These are all written on the JVM either in Scala (Spark) or Java (H20).

The Skymind.io co-founders wrote a book that references their open-source "deep learning for Java and Scala framework"[1]. They are in the YC16 batch.

[1] https://deeplearning4j.org/about

I haven't read it but this book looks reasonable https://www.amazon.com/Scala-Machine-Learning-Patrick-Nicola...

In the slides for unsupervised learning, what is meant by "Maximum Entropy"? Doesn't this just imply that the distribution will be uniform; i.e. it's no better than making a blind guess?

I have only seen a maximum entropy model as part of the supervised realm where it is a discriminative model. In other words, given some labeled data, we can draw a decision boundary. Maximum entropy in this context is almost certainly associated with the information theory definition, where the entropy of a collection of data based on the distribution of classes is measured. High entropy if each class is equally probable. Lower Entropy otherwise.

Honestly, this is a good run through of resources and examples of different machine learning algorithms/techniques be it supervised, unsupervised, or model validation... however, the wording used and mistakes made when describing supervised learning or Naive Bayes shows that this is an attempt at taking an O'Rielly book and trying to summarize it in a short article... while making errors... How did it get so many points on ycombinator?

Hi! I'm Stephanie Kim and wrote the article. This post and slides were from a talk I gave for a basic introduction to machine learning at a woman's programming conference in Seattle. I did update the language which was a mistake rather than a misunderstanding of Naive Bayes. I have professional machine learning experience and while I am definitely not an expert the talk was geared for web developers with no prior experience in machine learning. Thanks for your feedback.

Hi Stephanie. Sorry if my comment sounded harsh and nit picky. I actually passed it off to a fellow software engineer at work and he found it really insightful and useful for the work he is doing. Not everybody makes the effort to share their knowledge and I really appreciate you doing so. Have a good day :)

If one understand decently all of that, does he get a job?

Supply-demand dynamics at play. The general answer is no, though. Chances are higher if you use these in your own domain and become an applied expert or if you win some competition on Kaggle or similar tough envinronment.

If you will be able to debug all that, you will get a job.

Which is way harder than it sounds.

Not even a mention of TensorFlow or Torch?

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact