I recommend starting with Witten and Frank's "Data Mining: Practical Machine Learning Tools and Techniques", for two reasons:
1) your mathematical background will make it easy for you to spot the areas where they gloss over important theoretical details (e.g. they don't treat Mercer's theorem deeply, if at all if I recall correctly). This will keep you from making rookie mistakes like picking your own strange kernel functions for SVMs (all of this will make sense to you after reading W+F)
2) W+F is accompanied by Weka, which has decent-but-not-great implementations of a wide variety of algorithms in an open-source toolkit with a functional (though not particularly usable) GUI. You can be up-and-running on test problems from the the UCI Machine Learning repository in 20 minutes.
Russell and Norvig is a good reference book, but my strong suggestion is to grab W+F, Weka, and some data from UCI's ML repo (http://www.ics.uci.edu/~mlearn/). Use Weka to run the standard algorithms from W+F on real data.
After that things can get as complicated as you want. I'm personally betting that graphical models (see Koller's "Probabilistic Graphical Models") will continue to grow in importance.
On the statistical side, I strongly recommend "Applied Linear Regression" by Weisberg to get a sense for the kinds of idea statisticians bring to the party (significance, parameter intervals, chained analysis like ANOVA).
Hope that helps :-)
Edit* spelling errors
PS: Almost forgot MacKay's excellent "Information Theory, Inference, and Learning Algorithms" (free as pdf). It does a great job linking coding, ML, inference, and bayesian probability together. Read at least as far as the section linking K-means clustering to Expectation-Maximization over a Mixture of Gaussians, it's worth it.
1) your mathematical background will make it easy for you to spot the areas where they gloss over important theoretical details (e.g. they don't treat Mercer's theorem deeply, if at all if I recall correctly). This will keep you from making rookie mistakes like picking your own strange kernel functions for SVMs (all of this will make sense to you after reading W+F)
2) W+F is accompanied by Weka, which has decent-but-not-great implementations of a wide variety of algorithms in an open-source toolkit with a functional (though not particularly usable) GUI. You can be up-and-running on test problems from the the UCI Machine Learning repository in 20 minutes.
Russell and Norvig is a good reference book, but my strong suggestion is to grab W+F, Weka, and some data from UCI's ML repo (http://www.ics.uci.edu/~mlearn/). Use Weka to run the standard algorithms from W+F on real data.
After that things can get as complicated as you want. I'm personally betting that graphical models (see Koller's "Probabilistic Graphical Models") will continue to grow in importance.
On the statistical side, I strongly recommend "Applied Linear Regression" by Weisberg to get a sense for the kinds of idea statisticians bring to the party (significance, parameter intervals, chained analysis like ANOVA).
Hope that helps :-)
Edit* spelling errors
PS: Almost forgot MacKay's excellent "Information Theory, Inference, and Learning Algorithms" (free as pdf). It does a great job linking coding, ML, inference, and bayesian probability together. Read at least as far as the section linking K-means clustering to Expectation-Maximization over a Mixture of Gaussians, it's worth it.