It's hard for me to say without knowing which firm or reading the job description, but to me research engineer implies a PhD level of knowledge. You won't get that from some open source projects and kaggle competitions.
also a great resource.
The smaller startups seemed to want more "data engineering" experience.
What do you mean by "what if any bias was there in your job search"?
What technique are you referring to?
My wife is a PhD Data Scientist, and I'd like to at least have a basic level of understanding theory, processes and tools used in her field.
Yours would be https://news.ycombinator.com/saved?id=catilac&comments=t (access it by clicking on your username on the top right then "saved comments")
I recently interviewed someone taking a (reputable) online masters in machine learning, and they couldn't describe how or why any of the models they were using worked, nor could they answer most basic questions about the problems / data they were working on.
I teach at a well respected university for what is essentially a data science masters program, and most of my students come from CS. They are woefully unprepared, and even worse few of them seem to care at all about learning the mathematics behind anything that is going on.
Personally, I think if you can't read linear algebra at the same proficiency as you read English then you have no business calling yourself a data scientist. Unfortunately, in my experience that would describe most people who label themselves data scientists.
Most books on the subject assume you already know what linear regression is, Naive Beyes is just explained briefly theoretically and it goes right into the code in spark, R or Clojure for example. However UC Berkeley's course is very theoretical, almost no code is shown and its just MATLAB code (don't remember the name off the top of my head), their spark course though is heavy on the code with IPython activities.
I'm torn about the blackbox thing. On one hand, it's important to understand the underpinnings of a model. On the other, we utilize a multitude of things in our daily lives of which we have no fundamental understanding; that's abstraction in a nutshell.
Edit: another pointer here https://algorithmicfairness.wordpress.com
Some examples: blackbox application of classifiers (e.g. WEKA gui as used by some for data exploration) can ignore parameter optimization, unbalanced sets, parsimony in features, dimensionality reduction, etc. etc.
(I'm assuming you're comfortable with multivariable calculus.)
Andrew Ng's coursera course is good.
PRML (pattern recognition and machine learning) by bishop is good, and has a useful introduction to probability theory.
You also want a good grounding in linear algebra. Strang is basically the authority on linear: http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-...
You want a strong grounding in probability theory and statistics. (This is the basic language and intuition of the entire field.) I don't have as many preferences here (although its the most important); someone in this thread pointed to a course on statistical learning @ stanford that's good.
A good understanding of optimization is helpful. Here's a link that leads to a useful MOOC for that: http://stanford.edu/~boyd/cvxbook/
there's a lot of other stuff (markov decision processes, gaussian processes, monte carlo methods come to mind) that is useful that I'm not pointing to, but if you've hit the other stuff here then you'll probably be able to find out those things.
If you're into it, https://www.coursera.org/course/pgm is good but not vital.
You may want to know about reinforcement learning. This answer does better than I can: https://www.quora.com/What-are-the-best-books-about-reinforc...
Deep learning seems popular these days :) (http://www.deeplearningbook.org/)
Otherwise, it depends on the domain.
For NLP, there's a great stanford course on deep learning + NLP (http://cs224d.stanford.edu/syllabus.html), but there's a ton of domain knowledge for most NLP work (and a lot of it really centers around data preparation).
For speech, theoretical computer science matters (weighted finite state transducers, formal languages, etc.)
For vision, again, stanford: (http://cs231n.stanford.edu/syllabus.html)
For other applications, well, ask someone else? :)
EDIT: unfortunately, there's also a lot of practitioner's dark art; I picked a lot up as a research assistant, and then my first year in industry felt like being strapped to a rocket.
With so many programs and courses springing up about ML, in a few years I suspect there will be a glut of Data Scientists on the market, and once again companies will use the PhD as a filter. Already it seems like a good number of places will only hire PhDs for such roles.
In my experience hiring and chatting with other people hiring data scientists, there's the same trouble as there is with software engineers. No matter how many people have the training there's still a dearth of applicants that are truly talented and can actually do things. PhDs fleeing academia for a promise of easy employment and money are a huge bulk of new data scientists that I've seen and most of them have a very hard time taking deep knowledge and applying it to solve real-world problems.
At least in tech I think the future of Data Science lies in the perpetually small group of people that will have a proven track record of coming into companies and actually solving problems, just as it has been in software development.
I'm not sure that's a valid comparison. There's a relatively fixed and fairly small pool of companies who need quants. Machine Learning, OTOH, can be used by almost any company in existence (even if most of them don't realize it yet). And plenty of companies don't need somebody doing cutting edge academic research in ML... they need somebody who can use a pre-packaged library or service and apply linear regression, or k-means, or build a simple neural network with backprop.
The vast majority of startups and small businesses - those whose customer base measures in the dozens to hundreds - should be going out, engaging their customers person-to-person, and looking for qualitative data, because that's what'll move the needle on their sales. There's no point in understanding "your customer base" as a unit until it's big enough that it behaves, statistically, as a unit; instead, you should be focusing on "your customers", individually. Once you get into the thousands of customers you can start applying some basic learning models, and once you get into the millions machine-learning becomes as fundamental as pricing.
But you gotta get there first, and many businesses haven't. And even if they have, userbase-wise, they need to build the infrastructure (through web & mobile devs, backend engineers, data scientists, etc.) to log, store, and clean all that data before they can apply machine-learning to it.
Agreed. But many have as well. So I'll still argue that there are more potential positions for people doing "applied ML" than there are for quants. I'm open to being proven wrong though.
And even if they have, userbase-wise, they need to build the infrastructure (through web & mobile devs, backend engineers, data scientists, etc.) to log, store, and clean all that data before they can apply machine-learning to it.
We're working on a MLaaS offering to help reduce the need for a lot of that stuff. And there are some offerings in that space already.
Problem is, writing an effective machine-learning model already doesn't require knowing the algorithms well. It requires knowing your data well. You can provide tools for this, and AML does, but there's no substitute for actually working with the data day-in-and-day-out and developing an intuition for it.
(Deep learning promises to change that a bit, since the relevant features are extracted for you by the algorithm and you don't need to do any particular data cleaning or feature extraction work. You still need to understand your data well to understand how to train the model, though, and how to apply primitive ML operations - classification, regression, clustering, etc. - to a real-world problem.)
And this is the kind of stuff that I believe can be done by people who don't necessarily need phd's in Stats or ML. A decent grounding in statistics / ML, and good domain knowledge should be enough to support using pre-packaged algorithms to solve business problems.
Hey Wow. Hoo Boy (or Girl). Maybe you should start by learning the basics, get some work experience, and only then try to pull a Zark Muckerberg or a Bacefook.
Touché. As a web developer, I have no illusions on competing with people with the proper mathematical background. However, there are still some problems that are relevant and interesting in our field(collaborative filtering for instance)
So I guess if you temper your expectations, you won't get burned.
True for now, but it will change. Once terms like 'Statistics', 'ML' and 'Data science' will become common place, there will be a ridiculous amount of lust for automation jobs.
Sure you won't be fixing the beta release bugs for skynet 1.0. But almost any job which has a scope for data collection, analysis and automation will demand skills in these area.
I graduated from uni with a Computer Engineering degree in 2005- ML wasn't really a thing being offered at that point. I loved algebra and calculus but hated statistics. All I hear about these days is machine learning, so I wanted to see what the fuss was all about, as I've really never been exposed to it. I also wanted a university style exposure to it as I wanted to ease into some of the statistical concepts necessary as I wasn't good at them way back when and I haven't practiced them in 10 years.
Finally, after some initial exposure, I may find that ML or some of its concepts will be another tool in my problem solving tool bag.
I feel like a lot of people have my same motivation. We are hearing all the hype, we weren't really exposed in our formative years, and we're curious. Also, my particular course was free. So what's the harm?
At least in my Computer Engineering curriculum it was very much about the electrical engineering and software development fundamentals.
Traditional statistics is very very good at helping us learn a lot about relatively simple (or carefully and deliberately simplified) processes, and provides a rich background in study design.
ML techniques are good at helping us learn a little bit about arbitrarily complicated processes, and apply that knowledge quickly. A modern practitioner in either field should have a working knowledge of both [families of] paradigms.
ML is still supposed to do the third step. IMO, where ML often falls down due to the immaturity of the field is in not creating good experiments to test the models (hypotheses) generated by the algorithms.
Sure, the research portion of the field has made a lot of strides since 2011 but for anything that's not PhD or research level stuff, Ng's class is perfectly up to date.
You'd be hard pressed to find a better class anywhere.
As well as the problem sets: https://see.stanford.edu/Course/CS229
They are not any more recent, though.
As stated in another comment, the basics haven't changed much. The libraries you will use have evolved though. My impression is that that is where the innovation has been.
There's also the Geoffrey Hinton class on Coursera, although I'm not sure if additional sections of it are being offered per-se. But you can still enroll in it and watch the videos and stuff. I don't know if it's any more recent than ang's class, but it goes into more detail in some areas and covers slightly different topics. At worst, it's a good complement to the other ang class.
But I wouldn't really be able to keep up with it if I hadn't taken Ng's Machine Learning course first. The basics it teaches aren't out of date at all, and there's lots of regular old ML stuff in there that is useful now and hasn't changed in the interim apart from maybe which library you might pick.
Distributed Systems was far more fascinating in my opinion.
Introducing senses and movement - i.e., robotics - makes the boring parts worth it for me though.
In academia, it's publish or perish, so much of the cutting edge research is over-engineered (over-researched?) and too brittle to be relied on in production. Not to mention lacking a usable implementation.
Because in practice, many business problems that call for automation and ML can be solved using the simplest of techniques to a satisfactory degree. The challenge fresh graduates face is rarely advanced math. It's usually solving the right problem and making the solution robust enough to be reliable in production (+communicating this to all stakeholders).
Model interpretability and your ability to analyse errors and iterate the solution are worth way more than a few percent gain in accuracy (accuracy/f1 are rarely the measures most relevant to the business goals too; the cost matrix is usually trickier than that). Pulling every opaque deep learning library under the sun into a system that could be solved using a few regexps and ifs to get a 5% KPI boost is not a good idea.
Building practical Machine Learning models is as much about solid engineering and understanding the business objectives as it is about math&theory, though the math cannot be skipped. We're not nearly at the stage where some "generic ML in the cloud" can cut it, wherever there's real money on the line. Successful systems are still very domain specific, built with significant SME expertise.
 Source & plug: we run a ML mentoring program for promising university students, as well as give corporate trainings on Machine Learning in Python: http://rare-technologies.com
But the arguments to a ML algorithm provided by third-party libraries include various parameters to tune the algorithm and make tradeoffs between accuracy/time etc. Debugging the algorithm for your particular dataset also requires some knowledge of how the algorithm works.
But there are use cases where you need not know what's happening behind the scenes. For example, you can use a ML as a Service provider to classify your photos without knowing how their algorithms work.
Don't rely on a magic box.
That's quite funny given the circumstances. I think the neural network algorithm disagrees, and it really wants you to rely on a magic box.