Now, having that in mind, how realistic is for a guy in mid-30s, a very good software developer (enthusiastic about functional programming, if that matters) to pick up enough knowledge about data science to be able to actually take the data-scientist role in some company?
Edit: Rephrase the question.
In the beginning the Data Science Team at Stack Overflow was just me and Kevin Montrose. We're both ~30, we had some skills that were data-ish, but they were general critical thinking and math skills. By no means were we statisticians or data scientists by trade like Dave Robinson. However, the data at Stack Overflow was very amenable to analysis, and we were able to ship a personalized machine learning product after a year of work.
That year was plagued by (my, mostly) engineering failures (http://jasonpunyon.com/blog/2015/02/12/providence-failure-is...) and a bit of unfortunate lack-of-empiricism, not "data science" problems. And post launch of Providence, the engineering remains the bigger issue. It's just hard to get this shit right. Our record on getting experiments right the first time is spotty, and at Stack Overflow levels of traffic you might be dumping a week or more of time on a single screwup. We've had to rerun many experiments because of various things going wrong with the code (screwed up the A/B testing code, not counting the right things, visual issues with the ads we didn't catch during design, etc etc etc). There's a litany of things you've gotta get right engineering wise before data even comes into the picture.
So if you're gonna take the plunge, know that it's 100% possible but 100% difficult, and maybe not for the reasons you think.
While people can often focus on applying the latest deep learning thought-vector approach to their BIG DATA, there's an enormous gulf between the common condition of data and this aspiration.
You don't need PhD level stats and machine learning to apply the things that many companies actually could benefit from. Storing, maintaining and managing the data properly is a start. Then working with the people in the business to get insights from the data they have. Often simple aggregations and visualisations can provide enormous benefit. Being able to show correlations, and sometimes even just being able to show how noisy things are can be important.
Big questions: How is reality different from what we think? How are these differences important for the company?
That might be a correlation that we don't expect, or a lack of one we think is there. Part of a next step might be to become more "pro-active" and design new experiments to answer questions that can't quite be answered yet due to a lack of data. Beyond that you're heading towards bringing a new feature into a product.
Does your current company have some data? Do they spend a lot of time emailing spreadsheets around, or have salesforce or a proper database? See if you can make something useful for your work that's based on an analysis of that data (and possibly do some work to smooth those workflows of passing data around).
Over time I came to appreciate exactly what you were saying. Many companies can be helped by fairly simple analyses.
Fast forward - data is getting much cheaper and now all these years later my more advanced stats skills seem to actually matter. But even in this apparent abundance ... data management and data quality assurance is often lacking or significantly underfunded and simple aggregations and analyses still make the most difference in many organizations.
Collecting and processing data is going to be extremely time consuming and expensive regardless of what you do, but you can make it easier on yourself through smart experimental design. P-values are still widely used for analysis in biology, to the chagrin of most statisticians. I don't think it is as much of a problem as it comes across as though. Bayesian reasoning is inherently included in the scientific process as part of the experimental design stage. The reliance on p-values becomes a problem when journalists report on research findings in a single paper that is "significant" because of p-values. The professionals in the field are capable of balanced analysis of p-value based research, but we end up with issues like anti-vaxxers when research is presented outside the field of professionals.
Of course there are still ways the process can allow incorrect research to present itself. For one thing, journals are not interested in publishing studies that do not have "significant" results, which means that bad research can stick around for far too long.
I personally think that the solution to these problems is open data, since so much of the research depends on how the data was manipulated and 'cleaned' prior to analysis.
There is such a shortage of data scientists that most companies will settle for 2 out of 3 of the trinity of Stats, Programming and Domain Knowledge. If the candidate is smart enough, sometimes 1 out of 3 is enough.
it's not sufficient to say "oh, we used a support vector machine", and when prompted for more detail about how it works to shrug and say you just copied and pasted code from the internet and tweaked parameters until it worked.
Why do people get religious about tech.. funny, let him/her use excel for god sake its a great tool :)
I personally think that Excel is one of the best tools I have ever used. I do about 40% of my work in Excel. I graduated from college with a degree in biology, and I spent hundreds or thousands of hours in Excel manipulating data and graphing. Excel can and will graph almost anything you need to graph, but anything more complex than the simplest line graph requires getting creative with the formatting of your data and how you use series and data sets. The maximum complexity of a graph in Excel is technically restricted by the 254 series limit, but creating a graph with 254 series will probably take the better part of a week.
So ultimately, Excel isn't a bad tool, but when graphing it's a bit like using a shovel to dig a hole. It has a time and a place, but at a certain point using an excavator will be faster, safer and less expensive than digging the hole by hand.
I don't see how this is an argument for using excel. You could have spent a small fraction of that time learning the basics of R and gotten everything afterwards done 10-100x faster. I know, because I used to be one of the people who spent insane amount of time on incredibly basic stuff. That said I still do sometimes use excel to inspect data and for some other simple tasks.
Excel is a great tool for manipulating data, especially if you are working with other people. Excel produces serviceable graphs, but if you need to produce a graph of any sort of complexity (and know how to program) R is certainly a better tool. Sometimes you need a quick visualization of some very simple data. I'd prefer to use Excel rather than R or Python at that point.
In particular: it's not so much whether the tool "works" or not, but the side effects (how well it plays with other tools; and hence, the extent to which it can be integrated into larger, maintainable projects) that need to taken into consideration, also. Which is why better engineers detect more than a whiff of bad "design smell" when they hear someone say "just use Excel" (most of the time).
Regardless, even if the recipient has Excel, I think it is poor form to insert an Excel object into your reports and emails. The copy/paste menu allows you to paste charts and graphs as images, and it's a superior option to inserting an actual graph (or table) the majority of the time. Not only does it preserve the formatting, and allow better cross-platform compatibility, but documents with images instead of graphs open faster and are generally smaller.
Putting people down for using Excel for simple tasks is snobbery, plain and simple. I'm not a Microsoft apologist in the least (I use Linux as my desktop OS), and I can confidently say Excel is a great program. While there are more powerful tools, Excel is often perfectly suited for smaller tasks. When I need to clean and match a small set of data, I'll reach for Excel. When I need to clean and link a small set of data, I'd rather use filter and VLOOKUP in Excel than dealing with SQL/R/Python.
The faster I can turn data into information the happier my coworkers are, and the happier I am in turn.
While interesting from a technical point of view, why not make yourself 1000x more useful to society by working, for example, on "cognitive health" problems? These are problems that lean heavily on statistics, and are interesting and imho more rewarding at the same time.
> In game theory and economic theory, a zero-sum game is a mathematical representation of a situation in which each participant's gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility of the other participant(s).
Advertisement is (to first approximation) zero-sum in the sense that what you sell, your competitor will not sell. The contribution to society of the advertisement is zero. You can say that you have provided the customer the service of making them aware of the product, but that is only in second approximation, as customers generally do not want this service. Also, they can only spend their money once. Further, ads cost money, so perhaps we can even call it a negative-sum game :)
Anyway, it would be nice to have the effects properly quantified.
Our goal is to solve those problems, and it is not a zero-sum game.
No, its not. A major portion of the effort in advertising of most products goes into spreading knowledge and/or perceived need for the product category, not just competing for share of the existing demand.
> The contribution to society of the advertisement is zero.
Even in the purely competitive case, that's only clearly the case in the case where the products are perfect subsitutes with neither cost nor utility differences for the purchaser.
This is the best line in the entire (interesting) post.
Doing some kind of public work will make your next job hunt so much easier. Whether that is SO answers, a blog, or a github profile, showing bests telling.
The beta distribution: In the free book "think stats Probability and Statistics for Programmers" there is a chapter about how the beta distribution can be used as a prior to model an unknown probability and how Bayes' Theorem allow us to update that prior with a posterior distribution that is also a Beta distribution, that important property is called the self-prior property of the beta. Since the two parameters of the beta in that intuitive explanation are just the number of experiments (battings) and the number of successes (runs) that example constitutes a very intuitive and clear way to explain what is Bayesian Statistics. I think that you would enjoy the think stats book, it is aimed for programmers and it tries hard to enhance intuition.
I also enjoyed how you describe the atmosphere in your office, it seems that you work in a lovely place in which statistics is a well respected tool and people try to explore and innovate in a fun way without excessive pain.
Nice post, I enjoyed it.
Unfortunately hiring seems to be a very difficult problem and I can imagine that many key features are hidden and can't be obtained since the candidate can't be put into a controlled experiment that should allow us to obtain deep information that is usually hidden, perhaps a way to gain more information and wealth is to communicate with your users and clients in such a way that what is now hidden can be measured and new features can be obtained. That is you need to think about a model for your users, and that model must use features related to mental and human capabilities.
This shows "git" and "github" in a separate cluster from "C++" and "Python"? I don't understand this. These tools are used regardless of what other technologies are being used. For example, there are many Python and many C++ projects on github.
But note that precisely because git/github is used in combination with almost all other tags, it doesn't have a high correlation with any particular technology. A correlation (roughly) means "If I know you use tag X, you're more likely to also use tag Y." But knowing someone uses git doesn't let you guess what other technologies you use, because as you note they can use almost anything.
In short, if you're connected to everything, that means you're correlated with nothing.