Hacker News new | past | comments | ask | show | jobs | submit login
One Year as a Data Scientist at Stack Overflow (varianceexplained.org)
189 points by var_explained on June 21, 2016 | hide | past | web | favorite | 44 comments

I always liked data crunching, databases and data topics in general. Also, most of the software development knowledge I have I didn't learn in the official school curriculum, but rather from books, online courses and real-world experience.

Now, having that in mind, how realistic is for a guy in mid-30s, a very good software developer (enthusiastic about functional programming, if that matters) to pick up enough knowledge about data science to be able to actually take the data-scientist role in some company?

Edit: Rephrase the question.

Hey, I'm the Jason Punyon drob's gushing about in the post. I wanted to tell you that this is 100% totally realistic, everyone on the Data team at Stack Overflow besides Dave Robinson is living proof.

In the beginning the Data Science Team at Stack Overflow was just me and Kevin Montrose. We're both ~30, we had some skills that were data-ish, but they were general critical thinking and math skills. By no means were we statisticians or data scientists by trade like Dave Robinson. However, the data at Stack Overflow was very amenable to analysis, and we were able to ship a personalized machine learning product after a year of work.

That year was plagued by (my, mostly) engineering failures (http://jasonpunyon.com/blog/2015/02/12/providence-failure-is...) and a bit of unfortunate lack-of-empiricism, not "data science" problems. And post launch of Providence, the engineering remains the bigger issue. It's just hard to get this shit right. Our record on getting experiments right the first time is spotty, and at Stack Overflow levels of traffic you might be dumping a week or more of time on a single screwup. We've had to rerun many experiments because of various things going wrong with the code (screwed up the A/B testing code, not counting the right things, visual issues with the ads we didn't catch during design, etc etc etc). There's a litany of things you've gotta get right engineering wise before data even comes into the picture.

So if you're gonna take the plunge, know that it's 100% possible but 100% difficult, and maybe not for the reasons you think.

Totally realistic. It'll depend on what the company is, and what they're looking for (and hopefully, this will match what they should be looking for).

While people can often focus on applying the latest deep learning thought-vector approach to their BIG DATA, there's an enormous gulf between the common condition of data and this aspiration.

You don't need PhD level stats and machine learning to apply the things that many companies actually could benefit from. Storing, maintaining and managing the data properly is a start. Then working with the people in the business to get insights from the data they have. Often simple aggregations and visualisations can provide enormous benefit. Being able to show correlations, and sometimes even just being able to show how noisy things are can be important.

Big questions: How is reality different from what we think? How are these differences important for the company?

That might be a correlation that we don't expect, or a lack of one we think is there. Part of a next step might be to become more "pro-active" and design new experiments to answer questions that can't quite be answered yet due to a lack of data. Beyond that you're heading towards bringing a new feature into a product.

Does your current company have some data? Do they spend a lot of time emailing spreadsheets around, or have salesforce or a proper database? See if you can make something useful for your work that's based on an analysis of that data (and possibly do some work to smooth those workflows of passing data around).

IanCal - I upvoted you but let me comment for extra emphasis. I remember coming out of PhD level quantitative social science studies in academia back in the late 90s where I was using K-means clustering, Factor analysis, multiple linear regression, ANOVA and more. When I moved into marketing research and dealing with company data it was shocking and disheartening how little of those skills I could actually deploy. Data quality, data management and just the cost of capturing relevant data was so high that we were reduced to much simpler analyses.

Over time I came to appreciate exactly what you were saying. Many companies can be helped by fairly simple analyses.

Fast forward - data is getting much cheaper and now all these years later my more advanced stats skills seem to actually matter. But even in this apparent abundance ... data management and data quality assurance is often lacking or significantly underfunded and simple aggregations and analyses still make the most difference in many organizations.

I went into marketing from a biology background, and I felt the transition was very easy. Collecting data in biology, especially on an ecosystem scale, is extremely difficult, and your data has a lot of noise in it. How the data is collected and processed is almost more important than how the data was analyzed.

Collecting and processing data is going to be extremely time consuming and expensive regardless of what you do, but you can make it easier on yourself through smart experimental design. P-values are still widely used for analysis in biology, to the chagrin of most statisticians. I don't think it is as much of a problem as it comes across as though. Bayesian reasoning is inherently included in the scientific process as part of the experimental design stage. The reliance on p-values becomes a problem when journalists report on research findings in a single paper that is "significant" because of p-values. The professionals in the field are capable of balanced analysis of p-value based research, but we end up with issues like anti-vaxxers when research is presented outside the field of professionals.

Of course there are still ways the process can allow incorrect research to present itself. For one thing, journals are not interested in publishing studies that do not have "significant" results, which means that bad research can stick around for far too long.

I personally think that the solution to these problems is open data, since so much of the research depends on how the data was manipulated and 'cleaned' prior to analysis.

Absolutely realistic, though it helps to have good intuition about math. I've seen smart people without data science experience make the leap after a 12 week boot camp, or finishing the Johns Hopkins Coursera certificate. If you're good at picking stuff up online, the latter is probably the best start.

There is such a shortage of data scientists that most companies will settle for 2 out of 3 of the trinity of Stats, Programming and Domain Knowledge. If the candidate is smart enough, sometimes 1 out of 3 is enough.

You could absolutely learn all the skills and knowledge to be a good data scientist, the real question is if you could get hired. I have been looking into data science and I think the demand for data scientists is not as large as it seems, but it is growing. There are very few junior data science positions. Especially without a PhD or an inside contact it could be challenging to get your foot in the door. My advice would be to check out Kaggle, try out a problem and see if you enjoy it.

I hire plenty of data scientists - if I had to give a single piece of advice: gain a deep understanding of the techniques you're using.

it's not sufficient to say "oh, we used a support vector machine", and when prompted for more detail about how it works to shrug and say you just copied and pasted code from the internet and tweaked parameters until it worked.

To take a page from the authors playbook...create some artifacts. Find interesting data sets, do interesting stuff with said data sets and write about it :)

I'm also a self taught programmer. I recently started using https://www.dataquest.io/ (not affiliated with them) and have found it to be a great data science resource for people who prefer a hands-on approach to learning. Previously I'd tried various MOOCs but the lecture format with only a few questions to work on was not helpful for me.

Depends on the stage the company is at... I've a similar background, but took up a Data Scientist role. However and here's the rub. The company has very little data-pipeline setup and functioning , that much of the work is laying out the plumbing, ensuring it works without getting blocked etc.. I've tried my hand at a few predictive models and cross-validation accuracy looks good, but can't test them on real-life scenario, as some of the data-source it depends on is not coming in. What's worse is they don't have a capable and dedicated devops, so it's the CTO who had started this data puller project and hasn't delegated/handed-over either. So I'm reduced sitting on my ass, warming the seat/chair.

I agree with the previous messages. There are basically two paths: A) Learn from others. There are some good programs like Data Science Retreat. B) Self-learn and use Kaggle competitions to prove it.

There are multiple Data Science boot camps. The fact that these places exist and claim to be able to get people jobs after three months learning suggests you could do it. Look at Signal Data Science.

> It makes me sad when brilliant software engineers open up Excel to make a line graph!

Why do people get religious about tech.. funny, let him/her use excel for god sake its a great tool :)

Excel is a great tool, but its graphing ability leaves a lot to be desired. I feel that graphing in Excel has gotten worse as they have tried making it easier to use as well. I'm 99% sure the graphing engine in Office 2007-2016 is the same one as in 2003 and earlier, but all of the damn menus they have added slow down the graph making process so much. If you are trying to use Excel to make a graph for a reasonably technical audience you are going to need to make a lot of tweaks to the graph to make it acceptable, and tweaking each part of the graph takes dozens of clicks where it used to take one or two.

I personally think that Excel is one of the best tools I have ever used. I do about 40% of my work in Excel. I graduated from college with a degree in biology, and I spent hundreds or thousands of hours in Excel manipulating data and graphing. Excel can and will graph almost anything you need to graph, but anything more complex than the simplest line graph requires getting creative with the formatting of your data and how you use series and data sets. The maximum complexity of a graph in Excel is technically restricted by the 254 series limit, but creating a graph with 254 series will probably take the better part of a week.

So ultimately, Excel isn't a bad tool, but when graphing it's a bit like using a shovel to dig a hole. It has a time and a place, but at a certain point using an excavator will be faster, safer and less expensive than digging the hole by hand.

>"I graduated from college with a degree in biology, and I spent hundreds or thousands of hours in Excel manipulating data and graphing...creating a graph with 254 series will probably take the better part of a week."

I don't see how this is an argument for using excel. You could have spent a small fraction of that time learning the basics of R and gotten everything afterwards done 10-100x faster. I know, because I used to be one of the people who spent insane amount of time on incredibly basic stuff. That said I still do sometimes use excel to inspect data and for some other simple tasks.

I wasn't arguing that one should use Excel. I wasn't arguing that you shouldn't use Excel either. I was arguing the exact same point you just made - you should use the appropriate tool for the job.

Excel is a great tool for manipulating data, especially if you are working with other people. Excel produces serviceable graphs, but if you need to produce a graph of any sort of complexity (and know how to program) R is certainly a better tool. Sometimes you need a quick visualization of some very simple data. I'd prefer to use Excel rather than R or Python at that point.

Nothing "religious" about that observation; quite the opposite, in fact.

In particular: it's not so much whether the tool "works" or not, but the side effects (how well it plays with other tools; and hence, the extent to which it can be integrated into larger, maintainable projects) that need to taken into consideration, also. Which is why better engineers detect more than a whiff of bad "design smell" when they hear someone say "just use Excel" (most of the time).

I think the concern is less that it's a good tool, and more that it's one invariably receives emails with attached documents that open well only in Excel.

I re-read that part and I think the sentiment was more like "If you are a great computer scientist you have most of the skills you need to be able to use R."

Regardless, even if the recipient has Excel, I think it is poor form to insert an Excel object into your reports and emails. The copy/paste menu allows you to paste charts and graphs as images, and it's a superior option to inserting an actual graph (or table) the majority of the time. Not only does it preserve the formatting, and allow better cross-platform compatibility, but documents with images instead of graphs open faster and are generally smaller.

The same can be said for screen clipping an Excel graph, which can be done at any point, and there's an additional bonus in that the layperson can see and manipulate the data as well. It's a(n admittedly poor) self-contained document that anyone can poke at, versus an image that only the anointed can recreate with different data. For better or for worse.

That point was in favor of Excel actually, sorry I wasn't very clear. I was actually referring to the built-in functionality in Office that allows you to choose to paste Excel graphs as images instead of as full excel objects.

Putting people down for using Excel for simple tasks is snobbery, plain and simple. I'm not a Microsoft apologist in the least (I use Linux as my desktop OS), and I can confidently say Excel is a great program. While there are more powerful tools, Excel is often perfectly suited for smaller tasks. When I need to clean and match a small set of data, I'll reach for Excel. When I need to clean and link a small set of data, I'd rather use filter and VLOOKUP in Excel than dealing with SQL/R/Python.

The faster I can turn data into information the happier my coworkers are, and the happier I am in turn.

Ah man, I wanted that job. I guess I didn't waste enough time on the Internet answering questions during my PhD.

The article is a very nice write-up, but in the end he is still optimizing advertisement click-through rates (a zero-sum game).

While interesting from a technical point of view, why not make yourself 1000x more useful to society by working, for example, on "cognitive health" problems? These are problems that lean heavily on statistics, and are interesting and imho more rewarding at the same time.

How is matching people to jobs a zero-sum game? If you are able to devise a method to find the ideal job for people I think that society will benefit from it, better products and services will be constructed, so this is in no way a zero-sum game. Is a win-win situation is jobs seekers and jobs providers are able to find the match the right candidate with the right job what require to pay the right price.

Optimizing CTRs is not at all zero sum. Advertising a board game to me instead of beyonce tickets doesn't take my clicks away from Beyonce. I'm just not interested in seeing her show. Another client might be the reverse.

Interesting, how is optimising advertisement CTR a zero-sum game? Just curious.

From wikipedia [1]:

> In game theory and economic theory, a zero-sum game is a mathematical representation of a situation in which each participant's gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility of the other participant(s).

Advertisement is (to first approximation) zero-sum in the sense that what you sell, your competitor will not sell. The contribution to society of the advertisement is zero. You can say that you have provided the customer the service of making them aware of the product, but that is only in second approximation, as customers generally do not want this service. Also, they can only spend their money once. Further, ads cost money, so perhaps we can even call it a negative-sum game :)

[1] https://en.wikipedia.org/wiki/Zero-sum_game

I don't think advertising for jobs is a zero-sum game, especially if the matching algorithm is good enough to match employers and employees that had no knowledge of each other before. If you are able to reduce those search frictions, you have created value. Several economic professors won the Nobel prize for their work in this area:



Well, those Nobel prizes are for a specific theory. In practice, in this case, you see developers looking to get problems solved (or solving problems), and getting distracted to look for another job. So even if you are removing market friction, there's a huge cost in people switching (or even getting distracted all the time). In my opinion, if people are looking for a job, they should go to a job-hunting website (even if it has lower-quality data about them).

Anyway, it would be nice to have the effects properly quantified.

There are many developers looking for a good job. There are many companies having difficulty finding qualified developers.

Our goal is to solve those problems, and it is not a zero-sum game.

> Advertisement is (to first approximation) zero-sum in the sense that what you sell, your competitor will not sell.

No, its not. A major portion of the effort in advertising of most products goes into spreading knowledge and/or perceived need for the product category, not just competing for share of the existing demand.

> The contribution to society of the advertisement is zero.

Even in the purely competitive case, that's only clearly the case in the case where the products are perfect subsitutes with neither cost nor utility differences for the purchaser.

yep agree, found myself working on clinical neurology applications instead (probably less well paid tho)

I wouldn't call it waste if it helped him land the job.

> public work is not a waste of time

This is the best line in the entire (interesting) post.

Doing some kind of public work will make your next job hunt so much easier. Whether that is SO answers, a blog, or a github profile, showing bests telling.

> For example, if you visit mostly Python and Javascript questions on Stack Overflow, you’ll end up getting Python web development jobs as advertisements

But what if you are an excellent C++ developer, who needs some assistance with Python and Javascript?

I'm one of the Ad Server devs that David mentioned in his post. We have plans in the works for allowing a user to specify what technologies/tags they're more interested in seeing jobs for, as well as things like customizing the geographical location (if any) you'd like to see prioritized. We hope to roll those out this year (we're a small team - just got our 3rd dev)

I just read your R course lessons and I find it very well explained, I enjoyed the lessons about data.table and ggplot2.

The beta distribution: In the free book "think stats Probability and Statistics for Programmers" there is a chapter about how the beta distribution can be used as a prior to model an unknown probability and how Bayes' Theorem allow us to update that prior with a posterior distribution that is also a Beta distribution, that important property is called the self-prior property of the beta. Since the two parameters of the beta in that intuitive explanation are just the number of experiments (battings) and the number of successes (runs) that example constitutes a very intuitive and clear way to explain what is Bayesian Statistics. I think that you would enjoy the think stats book, it is aimed for programmers and it tries hard to enhance intuition.

I also enjoyed how you describe the atmosphere in your office, it seems that you work in a lovely place in which statistics is a well respected tool and people try to explore and innovate in a fun way without excessive pain. Nice post, I enjoyed it.

If I were to work in your place to try to match people to jobs I should study some psychology and NLP trying to predict from what an user write and answers what is his mental state and how that mental state fits in the jobs.

Unfortunately hiring seems to be a very difficult problem and I can imagine that many key features are hidden and can't be obtained since the candidate can't be put into a controlled experiment that should allow us to obtain deep information that is usually hidden, perhaps a way to gain more information and wealth is to communicate with your users and clients in such a way that what is now hidden can be measured and new features can be obtained. That is you need to think about a model for your users, and that model must use features related to mental and human capabilities.

What about the millions of people who've never written anything on Stack Overflow?

> For that, I might look at another source of data, Stack Overflow Careers profiles, and see which technologies tend to be used by the same developers

> http://varianceexplained.org/images/network2.jpeg

This shows "git" and "github" in a separate cluster from "C++" and "Python"? I don't understand this. These tools are used regardless of what other technologies are being used. For example, there are many Python and many C++ projects on github.

That is a heavily-filtered network: if he didn't drop most of the weak connections he'd end up with everything connected to everything else. On the other hand, if you don't use a sophisticated way to define what a "noisy" edge is, you'll end up with some curious cases like the one you point out. He might have used a naive global threshold -- it's the easiest way to go about it: drop all connections with weight lower than x. But it's also very wrong most of the times :-) Something like the disparity filter [1] works usually well, although you have sure that its null model hypothesis is aligned with what you think is the generative process of your network. The field is "network backboning" and it's a nice one from network science.

[1] https://en.wikipedia.org/wiki/Disparity_filter_algorithm_of_...

mikk14 is right; the connections were filtered out below a particular threshold (set that threshold too low and everything is connected to everything; set it too high and you miss meaningful connections).

But note that precisely because git/github is used in combination with almost all other tags, it doesn't have a high correlation with any particular technology. A correlation (roughly) means "If I know you use tag X, you're more likely to also use tag Y." But knowing someone uses git doesn't let you guess what other technologies you use, because as you note they can use almost anything.

In short, if you're connected to everything, that means you're correlated with nothing.

Remember all this data comes from Stack Overflow. Some of the clusters might look a bit odd, but that's because they're all derived from the relationships we've found in Stack Overflow content.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact