Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Which new skills for a data science career?
75 points by dsbyaccident on April 20, 2022 | hide | past | favorite | 42 comments
Hi HN, long time lurker with a throwaway here:

I recently quit academia (social sciences) and I plan to transition into a data science career within the next year or so. This is the part I liked about my academic job and the stuff I am good at (statistics, analytical problem solver, data wrangling/modeling).

Reading through job ads the technical skill palette for DS seems overwhelming: python (+pandas, scikit, ...), R, docker, k8s, PowerBI, Tableau, PostgreSQL, DevOps/pipelines, different cloud providers, maybe add some javascript, various ML toolkits (Tensorflow, etc.).

I have 15 years of experience as a statistician (R/STATA/Linux/SQL) and a sabbatical year in front of me, which new skill(s) should I learn/prioritize?

Edit: I have a PhD.

Thanks!



Do you have a PhD? If you do, it's going to help a lot with the transition.

Skills for someone like you to work on - Python, the Python data ecosystem, machine learning, deep learning, being a good software developer.

Things not to worry about right now - Kubernetes, DevOps, cloud providers.

You don't need JavaScript. Don't learn any Tableau/PowerBI and don't apply for jobs that require them unless you want a more analytics/business intelligence focused role. Or do learn them if you want to go in that direction but those jobs are quite different even if they have the same job title.

(If a job description asks for TensorFlow/PyTorch and PowerBI/Tableau, it means that they have no idea what they're looking for whatsoever.)

Maybe I should have started there - figure out if you want a more analytics/product/decision-making kind of role or more of an applied ML kind of role and then focus on that skillset. For the applied ML kind of data science job, you need the skills that I listed, for the other kind you need the stats background that you already have, some SQL, much less coding and a couple of BI tools.


Seconding this. Take Tableau off of your resume unless you want to be a dashboard monkey (sorry, “BI Developer”). Dashboards != analytics, but lots of companies seem to think they do.

I would suggest looking at the Certified Analytics Professional certification from INFORMS. Operations Research is the original data science, and preparing for and passing the test is a good signal that you understand analytics.

Machine learning is just one tool in the arsenal of providing value using data.


Take Tableau off of your resume unless you want to be a dashboard monkey (sorry, “BI Developer”).

As opposed to a regression-and-printf monkey (sorry, "Data Scientist")?

Like it or not - a lot of companies run on dashboards. To, you know, actually get their data out to people. So they can run their business, so you can get paid.

If you dig math and like doing mathy things, good for you. There's no need to be snide about the work that others do that (like it or not) as just as essential to keeping the business afloat as yours.

You are absolutely right though that pigeonholing in this industry is very much a thing. And intellectual honesty, not so much. Lesson being -- if one is targeting certain fields, it is often best to leave certain things off our resumes (or tone them way down).


Hah, that is indeed a good description of most “Data Science” work.

At the end of the day, the point of corporate analytics is to provide actionable insights to improve a company’s profitability.

Descriptive analytics are the lowest form of analytics. If it is appropriate for your problem, and if you have the brainpower to do it, I am much more in favor of prescriptive analytics.

What’s better:

1) a dashboard that shows me inventory levels of various commodities at dozens of locations, from which inventory management employees are supposed to come up with a resupply schedule.

2) an optimization model that is specifically formulated for this problem, that runs every morning at 7am, with a quick double check (indeed, maybe visualized in a dashboard) of the output by a skilled human.

A company that uses the latter approach will dominate a company that relies on humans staring at graphs and charts.

If it’s a regular decision, it should be automated (with human double checking, absolutely). If it’s an irregular decision, a slide deck with matplotlib or similar visuals is fine. If that decision becomes more frequent, it should be built into an automated tool so humans don’t have to sift through dozens of charts and tables to come up with what will still be a suboptimal solution.

So perhaps I should have clarified my initial point. Most dashboarding that I see is purely descriptive. Descriptive analytics are neat, and absolutely have their place in exploratory data analysis and sense making about the basics of business, but are horribly slow to translate into actions as they must be filtered through humans who are getting deluged with visual information.

This reply is getting long, but it comes down to “what is the role of an analyst”? In my opinion, an analyst should tell me, the executive, what I should do. Or they should build me a system that makes this decision for me regularly. What I don’t want is to have an “analyst” merely present me with graphs and tables of my data. If you are not actually doing anything but visualizing the data, you are not analyzing it, and are therefore not an analyst. A data visualizer, or BI developer, but not an analyst.


Well said. You need Tableau/PBI to tell the story of data, and distribute it to stakeholders. They aren't going to read a 10 page Word document of your findings... Maybe in 1999 they did. Looked at another way, you'd never publish a mathematical or statistical based thesis and not include any graphs or visuals. BI is just one of the steps after the analytics that is a necessary evil. It's not even a hard tool to use anyhow.


Sigh..another program that requires you to maintain your certificate by doing PDUs. This is like a 90s dark pattern that still creeps up from time to time. I can't get behind any organization that threatens to take education away from you.


> Dashboards != analytics, but lots of companies seem to think they do.

A lot of people (including people whose job title contains the word 'analyst') think this.


This is very good advice. There's a world of difference between building and deploying models and doing BI.


Came here just to weight in for each and every point. This is the best answer so far.

PhD is seem as a magical badge of super power, even if you can get a job without one, consider having it, is a 1.5x-2x salary multiplier.


Definitely not my experience in big tech data science. 30-50% of data scientists I work with don’t have a PhD, and took the self-taught hacker path, and earn just as much as PhDs. I probably earned a total of a million bucks by time I was as old as the average PhD data scientist hire out of academia.


I've only seen PhD's increase salary for machine learning researcher positions. For ML eng, ML ops, and analytics, I haven't seen any pay differential. YMMV


I don't think there's a definite one-size-fits-all answer to that, it will depend on what's most enjoyable to you, what jobs you'll be applying to etc. My personal recommendation (which is based on my n=1 subjective experience) is:

- python DS ecosystem; fundamentals: numpy pandas matplotlib seaborn sklearn scipy. From there you can branch in many different directions - interactive visualization libraries (e.g. plotly / bokeh), stats / probability stuff (statsmodels / pymc3), NLP, sklearn addons, ML explainability, ...

- solid "software engineering" - writing good code, unit tests, documentation, logging, basics of deploying a service

- TF / pytorch if you want to get into the deep learning hype

Best of luck, and more importantly, enjoy!


Since you already know R/STATA/Linux/SQL, you're already way ahead of most people. You'll find that a lot of the specific technical stuff is project/industry dependent and since you're just starting in the field, think of any hard requirements for a specific technology as a warning sign. Good employers will give you room to learn their niche technicals.

The thing about data sciencey stuff is that the data you're working with will often be extraordinarily messy. It's not uncommon to realize two weeks into a project that you made a terrible cleanup assumption on day one... then have to run hours and hours of mostly-hand-executed ETL work again. And then again when you realize you forgot that you ran a unix one-liner on a random input file, but didn't write it down anywhere.

So, one of the most important things to learn off the bat is learning how to clean data programmatically, specifically with the goal of making sure that your cleanup is repeatable at any step of the way. You want to be able to get to a point where you feel confident that you can mostly trivially recover after deleting all cache/temp files, tables, etc. Makefiles are great for this.

You'll save a lot of time in the long run if you can get good at this.


So, basically data engineering. If you want to be good at data science you should at least be a mediocre data engineer.


The reason the skills matrix seems overwhelming is because there is no true singular "data scientist" role. It's a title that's been abused. So my answer is, it depends a lot on the TYPE of data scientist you want to become:

- Data Scientist MIGHT mean Applied ML Scientist (important distinction) - Data Scientist MIGHT mean Data Engineer - Data Scientist MIGHT mean ML Engineer - Data Scientist MIGHT mean Data Analyst - Data Scientist MIGHT mean Statistician - Data Scientist MIGHT mean Product Analyst

The traditional idea of a data scientist for the last decade or so is someone who is able to do insight extraction, create models (ML and otherwise) and build dashboard and presentations. In practice this has mostly proved to not be practical and little value is being extracted from the role, so a mature organization will properly break out the responsibilities into the above mentioned functional areas.

I think it's super important that you reach out to industry data scientists at companies or in domains you're interested in and ask them what it is that they actually do on the daily. Be careful with most data science roles as they are really just data analyst roles in disguise.

Very few true data "science" roles exist, and I'd argue that might you not actually want to work in those roles since they likely exist in companies that have no idea what they want out of them.

That being said I think the 2 absolutely crucial technical skills to have for almost any modern data related job will be:

- Python + Pandas - SQL

That's really going to be the technical foundation to make a data career in industry.

The breadth of other skills and required knowledge is too much for any single person or post to tell you.

I'm not entirely sure what your PhD is in, but if it has an associated domain in industry and if you're actually interested in that domain, I would recommend starting there and seeing what Data Scientists with similar academic backgrounds as you might be doing. LinkedIn is a great place to find people and connect!

Best of luck.


Have you considered/are you open to a mgmt role? Not to be derogative to coding, but finding a developer is easier than a stats expert with data science chops. You could be in a leadership/SME role for a small team, and be a value "multiplier". For this, you need to be expert in the tools of your trade .. so R for sure, and possibly Python(+pandas, scikit). Your dev team can handle other parts.

Where you are you based btw? I have a ton of Parallel Computing experience (PhD too). My weakness is on the math/stats side. Happy to give more specific advice. Market is hot right now .. don't delay.


If you are looking for a job, get a job.

Self study is not a job.

It's attractive because it's easier than looking for a job.

Or to put it another way, job hunting is the key skill for any career transition.

Good luck.


I've got a slightly left-field suggestion that I think might better fit your existing expertise and mind-set. Skip all the mechanics and go for a high level understanding of the relevance and use of data. Take Scott E Page's course on Model Thinking Download NetLogo and go through all the examples. Then learn Python and take at least one good course on Systems Thinking. Best of luck.


Python, pandas, pytorch be comfortable with them, and that demonstrates enough. Stick with the python eco system, that's enough tools, you can add ad hoc things on the job, that other stuff is getting more into data engineering and visualising, which is fine but your not looking to be a unicorn of all skills, you just need your first job.


I made this transition a few years ago. Prioritize studying Python. Leetcode. Make sure you're familiar with git. And then pick 2 or 3 sub-genres of Data Science and study those. Operations Research, experimentation and statistical tests, Causal Inference, Computer Vision, NLP, etc. Then apply for Senior Data Scientist or Senior Applied Scientist roles in those areas. The hiring process is broken and based on quickfire technical tests. There are huge numbers of people trying to break in to DS. Be prepared for lots of rejections. Be positive during initial calls with recruiters; they are just looking for a reason to filter you out and don't actually know much about the job. Good luck!


I recently made the transition from academia to industry (also a PhD holder). I would echo a lot of what other commenters said about learning python and the associated data science tooling. Coming from an R heavy scientific discipline (quantitative ecology), I found python to be quite a bit better at things that base R struggled with, particularly string manipulations.

Aside from programmatic and cloud tools as identified in your post, one of the biggest hurdles is whittling down your academic CV into a resume. Spending time re-framing your academic accomplishments in the short form will be the best time investment for getting in for interviews. I ended up following the google XYZ resume formula: https://www.inc.com/bill-murphy-jr/google-recruiters-say-the... It kind of hurts to distill your academic achievements into "Published [X] peer reviewed papers [Y] by driving the analysis [Z]", but I think it really helped me start getting calls vs. desk rejects. Relatedly, only include publications that either highlight your expertise for a specific job posting or if they further highlight your expertise in statistics in a way that could set you apart from other candidates.


Based on your skills I would highly recommend Python and some JavaScript.

Python being the main language to focus on because it's great for working with data and general scripting needs (working with files, etc). For data everything from basic data access to a lot of math and statistics that you would find in R.

I would recommend some JavaScript so that you have the ability to easily read it and because I feel that when you learn multiple languages it improved your skills overall for each language. Doesn't have to be long - perhaps a few days of focused learning (or even a day or less).

Recently I have been working on something related to using a lot statistics, analytics, etc and have been using Python the most for it and actually some SQL as all major databases now support Percent Rank and other statistical functions. For my project I'm using JavaScript a little for work such as Web Scraping but most work is done in Python and then final reporting is SQL.

Python is also great for Machine Learning. Even for basic API access with TensorFlow I prefer using Python over there JavaScript API.

Good luck with your new career!


Basically the more full stack you can be the better. In a perfect world you have have the ability to frame statistical problems, decide on a modelling approach and evaluate it, deliver it not as a notebook but as a Python library, build the data pipelines to integrate it into the business, write all this code as an experienced software engineer would, then be able to visualise or reason about the model outputs with an eye to the wider business. In reality, few people are good at all these things, especially early in their career, but they’re all worthwhile skills to pursue.

In terms of prioritisation, Python and general software engineering skills multiply your ability to deliver stuff but also learn and experiment with new libraries. If all you knew was logistic regression or gradient boosting, but you could deliver a whole pipeline and insights on top of it, you’d be extremely valuable to most businesses. You’re also set to fill those roles in future startups you found yourself, where needed.

It’s true you can also do world class research and never leave Jupyter - that’s valid too, but probably more hit and miss as a career.


I think everyone’s advice here is good, I also want to throw on your radar that with your background you can also try targeting Applied Scientist roles instead of Data Science directly e.g. https://www.amazon.jobs/en/landing_pages/ops-tech-applied-sc...

Generally the highest value for your background will be Python (do a bit of leetcode), Pandas and Sklearn. PyTorch would also compliment your existing skill set.

A lot of the other stuff you can just learn on the job like cloud, dev ops etc. it all differs by where you work. It’s also the shiniest stuff that you can get distracted by and waste time on for less value.


- Find out what you want to do

  - Think about what are your strongest applicable skills

  - Talk to as many DSes/MLEs/MLOps engineers you can

  - Experiment with various fields (watch videos, OSS work etc) 
- Find comapnies that you actually want to work for

- Find out what DS means at those places and what do they do and if that's what you want to do. Ask as many questions about the details of the job as possible at the interview, the hiring manager will be glad you want to avoid getting a job you are not committed.

Essentially: Do your homework then make a decision that is good for _you_ rather than trying to fit to an abstract idea.


It sounds like the main thing for you is going to be machine learning. Do some Kaggle rounds. To get good results on that you will be forced to learn a lot of the stuff data scientists do, plus it always looks good on your CV.


Prioritize getting a job and go all in on the skills that job needs and wants. Consider applying for Amazon Scholars or an equivalent programme.

> The Amazon Scholars program has broadened opportunities for academics to join Amazon in a flexible capacity, in particular part-time arrangements and sabbaticals.

https://www.amazon.science/scholars#:~:text=Amazon%20Scholar....


Stick with mastering Python and focus on working with either Google or AWS. You want to go work where they have a lot of interesting data. Frame your job search around that.


Truthfully, most roles advertised as data science in most corporations and Healthcare orgs are mostly just data wrangling for presentations or dashboard work. You have plenty of skillset to get in the door at most places if you interview well. Your big problem will be weeding through the jobs to find one where you will really get to do data science working and not just write somewhat complex sql queries for people.


Based on jobs data: Python, SQL, R, and Tableau are all in very high demand. See: https://i.imgur.com/WXn8Cny.png

Source: https://www.kaggle.com/code/nomilk/data-science-language-and...


This doesn't show exactly demand. One thing about data science job listings is that some requirements aren't descriptive of the job itself. It's common to put in something like "Python/R" or "Python/R/Julia" as a way to show that they want experience with at least one of them, but then the job is almost always Python; and not having Python experience is an immediate disadvantage because so many other candidates know it. I don't know about the Australian market but in my job search in the UK last year, 100% of the jobs I came across were Python jobs even if there was a long tail of other languages in the job descriptions.

If someone sees this visualisation and decides to focus on R/SAS/MATLAB or anything even lower on the list, they're making a career mistake in the current data science market.

(tl;dr - scraping job listings can lead to misleading results)


For sure, Job listings are a proxy and should be taken with salt. 3501 python jobs, 2490 R jobs still indicates strong demand for R though, even if it's not going as bananas as python right now.

I suspect a lot of python jobs would hire a good 'R person' knowing they'll be up to speed in a month or two, and vice versa. Similar to other CS domains, where strong fundamentals are much more important than whatever language you learned them in.


> 3501 python jobs, 2490 R jobs still indicates strong demand for R though

My point is that out of those 2490 job listings that mention R, the number of jobs that require you to actually write R more than any other language will be closer to 0 than it is to 2490. Some of them will still actually use R but it's a much much smaller number than the visualisation leads you to believe.


The data seems to support your idea. Although there were 2370 jobs that mentioned both python and R, there were about 9 times as many 'python only' jobs (261) as there were 'R only' jobs (31):

https://i.imgur.com/vkjGSyU.png


I'm in a similar place to the OP - I have a PhD in health science with R and Python experience (and a little hobby experience with TensorFlow and some AWS stuff). Is it worth doing kaggle competitions? I've heard mixed things - some people say companies ignore them, others say they're useful evidence on a CV. Thanks!


Do them if you're interested but I don't think they're that useful on a CV. If you win a competition, you'll obviously get some attention but just participating and putting it on your CV doesn't do much.


thanks - do you think there's anything obvious I might be missing coming from an academic background in terms of getting attention? I don't know how useful side projects really are, for instance (particularly vs things I've done as part of my research).


I have been in a data science consulting role for several years at a big management consulting firm. Consider this path if it interests you. Money is better. Problems can be more varied and interesting imo. Mentorship is much better. We get lots of PhDs from random fields.

That aside, I work with many data scientists at clients. Most companies are still windows shops. Some may get linux. I've never seen a data science group using STATA successfully. R is ok, but in my experience is falling out of favor rapidly. While R is great for data science, it's relevance as the glue for other things is not so great. I would softly advise python.

I wouldn't bother learning viz tools like tableau beyond basic familiarity enough to slap it on your resume as something you've touched. It's all company specific. Same for the cloud and pipeline shit.

Think hard about what kinds of problems you want to solve. Most problems, that is, every single problem I have worked on, are poor fits for neural networks. PyTorch has never been relevant. Real world business problems just don't really benefit from that kind of stuff all that much. A real data science value add is picking up the low hanging fruit by being smarter about decisions that used to be made on gut instinct or whatever. Unless you really want to work on computer vision or whatever, it's just not something you need to bother with. I typically end up using a lightGBM model for pretty much everything at the end of the day. Which is basically just a fancier random forest.

Many data scientists shops fail to achieve anything because the data scientists are too complacent. Be a business person. Show a willingness to engage on problems and grill business folks for how they make decisions, and discuss how your model could change that process to add value. Don't make book reports on your findings and expect them to figure out how to use it. It's so, so common to have data scientists who don't feel qualified to take that part of the job on, and so they build nice models and visuals that everyone applauds and then collect dust on a shelf. Every output should be clearly dictating a path to generating value, be it money or some other worthy metric.

Highlighting that you're someone who can use their data science to solve real problems will be much more appealing in interviews than someone who can say they data science things the best. IMO, good question would be "I want to make sure I'm joining a group that has the power and support to really influence how decisions get made. Can you give some examples of the work that the business has adopted from your outputs?". Both from a virtue signaling perspective and a genuine desire to avoid joining a back office data science skunkworks that nobody listens to.

edit: I'm on the east coast, and not interested in working for big tech, for context. All of my examples here are from experience working with "normal" companies from dozens of industries, but no west coast tech.


> I typically end up using a lightGBM model for pretty much everything at the end of the day

I laughed a bit when I read this. This matches my experience exactly. Most of the crap I do daily is basically the ML equivalent of CRUD apps. Oh we need to make some data-driven decision here to optimize this metric. Typically any basic ML, especially something like LightGBM or CatBoost, works so well it's pointless trying to squeeze out a little more performance. And yeah, this is also normal-company shit, not MAANG stuff. There is so much low-hanging fruit at normal companies. Another bonus is that you get to do full-stack DS, from understanding the business problem, gathering the data, building the solution, getting it to production and showing the result. I feel like I get to keep up to date with pretty much all the dev tooling, leaving an exit back to software dev if things go to shit (I was full-stack SWE about 7 years ago), or just continuing to specialize in statistics and ML (currently applying mixed-effect models and survival analysis in a domain that no one has probably thought of applying it to).


I know R, Python, ASP.Net, Docker, Postgres, tons of fullstack, machine learning, deep learning etc and still don't even get interviewed for data science jobs. So skills isn't the whole picture...


I might ask:

(1) Do the people hiring really have an actual real need?

(2) Do these people really know what skills they need or are they asking for all the skills they can think of to have confidence that the people they hire will have enough skills, no matter what the need?

(3) Given a long list of skills, there are nearly no jobs that actually need and will use all those skills. With a long list of skills in a job that uses only a few of them, soon the skills that the person still knows and are proficient with will shrink. Then what happens to the career, that is, to ability to have a chance to get hired from job ads that have a long list of skills?

(4) Don't forget that one reason for job ads with a collection of skills that is long and particular is because the people placing the ad already have a candidate for the job who claims to have those skills and someone is wondering how easy it would be to hire someone else with those skills.

(5) My experience is strongly that nearly no one in business with enough money to hire technical people actually (a) has work that they understand and that needs technical skills and (b) knows what skills are needed.

(6) Really, if they are recruiting for a long list of skills, then likely they are offering at best a version of gig work and not a real job and career.

(7) Learning a long list of skills prior to getting a job seems to mean there is something wrong with the job being applied for.

(8) My strong experience is that any very productive use of any specialized skills will totally torque off and threaten essentially everyone in the management chain all the way, literally, to the BoD.

(9) At Easter dinner this year, I learned a good lesson: There actually are employers who are ready, willing, eager, and able to hire people with just good, basic abilities and some background in some technical area for a job they have where: (a) They actually do have important work to be done and do understand the importance of that work. (b) Want their people to attack the work and learn any necessary material as they go. They don't ask the people they hire to have a long list of skills before they are hired.

(10) I'd suggest doing what have to do to get a basic income now and otherwise start a business. In that business, learn the skills the business needs. Typically for the technical skills, the list might be particular to the business but otherwise is not very long.

Lesson: In the end, in nearly all of the US economy, to have a lot of money to pay people to have good jobs there has to be a business founder who found some important problem to solve, got a good solution, and got a lot of people to pay for that solution. So, BE one of those people.


Python programming. 4-6 hours/day. Work on the Project Euler problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: