Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Am I too late for the “Data Science” wave?
199 points by kylebenzle on Nov 15, 2020 | hide | past | favorite | 143 comments
I did my MS and PhD in agriculture but am looking to pivot into something that gets me out of the lab.

I'm now doing the NYC Data Science course (https://online.nycdatascience.com) but am getting concerned I may have missed the boat. What does HN think?

I work as a data scientist and have some perspective on this. There's no boat to miss, you'll probably be fine. Just keep a couple of things in mind

- The fundamental skills that you need are mathematics and software engineering. Depending on your background it might take years of additional studying.

- There is a big oversupply of people for the junior-mid level data science jobs. There are more people who want to get in the field than there are jobs. If you drastically switch careers, you'll take yourself out of a field where your skillset is incredibly rare and your competition is limited and put yourself in a place where everyone else wants the same job.

- The fact that you have a PhD is going to help you. Personally, I don't think that a PhD in a field other than mathematics/computer science is that relevant but employers tend to favor applicants with PhDs mostly because there are too many candidates for any given job and asking for a PhD is just a strong initial filter. There are also research jobs within data science for which a PhD requirement (in a relevant discipline) makes more sense but these are a small proportion of all the data science jobs.

- If you're already employed with your agriculture PhD, there must be a number of opportunities for you apply the techniques that you're currently learning wihout leaving the industry. That's probably the path that I would suggest - it would allow you to expand your skillset without taking big risks and you'll have more options in the future. Use the career capital that you already have and explore your options instead of making a sharp turn in your career direction that might leave you disappointed.

> - If you're already employed with your agriculture PhD, there must be a number of opportunities for you apply the techniques that you're currently learning wihout leaving the industry. That's probably the path that I would suggest - it would allow you to expand your skillset without taking big risks and you'll have more options in the future. Use the career capital that you already have and explore your options instead of making a sharp turn in your career direction that might leave you disappointed.

This is real gold. If you have existing knowledge about some area than you can learn and apply those things, thus you'll get the traction that you want earlier, and it would be more rewarding in the end.

If there is a chance, I won't miss that opportunity. Good luck.

In general, domain knowledge is incredibly valuable and will generally trump someone who may have some more technical skills but doesn't know the field. I'd add that a lot of larger companies--heard someone from Shell talking about this a couple months ago--are training up people who have knowledge of the business with "citizen data scientist" skill sets.

Without knowing all the details of your situation, it seems at least a much lower risk path to acquire some data science skills--maybe your company will even pay for it--that you can pair with your existing domain knowledge.

Your comment will forever be under rated due to how spot on it is. It's a fundamental issue in tech that people don't appreciate. While yes, some problems are universal and general tech can solve them without industry experience, I'm pretty sure most people agree that those problems are either solved or there's an army of devs already on it. These days it's not really enough to just be a dev or data sci. You need to know a field and apply your tech knowledge in better ways than the uninitiated would never imagine.

The problem is that real world, physical business data is taken from what one can get, not what one would want. And usually "what the business was collecting for (unrelated purpose)."

This means it has nearly infinite caveats and assumptions. A specifications doc or readout will never sufficiently express all of these. Especially if humans were involved in the data generated.

Consequently, the most useful data products are going to turn on whether of not you (did this small thing) to (correct for this obvious bias or flaw that anyone familiar with the industry knows).

>The problem is that real world, physical business data is taken from what one can get, not what one would want.

Yea, sorry, but part of your job in data sci is to collect the right data. Data doesn't magically exist and we are not stuck with what's out there. A data sci job is to figure this stuff out. Tech has a weird culture of not doing their job. Kind of like the Zip Recruiter ads. "Working as a hiring manager, hiring new people is the worst part of my job." Bitch, that IS your job. If you dont do that, what's the point in keeping you around? Bee keepers collect honey. Yea it's not exactly easy if you're not careful, but they dont bitch about it because they knew what they signed up for.

Data sci/analysis is about collecting and analyzing data, in not straightforward ways. Because if it were easy and didnt require any effort, why are they needed?

You're both right. Sometimes we have to work with the data we have. Other times we have to create or buy the data we need.

Some companies aren't experienced at building infra to collect data and don't know how to do it. Or their environment is too complex or expensive to sample data from. The data scientist's job in such cases is to do their best with what exists, show success and make a business case for investing resources into data collection infrastructure.

In other cases, when the required sensors don't exist and the information is critical to decision making, you can either buy the data or work with with an engineering group or external vendor to integrate and build out the sensors needed. Need foot traffic data? You can buy from a data marketplace like https://datarade.ai, where there exist various vendors (like SafeGraph -- which was recently used in a COVID19 study published in Nature) aggregating foot traffic data from cell phones. There are datasets that can be used as inferential proxies (so called "alternative data") for the actual data one needs.

Need to collect in-store data? I was at the NRF conference (the world's largest retail tech conference) in NYC back in January and there were a boatload of vendors hawking different types of retail analytics sensors.

In certain small scale operations, you can even engage field operations and get the in-store retail staff to help collect data and upload manually. (you'll need a good relationship with the field supervisor of course)

Sometimes the data does exist but is inaccessible, say in the ERP or in some proprietary format -- then you have negotiate with certain business groups or with OEM vendors in order to get the data out.

It all boils down to whether the data has value that exceeds (by a margin) the cost of collecting them. If the answer is yes, there's often a way to do it (albeit sometimes imperfectly).

Is it part of the data scientist's job description to create or participate in creating data collection infrastructure? I guess this depends on the company but for many companies the answer is yes.

I agree with you too. I think it's a mix of exec and management dont fully grasp the job and its implications if you shortcut too much. At the same time, too many data sci are in it for the keyword/sexiness of the job and are not of the personality type to take hardline stands. Inexperience leads to a lack of trust from higher ups. A lack of backbone from the experienced results in performing more incompetence. Which results in more lack of trust. Experienced personnel leave, more inexperienced comes in and do things the cheap, shortcut, buy bad data way, plus no backbone to combat against this when seen... and you see how this can spiral into a shitshow that I've been noticing in some consulting projects I've been in.

But yes, data collection should be part of their job. I'm having a hard time understanding why the person who analyzes the data should have a good word at least in what data is collected.

Have you collected data from and deployed products to a 2000+ store environment?

Okay, how is my argument changed if I answer yes or no? Is what you're talking about a data sci's responsibility or not? If collecting, analyzing and deploying data in reports or db is too difficult for you, data sci isn't for you. I'm not telling them HOW to do their job. I'm clarifying that you have to DO the job if you signed up for it. Dont like it? Get out. We all screwed up by taking jobs we didnt like. Nothing wrong with that. Get out of the kitchen if you dont like the heat or the smell.

I'm curious about the relative experience you're talking from.

You're trying to make a point about the fact that you need to push the business and/or obtain data yourself, but I'm saying that can be a vastly more difficult problem than you think (or just flat impossible) at scale.

Why single data scientist should be responsible for obtaining data from 2000 stores? Data at scale requires people at scale. Data engineers would do this job.

There is a book "Range: Why Generalists Triumph in a Specialized World" which claims there are domain-specific problems that are more likely to be solved by people originally outside that domain.

I don't doubt there are examples where a fresh set of eyes and lack of knowledge about what can and can't be done can break out from "the way we've always done things." But it's probably not the way to bet in the general case.

The book claims that "generalist" are the rule at least if we look at the very top of certain fields e.g., Nobel laureates.

The data science professionals my company has had on staff have been a joke. It’s similar to the wave of inexperienced developers hired in the late 1990s; employers didn’t have the skills to prove that the contractors didn’t know what they were doing, so a ton of money was poured in with big dreams.

Part of that is that they don’t understand what we do and weren’t trained. They other part is that the business was just told they needed data people.

What businesses need today are to understand what they’re interested in first, and then hire people with proven experience and knowledge to accomplish those goals.

Having a PhD in any applicable subject in addition to sufficient practical knowledge of data science, machine learning, and AI would be a substantial plus for a niche position in the industry, a non-profit, or even in finance.

Don’t give up.

> practical knowledge of data science

This is key. What do companies actually want when they hire a data scientist? Actionable products that make their business better.

What does it take to produce actionable products? A data strategy (collection, ingress, normalize, enrich, store, expose), a compute provisioning strategy, data engineering (pull from source system(s), land in target stores in a reliable, available, automated manner), data science, and data application (reporting, integration with target systems, app development).

Which of those components do they typically have? A data scientist. Because they just hired one.

Career-wise, the more unifaceted your skillset is, the more you're limited to employers that already have all the other pieces in place. Which effectively limits you to very large enterprise (~T100).

Start to be able to fill some of the other roles yourself, and you can compete for and succeed in smaller and more interesting opportunities.

>the fundamental skills that you need are mathematics and software engineering

So much this. If I have to interview another junior-level DS who has a MNIST project in their github and still somehow can't manage fizzbuzz or a fibonacci function I'm probably going to take up religious asceticism.

EDIT: I said junior, but I meant Senior. We're talking people with PhD's who claim to have done extensive software engineering in previous roles.

Try filtering them on their specific ability to visualize things. My neighbor was 80+ year old math guy who could see graphs of equations in his head.

My hypothesis is that good stats people probably don't visualize (Aphant) or visualize very specific types of data in a unique way. Without visualization, people tend to fall back to logical thinking - or emotional thinking, depending.

For example, I have a friend who can look at 2D seismic data and see what the underground formation looks like in his mind, in 3D.

The person you are replying to seemed to be complaining about their lack of programming ability, not statistical ability, so how would this help in their situation?

Yup, I think I and that child are talking past each other. :-(

It's good though because they raise some really valid points about the importance of intuition. I joke with my colleagues that all we're doing is encoding data as a hilbert space and slapping an algorithm on it, but that elides the fact that intuition like the child is talking about is important for knowing how to build that hilbert space.

I think there is an issue with focus of the role. If job is more focused on programming then the candidate must be at least proficient in coding. Otherwise if you need a mathematical modelling person then you need to look into relevant training background (undergrad degree etc). A lot of my colleagues in my research institute are from a physics background. Because molecular biology require a lot of statistics. They are ok coders but what they really contribute is the modelling part.

If I have to interview another whose only ML tools are GLMs, random forests and boosted trees (only ever with one hot encoding, of course) I’m going to do the same.

Those and SVR's get me through 99% of the algorithmic part of my job, though!

The rest is some unsupervised stuff like k-means and PCA.

What would you like to see instead? (INB4 CNNs/RNNs other deep learning topics)

As a software engineer who has worked intimately with data scientists in Big Tech companies, this comment sums it up exactly.

One minor thing to add is that not only is the data science field overflowing with junior/mid candidates, but the entire computer science field as a whole.

What happens in seven years when those candidates become experienced? Will they have dropped out? Will there be a glut of experienced software developers?

I sincerely doubt a large portion of them are cut-out for the sector at all. That's the issue.

> There is a big oversupply of people for the junior-mid level data science jobs.

Honest question is there any field where that isn’t true right now? Seems like there are no junior jobs in any sector.

German here. From my perspective it's hard to find

- embedded systems programmers, meaning people who know real-time systems, have good C/C++ knowledge, know their way around the Linux kernel and are also able to do basic things with a scope.

- good(!) C++ programmers in general

- people who know devops and software development infrastructure

C++ might not be sexy, but there's a vast amount of legacy software out there which is not going away anytime soon.

It's funny because I have all these (though it's been a few years since I've been doing C++ development daily), yet if I were laid off tomorrow I know from experience it could take me a whole year to find a new job in any of that domain.

By the way junior C++ developer is an euphemism because no company will hire a junior C++ developer. It's the most difficult programming languages that can take years of practices/studying to be able to use.

That is true. I found that for embedded C++ to be successful, you need to have some good (excellent sometimes was harmful, but that's a different topic) C++ _as well as_ good domain and optimization skills. It's a combination that is more and more scarce. C++ tends to be used in domains where size and speed matter. You need to know what the compiler and machine will do with your code to be fast. And what's more, only the domain knowledge will let you pick the right data structures to be fast and compact in the first place.

And there are a lot of foot guns around... ;-)

> I found that for embedded C++ to be successful, you need to have some good (excellent sometimes was harmful, but that's a different topic)

Why were excellent C++ skills sometimes harmful?

First of all, it leaves the team behind if you use features that are too advanced. Makes collaboration hard if you have wildly diverging skills (in both ways).

Also, it did in our case take you away so far from the bare metal, that the code was elegant, but slow. It did not play well with our static allocators and allocated/deallocated way too much, especially temporaries.

You might i.e. check out talks like [0].

[0] https://m.youtube.com/watch?v=NH1Tta7purM

I agree. Another thing to consider is portability. You might have to port your software to obscure platforms with bad C++ compilers where compiler bugs are not uncommon at all, so it's better to stay clear of more advanced features. Also, you might have a legacy system which does not have modern C++ compilers, so you might have to restrict yourself to C++03. Another thing is code size, so templates should be used judiciously.

These replies both don't seem like they describe someone with "excellent" C++ skills. Sounds more like someone who knows a lot of advanced features of the language, but doesn't make the correct tradeoffs considering portability and/or performance.

Maybe it could be that he didn't know all the requirements up front and his mediocre colleague accidentally wrote code that better suited the company's unstated requirements (e.g. portability to niche systems). But that's hardly the most likely explanation.

Ah, ok. In my head I do have a distinction between excellent C++ and excellent domain skills. So that person had excellent C++ skills but did not properly choose which C++ subset to use for the domain. From your comment I think you would say "excellent" implies "in all relevant aspects", whereas I just meant the pure programming language. What I tried to write in my first comment is that it is more and more difficult to find both combined in one person.

I think for C++ programmers either you have to pay a premium and hire senior people in their 40s (like 60s and 70s for COBOL), or find someone who is willing to learn and have the patience to get him on road not now, but 3-6 months from now.

I feel that it's rare that companies choose the second road so usually they pay a premium.

But all three are by no means junior.

I live in Amsterdam. What I'm seeing here is that the security field is lagging behind the data science / web dev field. It's not easy to find junior jobs, but a lot easier than data science / web dev.

The most fun example that I currently have is that my LinkedIn profile is super wide. I market myself as a generalist software dev, because that's what I am. I'm not a specialist and I've dabbled in almost everything.

Naturally, no recruiter contacts me because of that. But you know who does? A European company looking for a reverse engineer.

I found it quite surprising that they were capable of reading that I have some IDA Pro experience, because it's in the fine print of my profile and not at all readable when you skim it.

> I found it quite surprising that they were capable of reading that I have some IDA Pro experience, because it's in the fine print of my profile and not at all readable when you skim it.

Wouldn't searching profiles for "IDA Pro" find it?

I guess, but I wouldn't be on the top of the list as I am mentioning it once and there's also a lot more mobile and web buzz words on my profile.

They don't read your profile at first. Sometimes they don't even read it at all before contacting you.

They search for keywords, and rely on the search algorithm prioritisation and a good looking one-line summary of your profile to pick you out of pages of search results.

They'll find your "IDA Pro" if that's in the list of terms they are looking for, no matter how deeply it's buried in the description of your past jobs. If they don't read your profile they might not even know that it's buried.

one factor here is that 'senior' doesn't really mean anything anymore other than 3+ years of experience.

"There is a big oversupply of people for the junior-mid level data science jobs."

I feel this is true for most tech jobs. I see tons of senior job postings, but relatively very few entry-level or mid-level postings.

I don't exactly agree on why employers prefer PhDs even if not exactly in the field.

Having PhD shows that you have committed to one field hard enough and long enough to be granted the title even if it is PhD in philosophy.

Also, if you switch from another career try to figure out how you can use your past experience to differentiate yourself from other people.

For example, I have studied theoretical math and I have spent 20 years as software developer working on varied projects from embedded to algorithmic trading to backend to frontend applications.

I am currently switching to electronics and will use my knowledge to work on projects that require high level of both math and software engineering skills. Think in terms of realtime control systems, building reliable and performant stuff (reliability to differentiate from competition), etc.

Never think of moving as rebooting your career. Always start with mindset of building on top of what you know -- always move forward, not backward.

A little change in mindset goes a long way.

Could you please say more about how you find maths-intense software engineering niches and projects? What niches are there?

I too am a pure maths PhD with a background in software, but I am somewhat disappointed with the data science world.

I actually think that we agree on the PhD requirement.

> Having PhD shows that you have committed to one field hard enough and long enough to be granted the title even if it is PhD in philosophy.

Yes, exactly. That's what I meant by "strong initial filter".

This reply should be made into billboard and placed in front of every "Data science bootcamp" out there.

> Depending on your background it might take years of additional studying.

Excellent summary but this one bit is trying to gloss over reality. It will most certainly take multiple years of additional studying when you come in contact with anything related to software engineering, because that's just what it is.

Of course there is no shortage of people who learn one language, one framework, one tool and sit on their ass depending on the people they work with to pick up the slack. If that's you, none of the things being said in this thread matter - just go get your job, make your money and live your life :)

I slightly disagree about your reasoning on the PhD thing, for some roles. Having a PhD in an experimental science is a (weak) indicator that you’ve learned how to design, conduct, and analyze an experiment on your own.

What kind of math? It takes me a while to learn math so any focused area would help cut down the time taken immensely (not that I don't enjoy learning tidbits like supremum/infimum)

Multivariable calculus, differential equations, linear algebra, probability theory, statistics.

Ranked from most to least essential, I'd suggest: probability, statistics, linear algebra, basic calculus, operations research, multivariate and partial and vector calculus, diff eq.

True but order of importance is different from order of study. You should really start with calculus because you can't study probability without it.

>There is a big oversupply of people for the junior-mid level data science jobs.

Which shouldn't be too much of a problem because they have the skills to analyse the market and find a local optimum that suits them.

Have you explored opportunities to apply data science to agriculture?

I'm interested in this space; I do some work with agricultural data acquisition hardware and software (e.g., soil moisture, environmental conditions, sap flow, plant/fruit growth monitoring), irrigation, fertiliser application) and I'm interested in ways this data could be used in predictive models, but I'm not at the stage of being able to focus on that aspect yet (still getting the core data logging/display tech working well, though we’re nearly there).

Feel free to get in touch (email in profile) if you'd like to connect and discuss.

To address your question, I think the world is still mostly at a very basic stage in its use of data analysis and statistics. Most of the talented people are employed on big salaries by a relatively small number large companies with huge budgets for specific applications (e.g., ad targeting, risk assessment, algorithmic trading).

But outside of that, not much is happening, so I think there are big opportunities to apply data science in new fields and make the benefits more widely distributed.

>apply data science in new fields

I see what you did there.

But seriously, if you have a background in agriculture (and don't hate it) and want to get into data science, aim for the intersection of the venn diagram between agriculture and data science.

I understand agriculture is getting quite technical and data driven these days, and that can surely only become more the case in the future. Especially if vertical farms and robot tractors become a real thing.

This. Leverage the strengths you already have —agriculture— to help you get to where you want to get to. In another field, a junior data scientist with an agriculture phd is just a junior data scientist. In your field, you’re either a phd with a strong extra skill in data science, or a junior data scientist with a superpower.

On a related note, I can't help but be excited by what Planet are offering in the agriculture sector.

Whoever can crack ground level data with something like Planet's bird's eye view is going to be onto something groundbreaking.

The skill sets most needed for vertical farms probably have nothing to do with agriculture. If that is your goal then you should research or design fusion reactors instead.

It is hard imagine that all those millenia of knowledge about growing crops is going to be irrelevant, even if you manage to take soil, seasons and pests out of the equation.

It would be more intriguing to apply agriculture in data science.

> Have you explored opportunities to apply data science to agriculture?

This is the most straightforward route imo. Just pick up some of the skills required to get going, which it sounds like they are based on the post, and just start tweaking with stuff in your field. They already have a great advantage of specialized knowledge about the subject they would be applying it to and Ag, from someone who grew up and worked on a row crop farm, seems very ripe for exploration through data science.

I just started getting interested in sensors and ranching/agg. Can you link me any resources? Curious on the right terminology, some players in the space, etc. any help pointing me in the right direction would be really appreciated!

Definitely not. Let me put things in perspective. There are two types of companies, Company A - statisticians working as Data scientists, good engineers deploying models in production.

Company B - have no clue what ML or AI is and feeling the heat. They could be a multi million dollar company or a small SMB.

You will always find both these A & B atleast until ml and AI is well democratised. It is not, not even close. We are at the early stage of the curve still, but moving forward there will be rapid growth in the next 5-8years.

You have few options: 1. Start with sql. It’s not hard, join as an analyst and learn to code. Make sure the team or product you join deploys models. 2. Learn basic python and some orchestration tools (airflow, spark or aws/azure equivalent) . Join as data engineer along with basic sql skills.

I dont think I've been to two companies that had the same or even similar definition of 'data science'. Often it meant something like: see if you can use Tableau to produce new insights.

Absolutely this. My side of the engineering department has taken a hard stance on what our definitions of Data Analyst, Data Engineer, Data Scientist, Machine Learning Engineer and Applied Scientist are. We have had a few issues with people wanting a different job title while interviewing and we supply them with our description plus a path from the position we believe they are to what they want to be. The ones who have taken these offers have been some of our best hires.

On the other hand the analytics/sales teams have many DS and MLEs, a DA is basically a “junior” for them, but they routinely have to reach over to an engineering team to do pretty basic SWE skills that they are supposed to cover themselves

It's been a big deal for hundreds of years, most notably in economics and the military. Any reasonable company worth their salt was doing "data science" in the 1950s but calling it statistics or logistics or business intelligence to grab a phrase from yester decades.

What boat do you think you're catching besides a reliable middle class job?

I think the difference now is that even a small company can aggregate data from all over the world which makes it more valuable- therefore a good data scientist/ engineer can be paid more than just ‘middle class’

Genuinely shocked by this question. There is no part of the data science stack that is in any way settled. Even more broadly I don’t think statistical best practice is at all established, or at least well distributed.

You haven’t even missed the boat in terms of being able to make money off raw buzzwords and zero skills.

At the very basic technical level, there’s infinite work to be done optimising machine learning systems. This includes not just the fashionable issues of faster more accurate (or even less accurate in terms of floating point!) deep learning, but also moving Bayesian approaches like MCMC to multiple cores and GPUs.

There’s infinite work to be done on finding the right topology for a machine learning system. This applies not just to neural network layers but also to traditional stats (i.e. multiverse analysis).

There’s infinite work to be doing in understanding, cleaning and preparing datasets. Something as clunky as tidyverse can’t be close to the final form here. We’ve only just started talking about feature stores etc.

There’s infinite work to be done improving notebooks, integrating better software engineering practise into the workflow but also in terms of productionisation of models created therein.

All this is just platform stuff as well. It doesn’t even touch on the fact that businesses everywhere are terrible at formulating questions to be answered by stats, terrible at communicating those answers and terrible at even knowing this is a valuable endeavour in the first place.

I cannot imagine a boat harder to miss.

And hope that boat is not the Golgafrinchan Ark Fleet Ship B:


And yet, what would we give for more telephone sanitizers right now?

I've worked at productionising data science models for the past 4 years. I'm currently responsible for delivering technology platforms to ~180 data scientists.

I find the data scientist label misleading.

Roughly 70% of of the data scientists I've encountered are actually Excel analysts with little experience outside of a Windows desktop bar Facebook on a Mac. They're unable to use basic software engineering tools such as git, vscode and python. Excel users and their managers are hostile to solutions that aren't excel-like. They will fist-fight you if you restrict them from downloading and exploring data on their computer. Few understand their compliance/legal obligations.

Another 20% are familiar with a wide range of tools - such as Matlab, R, Jupyter notebooks and various ML/AI toolsets. As developers they're unaware of the tech stack, short of "I installed ananconda and it doesn't work" but are happy to work in the cloud and learn new tech. They understand PII requirements and memory/cpu limits but don't always demonstrate the latter in practice. Nonetheless they produce the bulk of your analysis, having studied classification, and reasonably cost efficient if you pair with a SWE.

The final 10% have mastered containers, venvs, wheels, cloud sdks and how to configure their software in environment independent way. They require help to achieve production quality but are great self-starters. Given enough time and support they're able to quickly replicate this effort and teach others. As relative superstars they're in high demand which makes capacity planning difficult. This pushes up their premium.

IMO the best data scientists are 1 in 10. Because we're desperate for quality almost anyone can assume the title meaning the market open to new comers - you just need to be skilled in Excel (harder than it seems - most developers can learn a lot observing an analyst/consultant use Excel).

To answer your question: No - you're not too late. Just by posting here I expect you'll be in the top 30% - an asset in demand.

This is a deeply misleading (though somewhat accurate) comment.

The reason it's misleading is because the 70% above (who may be called data scientists) are not actually data scientists, at best they are data analysts.

In general, the core difference between data scientists and data analysts is that the former can code in at least one language (SQL doesn't count, unfortunately).

However, because the term data science became so popular, everyone re-branded their analyst roles as data scientists leading to this concern.

Additionally, the post I'm replying to is pretty biased, as the OP talks about productionising models. While this is a major facet of DS work, it's not the whole thing. TBH, I can find people to productionise models a lot quicker than I can find people who can figure out what to model, and how to measure it.

Some of those people are most comfortable with Excel, and while I'd prefer they used a different tool, I can't argue with their output.

Also, the OP here is focused on deployment of Python ML models, which again is a subset of a very, very broad field.

That being said, i agree with most of the categorisations, except that the two critical attributes of good data scientists are a strong background in statistics and data common sense.

Data common sense is a weird attribute where when you look at the numbers and see if they are reasonable. For example, if you are running a mobile gaming company and see an ARPU of $5, something has either gone horribly wrong, or you're going to be a billionaire (assuming you have equity).

This attribute is actually not that common amongst DS people, so it tends to be the limiting factor, rather than ability with containers and deployment (which I do agree is very important).

> are not actually data scientists, at best they are data analysts.

Unfortunately the phrase's usage has been corrupted by HR departments, and the BA types of job listings now outnumber the "real data scientist" listings.

Didn't facebook start this by calling their Data Analysts Data Scientists?

No, it was a different role. The original data scientists were people who could both run large scale social science experiments and write mao reduce jobs to analyse them.

Unfortunately, it was such a great name that everyone stole it, and they eventually had to call all of their analytics people data scientists (as otherwise they couldn't hire).

I remember being very angry when they changed all the product analytics people to be data scientists as many of them (the ones I knew, at least) we're strictly SQL monkeys.

It's up to you. You should have much more knowledge in the agriculture field than most people.

I believe it makes no sense that you discard the knowledge you already have. You can apply data science methods with the knowledge that you have going to work or creating a new company that works in agriculture using new methods.

Do you want to spend your life doing surveillance and spying on people like everybody else? This is fashionable but people are starting to resist and develop antibodies for it as they understand it more. The TV or the phone that I bought spying or me is not acceptable.

Agriculture will grow enormously in the future with things like LED and other methods to give energy to plants, or plankton or whatever. Drones controlling pests or humidity or temperature. Using natural insects predators for bio farming. Growing materials like cotton directly from cell's cultures.

The methods that are used today for growing marijuana indoors will be applied for more common things when prices go down.

Nobody better than you to identify the markets that will grow in the future. It is also a very good idea if you know (or associate with someone who knows) economics and marketing and selling.

Reason from First principles is extremely useful for identifying new waves that will carry you in the future with no effort. https://www.youtube.com/watch?v=NV3sBlRgzTI

All those things are hard problems. I have worked on those in the south of Spain and in Holland as engineer and entreprenour.

The real life is not Academia, your title means nothing if you can not apply it and give results, but means a lot if you can. So you will need some time to adapt to a different mentality.

I work in ed-tech – currently building an intro to data science course – you have by no means missed the boat. The field will only continue to grow, rapidly, for quite some time.

On top of that, you have something incredibly valuable to a budding data scientist: domain expertise. Being able to manipulate data is great, but to most effectively solve real-world problems you have to understand (or communicate very closely with someone who understands) the main problems in a domain. I can't count the number of times I've heard scientists frustrated by their lack of data skills, and data scientists frustrated by some arcane domain fact that stymies their model production.

Far from being a liability (or just a sunk cost) your background in agriculture will make you extremely valuable as a data scientist.

If I may give you an advice : Python, R and deep learning are sexy but the most important skill to start in data science is SQL. It will help you get your first data role and will be your main tool to solve 97% of the problems you will ever face.

Bonus point it is very easy to learn.

If I may give you an advice : Python, R and deep learning are sexy but the most important skill to start in data science is SQL. It will help you get your first data role and will be your main tool to solve 97% of the problems you will ever face.

That's a bit of an exaggeration but I will say: most of data science is not the fun stuff. Everyone goes into the field thinking it's all about doing machine learning to uncover astounding insights that will fundamentally transform the business.

In reality about 5% of most data scientists time is spent doing that. Maybe less. The bulk of the work is getting the data and cleaning it up, doing a tiny bit of the sexy stuff, then writing it up into a report or a presentation to give to people who will either not believe it or scoff because they already knew it.

I work as a data scientist and every statement in your post sounds incorrect to me. SQL is useful but it's far from the most important tool. It's certainly unlikely to land you a job. It definitely won't solve 97% of problems. You either have a very skewed perspective of what data science is or you spend a lot of time on linkedin/medium where bad advice like this is parroted a lot.

Think about this - there are a bunch of other software developers who know SQL very well. If your advice was true, then every backend developer would be able to immediately land a data science job and do great at it without having to learn a bunch of math, ML-specific stuff and a whole other tech stack.

A surprisingly vast contingent of software developers do not know SQL, let alone know it very well. And as others have mentioned in this thread, data science looks differently at different companies.

For many, the math and "ML-specific stuff" ends up being a very small part of the process. For them, data quality and data cleaning take up the overwhelming majority of hours in a given project, and SQL chops will take you much farther in that kind of an environment.

Plus SQL is not going anywhere anytime soon. So worst case scenario, OP will learn a skill that's not likely to be dated in a few more tics of the hype cycle.

I think you're both right.

I find it hard to imagine successful data scientists who don't know SQL.

OTOH, I find it hard to imagine (even though I've met some) successful data scientists who only know SQL.

I suppose it's necessary but not sufficient.

I don’t think it’s that surprising. Most web dev is just doing mundane CRUD, and mostly through ORMs or other db abstractions. If you don’t practice something how can you be expected to be good at it?

My job title is data scientist even though you may not think of me as a 'true' data scientist. Just to give you some context I work in an ecommerce startup. Depending on your industry and the size of your company things may be very different.

I maintain one machine learning model that is very core to our business but doing 'machine learning' is a very portion of my job.

> Think about this - there are a bunch of other software developers who know SQL very well. If your advice was true, then every backend developer would be able to immediately land a data science job and do great at it without having to learn a bunch of math, ML-specific stuff and a whole other tech stack.

In some companies Data scientists are very software development oriented but that is not the case of everywhere. Think about this : software developers who know SQL very well usually don't like cleaning data, they don't necessarily have good interpersonal skills required to solve business problems, they are not necessarily interested in solving business problems, and they may tend to think that more software is the solution to all problems.

> there are a bunch of other software developers who know SQL very well. If your advice was true, then every backend developer would be able to immediately land a data science job

I fully disagree. Most backend developers don't know SQL beyond their ORM library or CRUD statements. The business intelligence world has utilized SQL to analyze data and make effective business decisions for 40+ years.

ML is 90% hype to check a box for investors, and the actual business problems could be solved by a semi-competent analyst armed with Excel or SQL, not a bunch of overpaid "scientists" who completed a few Andrew Ng courses.

Totally agree. I believe the set theory thinking one gains with SQL helps to deal with tables(databases) in any framework.

SQL can become super tricky as well (depending on the context), say you want to get the list of users who are active for 'n' consecutive days from a dataset that has daily user activity for an year. It's not very difficult but needs some effort.

However, for a data science beginner, SQL is the best place to start.

"However, for a data science beginner, SQL is the best place to start."

I totally agree with that statement. Being a beginner myself in the DS field, I'm living through this right now in my job. And, as a plus, working with SQL everyday is also helping me a lot to have different perspectives in handling the Python/Pandas DataFrame.

I'd agree with this. I'm not a data scientist but work closely with a few and the jump from the academic world to business/govt seems to have been jarring.

I think previously they'd been used to consuming data from exports and CSVs, scraping websites and plugging into APIs directly. Having to navigate (often messy) database schemas wasn't what they imagined they'd be doing!

SQL is almost never used in algorithmic trading by data scientists.

Learn Python, it is used universally.

Don't learn R.

Why not learn R? In the last year I've spent around 80% of my time working with R, coming from the last five years almost exclusively with Python, and there are some great reasons to use R. Although if the tidyverse didn't exist I'm not sure I'd be saying that. I find that suite of packages together to be a very cohesive set of tools for doing data science.

Three reasons (I know both languages well): (1) R is used much less in the data science industry, and (2) Python is a more universally useful language. If he learns it for data science then he can easily write utility scripts, build a back-end, etc. (3) the overlap between R and Python capabilities is so significant it would be a waste to start with R, I would only suggest picking it up if he needs some niche package that he can't get in python.

About the best time to ‘invest’ in anything that’s close-to-certain of yielding results: “The best time to plant a tree was twenty years ago. The second best time is now.” – Chinese proverb

Advanced data analysis / machine learning isn’t dated or old-fashioned at all, and I guess will continue to stay (or: become even more) relevant at least another decade. Not all ships haven’t left the harbor.

As with any role/skillset, the explosion of data science demand has lead to an explosion of "data scientist" supply. When demand was greatly exceeding supply, you naturally saw the definition and requirements of data science roles loosening. When I contact companies about DS roles, I like to ask what they mean by "data science". Often times you would find roles that were some mix of: SQL analyst (no DB admin skills required), Python hacker (no "real" software eng skills required), and basic stats (can you calculate a confidence interval?).

What I've seen in more recent years with growing supply and maturation of departments is the need for specialization. Can you do hardcore statistics? Are you an ML practitioner? Are you a data architect? Basically, the blended role DS hacker is more commonly (and correctly IMO) relegated to a various analytical and strategy roles.

Honestly you haven't missed the boat and you don't need a formal education, but I would highly recommend having a depth of skills in one or more areas of data science generally with an example or two to back it up. Basic skills are just table stakes at this point.

1970- Hey Guys I want to start programming with the PDP-10.Did I miss the boat?

1980- Guys I want to learn about micro-controllers, is it too late?

1990- GUI programming

2000- Linux, Internet, you name it

2010 -. Javascript

In 30 years time (at the very least) there will be still Data Science. So if you are really up to it, id does not matter if you should have started 5 years ago or now. If you suck at it or really dont like it, it would make no difference either.

But there's been Data Science jobs since at least the 50's.

50-60's: Operations Research 70's: Statistics 80's: KDD (knowledge discovery in databases) 90's: analytics (statistics again) 00's: Data science 10's: ML/AI 20's: ???

The tools and problems may have changed, but the core skills (statistics, some coding and data awareness) are identical.

For those here who are just "making a DS course": you have definitely missed the boat (IMO)

For those with solid Math, Stat, Tech skills that require years to master: it is your time to shine

As someone with a PhD in a different field who's made that switch, I actually have to disagree with most of the advice here.

This is a bit too bitter and jaded (I'm not quite this pessimistic), but I think it needs to be said to counter a lot of the rosier advice.

Having a PhD in a non-CS field is a _massive_ negative in the eyes of potential employeers. Even if you're looking at moving into a role where your domain knowledge would be immensely relevant (lots of ag + remote sensing startups these days), you will be seen as underqualified compared to someone with only a BS. You'll be seen as underqualified compared to someone with no BS or a BS in an irrelevant field. You need to be strictly better at multiple roles than anyone they could hire to be considered, and even then you'll be expected to work for 1/3 to 1/2 of what they'd pay someone with only a BS and no experience. Folks usually despise domain experts because they see the role of their company as "disrupting" all of the prior knowledge. You represent what they want to replace and you're likely to disagree with them about key approaches.

You will be much more effective, but no one cares about how much you contribute to the company's bottom line. People only care about appearances.

The appearance companies want is a "self taught college dropout". That holds for data science and machine learning positions every bit as much as it does for developer positions, in my experience.

The upshot of this is that you likely know how to learn quickly. Pick up multiple additional skillsets.

You won't get hired because you're an expert in X field that the company needs + a data scientist. You'll get hired because you can throw together a crappy web app on short notice, or debug their crazy duct-tape-and-glue CI system, or save a lot on their AWS bill by switching some things around. You need those skills _on top of_ being a domain expert and a data scientist.

You have to be able to do more than anyone else they could possibly hire for the role to even be considered. Otherwise, you'll never overcome the fact that you have a PhD in their eyes.

Again, that's the bitter/jaded view. Take the above with a grain of salt.

I'm going to chime in to disagree strongly with this. I've seen and worked with lots of non-CS PhDs in data science who have been terrific. It's pretty common too—there's even the Insight program that trains PhDs to go into data science [1]. I don't think it will be seen as a downside and will instead show that you know how to do research and solve problems.

[1] https://insightfellows.com/data-science

Anecdotal, but I’ve worked with dozens of PhD data scientists, and not one of them had a CS PhD. The most common one I see is physics, followed by math. (And I don’t work in a physics-related field.)

If you are already a PHD do you have chance to do something with big data in the domain? I'm sure there are some big data projects you can do in agriculture. PHD in DNA research tend to have much more relevant experience than those who just go through a few training camp courses as they have to build models for PB level data, which forces them to use CLI and optimize their algos and use advanced tools such as Hadooo, spark etc.

very valuable advise. DS itself is nothing unless it is implemented in a specific domain.

Yeah. I'd really want to work as a data engineer for a Bio phd that deals with pb data, for free or even pay a small fee...

As many people have already recommended, you should absolutely look into Applying Data Science to problems in your Agricultural field. Your domain expertise combined with skill in computing technology would be a killer combo.

PS: You might find the book Cloud Computing for Science and Engineering useful - https://cloud4scieng.org/

No, I don't think you are too late. The predictions that data science will be fully automated in the near future are, in my professional opinion, unrealistic.

I see the future of data science benefiting from better software design, e.g. transformative frameworks like pytorch and sklearn, which are powerful tools but hardly fully automated. We'll continue to need skilled workers who are current in the latest software stacks.

It also benefits from what I'll call the "lotto effect", where data scientists will occasionally multiply the bottom line by 10x or more. This is of course rare, but companies will continue to chase that fantasy and hire data scientists because it's too tempting to ignore.

My only advice would be lean heavily into the software side of things. There's too many data scientists who are novice programmers.

Most definitely not. It’s just that the focus is perhaps more on domain expertise and the production-side of things these days, rather than manually putting Tensorflow models together, which might give the impression that every problem has already been “solved”.

No. Data Science is still a very immature field. Like DevOps 5ish years ago. Everyone wants it. Few people know what they could actually do with it. Lot's people people hire for it and then let those they hire figure it out.

'Data Science' is a secular shift in how companies work, not cyclical.

It's a 'new kind of job' that's going to be here for the foreseeable future.

While competition might increase for jobs, the number of jobs is likely only grow over time.

Agriculture PhD plus even modest data science skills is hugely in demand at the giant ag-biotech company where I work. Many of the data engineers I lead are former science PhDs. Literally all of our people strategy discussions in R&D are about how we need more people like you. There are only a handful of big science companies left in ag, but I bet the other ones have similar needs.

It's not too late, no, but can I ask- why are you intersted in a data scientist role? Is that the only, or best way to "get out of the lab" given your background?

Because I must confess that I don't see the immediate connection between an MSc and PhD in agriculture and a (assumingly generic) data scientist job. You should be perfectly capable of performing data science tasks in an agriculture context, but it seems to me you are asking for something different, a "pivot".

At the risk of sounding rude (for which I apologise in advance) are you asking simply whether there's still space on the current bandwagon? If so, I must advise you against it, because employment bandwagons are awful things to get on. Crodwed, badly paid, poorly understood, not that useful, scarcely productive- in short, short-term and not very fullfilling.

Is it just a matter of making lots and lots of money with the skills you clearly have and that you must have worked hard to acquire? Well then, there should be much, much better placements for you, outside the lab, in the sector you studied about.

My concern is that seeking to jump on the data science bandwagon right now will only flood industry with more and more semi-skilled, half-baked professionals, who don't really understand and don't really care to understand their subject matter, similar to what has happened with software development. The world is full of bad devs who are "passionate about javascript" or something like that and who are struggling to promote their personal brand because they have no other skills than the promotion of their personal brand. Don't allow yourself to fall that low.

Edit: in the interest of full discolosure, I'm a CS graduate with an MSc in data science and studying for a PhD in AI, but I'm not looking for data scientist jobs and am not interested in them, because I find them boring, unproductive and unfullfilling. I have actually worked as a (freelance) data scientist for a while.

A lot of people are saying to leverage your agriculture PhD, but I don't see people asking if you like agriculture. Is the interest in data science because you want to get far away from agriculture, or just concern that there isn't much future in it?

I do not understand the concept of "being too late". If you are too late, what about the future generations? You can only be too late if the subject is already dead and not used anymore.

Being too late means, like in the field of agriculture, all fruits of innovation are already taken, rest is just hard mundane work.

An agriculture engineer, a food scientist and a botanist from 100 years in the future have just read this and they are still laughing.

umm, there is a ton of innovation in that field too. That makes no sense.

There is huge potential around better data usage & analytics in agriculture & crop science. You're domain expertise would place you in a unique position compared to other data scientists with more of a generic background.

I've personally spoken to a few large enterprises in the agricultural sector who are just beginning to build out their data science department. It seems like the industry as a whole is just getting started in data science.

Also, there are many promising startups that are emerging in space. E.g. Vertical farming

Climate change and carbon.

Ag, farmers, lawn owners are all errr... "ripe for disruption" as their genetic systems lag and only innovate at a the paltry rate of once per season.

There are many ways to play the field (no pun intended) from predicting and capitalizing on misfortune to minimizing the same. I hope some will be most interested in using our talents to mitigate crop failures, maximize sustainable nutrition or find the optimal low risk carbon sink.

As others have noted, many data scientists work in Excel. This book, which teaches data science, does just that:

"Data Smart: Using Data Science to Transform Information into Insight" by by John W. Foreman


This is likely the quickest way to start.

Big companies started data science, AI projects, etc. 5 to 10 years ago. Medium and Smaller ones are starting just now. So I'd say you're fine.

Yeah it's too late now, people already moved to data engineering ;-)

Joking aside, this is a very good overview of data science and engineering for the year 2020:


This might be shinging a light on something surprising -- After taking in a Data Science course at Microsoft last year..

Try to remember the very beginning of Data Science is often cleaning up data in Excel, and then learning to do it with excel functions.. and then learning to do the same in Python.

I wonder with schools producing CS graduates at such high rates, how people want to distinguish themselves and find jobs?

The market seems to be flooded with students with CS skills. Actually, people from all fields — mechanical engineering, computational biology, EE, math etc — are entering data science. It is worrisome.

My background: I am a tech architect/manager on IoT projects who designs and helps develop analytics systems. I do some basic data modeling in my role, but defer to experts to build a better model once we find some area of interest.

The actual skills of light programming (R, Python), data literacy/manipulation and some basic modeling (statistical/machine learning, Bayesian methods, time series, etc) are useful for many job roles, and I think will be considered basic skills for college graduates in the near future. This isn't new - operations research, particularly Six Sigma and quality control, have used statistics and some light programming to solve business problems for decades.

By itself, I don't see Data Science evolving the way promised by schools and boot camps. Most of the positions named "Data Scientist" (at non-tech companies) are really just senior business analysts; I work with a group of them at my company and 90% of their day to day is just extracting and analyzing various reports for other managers and directors. When an interesting (and potentially lucrative) business problem does come along, they usually outsource to a specialized analytics firm and the data scientist helps coordinate that project. (If you have a good data dictionary, a clear outcome in mind, and some basic knowledge of the field it's relatively easy to outsource the advanced work.)

st1x7 had the best advice below -- learn the basic methods and then apply them to your field. If you google "agriculture iot research papers" you'll find tons of examples of people using sensors for data collection and then analyzing the data to improve some process.

TL;DR I see Data Science melting into other roles, but the basic skills/data literacy are useful for almost everyone.

> am looking to pivot into something that gets me out of the lab

Will data science get you out of the "lab"?

Maybe you can become a farmer, and iterate on patentable new agri-tech you will inevitably develop along the way; then sell that.

There are plenty of data science problems in agriculture. Big ag, small ah, automated ag. Look for problems people have in your domain that could be solved with data science.

I remember feeling the same way about CS in 98' - how silly of me then! You'll be just fine.

Look at Google X. They have some workstreams going on right now that combine agriculture and ML.

Definitely not

There will always be a next fad.

data science is just a buzzword for statistics where you have more data than you know what to do with instead of working hard on getting more quality data. more or less.

No Not at all, there is lot to come in this field. so this is the right time to enter this field

I think this is the crux of data science: extrapolating


Also known as Linear Regression.

No, never too late to learn.

Yes, your too late.

Yes, you are too late. But if you retrain and become a data critic, you are definitely ahead of the curve ;) The robot armies of AI experts and big data scientists are doing incredible harm to society. There is a real task for you there.

Hijacking the top comment because it's plain wrong.

1/ No there's no "incredible harm" to the society, not more than any technological revolution in the past and it's bringing more good than bad, like any technological revolution.

2/ You can perfectly do a data scientist formation (or preferably a ML formation), there are thousands of free courses on the net but if you get into a reknowned formation it's better, and go find work. Also try to find a niche such as embedded systems, or robotics, or whatever is scarcer than doing "image recognition" lol

1/ No there's no "incredible harm" to the society, not more than any technological revolution in the past and it's bringing more good than bad, like any technological revolution.

Incorrect at best, apathetic and myopic at worst:



You think AIs are worst and more biased than some typical america judges ? Ill give you a big fat LOL.

Any atrocity and injustice has been done at all scales by humankind. Ai will not amplify that and actually i think that well used, it could help fight them

You are just choosing to see the wrong side of it because you think it makes you look woke.

You think AIs are worst and more biased than some typical america judges ?

Did I once say that they were?

I can read and my point stands: of course ai is biased, because people are. But on the long term it will make society better.

I can read

what I doubt are your abilities to actually comprehend.

Yes, you are too late. Instead, focus on data critique. There is a big market for that as the robot armies of AI experts are doing incredible harm to society.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact