Hacker News new | comments | show | ask | jobs | submit login

There are two "poles" in data science: math/modeling and backend/data-wrangling. Most of the time, the backend/data-wrangling piece is a prerequisite to the math/modeling. The vast majority of small and medium sized companies have not set up the systems they would need to support a data scientist who knows only math/modeling. Depending on the domain, it's not uncommon to find that a small/medium company outsourced analytics to Firebase, Flurry, etc...

That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.

If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.

Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.

A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.

IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.

So basic DBA skills + expert programming skills + very good math/stats?

Also - your model of asking questions before starting a new gig is very relevant to nearly every programming job. Could also be some of the questions a candidate asks in an interview.

Have you ever needed any Microsoft skills(MSSQL/C#) so far?

Yep, I’ve used MS SQL products and I write C# sometimes and read and write code to parse it very often because it is the primary language of the products I support.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact