Hacker News new | more | comments | ask | show | jobs | submit login
Is Data Science Your Next Career? (ieee.org)
98 points by ohjeez on May 28, 2013 | hide | past | web | favorite | 48 comments



I, for one, am glad that IEEE is making an effort to get people excited about a formal education in data science. My education was a disjoint combination of CS and Statistics (my degree is formally in statistics), with no union between the two except what I made of it. In neither CS nor Statistics did my education formally cover problems associated with having too much data to fit in memory or store on one hard disk.

My biggest issue with the teams of statisticians I've worked with before is that they lack a basic understanding of computer science. My biggest complaint dealing with the software developers on analytics projects is they don't understand statistics. I heard a great quote for which I don't remember the source (I paraphrase here): "A data scientist is someone who knows more computer science than a statistician, and more statistics than a computer scientist." The nature of the analytics world right now suggests that this type of specialty is sorely needed in many places.


Here's the source for the paraphrased quote: https://twitter.com/josh_wills/status/198093512149958656

"Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician."


A more cynical definition would add "... and who is worse at statistics than any statistician and worse at software engineering than any software engineer." ;)


I don't think you have to be very cynical to take that view.

I'm a data scientist, and I'll readily admit that your definition describes me well.


As a person who falls mostly into this camp I laughed at this since there is a grain of truth to it in my experience. There's also a certain scrappy pragmatism to being in this kind of role. Statisticians and software engineers tend to forget their technical ability is not innately valuable.


That's the one I was looking for. Thanks!


So, I'm considering grad school for either a masters in pure maths or statistics. Ultimately, I would love to teach math at a 2 yr college, but I also know that may not ever happen. Those jobs are just way competitive to get. A statistics degree would give me more versatility. I could get teaching jobs here in California's 2 yr CC system. But, I could also go out into the real world and work as a statistician.

I have a background in IT and programming. I did it for over 10 years. I was wondering, though, how much demand is there for data scientists? And, what kind of salaries could I expect?


I like the new Data Science term, because it suggests people might be more open to paying money for math. Coming from an operations research background, I know it can be a challenge to convince people that this is worthwhile.

But really, the term to me more or less means "mathematically literate." I know, there are some techniques that seem to be specifically associated with data science, like the analysis of large scale datasets, but many engineering and mathematical disciplines do deal with this already.

There's a reason these jobs want someone who has a degree in... math, statistics, computer science, operations research, physics, engineering, hell, let's just say "or related field" and be done with it.

It's partly because these fields contribute to intersection called "data science", but my real guess is that a degree in any of these fields means that you're probably mathematically literate. You've done one of those majors that requires you to take calculus of several variables, linear algebra, some differential equations, some kind of probability and statistics, probably write a computer program or two, and then focus on some more specific branch in depth where you learn to model things mathematically.

A good humanities curriculum will impart knowledge, sure, but it also trains you to read dense material, make sense of it, and express some kind of insight about it. A good "stem" curriculum does the same thing, except with numbers and data.

There was a time when someone could get a job by being highly literate. I see this as a similar situation - if you're mathematically literate at a reasonably high level, you're probably employable.


I don't know.

When I think of big data, the first thing that pops into my head is Insurance Actuarial tables, and that's not interesting to me. Are statistics suddenly the hottest and most interesting thing in the world because we can run experiments over larger datasets? Maybe, but I think that most engineers capable of doing the kinds of analysis these firms want would be better suited to harder problems.

Don't get me wrong, data analysis is important, I just wonder if the IEEE has a duty to encourage organizations like this or if they should be trying to influence kids back towards the "hard" engineering practices.

To be honest, as long as people are doing something that makes them happy, I'm not one to judge, but I do think there's something to be said for attacking things that are harder than statistics.


What makes computer science a "hard" engineering and statistics not a "hard" subject? I'd hesitate to call statistics an engineering discipline, granted, but I object to differentiating statistics from CS on the basis of the word "hard".

I hope you're not trivializing statistics simply because of the elementary approach most schools tend to take for the first three or four statistics classes offered at the university level. I would argue that statistics can be just as difficult as computer science, and just like CS has intractable problems, so too does statistics.

If your idea of big data is acturial tables, you're not thinking big enough. Actuarial tables are largely still small enough to be handled in orthodox ways using orthodox software (SAS has a deathgrip in managed health--what I do for a living--and it's a 50 year old piece of software). In fact, most of the times I hear people say their data has a very high volume, it really doesn't, and the traditional methods of data analysis can handle it. Once you step into Facebook/Amazon/Google sized data, things change. And that's only looking at "big data" as large volume--the most common definitions of the phrase involve several other variables (see: the four V's).

Just because what's commonly done or visible is elementary doesn't mean the entire field is elementary. It would be like assuming CS isn't a hard engineering if all you saw were web designers/developers.


It is pretty boring. Big data doesn't necessarily mean big insights. So lots of times it can end up being a wild goose chase with management not realizing they are funding science experiments.

For any one thinking of it as a career, I highly recommend Nate Silver's - The Signal and the Noise: Why So Many Predictions Fail - But Some Don't


The impression I got from the "cs/se/stats" remark was it was the fusion of all three disciplines. A statistician can't sit down and pull data from a MySQL table, and certainly can't write a library to gather data on users on the site. As someone that's basically in this exact confluence point, I can tell you that there is very much interplay between the statistics and the software engineering and computer science disciplines.


Hmm. What kind of statistician these days can't pull data from a MySQL table? Even the almost-retired people I know have to interact with data sources in some SQL product or another.


HN is not "average". False consensus effect: http://en.m.wikipedia.org/wiki/False-consensus_effect


I hardly think I'm special for posting on HN. Have you read job adverts with "statistician" in the title? In order to do your day-job as a statistician you need to work with data products and tools which require more than MS Office type computer skills.


Data Science != Statistics.


To me, this looks like a fad. Let me explain.

Five years ago, talk of Business Intelligence was all the rage. It was the 'hot' new thing that companies were pouring millions into. You needed the Analytical and statistics skills required to interpret large data-sets efficiently whilst having enough vision to clearly cut through the noise to deliver meaningful metrics. Technical knowledge of manipulating data using multi-dimensional cubes and datasets is also required.

Now it seems that 'Data Science' is set to pick up where BI left off. The fields appear very similar.

To avoid it being an oxymoron, I would clearly define the boundrys and goals relative to similar fields... BI / Data Warehousing / Data Analyst / Database Architecture

Disclosure: I've made a VERY good living since graduating working for Investment Banks in BI/Data analytics. I know from experience that money in these fields is more down to the industry you apply it to. Number crunching payroll or scientific data, low salary. Number crunching bank regulatory or trading data, massive money (regardless of what you call yourself).


As someone else working in the BI space, my coworkers and myself have been saying similar things. I feel like "Big Data" has the same feel that BI had years ago.

Also, we're all pretty sure that the title "Data Scientist" will be applied far too liberally. I have friends at other BI firms who are already calling themselves data scientists because they attended a convention where the words "Hadoop" and "Cloudera" were spoken.


Exactly, I find there is not enough definition and way too much overlap. Even on Kaggles front page where they show 3 examples of 'the worlds top data scientists' it says:

Alexander Larko: -Experienced Computer Scientist & Data Miner with wide ranging skillset

No disrespect to this man's skills but I'm sure there are hundreds of us on here that could easily fall under that category!?!


I don't think a lot of people know about the Insight Fellows program yet, but it's highly relevant here: http://insightdatascience.com/

They're taking people in STEM fields who are over-qualified and under paid, and helping them transition into new careers as data scientists at top technology companies (Google, Facebook, Square, LinkedIn, etc.). It's a really interesting model because they're filling a big hole that universities have right now in that there's no degree for data science. Close to 100% of their Fellows make the transition successfully and I think the idea is something that others are going to try to copy in the near future because it's clear there's a supply-demand mismatch right now.

Fwiw, the company is a YC alumnus (a hard pivot from their original idea).


So, universities are going to start trying to teach a person to be a programmer (in several languages), a sys/ops admin, a statistician, a business/systems analyst, and a DBA all at once? Good luck with that...


I feel obligated to post, hopefully it's not unwelcome. While not in recruiting, we're always looking for more great engineers at Inflection(Inflection.com) - We're a big-data company and are crunching billions of records. If you're at all interested feel free to shoot me an email (tjbiddle at the-website-i-mentioned-above).


My take on data science, and what makes it distinct from other fields, is that it combines a knowledge of the business logic (i.e. software engineering) with knowledge of statistics.

For example, a statistician might wonder exactly what a particular ID referred to. Does it mean a person, an IP address, a single "session". They could, of course, find this out, but the data scientist would already know this.

Similarly, a software engineer might wonder what information they need to be collecting from the user. The data scientist knows what analysis will ultimately be done, and so knows what information must be collected.

So data science combines statistics and software engineering, and this is useful because it allows a holistic view of the data analysis process, from the collection of data, to the statistical analysis of the processed data.


I think that the word "Science" deserves some attention. I completely respect the data structure side of a data scientist but more often than I'd like I find people addressing themselves as data scientists being very good as software engineers, but not as good as statisticians or model builders.

I may be wrong but I disagree with who says that the difference from a data analyst and a data scientist is that the data scientist is a software engineer.

I would say instead that the difference between a software engineer and a data scientist is that the data scientist is a scientist that has a strong CV in data structures and algorithms, as well as in (maybe pure) science, with experience in statistics, math, or physics, and that knows very well how to work with models, test hypotheses, spot patterns, anomalies etc...


What major differences are there between the new data science careers and what developer's have been doing in university research departments for years now? Is it simply a matter of scope? Transitioning from relatively small clinical trial sets to marketing data. Is credentialing needed because there is greater responsibility to formulate mathematical models in this path? If so how is that different than creating domain specific algorithms... Basically I'm trying to understand why this is seen as a unique career path as opposed to just another pivot developers may have to adapt to if they want to stay relevant or be on the cutting edge.


"I worry that the Data Scientist role is like the mythical 'webmaster' of the 90s: master of all trades" - Aaron Kimball, CTO at Wibidata

http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2...


What really differentiates data science from econometrics, other than (I guess) the type of data being analysed?


I would like to know this as well.


1) Size of data : Most econometrician work on small data set (mostly in MBs ) which they can they keep in RAM and use R and excel to analyze the data. but modern day data scientist have to deal with GBs (sometimes TBs or even PBs) of data..for such a large data you need multiple machine or even hundreds of machine..So you need to be good at distributed computing and frameworks like hadoop, hive etc

2) Visualization : such large dataset can not always be expressed in bar charts or pie charts...so standard charting tools like excel and R dont work..you need to have good knowledge of charting libraries like d3 or openGl (for 3d visualization) to analyze and express their findings

4) Type of data: Econometricians are never comfortable with unstructured data set consisting of twitter feeds and apache logs..good knowledge of machine learning and graph algorithms are becoming very essential...Apache mahout a machine learning framework build over hadoop is looking extremely promising


I would also add that econometricians are highly focused, almost exclusively focused in fact, on finding causal relationships.

This means that descriptive work such as clustering, dimension reduction, is often either ignored, or considered as a kind of pre-processing before the real work starts.


I think this is a big one, and one of the reasons I would be uncomfortable calling myself a "data scientist" despite meeting some of the more tool-oriented definitions - my work has a much larger focus on attempting to infer causality.


Things I didn't learn in my econometrics classes that are used all over in data science:

1) Machine learning techniques for analysing data sets as opposed to parametric models

2) Clustering (k-means, etc.)

3) TF / IDF

4) Using a variety of data sources / tools - my econometrics educations was heavily Stata dependent. Learn a little bit of SQL, R, and Matlab so that getting up to speed doesn't take you longer than a month.


Thanks. What do you mean TF / IDF?


It's a method of determining which words in an arbitrary collection of documents (tweets, for instance) are most important when classifying those documents.

Term-Frequency-Inverse-Document-Frequency. Assigns each word a score based on how often it appears in a document relative to how often it appears in all documents.

https://en.wikipedia.org/wiki/Tf%E2%80%93idf


Can anyone suggest a structured approach[books, tutorials, free online courses] to learn data science?


Here's some free book lists

http://www.kaggle.com/wiki/Tutorials

http://metaoptimize.com/qa/questions/186/good-freely-availab...

I think the best single resource is Kevin Murphy ML text, but there's lots of relevant stuff on linear algebra, Bayesian analysis etc

______________

Here's a "curriculum"

http://blog.zipfianacademy.com/post/46864003608/a-practical-...

And look at the coursera courses by Koller, Andrew Ng, NLP by Collins etc


There is a comprehensive list of online resources that we put together on our blog: http://blog.zipfianacademy.com/post/46864003608/a-practical-...


This is a free version of the book "Mining Massive Datasets".

Basically statistical methods that work with big datasets, which is the core of data science.

http://infolab.stanford.edu/~ullman/mmds.html


Thanks I was expecting may be a series of books, lectures, video which can take me from novice to intermediate level in data science.

I am willing to spend 10 hours a week on this.


This seems to fit the bill: https://www.coursera.org/course/datasci


I did the first two weeks of this course and found it quite accessible, although the second week question of implementing matrix algebra in SQL didn't seem to have much preparatory material in the lectures. Unfortunately I've had to drop out due to a concussion, but I think that most HN'ers would be able to take this course.


Data science seems, for now, to be what software engineering was supposed to be: a career where you choose your own tools and problems (with some constraints) and that gives you the freedom to move about in different industries instead of being stuck to one in the way that most programmers now are.

In many organizations, data scientists are full-time programmers but who get the dibs on the most interesting projects. I identify as a data scientist as code-word for "no-hire if the work's not interesting". There's plenty of hard engineering (in addition to traditional data science, where statistical intuition is more important) in data science. There are plenty of data scientists working on OS hacks, compilers, and other "hard engineering" topics. The difference and advantage for a data scientist is that your boss doesn't think he could do your job if he wanted to. If your title shows that you actually know math, you're not "just a code monkey".


I've seen you write something similar to this before, but I don't agree.

Counterexample: a good friend of mine specialized in computer graphics where he did a lot of math-heavy research. I specialized in machine learning where I also did a lot of math-heavy research. We are both full-time programmers who choose our tools and flexibly work on interesting projects, but he would never get hired as a "data scientist", whereas I did. I think anyone who has specialized enough in something that's useful to a company can find a good, flexible job (e.g. my graphics friend and me).

The fact that the term "data scientist" is so ubiquitous simply means it's cooler/more useful than other specializations right now. Some companies may get away with abusing the title because it's so vague, but it does mean something to companies that actually know they need one.


>>There are plenty of data scientists working on OS hacks, compilers, and other "hard engineering" topics.

Can you give some examples on this? Seems very interesting.


I encourage you to peruse the Github page of the R celebrity Hadley Wickham. While trained as a statistician and employed as a professor of statistics at Rice, he writes quite a few R packages that deal with more traditional development roles (unit testing and debugging, not to mention Plyr and GGplot): https://github.com/hadley?tab=repositories

Also, Duncan Temple Lang, another professor of statistics, has done some really amazing work with R compilers and CUDA interfaces. https://github.com/duncantl?tab=repositories http://www.omegahat.org/


Rob Zinkov (https://github.com/zaxtax) is a first-rate hacker. He has a popular ML blog (http://zinkov.com/) and organizes the Los Angeles machine learning users group.


Nothing comes off the top of my head, and it's not like they get to specialize in compilers. They mostly end up doing one-off hacks to make an existing algorithm more performant.

What's fun about machine learning is that it touches so many other parts of computer science. You could be at a high level writing DSLs in Clojure to make it possible for statisticians to specify their models directly, or you could go to the low level and write GPU code.

The general rule is that if your boss thinks he can do your job, you lose. If he doesn't think that, you win. When you're a data scientist, your odds are much higher of coming out in the second category.


To your last point, it depends on what role you currently serve to your company. If you're among statisticians, you make yourself valuable by knowing more about computer science and programming than the rest of the group. If you're amongst engineers (probably more common than the former--at least on HN), knowing probability and statistics gives you that edge.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: