My biggest issue with the teams of statisticians I've worked with before is that they lack a basic understanding of computer science. My biggest complaint dealing with the software developers on analytics projects is they don't understand statistics. I heard a great quote for which I don't remember the source (I paraphrase here): "A data scientist is someone who knows more computer science than a statistician, and more statistics than a computer scientist." The nature of the analytics world right now suggests that this type of specialty is sorely needed in many places.
"Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician."
I'm a data scientist, and I'll readily admit that your definition describes me well.
I have a background in IT and programming. I did it for over 10 years. I was wondering, though, how much demand is there for data scientists? And, what kind of salaries could I expect?
But really, the term to me more or less means "mathematically literate." I know, there are some techniques that seem to be specifically associated with data science, like the analysis of large scale datasets, but many engineering and mathematical disciplines do deal with this already.
There's a reason these jobs want someone who has a degree in... math, statistics, computer science, operations research, physics, engineering, hell, let's just say "or related field" and be done with it.
It's partly because these fields contribute to intersection called "data science", but my real guess is that a degree in any of these fields means that you're probably mathematically literate. You've done one of those majors that requires you to take calculus of several variables, linear algebra, some differential equations, some kind of probability and statistics, probably write a computer program or two, and then focus on some more specific branch in depth where you learn to model things mathematically.
A good humanities curriculum will impart knowledge, sure, but it also trains you to read dense material, make sense of it, and express some kind of insight about it. A good "stem" curriculum does the same thing, except with numbers and data.
There was a time when someone could get a job by being highly literate. I see this as a similar situation - if you're mathematically literate at a reasonably high level, you're probably employable.
When I think of big data, the first thing that pops into my head is Insurance Actuarial tables, and that's not interesting to me. Are statistics suddenly the hottest and most interesting thing in the world because we can run experiments over larger datasets? Maybe, but I think that most engineers capable of doing the kinds of analysis these firms want would be better suited to harder problems.
Don't get me wrong, data analysis is important, I just wonder if the IEEE has a duty to encourage organizations like this or if they should be trying to influence kids back towards the "hard" engineering practices.
To be honest, as long as people are doing something that makes them happy, I'm not one to judge, but I do think there's something to be said for attacking things that are harder than statistics.
I hope you're not trivializing statistics simply because of the elementary approach most schools tend to take for the first three or four statistics classes offered at the university level. I would argue that statistics can be just as difficult as computer science, and just like CS has intractable problems, so too does statistics.
If your idea of big data is acturial tables, you're not thinking big enough. Actuarial tables are largely still small enough to be handled in orthodox ways using orthodox software (SAS has a deathgrip in managed health--what I do for a living--and it's a 50 year old piece of software). In fact, most of the times I hear people say their data has a very high volume, it really doesn't, and the traditional methods of data analysis can handle it. Once you step into Facebook/Amazon/Google sized data, things change. And that's only looking at "big data" as large volume--the most common definitions of the phrase involve several other variables (see: the four V's).
Just because what's commonly done or visible is elementary doesn't mean the entire field is elementary. It would be like assuming CS isn't a hard engineering if all you saw were web designers/developers.
For any one thinking of it as a career, I highly recommend Nate Silver's - The Signal and the Noise: Why So Many Predictions Fail - But Some Don't
Five years ago, talk of Business Intelligence was all the rage. It was the 'hot' new thing that companies were pouring millions into. You needed the Analytical and statistics skills required to interpret large data-sets efficiently whilst having enough vision to clearly cut through the noise to deliver meaningful metrics. Technical knowledge of manipulating data using multi-dimensional cubes and datasets is also required.
Now it seems that 'Data Science' is set to pick up where BI left off. The fields appear very similar.
To avoid it being an oxymoron, I would clearly define the boundrys and goals relative to similar fields... BI / Data Warehousing / Data Analyst / Database Architecture
Disclosure: I've made a VERY good living since graduating working for Investment Banks in BI/Data analytics. I know from experience that money in these fields is more down to the industry you apply it to. Number crunching payroll or scientific data, low salary. Number crunching bank regulatory or trading data, massive money (regardless of what you call yourself).
Also, we're all pretty sure that the title "Data Scientist" will be applied far too liberally. I have friends at other BI firms who are already calling themselves data scientists because they attended a convention where the words "Hadoop" and "Cloudera" were spoken.
-Experienced Computer Scientist & Data Miner with wide ranging skillset
No disrespect to this man's skills but I'm sure there are hundreds of us on here that could easily fall under that category!?!
They're taking people in STEM fields who are over-qualified and under paid, and helping them transition into new careers as data scientists at top technology companies (Google, Facebook, Square, LinkedIn, etc.). It's a really interesting model because they're filling a big hole that universities have right now in that there's no degree for data science. Close to 100% of their Fellows make the transition successfully and I think the idea is something that others are going to try to copy in the near future because it's clear there's a supply-demand mismatch right now.
Fwiw, the company is a YC alumnus (a hard pivot from their original idea).
For example, a statistician might wonder exactly what a particular ID referred to. Does it mean a person, an IP address, a single "session". They could, of course, find this out, but the data scientist would already know this.
Similarly, a software engineer might wonder what information they need to be collecting from the user. The data scientist knows what analysis will ultimately be done, and so knows what information must be collected.
So data science combines statistics and software engineering, and this is useful because it allows a holistic view of the data analysis process, from the collection of data, to the statistical analysis of the processed data.
I may be wrong but I disagree with who says that the difference from a data analyst and a data scientist is that the data scientist is a software engineer.
I would say instead that the difference between a software engineer and a data scientist is that the data scientist is a scientist that has a strong CV in data structures and algorithms, as well as in (maybe pure) science, with experience in statistics, math, or physics, and that knows very well how to work with models, test hypotheses, spot patterns, anomalies etc...
2) Visualization : such large dataset can not always be expressed in bar charts or pie charts...so standard charting tools like excel and R dont work..you need to have good knowledge of charting libraries like d3 or openGl (for 3d visualization) to analyze and express their findings
4) Type of data: Econometricians are never comfortable with unstructured data set consisting of twitter feeds and apache logs..good knowledge of machine learning and graph algorithms are becoming very essential...Apache mahout a machine learning framework build over hadoop is looking extremely promising
This means that descriptive work such as clustering, dimension reduction, is often either ignored, or considered as a kind of pre-processing before the real work starts.
1) Machine learning techniques for analysing data sets as opposed to parametric models
2) Clustering (k-means, etc.)
3) TF / IDF
4) Using a variety of data sources / tools - my econometrics educations was heavily Stata dependent. Learn a little bit of SQL, R, and Matlab so that getting up to speed doesn't take you longer than a month.
Term-Frequency-Inverse-Document-Frequency. Assigns each word a score based on how often it appears in a document relative to how often it appears in all documents.
I think the best single resource is Kevin Murphy ML text, but there's lots of relevant stuff on linear algebra, Bayesian analysis etc
Here's a "curriculum"
And look at the coursera courses by Koller, Andrew Ng, NLP by Collins etc
Basically statistical methods that work with big datasets, which is the core of data science.
I am willing to spend 10 hours a week on this.
In many organizations, data scientists are full-time programmers but who get the dibs on the most interesting projects. I identify as a data scientist as code-word for "no-hire if the work's not interesting". There's plenty of hard engineering (in addition to traditional data science, where statistical intuition is more important) in data science. There are plenty of data scientists working on OS hacks, compilers, and other "hard engineering" topics. The difference and advantage for a data scientist is that your boss doesn't think he could do your job if he wanted to. If your title shows that you actually know math, you're not "just a code monkey".
Counterexample: a good friend of mine specialized in computer graphics where he did a lot of math-heavy research. I specialized in machine learning where I also did a lot of math-heavy research. We are both full-time programmers who choose our tools and flexibly work on interesting projects, but he would never get hired as a "data scientist", whereas I did. I think anyone who has specialized enough in something that's useful to a company can find a good, flexible job (e.g. my graphics friend and me).
The fact that the term "data scientist" is so ubiquitous simply means it's cooler/more useful than other specializations right now. Some companies may get away with abusing the title because it's so vague, but it does mean something to companies that actually know they need one.
Can you give some examples on this? Seems very interesting.
Also, Duncan Temple Lang, another professor of statistics, has done some really amazing work with R compilers and CUDA interfaces.
What's fun about machine learning is that it touches so many other parts of computer science. You could be at a high level writing DSLs in Clojure to make it possible for statisticians to specify their models directly, or you could go to the low level and write GPU code.
The general rule is that if your boss thinks he can do your job, you lose. If he doesn't think that, you win. When you're a data scientist, your odds are much higher of coming out in the second category.