"A data scientist is a statistician who lives in San Francisco."
Am I being dense?
(though hopefully nobody bleeds out on a table when someone misconstrues statistical data)
We've lived through an amazing time where one could learn by doing, and talented people have been able to compete without the benefit of formal education (myself included), but in my opinion those days are numbered.
I've personally observed respected PhD statisticians stumble on the type of problems a data scientist is expected to address. The combination of complex software and often counterintuitive mathematics makes this an imposing field for all but perhaps the top one percent of practitioners. Most everybody else needs to really hit the books for a few years, in a formal setting.
With that pre-coffee rant out of the way, I'm looking forward to finding some new sources here myself. So, in that spirit, thanks for the question.
"Data kiddies like me are coming.
I just ran multiple passes of the Broyden–Fletcher–Goldfarb–Shanno algorithm with a 100-layer neural network on a tfidf-vectorized dataset. I have no clue what that all exactly means, all I know is that it took under an hour and it gives a higher (top 10%) AUC score.
Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available.
Show a regular Python dev some examples and library docs and she can compete in ML competitions.
I was getting good results with LibSVM before I even understood how SVM's work on the surface. Feed the correct input format and some parameters and you are good to go. Random Forests can be applied to nearly anything and get you 75%+ accuracy.
Maybe I am just a engineer looking for pragmatic and practical use of techniques from ML and data science. Hard data scientists will be the statisticians, the algorithmic theory experts, the experimental physicists. It takes me 7 years to understand a complex mathematical paper. It takes me 7 minutes to train a model and predict a 1 million test set with Vowpal Wabbit."
The point is that a Data Scientist is really a person who is a blend of statistician and software engineer. Sure, there are brilliant people who will invent new ML algorithms, but you don't need to invent that stuff to be of tremendous value to a business who has data that they aren't currently getting much value out of. Just as a software engineer at a small business doesn't need to write a database, she just needs to be able to implement one somebody else wrote to add tremendous value.
And what happens when this person gets a new data set and they are suddenly getting garbage out of some standard SVM? Is it just a matter of the data not being well-separated using a linear model but throwing some simple kernel at the SVM will do the trick?
Even something as simple as taking a mean can fall apart when you are dealing with data which doesn't live in a Euclidean space, let alone something like PCA or SVM which also make assumptions of linearity.
The point is, it isn't just about being able to invent new methods. Things like SVM make assumptions about your data and applying them in cases when these assumptions don't hold can give completely worthless information, even if it looks good on the surface. Using something you don't understand, even if it is at a (much) more basic level than someone with a PhD in statistics, is just asking for trouble.
I'm not trying to be snarky, but honestly unless you know what you're looking for it's a fool's game. Once you've got the feel for a subject, you tend to find several authors that crop up time and time again, or landmark papers that really shifted the field. But that takes a long time, it takes most PhD students a year to fully understand and simply collate the background of a topic they may think they know a lot about.
That and no one actually reads journals. You do a search on Web of Knowledge or ADS or arXiv or whatever your poison and you see what comes up. Point is, you need to know what you're looking for.
This is akin to saying that if you read Phys Rev enough, you'll become a physicist. Sure, sure, keep up with the trends, but big important results get press which is enough to rely on to start off with.
To become a data scientist? Read the recommended textbooks and take a proper degree in statistics, computer or data science. Look at the courses on EdX and Coursera for a starting point, they'll help you decide whether this is something you seriously want to pursue.
Even if this is just a hobby, e.g. you're a coder that wants to branch out, you should still take the time to invest in education properly. Data science, like statistics in general, is very easy to mess up. When people draw bad conclusions from data (and good data scientists can make up any conclusion from any data set), bad things inevitably happen. Entire threads of science have been destroyed because somewhere, someone messed up their stats and apparently important results are meaningless.
- Hear or read about something that sounds neat
- See if there's a wikipedia article (I always cringe when
I hear some colleagues of mine say never, ever use it)
- Get a high level understanding of the topic from the wikipedia article...that usually leads to some other wikipedia articles + plain old Google searches...just fishing for whatever comes up
[I also search for TED talks, youtube videos and MOOCs related to the topic]
- Scribble stuff down on a piece of paper and structure it in a way that makes sense to me (sometimes it's just a list, sometimes a full blown mindmap)
...at this point I have a decent high level understanding...which basically means I could describe the topic to someone without stumbling (which I usually try at this point)
- From the high level understanding I usually also get: key terms for searches, intor level books/articles that are linked etc.
- At this point working at a university comes in handy because it lets me be behind the annoying paywalls at will...search Google Scholar or similar databases for the mined key words. Everything that looks remotely interesting...oh wait BEST TOOL EVER
- Zotero is sick good, comes as a FF plugin...great. If you search in scientific databases and the like a little icon pops up in the address bar of the browser indicating it identified the sources...click, mark everything -> it goes into your collection (with full text access)
[I order it by topic so for AI I might have Expert Systems and Rule Based, Fuzzy etc.]
- So basically I just wade through the databases and get everything that sounds interesting from the title into Zotero. Alays a good idea to get some "history of XYZ" or "XCY since author Y" sources
- Once done I read the abstracts and the conclusions and put a rough note what the articels are about. I also scan the sources to grow my collection of relevant articles (I mark what I don't think is relevant or move it into a special subcollection)
- I usually try to establish a history of the field with the major stepping stones, this is usually easy (sometimes not, worth a paper to make it easier for future researchers :P)
- If it's related to programming in any way I also search google or github directly for anything related. Code is good :)
[often there are tomes that are the de facto standards in their fields that serve as a massive source collection as well. Perfect example would be AI - A Modern Approach]
You need to develop serious skills in at least 4 of the following disciplines.
RDMS query development
Natural Language Processing
Web crawling and data harvesting techniques
Programming to access data APIs
Systems in business that generate data including, CRM, ERP and more
Geospatial data systems
Each of these areas would have its own set of resources both formal and informal.
I’m not a “data scientist” (or statistician, for that matter), but of the (excellent) data scientists that I know, the only specific skill they really have in common is statistical analysis. I’d say the truth is probably closer to “statistical analysis + ability to do independent research + computational chops using whatever their tools of choice may be"
Usually you have a team where each person is "specialized" in a few of those categories.
You can call a data scientist a statistician, but I don't think you can necessarily call a statistician a data scientist.
The truth is, you need only a shallow understanding of machine learning and stats to be a data scientist. But you also need the know-how to collect data - this ends up being the much bigger issue to tackle in my environment. (For what it's worth, you need to have a strong understanding of how data points relate to one another, how accurate they might be, why they might not be accurate, and you also need to be constantly thinking about the long term vision for your data.)
Most of what I have been reading on the topic seems to define data science as the intersection of the kinds of things I have listed. I guess my larger point was that each of these areas have their own learning curve and some like statistics or machine learning benefit from formal training. A person does not become a data scientist by reading blogs and journals.
Downloading scikitlearn and R and such is not going to work. At that level you are only qualified to be bossed around by a real scientist or statistician. You are an "analyst".
He is the prolific author of many R packages, which are more like little languages than libraries. His papers are both philosophical and practical, and informed by writing a huge amount of code.
The first one on that page is really good, and along with another paper of his got me explicitly thinking of organize my data in R using the relational model (a thing people with computer science backgrounds will know well).
It made me realize that R is actually a better SQL. It's a language for tables, or an algebra of tables.
Both from O'Reilly (with some Packt mixed in). Excellent content.
and the HN for Data Sci - datatau.com