Languages and libraries are just tools: knowing APIs doesn’t tell you at all how to solve a problem. They just give you things to throw at a problem. You need to know a few tools, but to be honest, they’re easy and you can go surprisingly far with few and relatively simple ones. Knowing how, when, and where to apply them is the hard part: and that often boils down to understanding the mathematics and domain you are working in.
And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.
My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability, calculus, Fourier analysis, graph theory, PDEs, combinatorics.
(Context: Been doing data work for decades, before it got its recent “data science” name.)
I don't necessarily agree with this. Yes, a sound understanding of the domain and knowledge of the mathematics and statistics are vital to gaining insights. But. I would make a very clear distinction between exploratory data viz and explanatory data viz. Data visualization when presenting those insights is an important part of driving decision making.
I don't fully agree with this neither. Especially for mathematical concepts, visualization can give insight into how theorems are constructed and combined. This can prove to be vital when applying concepts and theorems to new problems.
I would especially like to bring forth 3blue1brown. He is a creator of videos which beautifully visualizes and explains complex mathematical problems. His efforts has given me an insight into math which theorems explained in text and variables could never do.
However I do see your point that visualizations without understanding can be misleading. Hence the pure, written math is important to read and reason about, but I do believe that some concepts need to be visualized to be fully understood.
With regards to gaining math skills, this upcoming MOOC from Microsoft on EdX looks promising.
If you don't have an industry in mind, you can use a site like glassdoor.com and search for data scientist positions by city and industry to get a feel for demand.
Basically, any problem where you can establish relations between elements can be treated as a graph. I've used graphs for image analysis before too: pixels are vertices, edges represent neighborhood relations - especially useful when you make nonlocal connections (e.g., nonlocal means; graph-cut methods for segmentation; etc...)
I've worked with them in three of the above contexts: cybersecurity (my current projects), retail analytics, and image analysis. I've avoided social network stuff - never cared for that area much.
I definitely think a solid mathematical understanding helps to build quantitative and critical thinking skills which are very key in data science.
For example, if you have a bachelors degree from a top engineering school (MIT, Cal Tech, Stanford, Berkeley, etc.) you have proven that you are intelligent and can work hard.
People without a masters degree, but more business experience, bring a different perspective, and are often more business results focused, and potentially work more collaboratively than an individual who just graduated from a masters program.
Source: I am a Data Science hiring manager, and have interviewed 100+ candidates at several companies
I think the commenter’s point is (implicitly) about specific, directly relevant Master’s degrees. Obviously a general Master’s wouldn’t provide much of an advantage. The difficulty isn’t demonstrating intelligence and work ethic, it’s demonstrating targeted expertise.
> People without a masters degree, but more business experience, bring a different perspective, and are often more business results focused, and potentially work more collaboratively than an individual who just graduated from a masters program.
To be honest with you, this sounds to me like complete speculation. I’m not saying it’s wrong; rather it seems like it’s at best unempirical, and at worst unfalsifiable. The qualifiers you’re using (like “potentially”, or “often”) don’t seem like strong heuristics.
I think it would be helpful to discuss straightforward job descriptions. For most real data science roles, I would not weight any of what you’ve listed (except collaboration) as being remotely as useful as demonstrable expertise in computer science and statistics. For candidates without a Master’s degree, I wouldn’t take business experience or lack thereof as a signal whatsoever - I’d look for a relevant heuristic to replace it.
I've had questions ranging from reversing strings on a whiteboard to checking for valid email addresses. I had another question about flipping biased coins and calculating probabilities. It's all nonsense and totally unrelated to the skills I developed during my PhD which primarily consisted of performing massive amounts of machine learning on high performance computing systems over large sets of data to extract important insights.
But — if solving these algorithm puzzles quickly and without errors is the key to a $300k+ job, so be it. I'll just practice this nonsense until I've optimized for the skill of "interviewing", and then maybe I can contribute in some kind of meaningful way to the company with actual data science.
Is there a consensus about what kind of Master's would be most useful data-sciency stuff? Computer science? Stats?
I think the actual term is Data Engineer.
It’s why technical interviews can be so brutal, unfortunately. There are a lot of frauds out there. Money attracts frauds.
What’s the fizzbuzz test for data scientists anyway?
My phone screen "fizzbuzz" is having them calculate a standard deviation from an array of data w/out with only basic operators (no numpy.std). Then explain why they choose population/sample and explain the difference.
I studied math in undergrad so one of my requirements is "knows more math than me".
What kind of questions are you asking to ensure that they’re correct when they’re speaking about math you don’t know?
1) People who are great at the mathematics behind the statistical tooling
2) People who are great at conceptualizing a relevant question, operationalizing it, and then using a computer to apply appropriate models.
I think in most cases, for businesses needing to solve business problems, the latter kind is probably more useful. There are applications where the former is required, but you probably know if you need this kind of data scientist.
I should also add that these traits aren't mutually exclusive, but that individual data scientists typically are stronger or weaker along approximately those axes.
In general, I still dislike the term "data science" because it obfuscates meaningful distinctions between math nerds, computer science nerds, and research nerds who happen to do some applied stats.
I do, however, think anyone with some lick of statistics background should know the formula for a standard deviation. Considering how fundamental the idea of variance is in statistics.
Also, for our role, we're specifically hiring someone with extensive stats background since a large part of the role is learning domain-specific statistics of the industry we're targeting and figuring out how we can adopt those models with our data.
It’s a filter that theoretically allows false positives (which is why you continue with other questions), but it really shouldn’t have any false negatives.
This is just my n=1 opinion, but this is a terrible test for data science skills. I've had to calculate standard deviation by hand many times in my life, but my short term memory is such that despite doing that dozens of times over the past two decades, I still can't recall the formula off the top of my head. And then there's the whole n vs (n-1) thing in the denominator which has something to do with degrees of freedom, but I would just Google that as soon as I needed to know (depending on exactly what I was trying to do with the data).
So I don't understand how your question in any way tests someone's skills at analyzing data to extract valuable business insights. At best, it tests someone's ability to memorize formulas and minutiae (although I'll grant you that understanding the difference between a sample and the population is important).
Personally, I think take-home interviews with real data sets are the best way to gauge a candidate's skills. You're actually testing them with a work sample, and they are not under artificial time or memorization constraints.
Read through, and do all the exercises in, one textbook each for:
2. Linear Algebra
3. Abstract Algebra
6. Probability Theory
7. Number Theory
...more or less in that order. Make sure your calculus book covers single variable and multivariable calculus. Supplement with applied mathematical statistics. Do that, and you have the equivalent of a mathematics undergrad (as far as relevant courses are concerned).
You could even do this with something like UIllinois’ NetMath program, or some courses on Coursera. You can swap out Number Theory for Complex Analysis or deeper Probability Theory and it’d be more relevant.
Skip abstract algebra, topology and analysis. If you find yourself in the same room as a number theory book, walk away slowly without making eye contact lest it cast a spell on you.
Sure, skip number theory. Like I said, you could swap that out.
One can learn the necessary topology, analysis (etc.) in the relevant places (and the relevant depths) that they come up.
1. The context is knowing more math than someone who has an undergraduate degree in it,
2. Abstract algebra is part of such a degree, and contributes significantly to overall mathematical maturity, and
3. You can avoid some subjects in the short term, but in the long term you can’t progress further without a reasonable mastery of algebra and analysis.
Probability theory and linear algebra are heavily used in data science. You won’t be as competitive a candidate for a job if you don’t have a firm grasp of both subjects. At a certain point, linear algebra ceases to be distinct from abstract algebra, and those exercises you were doing become applicable to real world results.
But, what you're describing I would consider "data engineering" (at least how I have been hired to do it). Working through the business problems and pragmatically facilitating data, pipelines, databases, and models to solve those problems. It's less established and less "hot" but, IMO, it's a much more valuable job to most businesses.
Much like how the early hires at Twitter were not deeply experienced in high availability work -- segregating the architecture of a predominantly RoR code base to be resilient at scale, which lead to countless "fail whale" outages, before they eventually landed someone who helped them re-think their architecture to use RoR for what it's good at while introducing the JVM and other languages to handle other aspects of their workload.
I'm having a little trouble trying to parse that sentence. Could you explain it better?
Based on what I think is being asked, the question is essentially: What is a STD? I think this is a very straightforward and fair question.
For less Stat-y HNers: For normally distributed data, the STD is the root of the Variance. The Variance is just the average of the square of the difference between the data points to the mean. Essentially: Take a point, find the distance to the mean, square that, average over all points you've done that to. That's the variance. Root the variance, that's the STD.
I use the `tidyverse` from R for everything people use `pandas` for. I think the syntax is soooo much more pleasant to use. It's declarative and because of pipes and "quosures" is highly readable. Combined with the power of `broom`,fitting simple models to the data and working with the results is really nice. Add to that that `ggplot` (+ any sane styling defaults like `cowplot`) is the fastest way to iterate on data visualizations that I've ever found. "R for Data Science"  is great free resource for getting started.
Snakemake  is a pipeline tool that submits steps of the pipeline to a cluster and handles waiting for steps to finish before submitting dependent steps. As a result, my pipelines have very little boilerplate, they are self documented, and the cluster is abstracted away so the same pipeline can work on a cluster or a laptop.
An example is the janitor::clean_names function I like to use for standardizing the column names on a data.frame.
However, the tidyverse is really serious in terms of api consistency and functional style, with pipes and purrr's functionalities. The unixy style of base R is unproductive in terms of fast iterating an analysis. Also, the idea of "everything in a data frame" (or tibble, with list columns and whatnot) together with the tidy data principles really takes the cognitive load off to just get things started.
It's like bundler or cargo for R
As a half-solution, I ended up restricting myself to a very few libraries in this family (mainly dplyr, lubridate, stringr, broom) and to using packrat to consistently freeze the library versions for these.
There are definitely some issues if you have to reliably run scripts (not to mention the difficulties of putting into production)
The thing I really like about R over python is for SPECIFIC tasks like inspecting data and trying to get an answer out quickly, there really isn't a quicker or better tool to use. The ONLY reason I still even use R is because of the ease to get answers with the tidyverse
I personally find that Jupyter feels like a hack compared to something like RStudio. You have to open a terminal and launch a web server?
I on the other hand, find most R packages provide barely readable documentation. I can just hope that the vignette exists and actually explains the inputs/outputs.
You think this is better than barely readable?
Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.
I think memorizing basic stats knowledge is not as useful as understanding deeper concepts like information theory, because most statistical tests can easily be performed nowadays using a library call. No one asks people to program in assembler to prove they can program anymore, so why would you memorize 30 different frequentist statistical tests and all of the assumptions that go along with each? Concepts like algorithmic complexity, minimum description length, and model selection are much more valuable.
On this specific point, it's worth noting that up until now there's been a single massive repository of every Julia package ever published, regardless of its current state or utility. Starting with the upcoming 0.7 release, Julia will introduce the concept of "curated" repositories so that, going forward, if you stick just with the default curated repository of packages you should have much less chance of running into a broken or unmaintained package.
- Jupyter + Pandas for exploratory work, quickly define a model
- Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet: https://www.cheatography.com/chewxy/cheat-sheets/data-scienc... . Additional write-up on why Go: https://blog.chewxy.com/2017/11/02/go-for-data-science/)
I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain
Go is quite straightforwards though - WYSIWYG for the most parts, hence you probably won't find a lot of sexy tutorials. Almost everything is just a loop away, and in the next version of Gorgonia, even more native looping capability is coming
... and in particular the resources lists at
Also, Dan's GopherCon talk on Go for data science is a great way to get yourself convinced enough to try it out:
- python (for general purpose programming)
- R (for statistics)
- bash (for cleaning up files)
- SQL (for querying databases)
- Pandas (for Python)
- RStudio (for R)
- Postgres (for SQL)
- Excel (the format your customers will want ;-) )
- SciPy (ecosystem for scientific computing)
- NLTK (for natural language)
- D3.js (for rendering results online)
It is worth understanding the concepts of numpy and pandas. Furthermore, try out IPython/Jupyter, especially for rapid publishing (people run their blogs on jupyter notebooks).
I think certain libraries depend very much on where you focus. Machine learning? Native language processing? Visualization? Something in economics? Fundamental sciences? For instance, I never need NLTK in theoretical astrophysics ;-) Instead, I need powerful GPU based visualization, which is however very old school with VTK and Visit/Amira/Paraview (also very much pythonic).
If you're doing a lot of work with matrices, model fitting in production, then python seems fine. However, a lot of data scientists I see are more like scrappy data analysis / visualization types, who are churning out small dashboards. In that case R's tidy verse and shiny are just incredibly fast to develop with.
For powerful GPU viz, have you considered vispy? Four authors of four independent Python science visualization libs got together to build it.
Very few enterprise data science teams are 100% Python (in fact none I've heard of). R is still very heavily used (and in fact all data science teams I've worked in it has been the dominant technology).
There is a reason Microsoft purchased Revolution.
The real selection happens when you consider what's available in opensource world. What code you don't have to write? What high-quality libraries are available vs which ones you will have to write yourself?
On this topic, R has vast advantage over python in some domains, such as bioinformatics for example, while python definitely shines when it comes to deep learning (and using for loops).
You can't just claim that one shouldn't look at R because you personally know one language better the other, quite likely because in your domain it's not being used as much.
I do prefer the deep learnin, NLP and production serving story in python, but you will have to pry dplyr+ggplot from my cold dead hands for quick analysis and charting. Not to mention that pandas's API is a clusterfuck compared to R's native data frames.
$ docker run -it --rm -p 8888:8888 jupyter/datascience-notebook
WRT bash, where to begin? In the past 40 years, there’s pretty much a better tool for everything someone tries to do with bash. It lives on pretty much through inertia and pride.
1. easier interpretation of results than frequentist methods for lay people (business strata, elected officials, or other decision makers)
2. Uncertainty can be quantified and visualized reasonably well, which helps decision makers not think of stats as a magic box that produces a single answer.
3. Sensitivity analysis can be placed right up front: selection of priors representative of the beliefs of differing opinions / ideologies can inform decision makers of when they should consider changing their minds, and when they might still hold out.
Downsides of Bayesian methods:
1) Conceptually more involved than typical maximum likelihood estimation methods
2) Computationally expensive
3) Methods might not be as well known to a nominally stats-savvy audience.
Also, get used to reading the Stan forums on Discourse. Happy Stanning
A sound understanding of mathematics, in particular statistics.
It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.
- Dask for distributed processing
- matplotlib/seaborn for graphing
- IPython/Jupyter for creating shareable data analyses
- S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet
- EC2 for Dask clustering
- Ansible for EC2 setup
My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.
Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.
The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.
To sum it up, Dask alone makes all the difference for me!
I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.
Maybe we need a Data Scientist to work out what a Data Scientist is?
It means someone who can work with business stakeholders to break down a problem e.g. "we don't know why customers are churning", produce a machine learning model or some adhoc analysis (usually the former) and either communicate the results back or assist in deploying the model into production.
Typically there will be data engineers who will be doing acquisition and cleaning and so the data scientists are all about (a) understanding the data and (b) liaising with stakeholders.
As for technologies it is typically R/Python with Spark/H20 on top of a data lake i.e. HDFS, S3. Every now and again on top of an SQL store e.g. EDW, Presto or a Feature store e.g. Cassandra.
const Y = a => (b => b(b))(b => a(x => b(b)(x)));
(Disclaimer: I wrote the post at the above link).
If you have a sound design you can still create a huge amount of value even with a very simple technical toolset. By the same token, you can have the biggest, baddest toolset in the world and still end up with a failed implementation if you have bad design.
There are resources out there for learning good design. This is a great introduction and points to many other good materials:
1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.
2. Familiarity with math (which? depends on what you are doing) to the point where you can read articles that may have a solution to your problem and the ability to propose changes, creating proprietary algorithms.
3. Some scripting language (Python, R, w/e)
4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.
Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.
1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?
2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?
3. Difference between Adam and SGD.
Numba for custom algorithms.
Dataiku (amazing tool for preprocessing and complex flows)
Amazon RDS (postgress), but thinking about redshift.
Tableau or plotly/seaborn
* statistical methods (more math)
* big, in-production model fitting (more python)
* quick, scrappy data analyses for internal use (more R)
For example, I would feel weird writing a robust web server in R, but it's straightforward in python. On the other hand R's shiny lets you put up quick, interactive web dashboards (that I wouldn't trust in exposing to users).
Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.
I’m the developer of a library called Featuretools (https://github.com/Featuretools/featuretools) which is a good tool to know for automated feature engineering. Our demos are also a useful resource to learn using some interesting datasets and problems: https://www.featuretools.com/demos
No it won't.
That combination can't handle large datasets that are typical for most data science teams i.e. maybe include PySpark. And then it's very limited so far as ML/DL technologies.
Pandas and Spark are both DataFrame libraries, and seem to offer very similar functionality to me. Why do you prefer Spark over Pandas?
> very limited so far as ML/DL technologies
I mean, getting Tensorflow up and running with GPU support isn't trivial, but it's not exactly hard, and Keras provides excellent support for a wide variety of other backends. What, in your experience, is less limited?
Personally, I think two areas often lacking are software development skills and general statistics knowledge. The former is necessary for writing production-quality code, assisting with an sort of data engineering pipeline, writing reliable, reusable code, and creating custom solutions. Unfortunately, the latter is often skimped on (if not skipped entirely) in favor of more 'hot' fields like ml/dl, with the result being a fuzzy understanding across the board. (You'd be amazed at the quantity of candidates lacking fundamental knowledge about glm's, basic nonparametric stats, popular distributions, etc).
I would bet that the mean size of dataset people are dealing with is a lot bigger than the median size.
Pattern matching helps you write code faster (that is, spending less human time).
Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.
Coconut is an extension of Python that offers all of those.
Test driven development also helps you write more correct code.
It seems like getting into the upper echelons of Kaggle is a matter of refining your model, and I do wonder how much value these refinements offer over a more basic and general approach in a real world scenario. To be clear, when I say I wonder, I'm not saying I'm rejecting the value, I really do mean it, I'm uncertain about the value. I think it's probably very scenario specific.
Think of it this way - a predictive value of 90% vs 95% could be the difference between placing in the top 10% and the bottom third. Now, 5% isn't nothing, it could be very valuable. It really depends.
But Kaggle is an environment where the question is already posed, the data has been collected, the test and train sets are already split apart for you, and winning model is the one that scores best on a hidden test set by a predefined goodness of fit score.
In a real world scenario, suppose someone does a great job figuring out the question to ask, gathering the data, and determining the most effective way to act on the results, but uses a fairly basic, unrefined model. Someone else does a middling job on those things, but builds a very accurate model as measured by the data that has been collected. I'd say the first scenario is likely to be more valuable, but again, it depends of course.
A couple other things, since I am a fan of Kaggle and do highly recommend it. First, these things aren't necessarily exclusive - you can have a particularly well conceived and refined model as well as a thorough and excellent businesss and data collection process (though you may have to decide where to put your time and resources).
Also, refining a model with Kaggle can be an exceptional training opportunity to really understand what drives these things. So go for it! (I also find these things kinda fun).
My best was somewhere in the top third, so I'm not an especially strong Kaggle competitor. But even that took a lot of data parsing, piping, cleaning, moving some things to a database, populating a model, and parallelizing the processing so I could things on a cloud in an hour rather than 100 hours on my laptop. I learned a lot from it.
If you can score high on Kaggle, you definitely have some skill. And it's hardly like people who can do this never have the other skills necessary to manage the other stages of a data science project.
I probably wouldn't hire someone purely on Kaggle scores, but sure, it's a positive indicator of programming and data management ability.
a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure
That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.
If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.
Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.
A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.
IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.
Also - your model of asking questions before starting a new gig is very relevant to nearly every programming job. Could also be some of the questions a candidate asks in an interview.
Have you ever needed any Microsoft skills(MSSQL/C#) so far?
However, I can't seem to recall the name. Has any one seen what I'm talking about?
Oh I don't know about that. Programming languages are force multipliers, and each language has a different force coefficients for different problem domains. They are not all equivalent. They have their different points of leverage, and simply being good in one does not mean you can solve problems in any domain with ease. In fact the wrong programming language can often be harmful if it's ill-suited to the problem at hand, and especially if it contorts your mental model of what you can do with the data.
One example I encounter a lot in industry is Excel VBA. I'm fairly good at VBA and have seen very sophisticated code in VBA. I've also seen many basic operations implemented badly in VBA that should not have been written in VBA at all. By solving the problem in VBA, the solution is often "hemmed in" by the constraints of VBA.
For instance, unpivoting data is often done badly in VBA (with for-loops), but is trivial to do well in dplyr or pandas.
So I would say one has to choose one's programming language somewhat carefully. Not any language will do.
Every single large scale data science team e.g. Google, Spotify, AirBnb will be using Spark for most of their work. It is by far the defacto standard for working with large datasets. Especially since it integrates so well with machine learning (H2O) and different languages (Scala, Python, R).
Would you use pyspark mllib in a webservice instead of scikit ?
However, if you use a lot of UDFs where Spark has to serialize your Python functions, you might consider rewriting those UDFs in a JVM language. Serialization overhead is still fairly substantial. Arrow is trying to address this by implementing a common in-memory format, but it's still early days.
I would still recommend PySpark to most people. It's more than good/fast enough for most data munging tasks. Scala does buy you two things: type safety and low serialization overhead (i.e. significant!), which can be critical in some situations, but not all.
Also, the Python way has always been to prototype fast, profile, and rewrite bottlenecks in a faster language, and PySpark conforms to that pattern.
2) Spark MLLib is still fairly rudimentary in its coverage of major ML algorithms, and Spark's linear algebra support, while serviceable, is currently not very sophisticated. There are a few functions that are useful in the data prep stage (encoding, tokenizers, etc.) but overall, we don't really use MLlib very much.
Companies that have simple needs (e.g. a simple recommender) and that don't have a lot of in-house expertise, might use MLlib though -- I believe someone from a startup said that they did at a recent meetup.
Most of us need better algorithmic coverage and Scikit's coverage is currently much better, plus it is more mature. We also have Numpy at our disposal, which lets us do matrix-vector manipulation easily. There is some serialization cost, but we can usually just throw cloud computational power at it.
Also note that for most workloads, the majority of the cost is incurred in training. For models in production, one is typically processing a much smaller amount of data using a trained model, so less horsepower is required.
About mllib - yes, we concur with you on algorithmic coverage. And yes, training is the major issue. For example, what I read of Uber's Michaelangelo infrastructure - it seems they train using Spark and save to a custom format that is deserialized (using custom code) and made available as a docker image .
There is value in consistency - using Spark thtoy2and through. Wonder what you thought of that ?
2) I'm not that familiar with what Uber is doing. My take is I'd like to use Spark for as much as I can, but there are parts that are either more performant or easier to accomplish in Python.
Spark with Arrow will definitely change the game.
XGBoost, LibLinear, Apache Arrow, MXNet
"Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).
The analysis part is usually quite simple, often if it gets really complex then that's a sign that the data is being tortured. Sometimes the marginal gains that complex methods create (vs simple but good approaches) are not worthwhile even if they are valid - simply in terms of time spent and difficulty in communications.
Or maybe whole humanities should be considered as « not science ».
Beside a data analyst that don’t use scientific method is just a bad analyst. Some media outlet showcase blatantly lying charts made by people that understand the technicals but get everything wrong about the concepts.
So this is my advice, focus on understanding the concepts before the tooling. That is what will really make your value.