Ask HN: As a data scientist, what should be in my toolkit in 2018? - mxgr
======
ms013
Mathematics. Which branch of math is domain dependent. Stats come up
everywhere. Graphs do too. In addition to baseline math, you really need to
understand the problem domain and goals of the analysis.

Languages and libraries are just tools: knowing APIs doesn’t tell you at all
how to solve a problem. They just give you things to throw at a problem. You
need to know a few tools, but to be honest, they’re easy and you can go
surprisingly far with few and relatively simple ones. Knowing how, when, and
where to apply them is the hard part: and that often boils down to
understanding the mathematics and domain you are working in.

And don’t over use viz. Pictures do effectively communicate, but often people
visualize without understanding. The result is pretty pictures that eventually
people realize communicate little effective domain insight. You’d be surprised
that sometimes simple and ugly pictures communicate more insight than
beautiful ones do.

My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various
specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability,
calculus, Fourier analysis, graph theory, PDEs, combinatorics.

(Context: Been doing data work for decades, before it got its recent “data
science” name.)

~~~
edem
I really like to get a degree in Mathematics but I simply don't have the time
to throw at it (work, children, etc). What do you suggest I should do to have
something on my resume? MOOC maybe?

~~~
CoVar
Usually MOOC for resume don't help as everyone does them. The advice that I
found useful for resume building is working on projects that you can catalog
in a portfolio.

With regards to gaining math skills, this upcoming MOOC from Microsoft on EdX
looks promising[1].

[1] [https://www.edx.org/course/essential-mathematics-for-
artific...](https://www.edx.org/course/essential-mathematics-for-artificial-
intelligence)

~~~
edem
So you suggest that I should learn from MOOC then go on and work on some
projects so that I can prove I really know it.

~~~
CoVar
Exactly. And to take it one step further, choose one industry you are
interested in. That way you will gain invaluable domain experience as you add
relevant portfolio projects.

If you don't have an industry in mind, you can use a site like glassdoor.com
and search for data scientist positions by city and industry to get a feel for
demand.

~~~
edem
Full disclosure: I'm in the industry for 10+ years as a programmer. I just
realized that if I want to move in the AI direction I'll need some math
education. I don't want to become a data scientist.

------
elsherbini
I'm a scientist (PhD student in microbiolgy) that works with lots of data. My
data is on the order of hundreds of gigabytes (genome collections and other
sequencing data) or megabytes (flat files).

I use the `tidyverse` from R[0] for everything people use `pandas` for. I
think the syntax is soooo much more pleasant to use. It's declarative and
because of pipes and "quosures" is highly readable. Combined with the power of
`broom`,fitting simple models to the data and working with the results is
really nice. Add to that that `ggplot` (+ any sane styling defaults like
`cowplot`) is the fastest way to iterate on data visualizations that I've ever
found. "R for Data Science" [1] is great free resource for getting started.

Snakemake [2] is a pipeline tool that submits steps of the pipeline to a
cluster and handles waiting for steps to finish before submitting dependent
steps. As a result, my pipelines have very little boilerplate, they are self
documented, and the cluster is abstracted away so the same pipeline can work
on a cluster or a laptop.

[0] [https://www.tidyverse.org/](https://www.tidyverse.org/)

[1] [http://r4ds.had.co.nz/](http://r4ds.had.co.nz/)

[2]
[http://snakemake.readthedocs.io/en/stable/](http://snakemake.readthedocs.io/en/stable/)

~~~
nonbel
Sometimes I think I'm the only one who isn't really a fan of the tidyverse.
I've found it slower, more prone to dependency issues, more prone to silent
errors, and less well documented than most R packages (ie most of what you
find on CRAN).

~~~
in9
Dependency management, in my opinion, is one of the problems in the R
ecosystem. The lack of name spaces when calling functions has made the
community have many little packages that only do one thing on you are not
really sure where it was actually used, unless you know the code and the
package.

An example is the janitor::clean_names function I like to use for
standardizing the column names on a data.frame.

However, the tidyverse is really serious in terms of api consistency and
functional style, with pipes and purrr's functionalities. The unixy style of
base R is unproductive in terms of fast iterating an analysis. Also, the idea
of "everything in a data frame" (or tibble, with list columns and whatnot)
together with the tidy data principles really takes the cognitive load off to
just get things started.

~~~
kirillseva
You should try
[https://github.com/robertzk/lockbox](https://github.com/robertzk/lockbox) for
dependency management

It's like bundler or cargo for R

------
Xcelerate
As a data scientist who has been using the language for 5 years now, Julia is
by far the best programming language for analyzing and processing data. That
said, it’s common to find many Julia packages that are only half-maintained
and don’t really work anymore. (I still don’t know how to connect to Postgres
in a bug-free way using Julia.) And you’d be hard pressed to find teams of
data scientists that use Julia. So in that sense, Python has much more mature
and stable libraries, and it’s used everywhere. (But I really hope Julia
overtakes it in the next couple of years because it’s such a well-designed
language.)

Aside from programming languages, Jupyter notebooks and interactive workflows
are invaluable, along with maintaining reproducible coding environments using
Docker.

I think memorizing basic stats knowledge is not as useful as understanding
deeper concepts like information theory, because most statistical tests can
easily be performed nowadays using a library call. No one asks people to
program in assembler to prove they can program anymore, so why would you
memorize 30 different frequentist statistical tests and all of the assumptions
that go along with each? Concepts like algorithmic complexity, minimum
description length, and model selection are much more valuable.

~~~
chubot
Has Julia converged on a solution for data frames? I watched some JuliaCon
videos and got the impression that they hadn't. There seem to be a lot of
different overlapping efforts.

~~~
0kto
Well, only the DataFrames.jl package comes to my mind. However, there exist a
few packages that extend this package (DataFramesMeta.jl or Query.jl; these
overlap to some extend, but the newer Query package seems to go beyond
DataFrames and offers some piping capabilities to interface with plotting
packages). In general: During the three years of my PhD some language /
package upgrades broke some of my scripts (during 0.4 -> 0.5 and -> 0.6), but
the language (and its extensive documentation, online and from the source code
of the packages) is very pleasant to use - the deprecation warnings usually
help you to adjust your code in time. I have been relaying heavily on said
DataFrame package, and am quite happy - the community is usually responsive
and helpful in case of problems or questions.

------
chewxy
My toolkit hasn't changed since 2016:

\- Jupyter + Pandas for exploratory work, quickly define a model

\- Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet:
[https://www.cheatography.com/chewxy/cheat-sheets/data-
scienc...](https://www.cheatography.com/chewxy/cheat-sheets/data-science-in-
go-a/) . Additional write-up on why Go:
[https://blog.chewxy.com/2017/11/02/go-for-data-
science/](https://blog.chewxy.com/2017/11/02/go-for-data-science/))

I echo ms013's comment very much. Everything is just tools. More important to
understand the math and domain

~~~
ZeroCool2u
I'm a big Go fan, but this is the first time I've seen someone recommend Go
for data science. After looking at this cheat sheet you've got me convinced
though. Would you mind pointing me to any other less cheat sheet style and
more in depth examples that you particularly like?

~~~
chewxy
Working on it. Part of my goal for 2018 is to write a lot more soft
documentation - tutorials etc.

Go is quite straightforwards though - WYSIWYG for the most parts, hence you
probably won't find a lot of sexy tutorials. Almost everything is just a loop
away, and in the next version of Gorgonia, even more native looping capability
is coming

~~~
ZeroCool2u
Awesome, thank you!

------
trevz
A couple of thoughts, off the top of my head:

Programming languages:

    
    
      - python (for general purpose programming)
      - R (for statistics)
      - bash (for cleaning up files)
      - SQL (for querying databases)
    

Tools:

    
    
      - Pandas (for Python)
      - RStudio (for R)
      - Postgres (for SQL)
      - Excel (the format your customers will want ;-) )
    

Libraries:

    
    
      - SciPy (ecosystem for scientific computing)
      - NLTK (for natural language)
      - D3.js (for rendering results online)

~~~
ktpsns
I make the claim that you can go very far in the SciPy ecosystem without ever
touching R.

It is worth understanding the concepts of numpy and pandas. Furthermore, try
out IPython/Jupyter, especially for rapid publishing (people run their blogs
on jupyter notebooks).

I think certain libraries depend very much on where you focus. Machine
learning? Native language processing? Visualization? Something in economics?
Fundamental sciences? For instance, I never need NLTK in theoretical
astrophysics ;-) Instead, I need powerful GPU based visualization, which is
however very old school with VTK and Visit/Amira/Paraview (also very much
pythonic).

~~~
albertgoeswoof
Agree, I would drop R, Python has you mostly covered now. Julia is also worth
learning.

~~~
threeseed
I wouldn't be recommending to drop R at all.

Very few enterprise data science teams are 100% Python (in fact none I've
heard of). R is still very heavily used (and in fact all data science teams
I've worked in it has been the dominant technology).

There is a reason Microsoft purchased Revolution.

------
xitrium
If you care about quantifying uncertainty, knowing about Bayesian methods is a
good idea I don't see represented here yet. I care so much about uncertainty
quantification and propagation that I work on the Stan project[0] which has an
extremely complete manual (600+ pages) and many case studies illustrating
different problems. Full Bayesian inference such as that provided by Stan's
Hamiltonian Monte Carlo inference algorithm is fairly computationally
expensive so if you have more data than fits into RAM on a large server, you
might be better served by some approximate methods (but note the required
assumptions) like INLA[1].

[0] [http://mc-stan.org/](http://mc-stan.org/) [1]
[http://www.r-inla.org/](http://www.r-inla.org/)

~~~
hobolord
do you have a recommended guide/textbook on learning stan? I've recently
started doing more bayesian analysis, mainly bayesian estimation supercedes
the t-test.

~~~
wishart_washy
As someone who uses Stan - I would recommend reading the Stan reference
documentation, it's essentially a textbook.

Also, get used to reading the Stan forums on Discourse. Happy Stanning

------
piqufoh
> what tools should be in my arsenal

A sound understanding of mathematics, in particular statistics.

It's amazing how many people will talk endlessly about the latest python/R
packages (with interactive charting!!!) who can't explain the student's
t-test.

------
justusw
Dealing with large data processing problems my main tools are as follows:

Libs: \- Dask for distributed processing \- matplotlib/seaborn for graphing \-
IPython/Jupyter for creating shareable data analyses

Environment: \- S3 for data warehousing, I mainly use parquet files with
pyarrow/fastparquet \- EC2 for Dask clustering \- Ansible for EC2 setup

My problems usually can be solved by 2 memory-heavy EC2 instances. This setup
works really well for me. Reading and writing intermediate results to S3 is
blazing fast, especially when partitioning data by days if you work with time
series.

Lots of difficult problems require custom mapping functions. I usually use
them together with dask.dataframe.map_partitions, which is still extremely
fast.

The most time-consuming activity is usually nunique/unique counting across
large time series. For this, Dask offers hyperloglog based approximations.

To sum it up, Dask alone makes all the difference for me!

------
trollied
What does "Data Scientist" actually mean these days? Does it mean "Write 10
lines of Python or R, and not fully understand what it actually does"? Or
something else?

I just see the term flinged around so much recently, and applied to so many
different roles, it has all become a tad blurred.

Maybe we need a Data Scientist to work out what a Data Scientist is?

~~~
threeseed
I hire data scientists so can tell you.

It means someone who can work with business stakeholders to break down a
problem e.g. "we don't know why customers are churning", produce a machine
learning model or some adhoc analysis (usually the former) and either
communicate the results back or assist in deploying the model into production.

Typically there will be data engineers who will be doing acquisition and
cleaning and so the data scientists are all about (a) understanding the data
and (b) liaising with stakeholders.

As for technologies it is typically R/Python with Spark/H20 on top of a data
lake i.e. HDFS, S3. Every now and again on top of an SQL store e.g. EDW,
Presto or a Feature store e.g. Cassandra.

------
schaunwheeler
A lot of people in this thread are focusing on technical tools, which is
normal for a discussion of this type, but I think that focus is misplaced.
Most technical tools are easily learnable and are not the limiting factor is
creating good data science products.

[https://towardsdatascience.com/data-is-a-
stakeholder-31bfdb6...](https://towardsdatascience.com/data-is-a-
stakeholder-31bfdb650af0)

(Disclaimer: I wrote the post at the above link).

If you have a sound design you can still create a huge amount of value even
with a very simple technical toolset. By the same token, you can have the
biggest, baddest toolset in the world and still end up with a failed
implementation if you have bad design.

There are resources out there for learning good design. This is a great
introduction and points to many other good materials:

[https://www.amazon.com/Design-Essays-Computer-
Scientist/dp/0...](https://www.amazon.com/Design-Essays-Computer-
Scientist/dp/0201362988)

------
severo
I'd say:

1\. You need research skills that will allow you to ask the right questions,
define the problem and put it in a mathematical framework.

2\. Familiarity with math (which? depends on what you are doing) to the point
where you can read articles that may have a solution to your problem and the
ability to propose changes, creating proprietary algorithms.

3\. Some scripting language (Python, R, w/e)

4\. (optional) Software Engineering skills. Can you put your model into
production? Will your algorithm scale? Etc.

------
dxbydt
> What’s the fizzbuzz test for data scientists anyway?

Here's 3 questions I was recently asked on a bunch of DS interviews in the
Valley.

1\. Probability of seeing a whale in the first hour is 80%. What's the
probability you'll see one by the next hour ? Next two hours ?

2\. In closely contested election with 2 parties, what's the chance only one
person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?

3\. Difference between Adam and SGD.

------
ever1
Python: Jupyter, pandas, numpy, scipy, scikit-learn

Numba for custom algorithms.

Dataiku (amazing tool for preprocessing and complex flows)

Amazon RDS (postgress), but thinking about redshift.

Spark

Tableau or plotly/seaborn

------
closed
I would think about which of these you see yourself doing more..

* statistical methods (more math)

* big, in-production model fitting (more python)

* quick, scrappy data analyses for internal use (more R)

For example, I would feel weird writing a robust web server in R, but it's
straightforward in python. On the other hand R's shiny lets you put up quick,
interactive web dashboards (that I wouldn't trust in exposing to users).

------
greyman
If you will work in some bigger company doing data analytics, you can also
come across Tableau instead of Excel. Apart from SQL, if there is more data,
you might want to use Bigquery or something similar.

------
kmax12
One crucial skill you will need is feature engineering. Formal methods for it
aren’t typically in data science classes. Still, it’s worth understanding in
order to build ML applications. Unfortunately, there aren't many available
tools today, but I expect that to change this year.

Deep learning addresses it to some extent, but isn’t always the best choice if
you don’t have image / text data (eg tabular datasets from databases, log
files) or a lot of training examples.

I’m the developer of a library called Featuretools
([https://github.com/Featuretools/featuretools](https://github.com/Featuretools/featuretools))
which is a good tool to know for automated feature engineering. Our demos are
also a useful resource to learn using some interesting datasets and problems:
[https://www.featuretools.com/demos](https://www.featuretools.com/demos)

------
fredley
IPython/Jupyter, Pandas/Numpy and Python will get you everywhere you need to
go. Currently, until maybe Go gets decent DataFrame support, in terms of the
total time to get to your solution, I'd be amazed if any other setup got you
there quicker.

~~~
threeseed
> get you everywhere you need to go

No it won't.

That combination can't handle large datasets that are typical for most data
science teams i.e. maybe include PySpark. And then it's very limited so far as
ML/DL technologies.

~~~
fredley
> i.e. maybe include PySpark

Pandas and Spark are both DataFrame libraries, and seem to offer very similar
functionality to me. Why do you prefer Spark over Pandas?

> very limited so far as ML/DL technologies

I mean, getting Tensorflow up and running with GPU support isn't trivial, but
it's not exactly hard, and Keras[1] provides excellent support for a wide
variety of other backends. What, in your experience, is less limited?

[1]: [https://keras.io/](https://keras.io/)

~~~
jwilbs
Spark sits on top of YARN/Mesos, and is used for data processing scalability
that pandas can't handle.

Personally, I think two areas often lacking are software development skills
and general statistics knowledge. The former is necessary for writing
production-quality code, assisting with an sort of data engineering pipeline,
writing reliable, reusable code, and creating custom solutions. Unfortunately,
the latter is often skimped on (if not skipped entirely) in favor of more
'hot' fields like ml/dl, with the result being a fuzzy understanding across
the board. (You'd be amazed at the quantity of candidates lacking fundamental
knowledge about glm's, basic nonparametric stats, popular distributions, etc).

------
cwyers
You can get a lot of mileage out of just using R, dplyr, ggplot2 and lm/glm.
OLS still performs well in a lot of problem spaces. Understanding your data is
the key there, and a lot of exploratory visualization there will help a lot.

------
innovather
Hey everyone, I'm not a data scientist or a developer but I work with a lot of
them. My company, Introspective Systems, recently released xGraph, an
executable graph framework for intelligent and collaborative edge computing
that solves big problems: those that have massive decision spaces, tons of
data, are highly distributed, dynamically reconfigure, and need instantaneous
decision making. It's great for the modeling work that data scientists do.
Comment if you want more info.

------
drej
grep, cut, cat, tee, awk, sed, head, tail, g(un)zip, sort, uniq, split; curl;
jq, python3

~~~
proc0
So unix? lol

------
Jeff_Brown
Static typing lets you catch errors before running the code.

Pattern matching helps you write code faster (that is, spending less human
time).

Algebraic data types, particularly sum types, let you represent complicated
kinds of data concisely.

Coconut is an extension of Python that offers all of those.

Test driven development also helps you write more correct code.

------
ChrisRackauckas
A good understanding of calculus (probability), linear algebra, and your
dataset/domain. Anything else can be picked up as you need it. Oh, and test-
driven development in some programming language, otherwise you can't develop
code you know is correct.

------
ak_yo
Experimental design and observational causal inference would be excellent
skills to have. Especially if you’re working with people who are asking you
“why” questions, ML is helpful but isn’t going to cut it alone.

------
pentium10
As 1TB is free for processing every month, using SQL 2011 standard + combined
with Javascript UDFs, the winner solution is Google BigQuery for us, combined
with Dataprep

------
bitL
Spark + MLlib, Python + Pandas + NumPy + Keras + TensorFlow + PyTorch, R, SQL,
top placement in some Kaggle competitions. This would get you long way.

~~~
geebee
Good tool set recommendations (+1 for mentioning SQL, immensely helpful), and
I enjoy Kaggle. Not sure how critical top placement is, though.

It seems like getting into the upper echelons of Kaggle is a matter of
refining your model, and I do wonder how much value these refinements offer
over a more basic and general approach in a real world scenario. To be clear,
when I say I wonder, I'm not saying I'm rejecting the value, I really do mean
it, I'm uncertain about the value. I think it's probably very scenario
specific.

Think of it this way - a predictive value of 90% vs 95% could be the
difference between placing in the top 10% and the bottom third. Now, 5% isn't
nothing, it could be very valuable. It really depends.

But Kaggle is an environment where the question is already posed, the data has
been collected, the test and train sets are already split apart for you, and
winning model is the one that scores best on a hidden test set by a predefined
goodness of fit score.

In a real world scenario, suppose someone does a great job figuring out the
question to ask, gathering the data, and determining the most effective way to
act on the results, but uses a fairly basic, unrefined model. Someone else
does a middling job on those things, but builds a very accurate model as
measured by the data that has been collected. I'd say the first scenario is
likely to be more valuable, but again, it depends of course.

A couple other things, since I am a fan of Kaggle and do highly recommend it.
First, these things aren't necessarily exclusive - you can have a particularly
well conceived and refined model as well as a thorough and excellent businesss
and data collection process (though you may have to decide where to put your
time and resources).

Also, refining a model with Kaggle can be an exceptional training opportunity
to really understand what drives these things. So go for it! (I also find
these things kinda fun).

~~~
bitL
Top placement in Kaggle attracts recruiters for higher positions; i.e. I
observed a top 10 person getting a job of Head/VP of analytics in a large
European company even if let's say formal education wasn't top 100. I agree
real-world it is often useless, but people are drawn to proven winners.

~~~
geebee
I'm not too surprised to hear that. In fact, I'd say a top score on Kaggle is
probably a pretty positive indicator. Yeah, refining the model probably isn't
as big a deal in a real project as it is on Kaggle, but it still takes some
decent chops to get a good score like that.

My best was somewhere in the top third, so I'm not an especially strong Kaggle
competitor. But even that took a lot of data parsing, piping, cleaning, moving
some things to a database, populating a model, and parallelizing the
processing so I could things on a cloud in an hour rather than 100 hours on my
laptop. I learned a lot from it.

If you can score high on Kaggle, you definitely have some skill. And it's
hardly like people who can do this never have the other skills necessary to
manage the other stages of a data science project.

I probably wouldn't hire someone purely on Kaggle scores, but sure, it's a
positive indicator of programming and data management ability.

------
larrykwg
Nobody mentioned this yet: ETE:
[http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.ht...](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html)

a fantastic tree visualization framework, its intended for phylogenetic
analysis but can really be used for any type of tree/hierarchical structure

------
nrjames
There are two "poles" in data science: math/modeling and backend/data-
wrangling. Most of the time, the backend/data-wrangling piece is a
prerequisite to the math/modeling. The vast majority of small and medium sized
companies have not set up the systems they would need to support a data
scientist who knows only math/modeling. Depending on the domain, it's not
uncommon to find that a small/medium company outsourced analytics to Firebase,
Flurry, etc...

That's fine, but when it comes time to create some customer segmentation
models (or whatever) the data scientist they hire is going to need to know how
to get the raw data. Questions become: how do I write code to talk to this
API? How do I download 6 months of data, normalize it (if needed) and store it
in a database? Those questions flow over into: how do I set up a hosted
database with a cloud provider? What happens if I can't use the COPY command
to load in huge CSV files? How do I tee up 5 TB of data so that I can extract
from it what I need to do the modeling? Then you start looking at BigQuery or
Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache
ecosystem.

If you take a job at a place that has those needs, be prepared to spend months
or even up to a year to set up processes that allow you to access the data you
need for modeling without going through a painful 75 step process each time.

Case in point: I recently worked on a project where the raw data came to me in
1500 different Excel workbooks, each of which had 2-7 worksheets. All of the
data was in 25-30 different schemas, in Arabic, and the Arabic was encoded
with different codepages, depending on whether it came from Jordan, Lebanon,
Turkey, or Syria. My engagement was to do modeling with the data and, as is
par for the course, it was an expectation that I would get the data organized.
Well - to be more straightforward, the team with the data did not even know
that the source format would present a problem. There were ~7500 worksheets,
all riddled with spelling errors and the type of things that happen when
humans interact with Excel: added/deleted columns, blank rows with ID numbers,
comments, different date formats, PII scattered everywhere, etc.

A data scientist's toolkit needs to be flexible. If you have in mind that you
want to do financial modeling with an airline or a bank, then you probably can
focus on the mathematics and forget the data wrangling. If you want the
flexibility to move around, you're going to have to learn both. The only way
to really learn data wrangling is through experience, though, since almost
every project is fundamentally different. From that perspective, having a rock
solid understanding of some key backend technologies is important. You'll need
to know Postgres (or some SQL database) up and down; how to install,
configure, deploy, secure, access, query, tweak, delete, etc. You really need
to know a very flexible programming language that comes with a lot of
libraries for working with data of all formats. My choice there was Python.
Not only do you need to know the language well, you need to know the common
libraries you can use for wrangling data quickly and then also for modeling.

IMO, job descriptions for "Data Scientist" positions cover too broad of a
range, often because the people hiring have just heard that they need to hire
one. Think about where you want to work and/or the type of business. Is it
established? New? Do they have a history of modeling? Are you their first
"Data Scientist?" All of these questions will help you determine where to
focus first with your skill development.

~~~
dermybaby
So basic DBA skills + expert programming skills + very good math/stats?

Also - your model of asking questions before starting a new gig is very
relevant to nearly every programming job. Could also be some of the questions
a candidate asks in an interview.

Have you ever needed any Microsoft skills(MSSQL/C#) so far?

~~~
nrjames
Yep, I’ve used MS SQL products and I write C# sometimes and read and write
code to parse it very often because it is the primary language of the products
I support.

------
in9
I saw a simple tool somewhere a while ago (maybe a month or so ago) of a
simple cli for data inspection in the terminal. It seemed very useful for
inspecting data ssh'ed into a machine.

However, I can't seem to recall the name. Has any one seen what I'm talking
about?

------
anc84
Any programming language that you are proficient in. A solid understanding how
a computer works. Solid basis of statistics. Anything else is just sprinkles,
trends and field-specific.

~~~
wenc
> Any programming language that you are proficient in.

Oh I don't know about that. Programming languages are force multipliers, and
each language has a different force coefficients for different problem
domains. _They are not all equivalent_. They have their different points of
leverage, and simply being good in one does not mean you can solve problems in
any domain with ease. In fact the wrong programming language can often be
harmful if it's ill-suited to the problem at hand, and especially if it
_contorts your mental model_ of what you can do with the data.

One example I encounter a lot in industry is Excel VBA. I'm fairly good at VBA
and have seen very sophisticated code in VBA. I've also seen many basic
operations implemented badly in VBA that should not have been written in VBA
at all. By solving the problem in VBA, the solution is often "hemmed in" by
the constraints of VBA.

For instance, unpivoting data is often done badly in VBA (with for-loops), but
is trivial to do well in dplyr or pandas.

So I would say one has to choose one's programming language somewhat
carefully. Not any language will do.

------
eggie5
a lot of people using spark?

~~~
sandGorgon
same question that i have. Anyone using pyspark in production ?

Would you use pyspark mllib in a webservice instead of scikit ?

~~~
wenc
1) Yes, PySpark is great if you're mostly just doing dataframe manipulation in
Spark, using built-in functions. PySpark actually has similar performance to
Scala Spark for dataframes. (We've moved away from RDDs)

However, if you use a lot of UDFs where Spark has to serialize your Python
functions, you might consider rewriting those UDFs in a JVM language.
Serialization overhead is still fairly substantial. Arrow is trying to address
this by implementing a common in-memory format, but it's still early days.

I would still recommend PySpark to most people. It's more than good/fast
enough for most data munging tasks. Scala does buy you two things: type safety
and low serialization overhead (i.e. significant!), which can be critical in
some situations, but not all.

Also, the Python way has always been to prototype fast, profile, and rewrite
bottlenecks in a faster language, and PySpark conforms to that pattern.

2) Spark MLLib is still fairly rudimentary in its coverage of major ML
algorithms, and Spark's linear algebra support, while serviceable, is
currently not very sophisticated. There are a few functions that are useful in
the data prep stage (encoding, tokenizers, etc.) but overall, we don't really
use MLlib very much.

Companies that have simple needs (e.g. a simple recommender) and that don't
have a lot of in-house expertise, might use MLlib though -- I believe someone
from a startup said that they did at a recent meetup.

Most of us need better algorithmic coverage and Scikit's coverage is currently
much better, plus it is more mature. We also have Numpy at our disposal, which
lets us do matrix-vector manipulation easily. There is some serialization
cost, but we can usually just throw cloud computational power at it.

Also note that for most workloads, the majority of the cost is incurred in
training. For models in production, one is typically processing a much smaller
amount of data using a trained model, so less horsepower is required.

~~~
sandGorgon
Hi, Thanks for the answer. What you said resonates with me - with a few
changes. Spark 2.3 will come with Arrow UDF, that should be a significant
performance boost. In that way, yes - we are taking at a forward looking bet.

About mllib - yes, we concur with you on algorithmic coverage. And yes,
training is the major issue. For example, what I read of Uber's Michaelangelo
infrastructure - it seems they train using Spark and save to a custom format
that is deserialized (using custom code) and made available as a docker image
.

There is value in consistency - using Spark thtoy2and through. Wonder what you
thought of that ?

~~~
wenc
1) I've heard about vectorized Python UDFs in Spark 2.3. Thanks for reminding
of that.

[https://databricks.com/blog/2017/10/30/introducing-
vectorize...](https://databricks.com/blog/2017/10/30/introducing-vectorized-
udfs-for-pyspark.html)

2) I'm not that familiar with what Uber is doing. My take is I'd like to use
Spark for as much as I can, but there are parts that are either more
performant or easier to accomplish in Python.

Spark with Arrow will definitely change the game.

------
latenightcoding
If you use Python: scikit-learn, Pandas, NumPy, Tensorflow or PyTorch

Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet

------
spdustin
OpenRefine (openrefine.org) is definitely a handy (and automate-able) part of
my data-cleansing workflow.

------
eps
You probably mean "data analyst".

"Data scientist" title would apply only if you are applying scientific method
to discover new fact about natural world exclusively through data analysis (as
opposed to observation and experiments).

~~~
sgt101
Designing experiments is a key part of Data Science work. Another key part is
determining where & how revealing observations can be made.

The analysis part is usually quite simple, often if it gets really complex
then that's a sign that the data is being tortured. Sometimes the marginal
gains that complex methods create (vs simple but good approaches) are not
worthwhile even if they are valid - simply in terms of time spent and
difficulty in communications.

------
sdfjkl
numpy, Jupyter (formerly IPython Notebook) and probably Mathematica anyways.

------
amelius
Any book recommendations?

------
ellisv
Counting and dividing.

------
topologie
Random Matrix Theory.

------
kome
Excel, VBA, SPSS ;)

~~~
babayega2
OpenRefine has helped me a lot in data cleaning tasks.

