Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: As a data scientist, what should be in my toolkit in 2018?
341 points by mxgr on Feb 20, 2018 | hide | past | web | favorite | 169 comments



Mathematics. Which branch of math is domain dependent. Stats come up everywhere. Graphs do too. In addition to baseline math, you really need to understand the problem domain and goals of the analysis.

Languages and libraries are just tools: knowing APIs doesn’t tell you at all how to solve a problem. They just give you things to throw at a problem. You need to know a few tools, but to be honest, they’re easy and you can go surprisingly far with few and relatively simple ones. Knowing how, when, and where to apply them is the hard part: and that often boils down to understanding the mathematics and domain you are working in.

And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.

My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability, calculus, Fourier analysis, graph theory, PDEs, combinatorics.

(Context: Been doing data work for decades, before it got its recent “data science” name.)


And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.

I don't necessarily agree with this. Yes, a sound understanding of the domain and knowledge of the mathematics and statistics are vital to gaining insights. But. I would make a very clear distinction between exploratory data viz and explanatory data viz. Data visualization when presenting those insights is an important part of driving decision making.


> And don’t over use viz. [...]

I don't fully agree with this neither. Especially for mathematical concepts, visualization can give insight into how theorems are constructed and combined. This can prove to be vital when applying concepts and theorems to new problems.

I would especially like to bring forth 3blue1brown[1]. He is a creator of videos which beautifully visualizes and explains complex mathematical problems. His efforts has given me an insight into math which theorems explained in text and variables could never do.

However I do see your point that visualizations without understanding can be misleading. Hence the pure, written math is important to read and reason about, but I do believe that some concepts need to be visualized to be fully understood.

[1]: http://www.3blue1brown.com


I think what he/she means is poorly designed visualizations. Just because a plot is grayscale and not interactive doesn't mean it's worse than a cluttered poorly-designed super interactive web widget. It's a poor choice of wording, but I think by "overuse" they might mean "unclear but eye-catching". Besides "overuse" is literally the quantity that is excessive.


Good viz is what connects non-ML/AI users to the "magical" results of ML/AI


Indeed. Too many people when asked about their skills or experience just rattle off a list of tools or libraries. Usually the same ones as everyone else!


Out of interest, can you give an example of a problem you've solved using Z3?


One data problem boiled down to being an instance of the set cover problem (https://en.m.wikipedia.org/wiki/Set_cover_problem). Pretty easy to pose as an integer constraint problem, and Z3 solved it in about 20 minutes for me.


I really like to get a degree in Mathematics but I simply don't have the time to throw at it (work, children, etc). What do you suggest I should do to have something on my resume? MOOC maybe?


Usually MOOC for resume don't help as everyone does them. The advice that I found useful for resume building is working on projects that you can catalog in a portfolio.

With regards to gaining math skills, this upcoming MOOC from Microsoft on EdX looks promising[1].

[1] https://www.edx.org/course/essential-mathematics-for-artific...


Link didn't work for me. This one [1] did.

[1] https://www.edx.org/course/essential-mathematics-for-artific...


So you suggest that I should learn from MOOC then go on and work on some projects so that I can prove I really know it.


Exactly. And to take it one step further, choose one industry you are interested in. That way you will gain invaluable domain experience as you add relevant portfolio projects.

If you don't have an industry in mind, you can use a site like glassdoor.com and search for data scientist positions by city and industry to get a feel for demand.


Full disclosure: I'm in the industry for 10+ years as a programmer. I just realized that if I want to move in the AI direction I'll need some math education. I don't want to become a data scientist.


You could work thru one of the ML in R/python books e.g. Géron or Raschka/Mirjalili and then dig into the lin alg, prob, stats, calculus/analysis you see there with the books everybody recommends, LA by Axler, Strang or Proabilty by Bertsekas /Tsitsiklis, Real Analysis by Pugh, Abbott, Strichartz etc


I have good background in graph theory (IMHO) but don't know many data science use-cases (I'm amateur at that). Could you point to some good start points?


Graphs show up all over the place. Social media: who is connected, which people interact. Cybersecurity: which computers/programs/users interact with which other computers/programs/users. Retail analytics: which products are bought with which other products; which products are more important in a graph than others.

Basically, any problem where you can establish relations between elements can be treated as a graph. I've used graphs for image analysis before too: pixels are vertices, edges represent neighborhood relations - especially useful when you make nonlocal connections (e.g., nonlocal means; graph-cut methods for segmentation; etc...)

I've worked with them in three of the above contexts: cybersecurity (my current projects), retail analytics, and image analysis. I've avoided social network stuff - never cared for that area much.


R is not present in your list, did you ever try it and what's your opinion about it?


R is good for one-off analysis problems, but is really bad for large distributed production systems. My team moved a large internal analysis application from R to Python because Python works well for both statisticians and software engineers.


I know R and have used it in the past. I just don’t like the language. I keep RStudio around though because on rare occasions I do look around in it to see if it has something I need. So rarely though that I forgot to list it...


I think it is safe to say that they don't think it's too important for them given their main message, but hey, maybe they have an opinion anyway...


If you have Mathematica, you might not need R as both are like Swiss Army Chainsaws for Data Analytics.


Not sure y I was downvoted here. I've used both products in this problem area. Mathematica is definitely more than a CAS. Both are great in their own ways.


I think visualization can be a helpful tool to understand the data. I have seen some DS's get caught up in visualization for visualization's sake which I think can be wasteful.

I definitely think a solid mathematical understanding helps to build quantitative and critical thinking skills which are very key in data science.


@ms013 interested to know how you are using the solvers, are you willing to share any further details?


Responded to someone else earlier about this. Used solvers for problems that end up requiring solutions to problems like minimum set cover or schedule optimization problems. Basically, problems where a naïve or brute force approach will take forever to run and you need to use a real solver to attack it. These usually are data problems that end up looking like what would traditionally be considered under the umbrella of operations research.


Honestly, in any STEM major, esp. the physics heavy ones, those maths areas should be well understood. Is any (physics heavy) STEM major also a Data Scientist then too?


[flagged]


Unless you do a masters degree in Data Science, AI, CS, Statistics, or a related field, a masters degree only serves as proof that you are intelligent and work hard. Someone without a masters degree can still have those attributes, but they would just have to prove it some other way.

For example, if you have a bachelors degree from a top engineering school (MIT, Cal Tech, Stanford, Berkeley, etc.) you have proven that you are intelligent and can work hard.

People without a masters degree, but more business experience, bring a different perspective, and are often more business results focused, and potentially work more collaboratively than an individual who just graduated from a masters program.

Source: I am a Data Science hiring manager, and have interviewed 100+ candidates at several companies


> Unless you do a masters degree in Data Science, AI, CS, Statistics, or a related field, a masters degree only serves as proof that you are intelligent and work hard. Someone without a masters degree can still have those attributes, but they would just have to prove it some other way.

I think the commenter’s point is (implicitly) about specific, directly relevant Master’s degrees. Obviously a general Master’s wouldn’t provide much of an advantage. The difficulty isn’t demonstrating intelligence and work ethic, it’s demonstrating targeted expertise.

> People without a masters degree, but more business experience, bring a different perspective, and are often more business results focused, and potentially work more collaboratively than an individual who just graduated from a masters program.

To be honest with you, this sounds to me like complete speculation. I’m not saying it’s wrong; rather it seems like it’s at best unempirical, and at worst unfalsifiable. The qualifiers you’re using (like “potentially”, or “often”) don’t seem like strong heuristics.

I think it would be helpful to discuss straightforward job descriptions. For most real data science roles, I would not weight any of what you’ve listed (except collaboration) as being remotely as useful as demonstrable expertise in computer science and statistics. For candidates without a Master’s degree, I wouldn’t take business experience or lack thereof as a signal whatsoever - I’d look for a relevant heuristic to replace it.


During my recent data science job hunt, I received a lot of resistance due to my lack of a Master's/PhD: https://twitter.com/minimaxir/status/951117788835278848


I read your tweets and totally agree that data science interviewing is really bizarre. I have a PhD and currently work as a principal data scientist at a large company, but I've been interviewing with some other companies and the interviews are outright strange.

I've had questions ranging from reversing strings on a whiteboard to checking for valid email addresses. I had another question about flipping biased coins and calculating probabilities. It's all nonsense and totally unrelated to the skills I developed during my PhD which primarily consisted of performing massive amounts of machine learning on high performance computing systems over large sets of data to extract important insights.

But — if solving these algorithm puzzles quickly and without errors is the key to a $300k+ job, so be it. I'll just practice this nonsense until I've optimized for the skill of "interviewing", and then maybe I can contribute in some kind of meaningful way to the company with actual data science.


I'd call myself more of a "data plumber", with a double-BS in physics and computer science, but I'm considering returning for a MS so I'm more qualified to do interesting work.

Is there a consensus about what kind of Master's would be most useful data-sciency stuff? Computer science? Stats?


> I'd call myself more of a "data plumber"

I think the actual term is Data Engineer.


Lots of universities have Data Science Master's Programs, which may ultimately fall under the Computer Science, Mathematics or Stats department. So, it depends on the university.


Data Science degrees seem to be such a hodge-podge of topics, with very uneven quality between programs. For long-term benefit, I'll admit I'm more inclined to tackle a traditional MS like CS, math, or stats.


I have needed this term for years. Thanks!


I think data scientist, much like software engineer, is something you can call yourself without having any credentials whatsoever.

It’s why technical interviews can be so brutal, unfortunately. There are a lot of frauds out there. Money attracts frauds.

What’s the fizzbuzz test for data scientists anyway?


I'm a data engineer for a startup that's trying to hire its first data scientist. The range of candidates that apply with this title is massive. Defining our expectations has been challenging.

My phone screen "fizzbuzz" is having them calculate a standard deviation from an array of data w/out with only basic operators (no numpy.std). Then explain why they choose population/sample and explain the difference.

I studied math in undergrad so one of my requirements is "knows more math than me".


> I studied math in undergrad so one of my requirements is "knows more math than me".

What kind of questions are you asking to ensure that they’re correct when they’re speaking about math you don’t know?


This is a pretty hard problem I haven’t solved just yet. Generally, my in person interview is based on a set of DS problems I’ve been working on and had to do research myself to solve. What I look for is a strong intuition of the underlying math. Like, I can give them a formula and they can intuitively express what that means and explain it to me and then explain the next place they would take the solution. It’s not a perfect measure, but i’ve found the comfort with core concepts to be the most common trend among great data scientists I’ve worked with in the past.


I think regurgitation of math formulas is a terrible way to hire for most data science positions. I've seen a breakdown of data scientists into two categories:

1) People who are great at the mathematics behind the statistical tooling

2) People who are great at conceptualizing a relevant question, operationalizing it, and then using a computer to apply appropriate models.

I think in most cases, for businesses needing to solve business problems, the latter kind is probably more useful. There are applications where the former is required, but you probably know if you need this kind of data scientist.

I should also add that these traits aren't mutually exclusive, but that individual data scientists typically are stronger or weaker along approximately those axes.

In general, I still dislike the term "data science" because it obfuscates meaningful distinctions between math nerds, computer science nerds, and research nerds who happen to do some applied stats.


I actually agree with your breakdown. But, as a "data engineer" with a math background who's spent 5 years building analytics tools, I already identify as your type-2. We're working in a field that already has a rich history of established statistics that needs to be interpreted and broken down, so I think we're looking for someone who's a type-1.

I do, however, think anyone with some lick of statistics background should know the formula for a standard deviation. Considering how fundamental the idea of variance is in statistics.


At least with fizzbuzz you are working through how to logically solve a problem. This is just regurgitating a formula. I don't see how this is helpful.


Its designed to quickly weed out people who don't know the underlying math, just as FizzBuzz is designed to quickly weed out people who don't know programming.


I work as a data scientist, and my graduate research involved harmonic analysis over compact groups, optimization over Riemannian manifolds, and loopy belief propagation. You'd reject me in an interview because I couldn't remember the formula for standard deviation off the top of my head?


It's not a hard weed-out for us. But, if you talk through variance for 10 minutes, you get pretty close to the formula for standard deviation.

Also, for our role, we're specifically hiring someone with extensive stats background since a large part of the role is learning domain-specific statistics of the industry we're targeting and figuring out how we can adopt those models with our data.


Personally, I don't ask weed-out questions. Never have, never will. I'm just saying that's what they're doing.


No, using numpy would be analogous to a simple formula. Doing it without numpy requires actually understanding what’s going on.

It’s a filter that theoretically allows false positives (which is why you continue with other questions), but it really shouldn’t have any false negatives.


> My phone screen "fizzbuzz" is having them calculate a standard deviation from an array of data w/out with only basic operators

This is just my n=1 opinion, but this is a terrible test for data science skills. I've had to calculate standard deviation by hand many times in my life, but my short term memory is such that despite doing that dozens of times over the past two decades, I still can't recall the formula off the top of my head. And then there's the whole n vs (n-1) thing in the denominator which has something to do with degrees of freedom, but I would just Google that as soon as I needed to know (depending on exactly what I was trying to do with the data).

So I don't understand how your question in any way tests someone's skills at analyzing data to extract valuable business insights. At best, it tests someone's ability to memorize formulas and minutiae (although I'll grant you that understanding the difference between a sample and the population is important).

Personally, I think take-home interviews with real data sets are the best way to gauge a candidate's skills. You're actually testing them with a work sample, and they are not under artificial time or memorization constraints.


I have to admit this scares me just a little bit. I'm a senior sysadmin who is trying to lateral transition into data science, but I'm no math whiz, I'm just good at pragmatic use of tech stacks and have a generally analytical mind. If you are a math undergrad how could I ever expect to know more math than you? Of course a standard deviation should be easy, but your comment on math just stuck out to me.


> If you are a math undergrad how could I ever expect to know more math than you?

Read through, and do all the exercises in, one textbook each for:

1. Calculus

2. Linear Algebra

3. Abstract Algebra

4. Analysis

5. Topology

6. Probability Theory

7. Number Theory

...more or less in that order. Make sure your calculus book covers single variable and multivariable calculus. Supplement with applied mathematical statistics. Do that, and you have the equivalent of a mathematics undergrad (as far as relevant courses are concerned).

You could even do this with something like UIllinois’ NetMath program, or some courses on Coursera. You can swap out Number Theory for Complex Analysis or deeper Probability Theory and it’d be more relevant.


I would amend as follows:

Skip abstract algebra, topology and analysis. If you find yourself in the same room as a number theory book, walk away slowly without making eye contact lest it cast a spell on you.


If you skip analysis and at least elementary topology your understanding will be limited to discrete probability, at best. If you skip abstract algebra, you’ll miss out on a lot of buildup to advanced linear transformations and operations in vector spaces.

Sure, skip number theory. Like I said, you could swap that out.


You could also spend months of your life learning the intricacies of measure theory, but then every measure in probability is a positive sigma-additive unit measure.

One can learn the necessary topology, analysis (etc.) in the relevant places (and the relevant depths) that they come up.


That doesn't seem like a good use of time. I've tried reading through and doing the exercises in an abstract algebra textbook. It's a lot of work and the applicability to real world problems is virtually non-existent. I think a more targeted approach would give you a better return on your time.


Sure, I agree. Abstract algebra isn’t directly helpful. But:

1. The context is knowing more math than someone who has an undergraduate degree in it,

2. Abstract algebra is part of such a degree, and contributes significantly to overall mathematical maturity, and

3. You can avoid some subjects in the short term, but in the long term you can’t progress further without a reasonable mastery of algebra and analysis.

Probability theory and linear algebra are heavily used in data science. You won’t be as competitive a candidate for a job if you don’t have a firm grasp of both subjects. At a certain point, linear algebra ceases to be distinct from abstract algebra, and those exercises you were doing become applicable to real world results.


Thank you for this list, going on the todo, along with every other relevant comment on this thread. (emacs org mode is my ds notebook and todo app)


Honestly, the market is so oversaturated with PHDs who are switching to DS I don't see how anyone can transition into it from a different role. I'm speaking for myself as well as someone with a math background, most people don't consider a math degree and 4 years of applying math models as a data engineering experience enough to be a "data scientist". They just weed out anyone without a PHD.

But, what you're describing I would consider "data engineering" (at least how I have been hired to do it). Working through the business problems and pragmatically facilitating data, pipelines, databases, and models to solve those problems. It's less established and less "hot" but, IMO, it's a much more valuable job to most businesses.


Maybe consider being a data engineer or a systems engineer? There's a pretty big demand for people that can set up, maintain, and assist the data scientists with the more complex tech stacks out there. In a former job as a systems engineer, I set up Hadoop clusters and helped manage data going into and out of it. And if you do decide to continue learning to become a data scientist, you'll already have a solid footing on the tech they actually use.


That might be an option to learn from, but it's not my end goal. As a senior sysadmin who was working for and reporting to PHD execs, I saw directly how what was needed was someone to do the data science and then bring convincing results and reports to the execs, essentially distilling the knowledge and wisdom of what needed to be done. I really want to fill that disconnect. (eg one of my failures as a sysadmin was me focusing too much on the technical, and now I want to expand and play the business board room politics game, but with data science)


I think it is a reasonable expectation to require a certain baseline of expertise, considering the first thing the poster admitted to was: "I'm a data engineer for a startup that's trying to hire its first data scientist."

Much like how the early hires at Twitter were not deeply experienced in high availability work -- segregating the architecture of a predominantly RoR code base to be resilient at scale, which lead to countless "fail whale" outages, before they eventually landed someone who helped them re-think their architecture to use RoR for what it's good at while introducing the JVM and other languages to handle other aspects of their workload.


> ... of data w/out with only basic operators... (emphasis mine)

I'm having a little trouble trying to parse that sentence. Could you explain it better?

Based on what I think is being asked, the question is essentially: What is a STD? I think this is a very straightforward and fair question.

For less Stat-y HNers: For normally distributed data, the STD is the root of the Variance. The Variance is just the average of the square of the difference between the data points to the mean. Essentially: Take a point, find the distance to the mean, square that, average over all points you've done that to. That's the variance. Root the variance, that's the STD.


This was useful info for a noob, thanks. It makes sense to me. If you understand SD in principle, you don't need to memorize the formula for a simple exercise like this, so it's a good filter for candidates who can demonstrate they do understand some basic math.


I'm in that boat. I think a technical bachelor's plus work experience and self-study allows me to get along, but similarly to mr_overalls in the sibling comment I'm going back to school for a master's degree. It seems that getting into a data science career can be done without at least a master's but it seems that it would be hard to advance without one.


Yes, they do. Having a masters or not is orthogonal to being a quack, especially in an environment as buzzword-laden as data science/ML.


I agree that a graduate degree goes a long way. Even though the requirements of the role varies greatly from team to team, it almost always involves going out to learn and apply new math to solve a problem. That's one of those things a graduate degree (humanities as well, not just STEM) tends to require of you. And a Phd demonstrates you've done that for years.


I'm a scientist (PhD student in microbiolgy) that works with lots of data. My data is on the order of hundreds of gigabytes (genome collections and other sequencing data) or megabytes (flat files).

I use the `tidyverse` from R[0] for everything people use `pandas` for. I think the syntax is soooo much more pleasant to use. It's declarative and because of pipes and "quosures" is highly readable. Combined with the power of `broom`,fitting simple models to the data and working with the results is really nice. Add to that that `ggplot` (+ any sane styling defaults like `cowplot`) is the fastest way to iterate on data visualizations that I've ever found. "R for Data Science" [1] is great free resource for getting started.

Snakemake [2] is a pipeline tool that submits steps of the pipeline to a cluster and handles waiting for steps to finish before submitting dependent steps. As a result, my pipelines have very little boilerplate, they are self documented, and the cluster is abstracted away so the same pipeline can work on a cluster or a laptop.

[0] https://www.tidyverse.org/

[1] http://r4ds.had.co.nz/

[2] http://snakemake.readthedocs.io/en/stable/


Sometimes I think I'm the only one who isn't really a fan of the tidyverse. I've found it slower, more prone to dependency issues, more prone to silent errors, and less well documented than most R packages (ie most of what you find on CRAN).


Dependency management, in my opinion, is one of the problems in the R ecosystem. The lack of name spaces when calling functions has made the community have many little packages that only do one thing on you are not really sure where it was actually used, unless you know the code and the package.

An example is the janitor::clean_names function I like to use for standardizing the column names on a data.frame.

However, the tidyverse is really serious in terms of api consistency and functional style, with pipes and purrr's functionalities. The unixy style of base R is unproductive in terms of fast iterating an analysis. Also, the idea of "everything in a data frame" (or tibble, with list columns and whatnot) together with the tidy data principles really takes the cognitive load off to just get things started.


You should try https://github.com/robertzk/lockbox for dependency management

It's like bundler or cargo for R


I agree on these reservations, especially in terms of silent errors (which get compounded through minor ways in which backwards incompatibility can sneak in to the existing scripts) and dependencies.

As a half-solution, I ended up restricting myself to a very few libraries in this family (mainly dplyr, lubridate, stringr, broom) and to using packrat to consistently freeze the library versions for these.


I really enjoy the tidyverse, especially dplyr. I do most of my work in python now and find myself moving more and more of time in python.

There are definitely some issues if you have to reliably run scripts (not to mention the difficulties of putting into production)

The thing I really like about R over python is for SPECIFIC tasks like inspecting data and trying to get an answer out quickly, there really isn't a quicker or better tool to use. The ONLY reason I still even use R is because of the ease to get answers with the tidyverse


I would love some examples where R makes it easier to get answers than Python. Probably would make good Pandas2 issue too!


This is super controversial, but no need to use Jupyter.

I personally find that Jupyter feels like a hack compared to something like RStudio. You have to open a terminal and launch a web server?


You're not the only one. Though I've found there seems to be a bit of a cult surrounding the tidyverse, a mere hint of criticism usually results in outrage and attacking other tools/packages (by users, not the authors).


I like the Tidyverse. My only complaint is that it presents another headache when moving from one language (R) to another (usually Python/SQL). Using the base R functions while integrating loops and functions lessens the fatigue of moving between R and Python.


I tend to think that, if I am working with other people, python is the best choice. But if I'm working alone, R is the way to go.


> less well documented than most R packages

I on the other hand, find most R packages provide barely readable documentation. I can just hope that the vignette exists and actually explains the inputs/outputs.


Here it is for one of the most often used functions:

https://www.rdocumentation.org/packages/ggplot2/versions/2.2...

You think this is better than barely readable?


I am looking here and it's great: http://ggplot2.tidyverse.org/reference/aes.html


It is the exact same thing except the examples are run. So ok, it seems some people consider this great documentation.


You are not alone. I think it’s a great thing for some people, but a net negative for the R community in the long run.


A non-insignificant fraction of the R community only exists because of Tidyverse.


As a data scientist who has been using the language for 5 years now, Julia is by far the best programming language for analyzing and processing data. That said, it’s common to find many Julia packages that are only half-maintained and don’t really work anymore. (I still don’t know how to connect to Postgres in a bug-free way using Julia.) And you’d be hard pressed to find teams of data scientists that use Julia. So in that sense, Python has much more mature and stable libraries, and it’s used everywhere. (But I really hope Julia overtakes it in the next couple of years because it’s such a well-designed language.)

Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.

I think memorizing basic stats knowledge is not as useful as understanding deeper concepts like information theory, because most statistical tests can easily be performed nowadays using a library call. No one asks people to program in assembler to prove they can program anymore, so why would you memorize 30 different frequentist statistical tests and all of the assumptions that go along with each? Concepts like algorithmic complexity, minimum description length, and model selection are much more valuable.


> That said, it’s common to find many Julia packages that are only half-maintained and don’t really work anymore.

On this specific point, it's worth noting that up until now there's been a single massive repository of every Julia package ever published, regardless of its current state or utility. Starting with the upcoming 0.7 release, Julia will introduce the concept of "curated" repositories so that, going forward, if you stick just with the default curated repository of packages you should have much less chance of running into a broken or unmaintained package.


Has Julia converged on a solution for data frames? I watched some JuliaCon videos and got the impression that they hadn't. There seem to be a lot of different overlapping efforts.


Well, only the DataFrames.jl package comes to my mind. However, there exist a few packages that extend this package (DataFramesMeta.jl or Query.jl; these overlap to some extend, but the newer Query package seems to go beyond DataFrames and offers some piping capabilities to interface with plotting packages). In general: During the three years of my PhD some language / package upgrades broke some of my scripts (during 0.4 -> 0.5 and -> 0.6), but the language (and its extensive documentation, online and from the source code of the packages) is very pleasant to use - the deprecation warnings usually help you to adjust your code in time. I have been relaying heavily on said DataFrame package, and am quite happy - the community is usually responsive and helpful in case of problems or questions.


My toolkit hasn't changed since 2016:

- Jupyter + Pandas for exploratory work, quickly define a model

- Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet: https://www.cheatography.com/chewxy/cheat-sheets/data-scienc... . Additional write-up on why Go: https://blog.chewxy.com/2017/11/02/go-for-data-science/)

I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain


I'm a big Go fan, but this is the first time I've seen someone recommend Go for data science. After looking at this cheat sheet you've got me convinced though. Would you mind pointing me to any other less cheat sheet style and more in depth examples that you particularly like?


Working on it. Part of my goal for 2018 is to write a lot more soft documentation - tutorials etc.

Go is quite straightforwards though - WYSIWYG for the most parts, hence you probably won't find a lot of sexy tutorials. Almost everything is just a loop away, and in the next version of Gorgonia, even more native looping capability is coming


Awesome, thank you!


You might also want to have a look at:

- http://gopherdata.io

... and in particular the resources lists at

- https://github.com/gopherdata/resources

Also, Dan's GopherCon talk on Go for data science is a great way to get yourself convinced enough to try it out:

- https://www.youtube.com/watch?v=D5tDubyXLrQ


A couple of thoughts, off the top of my head:

Programming languages:

  - python (for general purpose programming)
  - R (for statistics)
  - bash (for cleaning up files)
  - SQL (for querying databases)
Tools:

  - Pandas (for Python)
  - RStudio (for R)
  - Postgres (for SQL)
  - Excel (the format your customers will want ;-) )
Libraries:

  - SciPy (ecosystem for scientific computing)
  - NLTK (for natural language)
  - D3.js (for rendering results online)


I make the claim that you can go very far in the SciPy ecosystem without ever touching R.

It is worth understanding the concepts of numpy and pandas. Furthermore, try out IPython/Jupyter, especially for rapid publishing (people run their blogs on jupyter notebooks).

I think certain libraries depend very much on where you focus. Machine learning? Native language processing? Visualization? Something in economics? Fundamental sciences? For instance, I never need NLTK in theoretical astrophysics ;-) Instead, I need powerful GPU based visualization, which is however very old school with VTK and Visit/Amira/Paraview (also very much pythonic).


I disagree, even though python is the language I do most of my development in. But it probably depends on the problems we're thinking of a data scientist solving.

If you're doing a lot of work with matrices, model fitting in production, then python seems fine. However, a lot of data scientists I see are more like scrappy data analysis / visualization types, who are churning out small dashboards. In that case R's tidy verse and shiny are just incredibly fast to develop with.


I second that R is nice to have, but not needed. I’ve been doing science in Python for a decade without ever needing R.

For powerful GPU viz, have you considered vispy? Four authors of four independent Python science visualization libs got together to build it.


Agree, I would drop R, Python has you mostly covered now. Julia is also worth learning.


I wouldn't be recommending to drop R at all.

Very few enterprise data science teams are 100% Python (in fact none I've heard of). R is still very heavily used (and in fact all data science teams I've worked in it has been the dominant technology).

There is a reason Microsoft purchased Revolution.


R, python and Julia are all Turing-complete languages, so of course you can drop any two and get by with just the third.

The real selection happens when you consider what's available in opensource world. What code you don't have to write? What high-quality libraries are available vs which ones you will have to write yourself?

On this topic, R has vast advantage over python in some domains, such as bioinformatics for example, while python definitely shines when it comes to deep learning (and using for loops).

You can't just claim that one shouldn't look at R because you personally know one language better the other, quite likely because in your domain it's not being used as much.

I do prefer the deep learnin, NLP and production serving story in python, but you will have to pry dplyr+ggplot from my cold dead hands for quick analysis and charting. Not to mention that pandas's API is a clusterfuck compared to R's native data frames.


Maybe SpaCy for NLP. Way more intuitive and fast too. Good list.


Most of these are conveniently packaged in:

$ docker run -it --rm -p 8888:8888 jupyter/datascience-notebook


I'd gently suggest basic CLI Perl over BASH for cleaning up files, as it combines grep/sed/awk in a language thats more generally useful.


Agreed. Perl was designed for text munging, and is superior to pretty much everything for this task.

WRT bash, where to begin? In the past 40 years, there’s pretty much a better tool for everything someone tries to do with bash. It lives on pretty much through inertia and pride.


FreeBSD sh(1) (not bash(1)) man page. That's just how I understood how to shell. Nowadays I'm running Debian and my $SHELL is /bin/bash, but when I was on FreeBSD I really learnt tools like make(1), sh(1); the man pages were pieces of art. Having read sh(1), I do have a nice grasp of how shell works in general, to which knowledge I can add anytime the higher-level goodies bash has to offer (though I generally prefer keeping it POSIX, and using an actual programming language when it doesn't cut it).


good list. I would add tidyverse in R ecosystem to it


I would go as far as saying the tidyverse is an essential piece of working with R. Base R sans tidyverse is not a pleasant experience.


It's not that bad. It's inconsistent and clunky, but all of the tools are there (and tend to be faster than the tidyverse versions). Don't get me wrong, I love the tidyverse but R is very, very usable without it.


If you care about quantifying uncertainty, knowing about Bayesian methods is a good idea I don't see represented here yet. I care so much about uncertainty quantification and propagation that I work on the Stan project[0] which has an extremely complete manual (600+ pages) and many case studies illustrating different problems. Full Bayesian inference such as that provided by Stan's Hamiltonian Monte Carlo inference algorithm is fairly computationally expensive so if you have more data than fits into RAM on a large server, you might be better served by some approximate methods (but note the required assumptions) like INLA[1].

[0] http://mc-stan.org/ [1] http://www.r-inla.org/


I think this is an important point. Having worked in / proximate to public policy kinds of problems, Bayesian methods have some really great properties:

1. easier interpretation of results than frequentist methods for lay people (business strata, elected officials, or other decision makers)

2. Uncertainty can be quantified and visualized reasonably well, which helps decision makers not think of stats as a magic box that produces a single answer.

3. Sensitivity analysis can be placed right up front: selection of priors representative of the beliefs of differing opinions / ideologies can inform decision makers of when they should consider changing their minds, and when they might still hold out.

Downsides of Bayesian methods:

1) Conceptually more involved than typical maximum likelihood estimation methods

2) Computationally expensive

3) Methods might not be as well known to a nominally stats-savvy audience.


I have also used Bayesian quantification of uncertainty in pricing forecast models. Decision makers love a measure of uncertainty when one recommends a pricing scenario that can have significant impact on revenue. Also, you get the chance to build multilayer models to combine knowledge from independent samples. PyMC3 is fantastic for building these models within Jupyter and Gelman's Bayesian Data Analysis is a great introduction for different Bayesian model applications.


do you have a recommended guide/textbook on learning stan? I've recently started doing more bayesian analysis, mainly bayesian estimation supercedes the t-test.


As someone who uses Stan - I would recommend reading the Stan reference documentation, it's essentially a textbook.

Also, get used to reading the Stan forums on Discourse. Happy Stanning


> what tools should be in my arsenal

A sound understanding of mathematics, in particular statistics.

It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.


Dealing with large data processing problems my main tools are as follows:

Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses

Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup

My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.

Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.

The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.

To sum it up, Dask alone makes all the difference for me!


What does "Data Scientist" actually mean these days? Does it mean "Write 10 lines of Python or R, and not fully understand what it actually does"? Or something else?

I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.

Maybe we need a Data Scientist to work out what a Data Scientist is?


I hire data scientists so can tell you.

It means someone who can work with business stakeholders to break down a problem e.g. "we don't know why customers are churning", produce a machine learning model or some adhoc analysis (usually the former) and either communicate the results back or assist in deploying the model into production.

Typically there will be data engineers who will be doing acquisition and cleaning and so the data scientists are all about (a) understanding the data and (b) liaising with stakeholders.

As for technologies it is typically R/Python with Spark/H20 on top of a data lake i.e. HDFS, S3. Every now and again on top of an SQL store e.g. EDW, Presto or a Feature store e.g. Cassandra.


That's a good meta reflection. Let's make an Y Combinator of Data Scientist a and Data Scientist b (recursive data scientist) to prove they can support recursion if Data Scientists a and b are first class functions, just because we can:

  const Y = a => (b => b(b))(b => a(x => b(b)(x)));


A lot of people in this thread are focusing on technical tools, which is normal for a discussion of this type, but I think that focus is misplaced. Most technical tools are easily learnable and are not the limiting factor is creating good data science products.

https://towardsdatascience.com/data-is-a-stakeholder-31bfdb6...

(Disclaimer: I wrote the post at the above link).

If you have a sound design you can still create a huge amount of value even with a very simple technical toolset. By the same token, you can have the biggest, baddest toolset in the world and still end up with a failed implementation if you have bad design.

There are resources out there for learning good design. This is a great introduction and points to many other good materials:

https://www.amazon.com/Design-Essays-Computer-Scientist/dp/0...


I'd say:

1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.

2. Familiarity with math (which? depends on what you are doing) to the point where you can read articles that may have a solution to your problem and the ability to propose changes, creating proprietary algorithms.

3. Some scripting language (Python, R, w/e)

4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.


> What’s the fizzbuzz test for data scientists anyway?

Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.

1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?

2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?

3. Difference between Adam and SGD.


Python: Jupyter, pandas, numpy, scipy, scikit-learn

Numba for custom algorithms.

Dataiku (amazing tool for preprocessing and complex flows)

Amazon RDS (postgress), but thinking about redshift.

Spark

Tableau or plotly/seaborn


I would think about which of these you see yourself doing more..

* statistical methods (more math)

* big, in-production model fitting (more python)

* quick, scrappy data analyses for internal use (more R)

For example, I would feel weird writing a robust web server in R, but it's straightforward in python. On the other hand R's shiny lets you put up quick, interactive web dashboards (that I wouldn't trust in exposing to users).


If you will work in some bigger company doing data analytics, you can also come across Tableau instead of Excel. Apart from SQL, if there is more data, you might want to use Bigquery or something similar.


One crucial skill you will need is feature engineering. Formal methods for it aren’t typically in data science classes. Still, it’s worth understanding in order to build ML applications. Unfortunately, there aren't many available tools today, but I expect that to change this year.

Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.

I’m the developer of a library called Featuretools (https://github.com/Featuretools/featuretools) which is a good tool to know for automated feature engineering. Our demos are also a useful resource to learn using some interesting datasets and problems: https://www.featuretools.com/demos


IPython/Jupyter, Pandas/Numpy and Python will get you everywhere you need to go. Currently, until maybe Go gets decent DataFrame support, in terms of the total time to get to your solution, I'd be amazed if any other setup got you there quicker.


> get you everywhere you need to go

No it won't.

That combination can't handle large datasets that are typical for most data science teams i.e. maybe include PySpark. And then it's very limited so far as ML/DL technologies.


> i.e. maybe include PySpark

Pandas and Spark are both DataFrame libraries, and seem to offer very similar functionality to me. Why do you prefer Spark over Pandas?

> very limited so far as ML/DL technologies

I mean, getting Tensorflow up and running with GPU support isn't trivial, but it's not exactly hard, and Keras[1] provides excellent support for a wide variety of other backends. What, in your experience, is less limited?

[1]: https://keras.io/


Spark sits on top of YARN/Mesos, and is used for data processing scalability that pandas can't handle.

Personally, I think two areas often lacking are software development skills and general statistics knowledge. The former is necessary for writing production-quality code, assisting with an sort of data engineering pipeline, writing reliable, reusable code, and creating custom solutions. Unfortunately, the latter is often skimped on (if not skipped entirely) in favor of more 'hot' fields like ml/dl, with the result being a fuzzy understanding across the board. (You'd be amazed at the quantity of candidates lacking fundamental knowledge about glm's, basic nonparametric stats, popular distributions, etc).


>typical for most data science teams

I would bet that the mean size of dataset people are dealing with is a lot bigger than the median size.


You can get a lot of mileage out of just using R, dplyr, ggplot2 and lm/glm. OLS still performs well in a lot of problem spaces. Understanding your data is the key there, and a lot of exploratory visualization there will help a lot.


Hey everyone, I'm not a data scientist or a developer but I work with a lot of them. My company, Introspective Systems, recently released xGraph, an executable graph framework for intelligent and collaborative edge computing that solves big problems: those that have massive decision spaces, tons of data, are highly distributed, dynamically reconfigure, and need instantaneous decision making. It's great for the modeling work that data scientists do. Comment if you want more info.


grep, cut, cat, tee, awk, sed, head, tail, g(un)zip, sort, uniq, split; curl; jq, python3


So unix? lol


Static typing lets you catch errors before running the code.

Pattern matching helps you write code faster (that is, spending less human time).

Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.

Coconut is an extension of Python that offers all of those.

Test driven development also helps you write more correct code.


A good understanding of calculus (probability), linear algebra, and your dataset/domain. Anything else can be picked up as you need it. Oh, and test-driven development in some programming language, otherwise you can't develop code you know is correct.


Experimental design and observational causal inference would be excellent skills to have. Especially if you’re working with people who are asking you “why” questions, ML is helpful but isn’t going to cut it alone.


As 1TB is free for processing every month, using SQL 2011 standard + combined with Javascript UDFs, the winner solution is Google BigQuery for us, combined with Dataprep


Spark + MLlib, Python + Pandas + NumPy + Keras + TensorFlow + PyTorch, R, SQL, top placement in some Kaggle competitions. This would get you long way.


Good tool set recommendations (+1 for mentioning SQL, immensely helpful), and I enjoy Kaggle. Not sure how critical top placement is, though.

It seems like getting into the upper echelons of Kaggle is a matter of refining your model, and I do wonder how much value these refinements offer over a more basic and general approach in a real world scenario. To be clear, when I say I wonder, I'm not saying I'm rejecting the value, I really do mean it, I'm uncertain about the value. I think it's probably very scenario specific.

Think of it this way - a predictive value of 90% vs 95% could be the difference between placing in the top 10% and the bottom third. Now, 5% isn't nothing, it could be very valuable. It really depends.

But Kaggle is an environment where the question is already posed, the data has been collected, the test and train sets are already split apart for you, and winning model is the one that scores best on a hidden test set by a predefined goodness of fit score.

In a real world scenario, suppose someone does a great job figuring out the question to ask, gathering the data, and determining the most effective way to act on the results, but uses a fairly basic, unrefined model. Someone else does a middling job on those things, but builds a very accurate model as measured by the data that has been collected. I'd say the first scenario is likely to be more valuable, but again, it depends of course.

A couple other things, since I am a fan of Kaggle and do highly recommend it. First, these things aren't necessarily exclusive - you can have a particularly well conceived and refined model as well as a thorough and excellent businesss and data collection process (though you may have to decide where to put your time and resources).

Also, refining a model with Kaggle can be an exceptional training opportunity to really understand what drives these things. So go for it! (I also find these things kinda fun).


Top placement in Kaggle attracts recruiters for higher positions; i.e. I observed a top 10 person getting a job of Head/VP of analytics in a large European company even if let's say formal education wasn't top 100. I agree real-world it is often useless, but people are drawn to proven winners.


I'm not too surprised to hear that. In fact, I'd say a top score on Kaggle is probably a pretty positive indicator. Yeah, refining the model probably isn't as big a deal in a real project as it is on Kaggle, but it still takes some decent chops to get a good score like that.

My best was somewhere in the top third, so I'm not an especially strong Kaggle competitor. But even that took a lot of data parsing, piping, cleaning, moving some things to a database, populating a model, and parallelizing the processing so I could things on a cloud in an hour rather than 100 hours on my laptop. I learned a lot from it.

If you can score high on Kaggle, you definitely have some skill. And it's hardly like people who can do this never have the other skills necessary to manage the other stages of a data science project.

I probably wouldn't hire someone purely on Kaggle scores, but sure, it's a positive indicator of programming and data management ability.


Nobody mentioned this yet: ETE: http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.ht...

a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure


There are two "poles" in data science: math/modeling and backend/data-wrangling. Most of the time, the backend/data-wrangling piece is a prerequisite to the math/modeling. The vast majority of small and medium sized companies have not set up the systems they would need to support a data scientist who knows only math/modeling. Depending on the domain, it's not uncommon to find that a small/medium company outsourced analytics to Firebase, Flurry, etc...

That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.

If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.

Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.

A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.

IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.


So basic DBA skills + expert programming skills + very good math/stats?

Also - your model of asking questions before starting a new gig is very relevant to nearly every programming job. Could also be some of the questions a candidate asks in an interview.

Have you ever needed any Microsoft skills(MSSQL/C#) so far?


Yep, I’ve used MS SQL products and I write C# sometimes and read and write code to parse it very often because it is the primary language of the products I support.


I saw a simple tool somewhere a while ago (maybe a month or so ago) of a simple cli for data inspection in the terminal. It seemed very useful for inspecting data ssh'ed into a machine.

However, I can't seem to recall the name. Has any one seen what I'm talking about?


Any programming language that you are proficient in. A solid understanding how a computer works. Solid basis of statistics. Anything else is just sprinkles, trends and field-specific.


> Any programming language that you are proficient in.

Oh I don't know about that. Programming languages are force multipliers, and each language has a different force coefficients for different problem domains. They are not all equivalent. They have their different points of leverage, and simply being good in one does not mean you can solve problems in any domain with ease. In fact the wrong programming language can often be harmful if it's ill-suited to the problem at hand, and especially if it contorts your mental model of what you can do with the data.

One example I encounter a lot in industry is Excel VBA. I'm fairly good at VBA and have seen very sophisticated code in VBA. I've also seen many basic operations implemented badly in VBA that should not have been written in VBA at all. By solving the problem in VBA, the solution is often "hemmed in" by the constraints of VBA.

For instance, unpivoting data is often done badly in VBA (with for-loops), but is trivial to do well in dplyr or pandas.

So I would say one has to choose one's programming language somewhat carefully. Not any language will do.


Hard to say... I was more proficient in PHP than python, but when doing AI, we use python anyway, since in PHP some necessary libraries just aren't there...


a lot of people using spark?


Absolutely

Every single large scale data science team e.g. Google, Spotify, AirBnb will be using Spark for most of their work. It is by far the defacto standard for working with large datasets. Especially since it integrates so well with machine learning (H2O) and different languages (Scala, Python, R).


Definitely. It's very nice to do large jobs in such a scalable manner. And interacting with databases is very straightforward. I'd also recommend Scala especially if using Spark. I've grown to like it as much if not more than python and you can use zeppelin/jupyter notebooks with is as well.


Maybe too soon, but this framework[0] claims to be 2x faster than spark

[0]: https://datafusion.rs/


We use spark for most of our work. We love it so far, able to handle all our use cases so far and we really appreciate the fact that Scala also runs on the JVM.


same question that i have. Anyone using pyspark in production ?

Would you use pyspark mllib in a webservice instead of scikit ?


1) Yes, PySpark is great if you're mostly just doing dataframe manipulation in Spark, using built-in functions. PySpark actually has similar performance to Scala Spark for dataframes. (We've moved away from RDDs)

However, if you use a lot of UDFs where Spark has to serialize your Python functions, you might consider rewriting those UDFs in a JVM language. Serialization overhead is still fairly substantial. Arrow is trying to address this by implementing a common in-memory format, but it's still early days.

I would still recommend PySpark to most people. It's more than good/fast enough for most data munging tasks. Scala does buy you two things: type safety and low serialization overhead (i.e. significant!), which can be critical in some situations, but not all.

Also, the Python way has always been to prototype fast, profile, and rewrite bottlenecks in a faster language, and PySpark conforms to that pattern.

2) Spark MLLib is still fairly rudimentary in its coverage of major ML algorithms, and Spark's linear algebra support, while serviceable, is currently not very sophisticated. There are a few functions that are useful in the data prep stage (encoding, tokenizers, etc.) but overall, we don't really use MLlib very much.

Companies that have simple needs (e.g. a simple recommender) and that don't have a lot of in-house expertise, might use MLlib though -- I believe someone from a startup said that they did at a recent meetup.

Most of us need better algorithmic coverage and Scikit's coverage is currently much better, plus it is more mature. We also have Numpy at our disposal, which lets us do matrix-vector manipulation easily. There is some serialization cost, but we can usually just throw cloud computational power at it.

Also note that for most workloads, the majority of the cost is incurred in training. For models in production, one is typically processing a much smaller amount of data using a trained model, so less horsepower is required.


Hi, Thanks for the answer. What you said resonates with me - with a few changes. Spark 2.3 will come with Arrow UDF, that should be a significant performance boost. In that way, yes - we are taking at a forward looking bet.

About mllib - yes, we concur with you on algorithmic coverage. And yes, training is the major issue. For example, what I read of Uber's Michaelangelo infrastructure - it seems they train using Spark and save to a custom format that is deserialized (using custom code) and made available as a docker image .

There is value in consistency - using Spark thtoy2and through. Wonder what you thought of that ?


1) I've heard about vectorized Python UDFs in Spark 2.3. Thanks for reminding of that.

https://databricks.com/blog/2017/10/30/introducing-vectorize...

2) I'm not that familiar with what Uber is doing. My take is I'd like to use Spark for as much as I can, but there are parts that are either more performant or easier to accomplish in Python.

Spark with Arrow will definitely change the game.


If you use Python: scikit-learn, Pandas, NumPy, Tensorflow or PyTorch

Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet


OpenRefine (openrefine.org) is definitely a handy (and automate-able) part of my data-cleansing workflow.


You probably mean "data analyst".

"Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).


Designing experiments is a key part of Data Science work. Another key part is determining where & how revealing observations can be made.

The analysis part is usually quite simple, often if it gets really complex then that's a sign that the data is being tortured. Sometimes the marginal gains that complex methods create (vs simple but good approaches) are not worthwhile even if they are valid - simply in terms of time spent and difficulty in communications.


Define natural world... And gather a consensus around your definition...

Or maybe whole humanities should be considered as « not science ».

Beside a data analyst that don’t use scientific method is just a bad analyst. Some media outlet showcase blatantly lying charts made by people that understand the technicals but get everything wrong about the concepts.

So this is my advice, focus on understanding the concepts before the tooling. That is what will really make your value.


numpy, Jupyter (formerly IPython Notebook) and probably Mathematica anyways.


Any book recommendations?


Counting and dividing.


Random Matrix Theory.


Excel, VBA, SPSS ;)


OpenRefine has helped me a lot in data cleaning tasks.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: