Ask HN: Data scientists, what tools do you use the most on a daily basis? - rusht
======
laingc
An answer in two parts:

The software tools: For me, PyCharm and Jupyter for development, Dask for data
preprocessing, Tensorflow, Theano, and various scipy libraries for ML, as well
as helpers such as PyMC3 and Edward for probabilistic programming.

The important tool: a PhD in applied mathematics that taught me how to think
originally and creatively about problems, to abstract and codify them with
mathematics, and to approach them numerically and computationally.

I suspect that the answer you were looking for is the first set of tools. This
set is extremely nice to have; the second set, at least for me, is essential.

~~~
DrNuke
Nevertheless, 80/20 (80% solution in 20% time) does wonders with the right set
of tools and a rudimentary appreciation of applied maths, 80/20 being often
what businesses really need. Unpopular as it may be, the latest tools come
pretty good as black boxes, so that creativity is shifted from maths to
application domains. No need to scare the OP, then, he is not a direct threat
to your job.

~~~
laingc
> No need to scare the OP, then, he is not a direct threat to your job.

I find it a bit disheartening that this is how you read my response to OP's
question. The denizens of HN may in general disapprove, but I believe my
education to be one of the most important factors in doing my job. I don't
think it's fair to respond with thinly-veiled accusations of elitism,
especially when I explicitly tried to also give him or her a "practical"
answer.

> Nevertheless, 80/20 (80% solution in 20% time) does wonders with the right
> set of tools and a rudimentary appreciation of applied maths, 80/20 being
> often what businesses really need.

In some cases, you're right, but more and more often I see people assuming
that they've found the "80% solution", when in reality it's something like a
30 or 40% solution. My personal experience is that with many problems, a bit
of thought shows you that many seemingly simple problems are hiding both a lot
of complexity and a lot of potential - and I've seen real money left on the
table because of a failure to recognise this.

My opinion is that conceptualising the problem in the first place is what
allows you to see really what an "80/20" solution looks like, and in my case I
don't think I would be able to do that without my most potent weapon: my
mathematical education.

Not every problem is like that: sometimes a simple, out-of-the-box solution
does work really well, and solves a concrete business need. My point is that
when this is about the limit of what you can do, you don't really know whether
your problem is in this class or not.

~~~
DrNuke
It was not elitism my constructive point but this: "Unpopular as it may be,
the latest tools come pretty good as black boxes, so that creativity is
shifted from maths to application domains." There are several tech fields and
all the human sciences that are directly benefitting from an 80/20 data
science approach and these are the new domains being added, they also have
higher ceiling if you want.

------
cardosof
That really depends on the task at hand and for each case there are lots of
similar tools, so let's consider just the most frequently used.

Daily I'd say RStudio, Excel, Tableau, PowerPoint. Either I'm coding in R or
I'm presenting.

------
roystonvassey
Development - Jupyter IDE for Python Text Editor - Sublime Text Knowledge Base
- Stackoverflow Visualization - Matplotlib + Excel Charts Presentation -
PowerPoint

------
dagw
Much of the data I analyse has a spatial/geometric/geographic component and
for that I wouldn't want to do my job without FME. In fact it's probably may
one of my all time favorite pieces of software.

Otherwise jupyter with numpy and all the other great python libraries is where
I spend much of my time.

------
numbernine
Atom + hydrogen addon Python 3 Keras and Tensorflow (I'm doing nlp)

------
usgroup
Out of interest, is everyone using Spark/Hadoop/etc because you need to and/or
you chose to or another reason? IMO, legit use cases seem to me relatively
sparse.

~~~
syllogism
Let's say you're developing your analysis script, and it takes 45 minutes to
run. That doesn't sound like much of a problem, but it means you only really
get to try 10 things a day. You can kick something off and then start trying
something different, but that's actually quite a burden. It's cognitively
easier to run a series of parallel tasks instead of parallel serial tasks.
Even if a job is taking 15-20 minutes, that can really wear you down, and give
you a pretty unproductive week. It's hard to keep your place.

That said: I think often these 45 minute tasks should be much faster, and
folks would actually benefit from optimising their code a bit more. If I have
to do a word count in 30gb of text, I'd rather implement carefully or use
Hyper Log Log. I find it weird that people would rather use Spark, which I
always find quite painful. I guess it comes down to familiarity.

------
francisb07
Jupyter + Spark. tableau if I'm feeling lazy for viz.

