
Data scientists mostly just do arithmetic and that’s a good thing - clorenzo
https://m.signalvnoise.com/data-scientists-mostly-just-do-arithmetic-and-that-s-a-good-thing-c6371885f7f6#.3q7sy2r4p
======
kfk
"Knowing what matters is the real key to being an effective data scientist."

Absolutely. What is absurd, though, is that even for the simplest of metrics,
we don't have really good tools in place to:

1\. Do matrix calculations (think R dataframes)

2\. Build quick dynamic dashboard with strong data segregation by user (you
don't want to show all data to all employees)

3\. Clearly separate values from dimensions so hierarchies and mappings can be
managed properly

4\. Allow for manual inputs

5\. Decently interface with Excel in some way

~~~
existencebox
4\. Holy shit 4. I actually disagree with some of your other points; if I need
DF calculation I'll do pandas in python, and in a pinch can make some raw
SQL/terrible LINQ work; dynamic dashboards exist but data segregation is
indeed a problem I don't see solved outside of the slightly aformentioned
"make views and give very fine grained permissions" runaround; 3 I'm not sure
I am understanding you right because I feel like I can do this in a
combination of sql views and powerpivot, and 5, see the above, powerpivot.

But 4, someday if it weren't the property of some company now I would LOVE to
show off some of the hacks I've done to get manual value input in Excel to
work to do live re-processing on data originally sourced elsewhere. Probably
some of the least maintainable and robust code I have _ever_ written, and
that's saying something at this point, unfortunately...

~~~
kfk
I should have mentioned that all my points should be easily done in an Excel-
like interface, so no coding allowed. Point 3 is how you look stuff, maybe
team A defines China as China, but team B defines it as China + Taiwan + Hong
Kong, at scale this kind of use case is very hard to manage on a centralized
BI system.

I'd be interested in hearing more about your hack, my email is in my profile,
manual inputs are one of the biggest pain in BI and one of the reason people
default to Excel files by email. In general, I am very passionate around those
topics, but I am having a hard time building an audience around this.

~~~
existencebox
Re: your profile, I saw a contact us for what looked to be a startup, and your
medium page, but no email (unless I'm particularly blind/you meant the
"contact us).

That being said, it's a simple enough explanation so I'll just divest it here.
The process I use certainly is constrained to certain requirements on the
dataset, but it's "worked for me" as a pattern a few times now. 1\. Get your
data into powerBI, data source of your choosing. Build any custom columns you
want primarily in the data model.

2\. Bring the powerBI as a pivotTable (not directly as a chart.)

3\. this is where things get messy/constrained, and will have needed to plan
ahead for this bit. Treat a cell in the chart as your "input cell", and create
new cells/new rows with formulas that utilize the input cell as WELL as the
cells underneath the pivot chart you pulled in. NOTE: excel REALLY DOESN'T
like you doing this. If you try to select the raw cells in the pivot table, it
will get confused and not behave as you expect. If you type in the cell
identifiers as you would with raw excel (B3:B24, etc) and they "just happen"
to be under a pivot chart, however, it works fine.

This step is largely where the "unmaintainability/fragility" comes into play
since as you can probably tell, the whole workaround relies on the _exact_
structure of the data as it relates to cell locations. You may need to ensure
this constraint yourself; e.g. if I had a timeseries pivotchart that I append
to, I'd need to order descending of the most recent 100 rows, to ensure my
static excel formulas were being populated as expected. I feel like my
explanation of this is a bit lacking, so please don't hesitate to ask
questions if I'm unclear, it's a rather mundane solution in and of itself.

4\. Once you're confident your data isn't going to change "structure" during a
refresh, you can build up a typical formulation chain -> chart as you would
with pure Excel, except that the formulation utilizes your input row for
custom on the fly user inputs, and is sourced from refresh-able external data.

------
apohn
I work for a company that does a lot of Data Science Consulting. And I'm still
surprised by how many companies I see that try to use complex methods to solve
problems that can be solved by high school statistics and a little lateral
thinking. There's just so much hype around "Data Science" right now.

I see this a lot when I walk into companies where I'm introduced to employees
with "Joe has a Master's in Applied Mathematics. Nitya has a PhD in Computer
Science with a focus on AI." Some managers love the idea that they are going
to spearhead some initiative that will solve some problem in the most complex
way possible, thus getting some visibility from people in the C-Suite.

A few years ago we worked with a customer who was trying to reduce issues with
their industrial processes. Most of the battle was just processing the raw
sensor data quickly enough. Using Stats 101 methods to predict issues, they
ended up saving a lot(!!) of money every year and were extremely happy. They
were so happy they agreed without hesitation to present the work at some major
conferences. During one conference presentation one audience member basically
said "You're not using Machine Learning so you're not doing predictive
Analytics or Data Science" and basically acted like we were a bunch of fakers
and idiots.

The hype is strong with Data Science.

Absolutely agree with the following from the post.

"Talk to customers. Watch what products sell and which ones don’t. Think about
the economics that drive the business and how you can help it succeed more.

Knowing what matters is the real key to being an effective data scientist."

All that being said, I do see plenty of the opposite as well. Lots of people
only know about sums and means, so that's all they want to use. There are
plenty of cases where advanced methods make more sense. Knowing the difference
is why you hire a good Data Scientist.

~~~
slv77
Asking the stupid question...

Was the stats 101 basic control charts and statistical sampling?

Do you have a link to the paper?

~~~
apohn
Stats 101 is basically what you would learn in the first applied statistics
course taken by an undergraduate (or advanced secondary schooling). Mean,
Median, Standard Deviations, Z scores, T-tests, etc. For control charting,
just look at the "Control Charts" page on Wikipedia, which provides different
methods used to figure out when data point are outside of limits.

I do have a published paper, but I try to avoid linking my internet accounts
to my real life. So I'm reluctant to link to it as that would give away my
identity.

------
slv77
I work in fraud prevention and after 10 years of working with complex data and
I am just starting to use machine learning (with a lot of software that makes
it simple). I'd be hesitant to call myself a Data Scientist but have borrowed
heavily so to speak. Thanks guys!

As my abilities to work with data have grown I've been amazed at the power of
some of the simpler statistical tools and that they aren't more widely used
elsewhere. My top X list:

\- Statistical sampling to get a quick estimate. Most of the time a quick
answer that is accurate to 10% is worth more than an 'exact' answer that takes
a month.

\- Statistical process control techniques to monitor operational process. Stop
focusing on the noise and let the system tell you when something is broken.

\- How to incorporate uncertainty into your models using Monte Carlo analysis
to improve decision making. Our model projects that saving may be between
negative $10M and $20M conveys more information than our model projects $10M
in savings.

\- How to incorporate trend and seasonality into forecasts using Holt-Winters.

\- Use of split testing to know when you have a significant result.

------
Eridrus
> My advice is simple: no. What you should probably do is make sure you
> understand how to do basic math, know how to write a basic SQL query, and
> understand how a business works and what it needs to succeed.

There is an assumption here that people want to work with data and would be
happy doing simple SQL queries with sums.

It's good that the author is happy with their job, but I I don't know how
"having highly trained people doing menial tasks" is a good thing.

I think that most people who want to get into "data science" don't just want
to write SQL queries and do simple arithmetic, but are interested because of
all the recent advances in computing that have made machine learning more
tractable. If the best they can hope for is to be doing SQL and simple
arithmetic, then maybe there just aren't enough interesting jobs and the
advice should be to avoid the profession entirely.

------
mamcx
Related:

More important than math? Have the data in the correct order & shape.

Cleaning & fixing & having well design tables/etc is by far the most
significant contribution to easy everything else.

~~~
themcgruff
(Full disclosure I work with the author, Noah, at Basecamp.)

Not sure if more important than math, but you are right it's extremely
important.

One of the biggest on going problems we have isn't getting data in -- it's
helping everyone get data out. Even with training sessions, documentation, and
some fairly fleshed out "self help" tools, there is still confusion about
where to look and how to "find" and "combine" the right data to answer a given
question.

One of the ways we've partially solved this problem is through Tableau which
is a commercial solution. (I was skeptical about whether we would stick with
solutions like Tableau but it has been worth every penny.)

~~~
mamcx
Niklaus Wirth coin the idea

Programs = Algorithms + Data Structures

Is easy to thing that the "Algorithms" side (where is more obvious to apply
math) is dominant. But instead, is the data side.

Math is worthless if data not fit it. You can't do binary search until the
data is ordered/unique. Thats way I think math is secondary

Of course, datastructures have math on them, and still the data side dominate.
Your O(1) get work when the datastructure is made for it...

Probably exist some counterpoint (Anyway, I'm more a database guy with minimal
math skills) but I think for the the idea work most of time..

