
As a Data Scientist, what challenges do you face? - karishmakunder
Some of the challenges I feel prevalent are finding value in data, integrating open source software, not enough platforms for feedback and improvement.
======
eeegnu
The most frustrating challenges I've faced boil down to just cleaning the
data. It's not too bad when everything is stable and you're just cleaning up a
database, though this can still be pretty hard depending on the scale of the
operations required. The worst is when I have a live data feed that is liable
to occasionally mess up. In one instance I was reading in stock data from an
API, and on their end they messed up and sent the same timestamp for two
different instances, which caused my local data aggregation to merge them
together into a series, and later when that value was actually queried,
expecting a numpy float, it just crashed. So writing data processing code
that's anticipatory of potential noise, with mechanisms to resolve it, or that
sends errors early instead of finding them a week later by performing asserts
on your assumptions are what I've done to face this.

I do agree with the general lack of feedback/improvement platforms, at least
on the non-analysis side (I've seen good feedback on Kaggle forums before when
it comes to questions on problem solving methodology.) I don't really follow
the not finding value in data part though, in my experience it's pretty much a
binary question like 'can I use this data I've found to solve my problem, or
improve my solution', and if so it's valuable relative to that application.

~~~
karishmakunder
That’s one of the major challenges, where you need the data processing code to
reassess what comes in.

Is there any way you share this piece of code across teams? One of the
challenges I have seen, is how to avoid re-inventing the wheel. Like, its all
there, somewhere, however, across team members, its quite difficult to pass on
that knowledge of “hey, already have this data processing script” for another
similar usecase.

~~~
eeegnu
Private git repo's, with Jupyter notebooks documenting the scripts is the
primary means of sharing. I have duplicated quite a few things inadvertently
though, just due to them being fairly simple and not asking about it. That's
more of a communication issue than anything else though.

------
stevesycombacct
A significant chunk of my work involves no use of algorithms, statistics, or
math. For the most part, my days are filled with one thing: cleaning up data.

If more firms took an educated, standardized approach to their data, I would
do my job significantly faster, and the company would have access to their
data-related products in far, far less time. Firms I have worked for, overall,
refuse to do that.

Instead, such notions are treated as toxic- and I am too, by the associative
property. They think I'm making _extra_ work ("You want us to stop copying
Powerpoints into Excels and put data into a whole new Excel that stops me from
pasting in whatever I want? But I already have it in my own Excel!"), or they
think I'm automating someone's job away to be replaced by a robot. This makes
getting support from both employees and leaders near impossible.

It could save firms millions a year in reduced labor costs alone to not have
entire teams of people sitting in basements fixing problems with data.
However, data scientists are sold on what math they can do and what algorithms
they can write, not on what processes they can improve.

It's, frankly, hell.

~~~
karishmakunder
Absolutely agree. It reminds me of an article that I read, had the statistics
of Data scientists spend 80% of their time cleaning data rather than creating
insights.

Curious to know, did you in any ways accomplish to standardise the way data
was collected/ stored?

~~~
stevesycombacct
It's more like 90%.

On existing processes, standardizing collection has been nearly impossible
unless I promised that it would be easier, it will take less time, and I'll do
all the setup. It always takes leadership backing, and if I don't get it from
one leader, I go over their head to the next.

On new processes, if I jump forward and volunteer, I find I'm given leeway to
do it my way- that is, a standardized way.

If this sounds ruthless, it is.

~~~
karishmakunder
It is, what it is :)

------
DataDaoDe
If you think about the scientific lifecycle: Gather Information => Form
Hypothesis => Test Hypothesis => Analyze Data => Interpret => Repeat.

Then I would say the hardest parts are the "Gather Information" and "Test
Hypothesis" phases. But its like this in every scientific endevour and this is
nothing unique to data science.

One interesting point is, perhaps, that we as data scientists are aware that
our sources for gathering information and our means for testing hypothesis are
often tied to man made software or hardware systems - as opposed to dynamical
real world structures. This means that theoretically and practically there is
only ingenuity and will-power keeping us from building better and less time
consuming ways for automating away the tedius (data cleansing/prep/etc.) parts
of those processes.

------
cyberdrunk
In my company, the biggest issue is finding the right data sets in company's
vast data landscape, figuring out the exact definition/meaning of each column
etc. Then, it's dealing with the data quality issues. Then, it's getting
access to it and setting up an ingestion job for the data to be copied to some
common storage (e.g. Hadoop). At the very end, it's the actual data science. I
suspect a lot of the PhDs we hire start drinking before they reach the data
science stage :)

~~~
karishmakunder
Yeah, that! Do you use any tool to centralise the data that you use across
your Data Science teams? Like a central repository of sorts, so that each one
can be given access to it and from there you can start off building your
models for training etc?

~~~
cyberdrunk
Yes. We ingest all that data onto central Hadoop, where data science team can
access all of it in an uniform way. This solves the physical access problem.

Unfortunately, the DQ and meaning of data are harder to solve. They require
essentially caretaking of the datasets done by the data owners (cannot be done
by a centralized unit). My organization is currently undergoing a transition,
where it will be a responsibility of the data owner to maintain the metadata
of his/her dataset and also to measure the data quality, but implementing it
across the whole org is a journey that will take a long time.

------
danielscrubs
Salespeople trying to coax us into the next automl or drag and drop ml tool.
We have an army of phds at my job and you think your jack of all master of
none software is going to impress us? The whole reason they started seems to
be “ai is so hot right now and it will look good on our cv”. A single domain
expert grey beard in our field would have impressed us more.

~~~
karishmakunder
Ah, true almost all the time! :) There are only a handful of folks that have
an impressive balance of understanding what hot and how to make an impressive
product/solution.

------
nxpnsv
Mangers not understanding what you are doing or why it is not instant.

~~~
natalyarostova
It took me four years before I built up the experience and confidence to
advocate for myself and the length things take. Now I'm trying to protect
newer data scientists"

"Hey, new_hire, can you build this model by Friday?" "Sure, let me just get
the data" Me: No you can't. The data will take a month to get. No one is
getting anything from you on Friday, because I'm not going to let our team
develop a culture where data scientists work all night on thankless failing
data jobs.

~~~
karishmakunder
Kudos to you for that!

It starts with, “We need accuracy for this problem statement” and ends with
“wait, lets find and prep the data first”. Timelines anyone?

