
ASK: HN how to deliver machine learning results under startup pressure? - kuro-kuris
Hi HN,
I just started working for a very early stage startup who want to do mainly intent extraction on email datasets. I thought I would work on the natural language processing. I have worked here for 2 weeks and I am struggling with extreme pressure between sprints. We don&#x27;t have any data, feature engineering, users nothing. What can I do? I am trying to build up a data processing pipeline but it is difficult with the pressure. How can I keep delivering and build a better machine learning environment in the company?<p>Thanks for your advice HN!
======
apohn
This may seem like really basic and obvious advice, but I’ve always found it
helpful to think about solving a much smaller problem, solving that, and then
going to management with that as a guidepost for building out the larger
project. Putting “Build Data+ML Pipeline from scratch” + “In 2 weeks” is a
great way to lose all hope and feel like you are drowning.

For example, PaulHoule mentioned Enron emails. Could you build a pipeline to
ingest those emails and provide sentiment or some basic text analysis? That
would give you a deliverable (so management/the team sees you are working
towards something) you can quantify into sprints. Make sure you are building
something that could be development into a production ready pipeline, and not
just a toy script you can put out in a day or two.

After that is built, I’d use that as a guidepost for what you need to build.
Show them the project and then use that to define the different stages you
need to build for the production application. Include risks, blocking issues,
and what you need from other team members. For example, you can point out that
after a certain point in developing the pipeline you simply cannot move
forward without data or emails.

Management/The team is going to come back to you with one of the following. 1)
That is a good plan and we like/dislike the timeline. 2). That is a bad plan,
change it. 3). Leave us alone, we hired you so we don’t have to deal with this
stuff.

If they do 3) that’s bad. If they do 2), you should evaluate who is being
unrealistic – you or them. At my last job we lost great data scientists
because lots of other people thought Stats/ML was magic that could be done
with a simple R/Python script. The end result was severely rushed work done to
unrealistic standards, resulting in massive technical debt – some projects
were just a house of cards.

------
adamwi
Many great answers already, but will add my two cents as well. All the tips
regarding poker planning, breaking down work in manageable chunks, identifying
the critical path to finish the sprint and so on are highly relevant but I
also think there is a broader question about how work is planned and the team
is managed.

The situation you describe with an ambitious product roadmap and very limited
time available (and other resources lacking) is very common in early stage
start-ups with very visionary founders that have limited managerial experience
(I have been there myself).

In my book the team leader/manger role is to make sure the team performing at
its best and can be productive, the personal productivity of the team leader
is irrelevant. A common mistake is the underestimate the effects of investing
time in creating an productive environment for the team.

If would be in your situation I would make a prioritised plan with time
estimates (using the techniques mentioned in other posts) and then sit down
with the team leader/manger/founder and refine it and agree on deadlines. This
would remove pressure as the timeline is jointly agreed, then it would just be
important to be very clear if any deviations come up.

If the team leader/manger/founder is someone can have an open discussion with
just share your thoughts as in the post. In my team we aim for brutally
transparent culture when it comes to morale and stress levels, it allows us a
team to be much more productive in the long run (and we have more fun).
Remember that start-ups should be a marathon and not a sprint despite working
in "sprints". I hope it helps!

------
tedmiston
You can't do much machine learning without users and data, or at least real
world datasets. I would focus on picking off the small problems you can solve
today (or this week) for now.

------
ncouture
I hope this helps...

You might find out like me that there is very little pressure when your goals
are well defined and you have a list of all the tasks needed to bring them to
completion.

This lets you focus on a set of specific tasks that are ideally ordered by
priority, or <effor (to some extent).

Add clocking your work and you get a very clear picture of how your time is
spent, and in some cases (:) where it would be wise to lower the amount of
time put on certain things.

Sorry, this is as vague as I could explain it.

TL;DR Organization can really make miracles.

------
PaulHoule
This is a great question! Here is what I think.

Both the ways "data scientists" typically work and the agile methods used in
many software development organizations are unsuitable to commercial use of
machine learning and other data-rich methods.

In your question I am hearing two themes: (i) how to organize the actual work
("no data", "no features", "no users") and (ii) how to slot the work into the
sprint system.

The typical sprint system often introduces risk and uncertainty to data rich
projects. Here is an example. I was working on a project where the sprints
were typically two weeks, but one part of building the knowledge base was
running a batch job that took two days. Of course if you set the batch job up
wrong you might have to do it more than once.

When I was doing the batch job I would account for the risk and spend maybe
two days getting ready for the batch job and run the batch job at the very
beginning of the sprint, then even if things went horribly wrong with the
batch job and I had to do it two or three times I was certain the KB would be
ready on time. Practically I had a PERT chart in my head that I was using to
plan my own work.

Even though I told them what I just told you, the first time some other team
members did the batch job, they started it on the last day of the sprint which
meant it wasn't ready and the Sprint shipped with an old and inappropriate KB.

As a retrospective it would be good to turn the 2 day batch job into a 2 hour
batch job (It started out as a 2 century batch job!) Also the reliability of
the batch job is every bit as important as the speed in a situation like that.
More fundamentally, I think some thinking about the ordering of work (PERT
charts) should have been built into the process.

There are lots of cases there, but note the risk amplifying property of the
sprint. If some input to the sprint is a day late, everything that input
depends on slips two weeks.

For that project we also did two hour "planning poker" meetings and that was
another problem because with two hours we didn't have enough time to make
certain decisions. If we'd had two or three people think about things for a
day we could have made consistently better decisions about certain things
which would mean doing the right work in the next sprint, similarly saving two
weeks of calendar time.

It is very easy for little failures of the type described above to cascade and
produce a recurring pattern of failure that is awful for productivity, morale,
etc.

It is very important to push back on management and address these kinds of
problems.

Now this sounds very negative for agile in data-rich projects and that's not
the only thing you should take away. In the long run, data rich projects
benefit hugely from continuous improvement that is done on a regular cadence.

You meet "data scientists" or "junior programmers" who have started a number
of projects and sent deliverables over to other people who get them ready for
production. They think they have a great batting average, but when you look it
from a wider perspective you see that 4x the man hours they put in the project
got spent getting ready to get stuff for production. Had the team "begun with
the end in mind", the total cost of the project could be cut in 1/2 or more
and the risk greatly reduced.

Big and very capable co's like IBM and Nuance, as well as many smaller ones
you have not heard of, have built data-rich systems that turned out to be like
building a nuclear reactor. We are not talking something that cost $22,000
when it should have cost $21,000, but rather something that cost $20 billion
when it should have cost $5. The people involved will tell you they don't know
what they're going to do next but they do know they are never going to do that
again.

So your process, technology, everything, has to be designed to control (1)
risk and (2) cost to address those things and you've got to communicate that
to the people you work with.

What most people don't know/accept/believe is that most teams would control
cost best if they tried to control risk first, see:

[https://www.amazon.com/Rapid-Development-Taming-Software-
Sch...](https://www.amazon.com/Rapid-Development-Taming-Software-
Schedules/dp/1556159005)

As for your other issues, this is what I am going to say.

Short term there are two things that really matter: (1) getting data, and (2)
developing the basic interfaces between the ML component and the rest of the
system. If you have (2) you can really contribute to the sprints, if you
don't, you are cannot. Without (1) any data pipeline stuff, featuring
engineering, etc. is going to largely be a waste of time.

For data start out with the Enron emails or your own emails and label enough
of it that you can start thinking about the other issues. Your early data set
will be nowhere near large enough to get useful results, and that's another
issue you'll need to bring up with management once you've reached it.

~~~
kuro-kuris
Paul, thanks for your great reply. I am going to bring up your suggestions and
hopefully a larger focus on risks can help with our deliverables PERT charts
seem a bit over bureaucratic for our team but considering our dependencies and
the risks more seems great. Improving the connection between risks and
features in the mind of the team will definitely help. I still feel really
constrained and over-pressured in the current sprint structure, hopefully
bringing the risks out will decrease my personal anxiety as well.

With regards to the work flow, we went down our personal email road initially.
Though getting data seems extremely cumbersome from the main backend of our
service. With regards to the interface between our ML component and the rest
of the system currently we are using simple heuristics to fulfill the ML
requirements so the flow of information is there in one direction but not in
the other one.

------
nyddle
Yeah, I noticed a trend: startups looking for devs with extensive ml and
highload experience to build their first prototype. They plan to grow rapidly,
thats why.

~~~
tedmiston
It's a great optimistic plan, but in practice, most don't. It's probably
overkill for most very early stage startups.

