
End-to-end implementation of a machine learning pipeline (2017) - spandan-madan
https://spandan-madan.github.io/DeepLearningProject/docs/Deep_Learning_Project-Pytorch.html
======
roystonvassey
Great explanation and I love the fact that the entire presentation _is_ a
Jupyter Notebook!

A non-academic observation - the 'real-world' challenge of ML pipelines is
what I call the 'last-mile' problem of ML - operationalizing your model. You
begin to run into problems of:

1\. How often do you 'score' live data? How will this affect latency, data
ingestion etc?

2\. How often do you have to update your weights, if you want your model's
performance to be consistent?

3\. Integration with source systems

4\. If you build your final scoring model on library-dependent languages like
Python, how do you ensure no breakages? (Docker solves this to a large extent
though)

~~~
DrNuke
That’s it, really. Any good reference to keep up to date with the last-mile
best practices for the average ML practitioner? Thanks!

~~~
rasmi
I link these resources often, but they are often relevant! See "The ML Test
Score: A Rubric for ML Production Readiness and Technical Debt Reduction" [1]
and the Rules of Machine Learning [2]. Another classic: "Machine Learning: The
High Interest Credit Card of Technical Debt" [3], and recently added:
Responsible AI Practices [4].

[1]
[https://ai.google/research/pubs/pub46555](https://ai.google/research/pubs/pub46555)

[2] [https://developers.google.com/machine-learning/rules-of-
ml/](https://developers.google.com/machine-learning/rules-of-ml/)

[3]
[https://ai.google/research/pubs/pub43146](https://ai.google/research/pubs/pub43146)

[4] [https://ai.google/education/responsible-ai-
practices](https://ai.google/education/responsible-ai-practices)

------
spandan-madan
For feature requests on this, please create an issue on the github Repo!

For future tutorial suggestions, mail me at smadan@mit.edu. A new one on NLP
is coming soon!

~~~
d_burfoot
Is your code intentionally verbose (for the sake of being explicit)? It seems
like it could be condensed a lot by using Pythonic structures. For example you
could replace block 39 with a one-liner:

Genre_ID_to_name=dict([(g['id'], g['name']) for g in list_of_genres])

In other places, you would benefit a lot from the enumerate(..) function,
which returns (index, item) tuples when called on a list.

~~~
spandan-madan
Precisely. I strongly believe that the purpose of tutorials is to be inclusive
of all people. That's something I realized as a TA, making things explicit
never hurts. There's always someone who can gain from more detail :)

~~~
roenxi
I'm going to throw out a plug for that mindset going beyond tutorials.

It is a suspect proposition that anything is gained by turning 3 lines of code
into one line of code. Unless it is javascript for the Google homepage or
somesuch where the bytes matter. Moving code from a bad data model to a good
one usually correlates with a big reduction in line count, but the gain is in
choosing more appropriate data structures and not in the number of lines
removed.

Every reader of code, including the author after 3 months, is going to have to
read and understand the code from scratch. One line doing a multidimensional
transform of the data is going to scan for a small fraction of people. That
one liner would take about 3 times as long to understand as any one line of
the tutorial code. The data model hasn't changed either. If anything, I'd
argue that the nature of the transform being done is clearer in 3 lines.

~~~
golergka
Readability of the code is always relative to the reader. In any language that
I've mastered, I've always preferred more verbose options at first, but with
experience I found out that I find condensed versions to be more elegant and
time-saving.

In the end, choose code style that matches level of your audience. If you
write a one-off python script that automizes some build process in mostly non-
python codebase, it should probably be very verbose and easy to understand.
If, on the other hand, you're writing code in a decently advanced codebase and
most of your colleagues are fluent in the language (or at least, supposed to
be), it makes sense to use as much condensed syntax sugar as possible.

------
jamesblonde
This is a very good way to get started building ML pipelines. When you do it
at scale, you often need to use a broader range of tools. Here's how we do it
in Hopsworks with Python the whole way (using Airflow to orchestrate the
different steps):

[https://hops.readthedocs.io/en/latest/hopsml/hopsML.html](https://hops.readthedocs.io/en/latest/hopsml/hopsML.html)

~~~
Octokat
Your pipeline design looks pretty sweet!

------
junke
Note: at the beginning, you speak about learning a function "g" that
approximates a function "f", then later, you swap them and learn a function
"f" that approximates "g". That could be confusing;

------
ratsimihah
This is great, most tutorials out there assume the dataset already exists.
Nice move covering the entire pipeline!

Also, looks like that notebook needs an update.

[https://github.com/ContinuumIO/anaconda-
issues/issues/6678](https://github.com/ContinuumIO/anaconda-
issues/issues/6678)

------
heinrichhartman
Did they have to put a big "Harvard University" banner at the top of the
GitHub repo: [https://github.com/Spandan-
Madan/DeepLearningProject](https://github.com/Spandan-
Madan/DeepLearningProject) ? This is a private repository, right? Is the code
owned by Harvard?

For me personally, I find this off-putting. Let your content speak for itself.
No added credibility when the affiliation is advertised like this.

Content looks good, though! :)

~~~
Reebz
Yes, this confused me too. Especially considering Spandan Madan is an MIT
researcher, I don't understand the overt branding of a different school

------
dang
An earlier discussion:
[https://news.ycombinator.com/item?id=14781888](https://news.ycombinator.com/item?id=14781888)

~~~
spandan-madan
Yup! That version was in Keras. It's now been re-written in PyTorch as well!
Thanks to [https://github.com/AnshulBasia](https://github.com/AnshulBasia).

~~~
avinassh
Any update on the NLP tutorial? I keep checking this repo [0] but seems it
hasn't been updated lately. I hope you didn't abandon this project

[0] - [https://github.com/Spandan-Madan/NLP-Intuition-and-
Applicati...](https://github.com/Spandan-Madan/NLP-Intuition-and-Applications-
of-word-embeddings)

~~~
chupasaurus
[https://news.ycombinator.com/item?id=18298670](https://news.ycombinator.com/item?id=18298670)

------
zdk
As a beginner and a self-tought. This style of tutorial is practically good
and persuasive for practitioners. Thank you very much for investing time to do
this.

------
laurentl
The link seems dead, is there another url ?

~~~
sooheon
[https://spandan-
madan.github.io/DeepLearningProject/PyTorch_...](https://spandan-
madan.github.io/DeepLearningProject/PyTorch_version/Deep_Learning_Project-
Pytorch.html)

------
antpls
It's interesting, however the webpage seems broken on Firefox Android (latest,
Android 8.1) :

\- some values in the command outputs dont match the author's comments (or
maybe I misunderstood some?)

\- there are some big red blocks of errors in the outputs

\- the outputs of the trainings are way too verbose for mobile reading

I guess they are issues on Jupyter's framework side. It would be nice if
mobile were treated as first-class viewer.

------
lolitan
thank you guys

