Hacker News new | past | comments | ask | show | jobs | submit login
End-to-end implementation of a machine learning pipeline (2017) (spandan-madan.github.io)
504 points by spandan-madan 6 months ago | hide | past | web | favorite | 42 comments

Great explanation and I love the fact that the entire presentation is a Jupyter Notebook!

A non-academic observation - the 'real-world' challenge of ML pipelines is what I call the 'last-mile' problem of ML - operationalizing your model. You begin to run into problems of:

1. How often do you 'score' live data? How will this affect latency, data ingestion etc?

2. How often do you have to update your weights, if you want your model's performance to be consistent?

3. Integration with source systems

4. If you build your final scoring model on library-dependent languages like Python, how do you ensure no breakages? (Docker solves this to a large extent though)

Seconding this. I have run a data science and machine learning team for the last couple of years. By far the most challenging part of our work has been convincing our data management team that we aren't just another front end widget factory and our development/operations staff that we aren't choosing "non-standard" tech to deliver model results into production. The model maintenance is difficult, too, due to poor data management practices but it's less challenging than the other items for my team.

What have you found to work best when coordinating with your data management and development/operations staff?

Every organization and team is different. Often I've found two approaches work best: going around the road blocks and managing everything end to end, then getting buy-in for data and ops to own it properly after the fact (playing up the political angle of owning more stuff after we do the heavy lifting), and the brute force method of just meeting after meeting to educate people about the differences in use cases and deployment for ML products.

same question here. im keen to understand this. Especially around responsibilities, OKRs and KRAs

That’s it, really. Any good reference to keep up to date with the last-mile best practices for the average ML practitioner? Thanks!

I link these resources often, but they are often relevant! See "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" [1] and the Rules of Machine Learning [2]. Another classic: "Machine Learning: The High Interest Credit Card of Technical Debt" [3], and recently added: Responsible AI Practices [4].

[1] https://ai.google/research/pubs/pub46555

[2] https://developers.google.com/machine-learning/rules-of-ml/

[3] https://ai.google/research/pubs/pub43146

[4] https://ai.google/education/responsible-ai-practices

It's a theme I like to read about (I mean "practical issues around ML in production settings"). I find company blogs and some research publications are great resources. Examples:

- https://eng.uber.com/ - https://code.fb.com/

and many more. Google also publishes papers on various engineering practices obviously, some ML-related, but I can't find a blog where they focus on that specifically.

Also it's not "to keep up to date", but there's a great paper (from Google) that's often cited:

Machine Learning: The High Interest Credit Card of Technical Debt https://ai.google/research/pubs/pub43146

It talks about issues you face over the long run (I've experienced some of those). It also provides interesting pointers for further reading, e.g. about "pipeline jungles".

If others have pointers, I'm curious to hear about them as well.

Could that be because of using Jupyter notebook itself? I like Jupyter for data and machine learning 'journalism', but I don't see it as the a proper medium to address the 'last-mile'. The insights driven from Jupyter, in my opinion, are not actionable and well integrated enough. It is becoming a de-facto medium reminding me of shared Excel files.

Could be. Using Jupyter for ML development or even prototyping (as opposed to presentations / demonstration / teaching like the OP — that's where Jupyter really shines) is a red flag.

I see a similar pattern with Pandas: some people use Pandas not because it's the right tool for the job (Pandas has many strengths), but because they're scared of writing comprehension loops and basic data structures. To avoid the CS-y stuff. But without the CS-y stuff, the result ends up a mess of lambdas, weird reindexing and buggy copy/view semantics.

And then "the next guy", the one who's job it is to clean up and productionalize the maverick's output, ends up having to reinvent and fix the entire solution. Basically doing both jobs.

How do you suggest prototyping without Jupyter? (in case prototyping means researching an approach)

Yes, Jupyter is for initial exploration. Then you write solid normal production code. Then you might write further notebooks that import that production code and run/visualize metrics and reporting for your client (probably non-technical people).

I had a "data scientist" submit notebooks to us as if we could ship any of that in production. (We fired him.) It's for hacking and blogging, not for production work.

For feature requests on this, please create an issue on the github Repo!

For future tutorial suggestions, mail me at smadan@mit.edu. A new one on NLP is coming soon!

Is your code intentionally verbose (for the sake of being explicit)? It seems like it could be condensed a lot by using Pythonic structures. For example you could replace block 39 with a one-liner:

Genre_ID_to_name=dict([(g['id'], g['name']) for g in list_of_genres])

In other places, you would benefit a lot from the enumerate(..) function, which returns (index, item) tuples when called on a list.

Precisely. I strongly believe that the purpose of tutorials is to be inclusive of all people. That's something I realized as a TA, making things explicit never hurts. There's always someone who can gain from more detail :)

I'm going to throw out a plug for that mindset going beyond tutorials.

It is a suspect proposition that anything is gained by turning 3 lines of code into one line of code. Unless it is javascript for the Google homepage or somesuch where the bytes matter. Moving code from a bad data model to a good one usually correlates with a big reduction in line count, but the gain is in choosing more appropriate data structures and not in the number of lines removed.

Every reader of code, including the author after 3 months, is going to have to read and understand the code from scratch. One line doing a multidimensional transform of the data is going to scan for a small fraction of people. That one liner would take about 3 times as long to understand as any one line of the tutorial code. The data model hasn't changed either. If anything, I'd argue that the nature of the transform being done is clearer in 3 lines.

Readability of the code is always relative to the reader. In any language that I've mastered, I've always preferred more verbose options at first, but with experience I found out that I find condensed versions to be more elegant and time-saving.

In the end, choose code style that matches level of your audience. If you write a one-off python script that automizes some build process in mostly non-python codebase, it should probably be very verbose and easy to understand. If, on the other hand, you're writing code in a decently advanced codebase and most of your colleagues are fluent in the language (or at least, supposed to be), it makes sense to use as much condensed syntax sugar as possible.

I agree, gains should go to reading the code most of the time. But when you say

> It is a suspect proposition that anything is gained by turning 3 lines of code into one line of code.

Some code is way more verbose than its description would be. A named function signals intent:

  function multidimensional_transform(the_data)
And for Javascript ES2015 there is new syntax (like spread operators) that improves code the same way.

Also, the code for the PyTorch version has been contributed by https://github.com/AnshulBasia. But it is basically a port of my original version in Keras, which was equally verbose :)

Couldn't agree more. Coming from a world where I strive to come with great self explanatory naming conventions, Python code most often looks minified to me, and I have a hard time reading it...

I agree with the sentiment in general, but that example from the OP is actually pretty clear, no? Some might argue it is at least as clear as the code it is intended to replace.

A more Pythonic way do to this would be

    id_to_name = {g['id']: g['name'] for g in list_of_genres}

    for i in range(len(list_of_genres))
is really a dangerous antipattern better replaced with

    for genre in list_of_genres:

And if you need the index:

    for idx, genre in enumerate (list_of_genres)

Out of curiosity, how is that dangerous?

My guess is based on Python 2, where range(n) would return a fully inflated list of 0..(n-1). If len(x) is large, you could be allocating giant temporary lists just to iterate through them once. Using xrange() was the Python 2 solution (it would return a generator instead), but, if I recall correctly, Python 3 fixed this s.t. range() returns a generator.

Just started an ML course this semester. I am not sure if I even have time to use this as additional resource, but it looks super awesome after skimming through it. Definitly going into my favorites and if I don't use it as additional resource now, I will read it later. Thanks for making all this work public!

Does the ML course have videos that can be accesed online?

This is a very good way to get started building ML pipelines. When you do it at scale, you often need to use a broader range of tools. Here's how we do it in Hopsworks with Python the whole way (using Airflow to orchestrate the different steps):


Your pipeline design looks pretty sweet!

Note: at the beginning, you speak about learning a function "g" that approximates a function "f", then later, you swap them and learn a function "f" that approximates "g". That could be confusing;

This is great, most tutorials out there assume the dataset already exists. Nice move covering the entire pipeline!

Also, looks like that notebook needs an update.


Did they have to put a big "Harvard University" banner at the top of the GitHub repo: https://github.com/Spandan-Madan/DeepLearningProject ? This is a private repository, right? Is the code owned by Harvard?

For me personally, I find this off-putting. Let your content speak for itself. No added credibility when the affiliation is advertised like this.

Content looks good, though! :)

Yes, this confused me too. Especially considering Spandan Madan is an MIT researcher, I don't understand the overt branding of a different school

Yup! That version was in Keras. It's now been re-written in PyTorch as well! Thanks to https://github.com/AnshulBasia.

Any update on the NLP tutorial? I keep checking this repo [0] but seems it hasn't been updated lately. I hope you didn't abandon this project

[0] - https://github.com/Spandan-Madan/NLP-Intuition-and-Applicati...

As a beginner and a self-tought. This style of tutorial is practically good and persuasive for practitioners. Thank you very much for investing time to do this.

The link seems dead, is there another url ?

It's interesting, however the webpage seems broken on Firefox Android (latest, Android 8.1) :

- some values in the command outputs dont match the author's comments (or maybe I misunderstood some?)

- there are some big red blocks of errors in the outputs

- the outputs of the trainings are way too verbose for mobile reading

I guess they are issues on Jupyter's framework side. It would be nice if mobile were treated as first-class viewer.

thank you guys

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact