ML Engineering Online Book

jebarker · on Jan 23, 2024

This is gold. I spend my days debugging LLM training setups in support of research and I'd have loved these notes when I started!

mentos · on Jan 23, 2024

Im a game developer looking to get into ml/dl. Biggest thing for me is finding a problem with real value that’s not too difficult that I can work on to learn. I believe I found one, curious if you’d offer any thoughts on it.

Right now there are two systems for capturing motion capture data for animations in games/movies, inertial and optical. Inertial is easier and more affordable but comes with more errors/inaccuracies in the capture that requires manual correction. Optical is more accurate and requires less cleanup but has an expensive hardware and space requirement.

My thought is to have someone wear an inertial mocap suit and record both inertial and optical session at the same time and then use ml to learn how to automatically fix the mocap data. Then you could theoretically get the precision of optical capture from inertial recordings after passing it through the ml.

Curious if you think something like this is tractable for a first project? If you have any suggestions for how you’d solve this or if there are any existing projects you could point me I appreciate any help!

HanClinto · on Jan 23, 2024

Yes, I think this is very tractable for a first project. I've played around with using AI to do optical-only with pose detection models -- if I had to do it again, I would probably start with this model and try to get it running locally:

https://github.com/facebookresearch/co-tracker

This sounds like a perfect place for you to get started!

mentos · on Jan 23, 2024

Incredible would never have guessed that pixel tracking is possibe like that! Thank you.

jebarker · on Jan 23, 2024

If you have the means to collect the data then this seems pretty tractable to me. Mostly because you don't need to deal with the raw optical data but rather just the derived trajectories. So data formats and volumes shouldn't be a distraction.

I'm assuming you've done some tutorials or courses on basics of DL and can program Python. At that point the easiest first step would be to just train an MLP to convert a single time step from inertial data to match the optical prediction (presumably they are in the same coordinate system and temporally close enough).

The crux of building something good would be in how you handle the temporal aspect I'd imagine. Clearly you want to use multiple samples over time from the inertial to get more accurate positional estimates. I'd imagine a fixed window of the past n inertial samples would be a good start. I wouldn't worry about more complicated temporal modeling, e.g. RNN or transformer, unless you can't get satisfactory results with the MLP.

My gut says there's probably a non-ML approach to this too, some sort of Kalman Filter etc. Always best to avoid ML if a simpler solution exists :)

mentos · on Jan 23, 2024

Awesome thank you! Really appreciate you taking the time to reply. I’m still learning Python and working my way through this course https://fleuret.org/dlc/ but without a problem I care about its hard. Your encouragement and tips are much appreciated! Knowing that this idea isn’t too ambitious will give me the motivation to keep pushing. Thank you.

grepLeigh · on Jan 23, 2024

I started learning ML at a video game studio too. My 2c is to start on a tractable problem that immediately pays off, so you have the grace to learn something more complicated later. I started with a recommendation system for our in-game store and email promotions (custom coupons based on previous purchases).

cyrux004 · on Jan 23, 2024

As somebody who works along with Applied Scientist helping them with tasks related to model training and deployemnt; how does one get exposure to more lower level engineering work like optimization, performance etc. We have an ML infra team; but their goal is building tools around the platform, not necessarily getting workloads run optimially

dayeye2006 · on Jan 23, 2024

I think no optimization is possible withoutprofiling. I think getting yourself familiar with the tools to understand the performance of a model might be the 1st step, e.g., https://pytorch.org/tutorials/recipes/recipes/profiler_recip...

tanelpoder · on Jan 23, 2024

Yes - understand first, then fix. And you’ll understand by measuring/profiling things.

I’d also recommend the detailed pytorch optimization case studies by Paul Bridger:

https://paulbridger.com/

grepLeigh · on Jan 23, 2024

Brendan Gregg's work on system performance and profiling is a good place to start. A lot of ML perf boils down to Linux perf or what the heck is happening in an HPC scheduling system like SLURM. https://www.brendangregg.com/linuxperf.html

HanClinto · on Jan 23, 2024

I really appreciate everything in the "Unsolicited Advice" in the AI Battlefield section [1]. It's a very realistic take on the frenetic pace of everything and the emotional tax that comes with feeling like one is always drowning in the relentlessly rapid advance of AI development.

Thanks!

[1] https://github.com/stas00/ml-engineering/blob/master/insight...

legerdemain · on Jan 23, 2024

How widespread is Slurm?

p4ul · on Jan 23, 2024

Slurm is absolutely ubiquitous in the high-performance computing (HPC) community. I believe its only similar competitors in the HPC space are the SGE [1] and Torque/PBS [2] resource schedulers.

I'm not sure of the exact numbers, but I would guess that an overwhelming majority of the Top 500 Supercomputers [3] are running Slurm. And as others have noted, research computing centers in academia all mostly run Slurm. And Slurm also dominates in the DoE national labs in the US.

Oh, and as a [potentially apocryphal] fun fact, the name "Simple Linux Utility for Resource Management (SLURM)" is a backronym from the soda in Futurama! [4]

[1] https://en.wikipedia.org/wiki/Oracle_Grid_Engine

[2] https://github.com/adaptivecomputing/torque

[3] https://www.top500.org/

[4] https://futurama.fandom.com/wiki/Slurm

jhfdbkofdchk · on Jan 23, 2024

According to Wikipedia, "Slurm is the workload manager on about 60% of the TOP500 supercomputers." I have used it as a job manager front end for most computational clusters in the last 10 years or so.

nikhilsimha · on Jan 23, 2024

Llama 2 models were trained on slurm

claytonjy · on Jan 23, 2024

related, has anyone had success moving from Slurm to Kubernetes for a physical (non-cloud) cluster primarily used for training large models on lots of GPUs?

vulcan01 · on Jan 23, 2024

It's used in most high-performance computing clusters (except for the folks that are still on Torque, I guess).

legerdemain · on Jan 23, 2024

I see, so it's limited to HPC contexts? I'm just surprised that as a data engineer, I've never seen it in real life.

a_bonobo · on Jan 23, 2024

Definitely! I was in academia for ten years and SLURM is everywhere. It's free! Now outside academia, SLURM is nowhere. AWS and Slowflake are king.

0cf8612b2e1e · on Jan 23, 2024

Both of my last two companies used Slurm. Probably just comes down to if the company maintains its own internal compute cluster.

jebarker · on Jan 23, 2024

> Now outside academia, SLURM is nowhere

Do you mean outside of academia _and_ HPC? Industry HPC clusters using slurm are quite common.

sbekman · on Jan 23, 2024

Many compute clouds provide SLURM env - definitely AWS and GCP do.

Scene_Cast2 · on Jan 23, 2024

I randomly clicked on repeatability and am still curious about how it's achieved with distributed training. Wouldn't deterministic synchronization make things slow? But I have heard that at least in a couple of big companies, their training is repeatable.

eru · on Jan 23, 2024

You would want to make training updates commutative as much as possible. That way it doesn't matter which order you apply the updates in.

hahnchen · on Jan 23, 2024

How do you get experience in this stuff without having a job?

eru · on Jan 23, 2024

By reading books like the one submitted, and doing your own small projects?

It's not that different from learning how to program without already having a programming job.

(That isn't to say either of these two is easy. They both require a lot of dedication.)

rmbyrro · on Jan 23, 2024

If your goal is to get a job, set realistic expectations.

There's a tiny job market for that - in comparison to, say, web dev - and these projects require professionals with very deep knowledge. This is not the kind of work where chatgpt or stackoverflow will help you a lot.

dayeye2006 · on Jan 23, 2024

Side projects or working on other people's side projects. The most important things is to connect with the community and learn the technical language to speak with them. This is a relatively small community and you need bunch of different stuff to get started, some ML, coding for sure, some knowledge about how modern accelerators work, some skills to read and understand papers in this direction.

HanClinto · on Jan 23, 2024

In my experience, best thing is to do side projects. Don't just learn a technology -- pick a doable project that leverages a new technology that you want to learn, and tackle it. Picking something "doable" is often the tricky part, so don't be afraid to re-evaluate after a few weeks and adjust your expectations as needed.

Important thing is to keep moving.

mistercheph · on Jan 23, 2024

Do the fast.ai course, with a bit of elbow grease and creativity you'll be able to finetune models to produce SOTA results in >2 weeks.

the_g0d_f4ther · on Jan 24, 2024

I really want to start experimenting with this, but i don’t really have a solid gpu. How do you guys actually run these ?

mayilian · on Jan 29, 2024

What Twitter accounts to follow to stay updated?

amelius · on Jan 23, 2024

Is there a pdf somewhere? I see there are instructions for building it, but not the actual file.

sbekman · on Jan 25, 2024

OK, the pdf is ready now: https://github.com/stas00/ml-engineering#pdf-version

sbekman · on Jan 23, 2024

It will be ready in a few weeks. The building workflow is ready, but I need to finish the stylesheets and restructuring the chapters.