I'm curious about the day-to-day of a Machine Learning engineer. If you work in this field, could you share what your typical tasks and projects look like? What are you working on?
Getting my models dunked on by people who can't open MS Outlook more than 3 tries out of 5, however, have a remarkable depth and insight into their chosen domain of expertise. It's rather humbling.
Collaborating with nontechnical people is oddly my favorite part of doing MLE work right now. It wasn't the case when I did basic web/db stuff. They see me as a magician. I see them as voodoo priests and priestesses. When we get something trained up and forecasting that we both like, it's super fulfilling. I think for both sides.
Most of my modeling is healthcare related. I tease insights out of a monstrous data lake of claims, Rx, doctor notes, vital signs, diagnostic imagery, etc. What is also monstrous is how accessible this information is. HIPAA my left foot.
Since you seemed to be asking about the temporal realities, it's about 3 hours of meetings a week, probably another 3 doing task grooming/preparatory stuff, fixing some ETL problem, or doing a one-off query for the business, the rest is swimming around in the data trying to find a slight edge to forecast something that surprised us for a $million or two using our historical snapshots. It's like playing wheres waldo with math. And the waldo scene ends up being about 50TB or so in size. :D
The Dead Internet Theory says most activity on the internet is by bots [1]. The Dead Privacy Theory says approximately all private data is not private; but rather is accessible on whim by any data scientist, SWE, analyst, or db admin with access to the database.
> The Dead Privacy Theory says approximately all private data is not private; but rather is accessible on whim by any data scientist, SWE, analyst, or db admin with access to the database.
At least more of that access is logged now. It used to be only the production databases that were properly logged, now it’s more common for every query to be logged, even in dev environments. The next step will be more monitoring of those logs.
That was my experience as well - training documentation for fresh college grads (i.e. me) directed new engineers to just... send SQL queries to production to learn. There was a process for gaining permissions, there were audit logs, but the only sign-off you needed was your manager, permission lasted 12 months, and the managers just rubber-stamped everyone.
That was ten years ago. Every time I think about it I find myself hoping that things have gotten better and knowing they haven't.
> If it wasn’t then, theoretically the US healthcare industry would grind to a halt considering the number of intermediaries for a single transaction.
It just occurred to me that cleaning up our country's data privacy / data ownership mess might have extraordinarily positive second-order effects on our Kafkaesque and criminally expensive healthcare "system".
Maybe making it functionally impossible for there to be hundreds of middlemen between me and my doctor would be a... good thing?
I think it would have the opposite effect. We could end up with a few all-in-one systems that would dominate the market and have little incentive to improve or compete on usability and price.
This is why I advocate all PII data be encrypted at rest at the field level.
Worked on EMRs (during the aughts). Had yearly HIPAA and other security theater training. Not optional for any one in the field.
Of course we had root access. I forget the precise language, but HIPAA exempts "intermediaries" such as ourselves. How else could we build and verify the system(s). And that's why HIPAA is a joke.
Yes, our systems had consent and permissions and audit logs cooked in. So theoretically peeking at PII could be caught. Alas, it was just CYA. None of our customers reviewed their access logs.
--
I worked very hard to figure out how to protect patient (and voter) privacy. Eventually conceded that deanon always beats anon, because of Big Data.
I eventually found the book Translucent Databases. Shows how to design schemas to protect PII for common use cases. Its One Weird Trick is applying proper protection of passwords (salt + hash) to all other PII. Super clever. And obvious once you're clued in.
That's just 1/2 of the solution.
The other 1/2 is policy (non technical). All data about a person is owned by that person. Applying property law to PII transmutes 3rd party retention of PII from an asset to a liability. And once legal, finance, and accounting people get involved, orgs won't hoard PII unless it's absolutely necessary.
(The legal term "privacy" means something like "sovereignty over oneself", not just the folk understanding of "keeping my secrets private.)
HIPPA protects you against gossipy staff. Beyond that, it’s mostly smoke - entire industries are built on loopholes.
Pre-HIPPA, the hospital would tell a news reporter about your status. Now, drug marketing companies know about your prescriptions before your pharmacy does.
My wife about 9 years ago was admitted to the hospital for a ruptured ectopic pregnancy. The baby would have been about two months along.
On the would-be due date, a box arrived via FedEx of Enfamil samples and a chirpy “welcome baby” message. Of course there was no baby.
It turns out that using prescription data, admission data and other information that is aggregated as part of subrogation and other processes, you can partially de-anonymize/rebuild / health record and make an inference. Enfamil told me what list they used and I bought it, it contained a bunch of information including my wife’s. I also know everyone in my zip code who had diabetes in 2015.
There’s even more intrusive stuff around opioid surveillance.
Insult to injury. They're all just ghouls. Profitting from other people's pain.
> what list they used
Who watches the watchers? Nobody.
We had an instance of Lexis/Nexus' (formerly Seisent) demographic database to play with. Most people just can't wrap their heads around how bad things were in the mid-aughts. I was called a "sweaty paranoid kook" by our local paper for simply describing how things worked (no opinion or predictions).
Oh well. Maybe society has started to recognize what's happened.
There were a couple "not prod" environments, but they were either replicated directly from prod or so poorly maintained that they were unusable (empty tables, wrong schemas, off by multiple DB major versions, etc), no middle ground. So institutional culture was to just run everything against prod (for bare selects that could be copied and pasted into the textbox in the prod-access web tool) or a prod replica (for anything that needed a db connection). The training docs actually did specify Real Production, and first-week tasks included gaining Real Production access. If I walked in and was handed that training documentation today I'd raise hell and/or quit on the spot, but that was my first job out of college - it was basically everyone's first job out of college, they strongly preferred hiring new graduates - and I'd just had to give up on my PhD so I didn't have the confidence, energy, or pull to do anything about it, even bail out.
That was also the company where prod pushes happened once a month, over the weekend, and were all hands on deck in case of hiccups. It was an extraordinarily strong lesson in how not to organize software development.
(edit: if what you're really asking is "did every engineer have write access to production", the answer was, I believe, that only managers did, and they were at least not totally careless with it. not, like, actually responsible, no "formal post-mortem for why we had to use break-glass access", but it generally only got brought out to unbreak prod pushes. Still miserable.)
I worked on a project to analyze endoscope videos to find diseases. I examined a lot of images and videos annotated with symptoms of various diseases labeled by assistants and doctors. Most of them are really obvious, but others are almost impossible to detect. In rare cases, despite my best efforts, I couldn't see any difference between the spot labeled as a symptom of cancer and the surrounding area. There's no a-ha moment, like finding an insect mimicking its environment. No matter how many times I tried, I just couldn't see any difference.
Mind sharing how to get a foot into the field? I've got a good amount of domain knowledge from my studies in life science and rather meager experience from learning to code on my own for a few years. It seems like I cant compete with CS majors and gotta find a way to leverage my domain knowledge.
I'm not an expert in machine learning, but rather a web developer and data engineer helping develop a system to detect diseases from endoscopy images using the model developed by other ML engineers. And it was 5 years ago when I worked on the project, so please take it with a grain of salt.
If you want to learn machine learning for healthcare in general, it may help to start problems with tabular data like CSVs instead of images. Image processing is a lot harder, and takes a lot of time and computational power. But it's best to learn what you're interested in the most.
Anyway, first you need to be familiar with basic; Python, machine learning, and popular libraries like scikit-learn, matplotlib, numpy, and pandas. Those are tons of articles, textbooks, and videos to help you learn them.
If you grasp the basics, I think it's better to learn from actual code to train/evaluate models rather than more theories. Kaggle may be a good starting point. They host a lot of competitions for machine learning problems. There are easy competitions for beginners, and competitions and datasets in the medical field.
You can view notebooks (actual code to solve those problems well written by experts) and popular ones are very educational. You can learn a lot by reading those code, understanding concepts and how to use libraries, and modifying some code to see how it changes the result. ChatGPT is also helpful.
If you want to learn image classification, the technology used to detect objects from images and videos is called image classification and object detection. It uses CNN, one of the deep neural networks. You also need to learn basic image processing, how to train a deep neural network, how to evaluate, and libraries like OpenCV/Pillow/PyTorch/TorchVision. There are a lot of image classification competitions in the medical field on Kaggle too[0][1].
To run those notebooks, I recommend Google Colab. Image processing often uses a lot of GPUs, and you may not have GPUs, or even if you have it's difficult to set up the right environment. It's easier to use those dedicated cloud services and it doesn't cost much.
It's hard to learn, but sometimes fun, so enjoy your journey!
This sounds almost exactly like my day-to-day as a solo senior data engineer — minus building and training ML models, and I don't work in healthcare. My peers are all very non-technical business directors who are very knowledgeable about their domains, and I'm like a wizard who can conjure up time savings/custom reporting/actionable insights for them.
Collaborating with them is great, and has been a great exercise in learning how to explain complex ideas to non-technical business people. Which has the side effect of helping me get better at what I do (because you need a good understanding of a topic to be able to explain it both succinctly and accurately to others). It has also taught me to appreciate the business context and reasoning that can drive decisions about how a business uses or develops data/software.
They really do. This has been my longest tenure at any position by far and Engineer QoL is a massive part of it. Our CTO came up through the DBA/Data/Engineering Management ranks and the empathy is solidly there.
As we grow, I'm ever watchful for our metamorphosis into a big-dumb-company, but no symptoms yet. :)
>Getting my models dunked on by people who can't open MS Outlook more than 3 tries out of 5, however, have a remarkable depth and insight into their chosen domain of expertise. It's rather humbling.
The people who have lasted in those roles have built up a large degree of intuition on how their domains work (or they would've done something else).
The scale of health insurance claims is incredible, my company has a process that simply identifies when car insurance should pay the bill instead of medicaid/medicare post traffic accident (subrogation). Seems a minor thing right? We process right at 1 billion a year in related claims (and I don't know our US market share, maybe like 10-20%).
I am guessing every datascientist that works for a BlueCross BlueShield at individual states deals with processes that touch multiple-million dollars of claims.
We even now have various dueling systems -- one company has a model to tack on more diagnoses to push up the bill, another has a process to identify that upcoding. One company has models to auto accept/deny claims, another has an automatic prior authorization process to try to usurp that later denial stage, etc.
Man, respectfully, sounds like you are doing the dirty immoral work of insurabce companies. AI algos for adding diagnoses to rack up the bill? That is completely disgusting. Wholly unnacceptable and if I ever find out ive been the target of such a scam I will raise holy hell.
I am speaking more generally of the industry. You can look at my profile and follow the bread-trail to see whom I work for. It is not insurance, we are a client of state systems, and I work to do the opposite of what you suggest (identify fraud/waste/abuse in insurance claims).
I do not know the chances that any individual provider uses a system to identify what codes to tack onto your bill.
I've been doing machine learning since the mid 2000s. About half of my time is spent keeping data pipelines running to get data into shape for training and using in models.
The other half is spent doing tech support for the bunch of recently hired "AI scientists" who can barely code, and who spend their days copy/pasting stuff into various chatbot services. Stuff like telling them how to install python packages and use git. They have no plan for how their work is going to fit into any sort of project we're doing, but assert that transformer models will solve all our data handling problems.
I'm considering quitting with nothing new lined up until this hype cycle blows over.
There are companies where applied scientists are required to code well. Just ask how they are hired before joining (that should be a positive feature).
Yeah, we used to be like that. Then, when this hype cycle started ramping up, the company brought in a new exec who got rid of that. I brought it up with the CEO, but nothing changed, so that's another reason for me to leave.
I like to feel useful, and like I'm actually contributing to things. I probably didn't express it well in my first post, but the attitude is very much that my current role is obsolete and a relic that's just sticking around until the AI can do everything.
It means I'm marginalized in terms of planning. The company has long term goals that involve making good use of data. Right now, the plan is that "AI" will get us there, with no plan B is it doesn't work. When it inevitably fails to live up to the hype, we're going to have a bunch of clobbered together systems that are expensive to run, rather than something that we can keep iterating on.
It means I'm marginalized in terms of getting resources for projects. There's a lot of good my team could be doing if we had the extra budget for more engineers and computing. Instead that budget is being sent off to AI services, and expensive engineer time is being spent on tech support for people that slapped "LLM" all over their resume.
As it was in the beginning and now and ever shall be amen
At the staff/principal level it’s all about maintaining “data impedance” between the product features that rely on inference models and the data capture
This is to ensure that as the product or features change it doesn’t break the instrumentation and data granularity that feed your data stores and training corpus
For RL problems however it’s about making sure you have the right variables captured for state and action space tuple and then finding how to adjust the interfaces or environment models for reward feedback
As somebody whose machine learning expertise consists of the first cohort of Andrew Ng's MOOC back in 2011, I'm not too surprised. One of the big takeaways I took from that experience was the importance of getting the features right.
This was very important with classical machine learning, now with deep learning, feature engineering became useless as the model can learn the relevant features by itself.
However, having a quality and diverse dataset is more important now than ever.
That depends on the type of data, and regardless, your goal is to minimizing the input data since it has a direct impact on performance overhead and duration of inference.
Same here, it's tons of work to collect, clean, validate data, followed by a tiny fun portion where you train models, then you do the whole loop over again.
This is a large problem in industry: defining away some of the most important parts of a job or role as (should be) someone else's.
There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.
It's not about "yucky" so much as specialization and only having a limited time in life to learn everything.
Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?
I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.
If other peoples work is reliant on yours then you should know how their part of the system transforms your inputs
Similarly you should fully understand how all the inputs to your part of the system are generated
No matter your coupling pattern, if you have more than 1 person product, knowing at least one level above and below your stack is a baseline expectation
This is true with personnel leadership too, I should be able to troubleshoot one level above and below me to some level of capacity.
> I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.
I've seen these too, and you aren't wrong. Division into specializations can work "way better" (i.e. the overall potential is higher), but in practice the differentiating factors that matter will come down to organizational and ultimately human-factors. The anecdotal cases I draw my observations from organizations operating at the scale of 1-10 people, as well as 1,000s working in this field.
> Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?
To realize the higher potential mentioned above, what they need to be doing is appreciating the value of what those things are and those who do those things beyond: these are the people that do the things I don't want to do or don't want to understand. That appreciation usually comes from having done and understanding that work.
When specializations are used, they tend to also manifest into organizational structures and dynamics which are ultimately comprised of humans. Conway's Law is worth mentioning here because the interfaces between these specializations become the bottleneck of your system in realizing that "higher potential."
As another commenter mentions, the effectiveness of these interfaces, corresponding bottlenecking effects, and ultimately the entire people-driven system is very much driven by how the parties on each side understand each other's work/methods/priorities/needs/constraints/etc, and having an appreciation for how they affect (i.e. complement) each other and the larger system.
> There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.
See: the use of "devops" to encapsulate "everything besides feature development"
Used to do this job once upon a time - can't overstate the importance of just being knee-deep in the data all day long.
If you outsource that to somebody else, you'll miss out on all the pattern-matching eureka moments, and will never know the answers to questions you never think to ask.
I count at least a half dozen “just use X” replies to this comment, for at least a half dozen values of X, where X is some wrapper on top of pip or a replacement for pip or some virtual environment or some alternative to a virtual environment etc etc etc.
Why is python dependency management so cancerously bad? Why are there so many “solutions” to this problem that seem to be out of date as soon as they exist?
Are python engineers just bad, or?
(Background: I never used python except for one time when I took a coursera ML course and was immediately assaulted with conda/miniconda/venv/pip/etc etc and immediately came away with a terrible impression of the ecosystem.)
I think it is worth separating the Python ML ecosystem from the rest. While traditional Python environment management has many sore points, it is usually not terrible (though there are many gotchas still-to-this-day-problems that should have been corrected long ago).
The ML system is a whole another stack of problems. The elephant in the room is Nvidia who is not known for playing well with others. Aside from that, the state of the art in ML is churning rapidly as new improvements are identified.
- You can’t have two versions of the same package in the namespace at the same time.
- The Python ecosystem is very bad at backwards compatibility
This means that you might require one package that requires foo below version 1.2 and another package that requires foo version 2 and above.
There is no good solution to the above problem.
This problem is amplified when lots of the packages were written by academics 10 years ago and are no longer maintained.
The bad solutions are:
1) Have 2 venvs - not always possible and if you keep making venvs you’ll have loads of them.
2) Rewrite your code to only use one library
3) Update one of the libraries
4) Don’t care about the mismatch and cross your fingers that the old one will work with the newer library.
Disk space is cheap, so where it's possible to have 2 (or more) venvs, that seems easiest. The problem with venv is that they don't automatically activate. I've been using a very simple wrapper around python to automatically activate venvs so I can just cd into the directory and do python foo.py and have it use the local venv.
After many years of Django development I've settled on what I find the simplest solution. It includes activation of virtual environments when I cd into a directory.
pyenv for installing and managing python versions.
direnv for managing environments and environment variables (a highly underrated package imo).
With those two installed I just include a .envrc file in every project. It looks like this:
You’re already managing a few hundred dependencies and their versions. Each venv roughly doubles the number of dependencies and they all have slightly different versions.
Now your 15 venvs deep, and have over 3000 different package version combinations installed. Your job is to upgrade them right now because of a level 8 CVE
It's not bad. It works really well. There's always room for improvement. That's technology for you. Python probably does attract more than it's fair share of bad engineers, though.
Unironically yes, it really is that bad. A moderately bad language that happened to have some popular data science and math libraries from the beginning.
I can only imagine it seemed like an oasis to R which is bottom tier.
So when you combine data scientists, academics, mathematicians, juniors, grifters selling courses…
things like bad package management, horrible developer experience, absolutely no drive in the ecosystem to improve anything beyond the “pinnacle” of wrapping C libraries are all both inevitable and symptoms of a poorly designed ecosystem.
As an amateur game engine developer, I morosely reflect my hobby seems to actually consist of endlessly chasing things that were broken by environment updates (OS, libraries, compiler, etc.) That is, most of the time I sit down to code I actually spend nuking and reinstalling things that (I thought) were previously working.
Your comment makes me feel a little better that this is not merely some personal failing of focus, but happens in a professional setting too.
Happens in AAA too but we tend to have teams that shield everyone from that before they get to work. I ran a team like that for a couple years.
For hobby stuff at home though I don't tend to hit those types of issues because my projects are pretty frozen dependency-wise. Do you really have OS updates break stuff for you often? I'm not sure I recall that happening on a home project in quite a while.
Can recommend using conda, more specifically mambaforge/micromamba (no licensing issues when used at work).
This works way better than pip, as it does more checks/dependency checking, so it does not break as easily as pip, though this makes it definitely way slower when installing something. It also supports updating your environment to the newest versions of all packages.
It's no silver bullet and mixing it with pip leads to even more breakages, but there is pixi [0] which aims to support interop between pypi and conda packages
Yeah, I agree, maybe I should have also mentioned the bad things about it, but after trying many different tools that's the one that I stuck with, as creating/destroying environments is a breeze once you got it working and the only time my environment broke was when I used pip in that environment.
> The Python people probably can't even imagine how great dependency management is in all the other languages...
Yep, I wish I could use another language at work.
> Choosing between Anaconda/Miniconda...
I went straight with mamba/micromamba as anaconda isn't open source.
To add. Conda has parallelized downloads and is faster. Not as fast as mamba, but faster than previously. pr merged sep 2022 -> https://github.com/conda/conda/pull/11841
Yes I started with conda I think and ended up switching to venv and I can’t even remember why, it’s a painful blur now. It was almost certainly user error too somewhere along the way (probably one of the earlier steps), but recovering from it had me seriously considering buying a Linux laptop.
Yeah, the workday there looks pretty similar though, except that installing pytorch and pillow is usually no problem. Today it was flash-attn I spent the afternoon on.
Isn't this what containers are for. Someone somewhere gets it configured right and then you download and run pre setup container and add your job data ? Or am I looking at the problem wrong?
But then how do you test out the latest model that came out from who knows where and has the weirdest dependencies and a super obscure command to install?
Dependency management is just.. hard. It is one of the things where everything relies upon it but nobody thinks "hey, this is my responsibility to improve" so it is left to people who have the most motivation, academic posts, or grant funding. This is roughly the same problem that led to heartbleed for OpenSSL.
Do you know what other ecosystem comes closest to the existing in Python? I've heard good things about Julia.
13 years ago when I was trying to explore the field R seemed to be the most popular, but looks like not anymore. (I didn't get into the field, and do just a regular SWE, so I'm not aware of the trends).
There is also a lot of development in Elixir ecosystem around the subject [1].
I don't think the ML community has an appetite for learning a different language.
There are Microsoft-backed F# bindings for Spark and Torch, but no one seems interested. And this is despite a pretty compelling proposition of lightweight syntax, strong typing, script-ability and great performance.
The answer will probably be JavaScript.
Everyone already knows the language - all that's missing is operator overloading and a few key bindings.
> There are Microsoft-backed F# bindings for Spark and Torch
> this is despite a pretty compelling proposition of lightweight syntax, strong typing, script-ability and great performance.
For exactly the reasons your mentioned, i feel like F# would have been the perfect match for both MLE/ETL(spark pipeline) work and some of the deep learning/graph modelisation such as pytorch.
Saddly, even from MSFT, the investment in F# as dried up
I see your contemporary hardware choices and raise you my P900 ThinkStation with 256GB of RAM and 48 Xeon cores. Eventually it might even acquire modern graphics hardware.
I wish my company would understand this and let us use something else. Luckily, they don't really seem to care that I use my Linux based gaming machine most of the time
MPS implementations generally lag behind CUDA kernels, especially for new and cutting edge stuff. Sure, if you're only running CPU inference or only want to use the GPU for simple or well established models, then things have gotten to the point where you can almost get the plug and play experience on Apple silicon. But if you're doing research level stuff and training your own models, the hassle is just not worth it once you see how convenient ML has become in the cloud. Especially since you don't really want to store large training datasets locally anyways.
Really can't recommend Nix for Python stuff more, it's by far the best at managing dependencies that use FFI/native extensions. It can be a pain sometimes to port your existing Poetry/etc. project to work with nix2lang converters but really pays off.
Legends say there were times when you'd have a program.c file and just run cc program.c, and then could just execute the compiled result. Funny that programmer's job is highly automatable, yet we invent ourselves tons of intermediate layers which we absolutely have to deal with manually.
I agree simplicity is king. But you're comparing making a script using dependencies and tooling for those dependencies and a C program with no dependencies. You can download a simple python script and run it directly if it has no dependencies besides stdlib (which is way larger in python). That's why I love using bottle.py by example.
Agree. But even with dependencies running "make" seems to be way simpler than having to install particular version of tools for a project, making venv and then picking versions of dependencies.
The point is the same - we had it simpler and now, with all capabilities for automation, we have it more complex.
Frankly, I suspect most of the efforts now are spent fighting non-essential complexities, like compatibilities, instead of solving the problem at hand. That means we create problems for ourselves faster than removing them.
I actually did a small C project a couple of years ago, the spartan simplicity there can have its own pain too, like having to maintain a Makefile. LOL. It’s swings and roundabouts!
In theory you're right, CPython is written in C and it could segfault or display undefined behavior. In practice, you're quite wrong.
It's not really much of a counterargument to say that Python is good enough that you don't have to care what's under the hood, except when it breaks because C sucks so badly.
I was specifically talking about python packages using C. You type "pip install" and god knows what's going to happen. It might pull a precompiled wheel, it might just compile and link some C or Fortran code, it might need external dependecies. It might install flawlessly and crash as soon as you try to run it. All bets are off.
I never experienced CPython itself segfault, it's always due to some package.
Some Linux distros are moving that way, particularly for the included Python/pip version. My Arch Linux already does so some years, and I do not set it up myself - so I think it is default.
A lot do, personally, every single time I try to go back to conda/mamba whatever, I get some extremely weird C/C++ related linking bug - just recently, I ran into an issue where the environment was _almost_ completely isolated from the OS distro's C/C++ build infra, except for LD, which was apparently so old it was missing the vpdpbusd instruction (https://github.com/google/XNNPACK/issues/6389). Except the thing was, that wouldn't happen when building outside of of the Conda environment. Very confusing. Standard virtualenvs are boring but nearly always work as expected in comparison.
I'm an Applied Scientist vs. ML Engineer, if that matters.
It's probably easier to reinstall everything anew from time to time. Instead of fixing broken 18.04 just move to 22.04. Most tools should work, if you don't have huge codebase which requires old compiler...
Conda.. it interfere with OS setup and has not always the best utils. Like ffmpeg is compiled with limited options, probably due to licensing.
I do all the time, and always have (in fact my first job was bare metal OS install automation), this was Rocky 9.4. New codebase, new compiler weird errors. I did actually reinstall and switch over to Ubuntu 24.04 after that issue lol.
It causes so many entirely unnecessary issues. The conda developers are directly responsible for maybe a month of my wasted debugging time. At my last job one of our questions for helping debug client library issues was “are you using conda”. And if so we just would say we can’t help you. Luckily it was rare, but if conda was involved it was 100% conda fault somehow, and it was always a stupid decision they made that flew in the face of the rest of the python packaging community.
Data scientist python issues are often caused by them not taking the 1-3 days it takes to fully understand their tool chain. It’s genuinely quite difficult to fuck up if you take the time once to learn how it all works, where your putbon binaries are on your system etc. Maybe not the case 5 years ago. But today it’s pretty simple.
Fully agree with this. Understand the basic tools that currently exist and you'll be fine. Conda constantly fucks things up in weird hard to debug ways...
Oh my! This hits home. We have some test scripts written in python. Every time I try to run them after a few months I spend a day fixing the environment, package dependencies and other random stuff. Python is pretty nice once it works, but managing the environment can be a pain.
You could learn how to use Python. Just spend one of those 4 hours actually learning. Imagine just getting into a car and pressing controls until something happened. This wouldn't be allowed to happen in any other industry.
Could you be a bit more specific about what you mean by "You could learn how to use python"?
What resources would you recommend to learn how to work around problems the OP has?
What basic procedures/resources can you recommend to "learn python"?
I work as a software developer alongside my studies and often face the same problems as OP that I would like to avoid. Very grateful for any tips!
Basically just use virtual environments via the venv module. The only thing you really need to know is that Python doesn't support having multiple versions of a package installed in the same environment. That means you need to get very familiar with creating (and destroying) environments. You don't need to know any of this if you just use tools that happen to be written in Python. But if you plan to write Python code then you do. It should be in Python books really, but they tend to skip over the boring stuff.
How do you learn anything in the space of software engineering? In general, there are many different problems and even more solutions with different tradeoffs.
To avoid spending hours on fixing broken environments after a single "pip install", I would make it easy to rollback to a known state. For example, recreate virtualenv from a lock requirements file stored in git: `pipenv sync` (or corresponding command using your preferred tool).
Oh wow, I’ve been a Python engineer for over a decade and getting dependencies right for machine learning has very little to do with Python and everything to do with c++/cuda
I've been programming with python for decades and the problem they are describing says more about the disastrous state of python's package management and the insane backwards compatibility stance python devs have.
Half of the problems I've helped some people solve stem from python devs insisting on shuffling std libraries around between minor versions.
Some libraries have a compatibility grid with different python minor versions, because how often they break things.
Yes, I’ve switched from conda to a combination of dev containers and pyenv/pyenv-virtualenv on both my Linux and MacBook machines and couldn’t be happier
Why Ubuntu specifically? Not even being snarky. Calling out a specific distro, vs. the operating system itself. I've had more pain setting up ML environments with Ubuntu than a MacBook, personally - though pure Debian has been the easiest to get stable from scratch. Ubuntu usually screws me over one way or another after a month or so. I think I've spend a cumulative month of my life tracking down things related to changes inn netplan, cloud-init, etc. Not to mention Ubuntu Pro spam being incessant, as official policy of Canonical [0]. I first used the distro all the way back at Warty Warthog, and it was my daily driver from Feisty until ~Xenial. I think it was the Silicon Valley ad in the MotD that was the last straw for me.
Not nearly as hard of a problem. Python does work just fine when it’s pure Python. The trouble comes with all the C/Cuda dependencies in machine learning
But you do because your local node_modules and upstream are out of sync and CI is broken. Happens at least once a month just before a release of course. I'd rather have my code failing locally than trying to debug what's out of sync on upstream.
Then you get one of my favorites: NVIDIA-<something> has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I’m a regular software dev but I’ve had to do ML stuff by necessity.
I wonder how “real” ML people deal with the stochastic/gradient results and people’s expectations.
If I do ordinary software work the thing either works or it doesn’t, and if it doesn’t I can explain why and hopefully fix it.
Now with ML I get asked “why did this text classifier not classify this text correctly?” and all I can say is “it was 0.004 points away to meet the threshold”, and “it didn’t meet it because of the particular choice of words or even their order” which seems to leave everyone dissatisfied.
Not all ML is built on neural nets. Genetic programming and symbolic regression is fun because the resulting model is just code, and software devs know how to read code.
Symbolic regression has the same failure mode; the reasons why the model failed can be explained in a more digestible way, but the actual truth of what happened is fundamentally similar -- some coefficient was off by some amount and/or some monomial beat out another in some optimization process.
At least with symbolic regression you can treat the model as an analyzable entity from first principles theories. But that's not really particularly relevant to most failure modes in practice, which usually boil down to either missing some qualitative change such as a bifurcation or else just parameters being off by a bit. Or a little bit of A and a little bit of B.
Genetic programming however isn’t machine learning, but instead it’s an AI algorithm. An extremely interesting one as well! It was fun to have my eyes opened after being taught genetic algorithms, to then be brought into genetic programming
My job title is ML Engineer, but my day to day job is almost pure software engineering.
I build the systems to support ML systems in production. As others have mentioned, this includes mostly data transformation, model training, and model serving.
Our job is also to support scientists to do their job, either by building tools or modifying existing systems.
However, looking outside, I think my company is an outlier. It seems in the industry the expectations for a ML Engineer are more aligned to what a data/applied scientist does (e.g. building and testing models). That introduces a lot of ambiguity into the expectations for each role in each company.
In my experience your company is doing it right, and doing it the way that other successful companies do.
I gave a talk at the Open Source Summit on MLOps in April, and one of the big points I try to drive home is that it's 80% software development and 20% ML.
My company is largely the same. I’m an MLE and partner with data scientists. I don’t train or validate the models. I productionize and instrument the feature engineering pipelines and model deployments. More data engineering and MLOps than anything. I’m in a highly regulated industry so the data scientists have many compliance tasks related to the models and we engineers have our own compliance tasks related to the deployments. I was an MLE at another company in the very same industry before and did everything in the model lifecycle and it was just too much.
Been working as an MLE for the last 5 years and as another comment said most of the work is close to SWE. Depending on the stage of the project I'm working on, day-to-day work varies but it's along the lines of one of these:
- Collaboration with stakeholders & TPMs and analyzing data to develop hypotheses to solve business problems with high priority
- Framing business problems as ML problems and creating suitable metrics for ML models and business problems
- Building PoCs and prototypes to validate the technical feasibility of the new features and ideas
- Creating design docs for architecture and technical decisions
- Collaborating with the platform teams to set up and maintain the data pipelines based on the needs of new and exiting ML projects
- Building, deploying, and maintaining ML microservices for inference
- Writing design docs for running A/B tests and performing post-test analyses
- Setting up pipelines for retraining of ML models
Not my main work, but spending a lot of time gluing things together. Tweaking existing open source. Figuring out how to optimize resources, retraining models on different data sets. Trying to run poorly put together python code. Adding missing requirements files. Cleaning up data. Wondering what could in fact really be useful to solve with ML that hasn't been done years ago already. Browsing the prices of the newest GPUs and calculating whether that would be worth it to get one rather than renting overpriced hours off hosting providers. Reading papers until my head hurt, that is just 1 by 1, by the time I finish the abstract and glanced over a few diagrams in the middle.
I work on optimizing our inference code, "productizing" our trained models and currently I'm working on local training and inference since I work in an industry where cloud services just aren't very commonly used yet. It's super interesting too since it's not LLMs, meaning that there aren't as many pre made tools and we have to make tons of stuff by ourselves. That means touching anything from assessing data quality (again, the local part is the challenge) to using CUDA directly as we already have signal processing libs that are built around it and that we can leverage.
Sometimes it also involves building internal tooling for our team (we are a mixed team of researchers/MLEs), to visualize the data and the inferences as again, it's a pretty niche sector and that means having to build that ourselves. That allowed me to have a lot of impact in my org as we basically have complete freedom w.r.t tooling and internal software design, and one of the tools that I built basically on a whim is now on the way to be shipped in our main products too.
Although I studied machine learning and was originally hired for that role, the company pivoted and is now working with LLMs, so I spend most of my day working on figuring out how different LLMs work, what parameters work best for them, how to do RAG, how to integrate them with other bots.
There is a vanishingly small percentage of people actually working on the design and training of LLMs vs all those who call themselves "AI engineers" who are just hitting APIs.
50%+ of my time is spent on backend engineering because the ML is used inside a bigger API.
I take responsibility for the end to end experience of said API, so I will do whatever gives the best value per time spent. This often has nothing to do with the ML models.
We live in a time when there are many more alleged machine learning roles than real ones.
I'd argue that if you are not spending >50% of your time in model development and research then it is not a machine learning role.
I'd also say that nothing necessitates the vast majority of an ML role being about data cleaning, etc. I'd suggest that indicates that the role is de facto not a machine learning role, although it may say so on paper.
not sure if this counts as ML engineering, but I support all the infra around the ML models: caching, scaling, queues, decision trees, rules engines, etc.
In larger companies, and, specifically, bigger projects, systems tend to have multiple ML components, and those are usually a mix of large NN models and more classical (ML) algorithms, so you end up tweaking multiple parts at once.
In my case optimising for such systems is ~90% of the work. For instance, can I make the model lighter or go faster and keep the performance? Or, can I make it go faster? Loss change, pruning, quantisation, dataset optimisation etc. -- most of the time I'm testing out those options & tweaking parameters.
There is of course the deployment part, but this one is usually a quickie if your team has specific processes/pipelines for this. There's a checklist of what you must do while deploying, along with cost targets.
In my case, there are established processes and designated teams for cleaning & collecting data, but you still do a part of it yourself to provide guidelines. So, even though data is always a perpetual problem, I can shed off most of that boring stuff.
Ah, and of course you're not a real engineer if you don't spend at least 1-2% of your time explaining to other people (surprisingly often to a technical staff, but not ML-oriented) why doing X is a really bad idea. Or, just explaining how ML systems work with ill-fitted metaphors.
* 15% of my time in technical discussion meetings or 1:1's. Usually discussing ideas around a model, planning, or ML product support
* 40% ML development. In the early phase of the project, I'm understanding product requirements. I discuss an ML model or algorithm that might be helpful to achieve product/business goals with my team. Then I gather existing datasets from analysts and data scientists. I use those datasets to create a pipeline that results in a training and validation dataset. While I wait for the train/validation datasets to populate (could take several days or up to two weeks), I'm concurrently working on another project that's earlier or further along in its development. I'm also working on the new model (written in PyTorch), testing it out with small amounts of data to gauge its offline performance, to assess whether or not it does what I expect it to do. I sanity check it by running some manual tests using the model to populate product information. This part is more art than science because without a large scale experiment, I can only really go by the gut feel of myself and my teammates. Once the train/valid datasets have been populated, I train a model on large amounts of data, check the offline results, and tune the model or change the architecture if something doesn't look right. After offline results look decent or good, I then deploy the model to production for an experiment. Concurrently, I may be making changes to the product/infra code to prepare for the test of the new model I've built. I run the experiment and ramp up traffic slowly, and once it's at 1-5% allocation, I let it run for weeks or a month. Meanwhile, I'm observing the results and have put in alerts to monitor all relevant pipelines to ensure that the model is being trained appropriately so that my experiment results aren't altered by unexpected infra/bug/product factors that should be within my control. If the results look as expected and match my initial hypothesis, I then discuss with my team whether or not we should roll it out and if so, we launch! (Note: model development includes feature authoring, dataset preparation, analysis, creating the ML model itself, implementing product/infra code changes)
* 20% maintenance – Just because I'm developing new models doesn't mean I'm ignoring existing ones. I'm checking in on those daily to make sure they haven't degraded and resulted in unexpected performance in any way. I'm also fixing pipelines and making them more efficient.
* 15% research papers and skills – With the world of AI/ML moving so fast, I'm continually reading new research papers and testing out new technologies at home to keep up to date. It's fun for me so I don't mind it. I don't view it as a chore to keep me up-to-date.
* 10% internal research – I use this time to learn more about other products within the team or the company to see how my team can help or what technology/techniques we can borrow from them. I also use this time to write down the insights I've gained as I look back on my past 6 months/1 year of work.
I select papers based on references from coworkers, Twitter posts by prominent ML researchers I follow, ML podcasts, and results.
The research becomes relevant immediately because my team is always looking to incorporate it into our production models right away. Of course it does take some planning (3-6 months) before it's fully rolled out in production.
The same job title could have very different responsibilities. I have been more of a "full-stack" MLE, I work from research and prototyping all the way to production software engineering and ML infra/dev-ops. I have also been mentoring MLEs and ML tech leads/managers - schedule a free 30-min intro call at: https://cal.com/studioxolo/intro.
junior level role, but currently it is a mix of working like a proxy product owner and half software engg.
the users are researchers and have deep technical knowledge of their use case. it is still a challenge to map their needs into design decisions of what they want in the end. thanks to open-source efforts, the model creation is rather straightforward. but everything around making that happen and shaping it like a tool is a ride.
especially love the ever-changing technical stack of "AI" services by major cloud providers rn. it makes mlops nothing more than a demo imho.
90% of the time it's figuring out what data to feed into neural networks, 2% of the time figure out stuff about neural networks and the other 8% of the time figure out why on earth the recall rate is 100%.
Pretty much the same as the others, building tool, data cleaning, etc. But something I don't see mentioned: experiment design/ data collection protocols
Demand is higher for flashy things that look good on directors' desks, definitely. But there's less attention on less flashy applications of machine learning, unless your superiors are so clueless that they think what you're doing is GenAI. Which sometimes the systems/models being trained are legitimately generative, but in the more technical, traditional sense.
Underrated comment. At my place of work, I find this to be a huge part of the MLE job. Everyone knows R but none of the cloud tools have great R support.
I worked as an engineer on an academic project that used sound to detect chewing. I acted as a domain expert to help the grad student doing the project understand how to select an appropriate microphone to couple to skin, how to interface that microphone to an ADC, offered insights into useful features to explore based on human biology, found a sound processing library that gave us 100's of new features for very little work, helped design a portable version of the electronics, did purchasing, helped troubleshoot during the times we were getting poor results, monitored human safety so we weren't connecting student volunteers to earth ground or AC power, helped write up human subjects committee review application forms, tracked down sources of electrical noise in the system, helped analyze results, reviewed 3D models for a headset to hold the microphone and electronics, wrote some signal processing/feature extracting code, selected an appropriate SBC system, explored confounding factors that were different between lab tests and field tests, and more. The grad student organized tests to record chewing sounds and ground truth video (a cheap camera in the bill of a baseball hat), arranged for a Mechanical Turk-like service to mark up the sound data based on the ground truth using a video analysis app I found, tried different pre-written ML algorithms to figure out which worked the best across different types of food, and analyzed which features contributed most to accuracy so we could eliminate most of them and get the thing to run on a wearable system with reasonable battery life. Then we collaborated with some medical doctors (our target customer for the device, it was not intended for consumers) to run tests on kids in their eating labs and wrote papers about it. The project ended when the grad student graduated.
I worked on other ML projects as well. A system that analyzed the syslogs of dozens of computers to look for anomolies. I wrote the SQL (256 fields in the query! Most complex SQL I've ever written) that prefiltered the log data to present it to the ML algorithm. And built a server that sniffed encrypted log data we broadcast on the local network in order to gather the data in one place continuously. Another system used heart rate variability to infer stress. I helped design a smartwatch and implemented the drivers that took in HRV data from a Bluetooth chest strap. We tested the system on ourselves. None of our ML projects involved writing new ML algorithms, we just used already implemented ones off the shelf. The main work was getting the data, cleaning the data, and fine tuning or implementing new feature extractors. The CS people weren't familiar with the biological aspects (digging into Grey's Anatomy), sensors, wireless, or electronics, so I handled a lot of that. I could have done all the work, it's not hard to run ML algorithms (we often had to run them on computing clusters to get results in a reasonable amount of time, which we automated) or figure out which features are important, but then the students wouldn't have graduated :-) Getting labeled data for supervised learning is the most time consuming part. If you're doing unsupervised learning then you're at the mercy of what is in the data and you hope what you want is in there with the amount of detail you need. Interfacing with domain experts is important. Depending on the project a wide variety of different skills can be required, many not related to coding at all. You may not be responsible for them all, but it will help if you at least understand them.
Collaborating with nontechnical people is oddly my favorite part of doing MLE work right now. It wasn't the case when I did basic web/db stuff. They see me as a magician. I see them as voodoo priests and priestesses. When we get something trained up and forecasting that we both like, it's super fulfilling. I think for both sides.
Most of my modeling is healthcare related. I tease insights out of a monstrous data lake of claims, Rx, doctor notes, vital signs, diagnostic imagery, etc. What is also monstrous is how accessible this information is. HIPAA my left foot.
Since you seemed to be asking about the temporal realities, it's about 3 hours of meetings a week, probably another 3 doing task grooming/preparatory stuff, fixing some ETL problem, or doing a one-off query for the business, the rest is swimming around in the data trying to find a slight edge to forecast something that surprised us for a $million or two using our historical snapshots. It's like playing wheres waldo with math. And the waldo scene ends up being about 50TB or so in size. :D