Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Machine learning engineers, what do you do at work?
365 points by Gooblebrai 8 months ago | hide | past | favorite | 226 comments
I'm curious about the day-to-day of a Machine Learning engineer. If you work in this field, could you share what your typical tasks and projects look like? What are you working on?



Getting my models dunked on by people who can't open MS Outlook more than 3 tries out of 5, however, have a remarkable depth and insight into their chosen domain of expertise. It's rather humbling.

Collaborating with nontechnical people is oddly my favorite part of doing MLE work right now. It wasn't the case when I did basic web/db stuff. They see me as a magician. I see them as voodoo priests and priestesses. When we get something trained up and forecasting that we both like, it's super fulfilling. I think for both sides.

Most of my modeling is healthcare related. I tease insights out of a monstrous data lake of claims, Rx, doctor notes, vital signs, diagnostic imagery, etc. What is also monstrous is how accessible this information is. HIPAA my left foot.

Since you seemed to be asking about the temporal realities, it's about 3 hours of meetings a week, probably another 3 doing task grooming/preparatory stuff, fixing some ETL problem, or doing a one-off query for the business, the rest is swimming around in the data trying to find a slight edge to forecast something that surprised us for a $million or two using our historical snapshots. It's like playing wheres waldo with math. And the waldo scene ends up being about 50TB or so in size. :D


The Dead Internet Theory says most activity on the internet is by bots [1]. The Dead Privacy Theory says approximately all private data is not private; but rather is accessible on whim by any data scientist, SWE, analyst, or db admin with access to the database.

[1] https://en.wikipedia.org/wiki/Dead_Internet_theory


> The Dead Privacy Theory says approximately all private data is not private; but rather is accessible on whim by any data scientist, SWE, analyst, or db admin with access to the database.

I like this so much I'm definitely stealing it!


Damn, I've talked about this many times at my last job (startup that went from 100k patients to ~2.5M in 5 years). I love the name Dead Privacy Theory


At least more of that access is logged now. It used to be only the production databases that were properly logged, now it’s more common for every query to be logged, even in dev environments. The next step will be more monitoring of those logs.


> HIPAA my left foot.

That was my experience as well - training documentation for fresh college grads (i.e. me) directed new engineers to just... send SQL queries to production to learn. There was a process for gaining permissions, there were audit logs, but the only sign-off you needed was your manager, permission lasted 12 months, and the managers just rubber-stamped everyone.

That was ten years ago. Every time I think about it I find myself hoping that things have gotten better and knowing they haven't.


So…

Is it surprising that Engineers in healthcare dont read the actual HIPAA documentation?

Use of health data is permitted so long as it’s for payment, treatment or operations. Disclosures and patient consent are not required.

There are helpful summaries on the US Department of Health and Humans Services website of the various rules (Security, Privacy & Notification)

Source: https://www.hhs.gov/hipaa/for-professionals/privacy/guidance...

This allowance is permitted to covered entities and by extension their vendors (business associates) by HIPAA.

If it wasn’t then, theoretically the US healthcare industry would grind to a halt considering the number of intermediaries for a single transaction.

Example:

Doctor writes script -> EHR -> Pharmacy -> Switch -> Clearinghouse —> PA Processing -> PBM/Plan makes determination

Along this flow there are other possible branches and vendors.

It’s beyond complex.


> If it wasn’t then, theoretically the US healthcare industry would grind to a halt considering the number of intermediaries for a single transaction.

It just occurred to me that cleaning up our country's data privacy / data ownership mess might have extraordinarily positive second-order effects on our Kafkaesque and criminally expensive healthcare "system".

Maybe making it functionally impossible for there to be hundreds of middlemen between me and my doctor would be a... good thing?


Correct. Would also mostly stop a lot of scams like identity theft.

A national identity service, like a normal mature economy, would be a massive public policy and public health win.

But it'd slightly decrease profits. So of course is quite impossible to implement.


Only if your mortgage isn’t paid by such a middleman.


I think it would have the opposite effect. We could end up with a few all-in-one systems that would dominate the market and have little incentive to improve or compete on usability and price.


This is why I advocate all PII data be encrypted at rest at the field level.

Worked on EMRs (during the aughts). Had yearly HIPAA and other security theater training. Not optional for any one in the field.

Of course we had root access. I forget the precise language, but HIPAA exempts "intermediaries" such as ourselves. How else could we build and verify the system(s). And that's why HIPAA is a joke.

Yes, our systems had consent and permissions and audit logs cooked in. So theoretically peeking at PII could be caught. Alas, it was just CYA. None of our customers reviewed their access logs.

--

I worked very hard to figure out how to protect patient (and voter) privacy. Eventually conceded that deanon always beats anon, because of Big Data.

I eventually found the book Translucent Databases. Shows how to design schemas to protect PII for common use cases. Its One Weird Trick is applying proper protection of passwords (salt + hash) to all other PII. Super clever. And obvious once you're clued in.

That's just 1/2 of the solution.

The other 1/2 is policy (non technical). All data about a person is owned by that person. Applying property law to PII transmutes 3rd party retention of PII from an asset to a liability. And once legal, finance, and accounting people get involved, orgs won't hoard PII unless it's absolutely necessary.

(The legal term "privacy" means something like "sovereignty over oneself", not just the folk understanding of "keeping my secrets private.)


> Use of health data is permitted so long as it’s for payment, treatment or operations

please, do tell me, where "estimating better pricing models to extract more profit" fit into those?

> Doctor writes script -> EHR -> Pharmacy -> Switch -> Clearinghouse —> PA Processing -> PBM/Plan makes determination

All the OP mentions happens way after the fact of those paths you described. People have already been charged. Treatment was already decided.


Define operations, because that sounds like a loophole that basically allows you to use it for anything


HIPPA protects you against gossipy staff. Beyond that, it’s mostly smoke - entire industries are built on loopholes.

Pre-HIPPA, the hospital would tell a news reporter about your status. Now, drug marketing companies know about your prescriptions before your pharmacy does.


Mid aughts, we devs sat in the meetings discussing how to enable this. With teams from Google Health, Microsoft Health Vault, some pharmas.

In attendance were marketing, biz dev, execs. Ours and theres. But no legal. Hmmm.

And it would have worked if it weren't for those meddling teenagers.

From your comment, I'm inferring they're now allowed to use health records for marketing.


My wife about 9 years ago was admitted to the hospital for a ruptured ectopic pregnancy. The baby would have been about two months along.

On the would-be due date, a box arrived via FedEx of Enfamil samples and a chirpy “welcome baby” message. Of course there was no baby.

It turns out that using prescription data, admission data and other information that is aggregated as part of subrogation and other processes, you can partially de-anonymize/rebuild / health record and make an inference. Enfamil told me what list they used and I bought it, it contained a bunch of information including my wife’s. I also know everyone in my zip code who had diabetes in 2015.

There’s even more intrusive stuff around opioid surveillance.


I'm sorry for your loss. Shit.

> Emfamil samples...

Insult to injury. They're all just ghouls. Profitting from other people's pain.

> what list they used

Who watches the watchers? Nobody.

We had an instance of Lexis/Nexus' (formerly Seisent) demographic database to play with. Most people just can't wrap their heads around how bad things were in the mid-aughts. I was called a "sweaty paranoid kook" by our local paper for simply describing how things worked (no opinion or predictions).

Oh well. Maybe society has started to recognize what's happened.

Thanks for reading my rant. Peace.


>There’s even more intrusive stuff around opioid surveillance.

Can you please explain more about this or point me towards a source?


Basically the only thing you can't do with the data is disclose it to someone who doesn't also fall under HIPAA


... you didn't have a UAT environment?


There’s an old joke that everyone’s got a testing environment, but some people are lucky enough to have a separate production environment.


There were a couple "not prod" environments, but they were either replicated directly from prod or so poorly maintained that they were unusable (empty tables, wrong schemas, off by multiple DB major versions, etc), no middle ground. So institutional culture was to just run everything against prod (for bare selects that could be copied and pasted into the textbox in the prod-access web tool) or a prod replica (for anything that needed a db connection). The training docs actually did specify Real Production, and first-week tasks included gaining Real Production access. If I walked in and was handed that training documentation today I'd raise hell and/or quit on the spot, but that was my first job out of college - it was basically everyone's first job out of college, they strongly preferred hiring new graduates - and I'd just had to give up on my PhD so I didn't have the confidence, energy, or pull to do anything about it, even bail out.

That was also the company where prod pushes happened once a month, over the weekend, and were all hands on deck in case of hiccups. It was an extraordinarily strong lesson in how not to organize software development.

(edit: if what you're really asking is "did every engineer have write access to production", the answer was, I believe, that only managers did, and they were at least not totally careless with it. not, like, actually responsible, no "formal post-mortem for why we had to use break-glass access", but it generally only got brought out to unbreak prod pushes. Still miserable.)


I worked on a project to analyze endoscope videos to find diseases. I examined a lot of images and videos annotated with symptoms of various diseases labeled by assistants and doctors. Most of them are really obvious, but others are almost impossible to detect. In rare cases, despite my best efforts, I couldn't see any difference between the spot labeled as a symptom of cancer and the surrounding area. There's no a-ha moment, like finding an insect mimicking its environment. No matter how many times I tried, I just couldn't see any difference.


Mind sharing how to get a foot into the field? I've got a good amount of domain knowledge from my studies in life science and rather meager experience from learning to code on my own for a few years. It seems like I cant compete with CS majors and gotta find a way to leverage my domain knowledge.


I'm not an expert in machine learning, but rather a web developer and data engineer helping develop a system to detect diseases from endoscopy images using the model developed by other ML engineers. And it was 5 years ago when I worked on the project, so please take it with a grain of salt.

If you want to learn machine learning for healthcare in general, it may help to start problems with tabular data like CSVs instead of images. Image processing is a lot harder, and takes a lot of time and computational power. But it's best to learn what you're interested in the most.

Anyway, first you need to be familiar with basic; Python, machine learning, and popular libraries like scikit-learn, matplotlib, numpy, and pandas. Those are tons of articles, textbooks, and videos to help you learn them.

If you grasp the basics, I think it's better to learn from actual code to train/evaluate models rather than more theories. Kaggle may be a good starting point. They host a lot of competitions for machine learning problems. There are easy competitions for beginners, and competitions and datasets in the medical field.

You can view notebooks (actual code to solve those problems well written by experts) and popular ones are very educational. You can learn a lot by reading those code, understanding concepts and how to use libraries, and modifying some code to see how it changes the result. ChatGPT is also helpful.

If you want to learn image classification, the technology used to detect objects from images and videos is called image classification and object detection. It uses CNN, one of the deep neural networks. You also need to learn basic image processing, how to train a deep neural network, how to evaluate, and libraries like OpenCV/Pillow/PyTorch/TorchVision. There are a lot of image classification competitions in the medical field on Kaggle too[0][1].

To run those notebooks, I recommend Google Colab. Image processing often uses a lot of GPUs, and you may not have GPUs, or even if you have it's difficult to set up the right environment. It's easier to use those dedicated cloud services and it doesn't cost much.

It's hard to learn, but sometimes fun, so enjoy your journey!

[0] https://www.kaggle.com/datasets/paultimothymooney/chest-xray... [1] https://www.kaggle.com/code/arkapravagupta/endoscopy-multicl...


This sounds almost exactly like my day-to-day as a solo senior data engineer — minus building and training ML models, and I don't work in healthcare. My peers are all very non-technical business directors who are very knowledgeable about their domains, and I'm like a wizard who can conjure up time savings/custom reporting/actionable insights for them.

Collaborating with them is great, and has been a great exercise in learning how to explain complex ideas to non-technical business people. Which has the side effect of helping me get better at what I do (because you need a good understanding of a topic to be able to explain it both succinctly and accurately to others). It has also taught me to appreciate the business context and reasoning that can drive decisions about how a business uses or develops data/software.


> I tease insights out of a monstrous data lake of claims, Rx, doctor notes, vital signs

I'm curious to know the tech stack behind converting unstructured to structured data(for reporting and analysis)


Take a look at AWS Healthlake and AWS Comprehend Medical


3 hours of meetings a week, that’s incredible. Sounds like your employer understands and values your time!


They really do. This has been my longest tenure at any position by far and Engineer QoL is a massive part of it. Our CTO came up through the DBA/Data/Engineering Management ranks and the empathy is solidly there.

As we grow, I'm ever watchful for our metamorphosis into a big-dumb-company, but no symptoms yet. :)


13 meetings/week, at least one full day of work wasted for me


>Getting my models dunked on by people who can't open MS Outlook more than 3 tries out of 5, however, have a remarkable depth and insight into their chosen domain of expertise. It's rather humbling.

The people who have lasted in those roles have built up a large degree of intuition on how their domains work (or they would've done something else).


HIPAA your left foot because nobody reads fine print anymore and signs their soul away for a one-time $60 discount.


What business are you in that predicting health data can make you millions?


The scale of health insurance claims is incredible, my company has a process that simply identifies when car insurance should pay the bill instead of medicaid/medicare post traffic accident (subrogation). Seems a minor thing right? We process right at 1 billion a year in related claims (and I don't know our US market share, maybe like 10-20%).

I am guessing every datascientist that works for a BlueCross BlueShield at individual states deals with processes that touch multiple-million dollars of claims.

We even now have various dueling systems -- one company has a model to tack on more diagnoses to push up the bill, another has a process to identify that upcoding. One company has models to auto accept/deny claims, another has an automatic prior authorization process to try to usurp that later denial stage, etc.


Man, respectfully, sounds like you are doing the dirty immoral work of insurabce companies. AI algos for adding diagnoses to rack up the bill? That is completely disgusting. Wholly unnacceptable and if I ever find out ive been the target of such a scam I will raise holy hell.


I am speaking more generally of the industry. You can look at my profile and follow the bread-trail to see whom I work for. It is not insurance, we are a client of state systems, and I work to do the opposite of what you suggest (identify fraud/waste/abuse in insurance claims).

I do not know the chances that any individual provider uses a system to identify what codes to tack onto your bill.


Insurance, health benefits.


I've been doing machine learning since the mid 2000s. About half of my time is spent keeping data pipelines running to get data into shape for training and using in models.

The other half is spent doing tech support for the bunch of recently hired "AI scientists" who can barely code, and who spend their days copy/pasting stuff into various chatbot services. Stuff like telling them how to install python packages and use git. They have no plan for how their work is going to fit into any sort of project we're doing, but assert that transformer models will solve all our data handling problems.

I'm considering quitting with nothing new lined up until this hype cycle blows over.


There are companies where applied scientists are required to code well. Just ask how they are hired before joining (that should be a positive feature).


Yeah, we used to be like that. Then, when this hype cycle started ramping up, the company brought in a new exec who got rid of that. I brought it up with the CEO, but nothing changed, so that's another reason for me to leave.


I just quit a day ago with nothing lined up for the same reason.


I wonder if this is how the OG VR guys felt in 2016.


Well Palmer Luckey sold oculus and now makes military gear so I guess he chose violence after his VR era


Why help them? Tell your boss you're pointing them to the other ones you helped previously.


You’re living the dream. Why quit ?


I like to feel useful, and like I'm actually contributing to things. I probably didn't express it well in my first post, but the attitude is very much that my current role is obsolete and a relic that's just sticking around until the AI can do everything.

It means I'm marginalized in terms of planning. The company has long term goals that involve making good use of data. Right now, the plan is that "AI" will get us there, with no plan B is it doesn't work. When it inevitably fails to live up to the hype, we're going to have a bunch of clobbered together systems that are expensive to run, rather than something that we can keep iterating on.

It means I'm marginalized in terms of getting resources for projects. There's a lot of good my team could be doing if we had the extra budget for more engineers and computing. Instead that budget is being sent off to AI services, and expensive engineer time is being spent on tech support for people that slapped "LLM" all over their resume.


Try to have some fun.


Is that really your idea of a dream?


My dreams are usually more disturbing, or fun…

But yes. My work is kind of similar… I do some data curation / coding, and help 2 engineers who report to me. I enjoy it.


The opposite of what you’d think when studying machine learning…

95% of the job is data cleaning, joining datasets together and feature engineering. 5% is fitting and testing models.


As it was in the beginning and now and ever shall be amen

At the staff/principal level it’s all about maintaining “data impedance” between the product features that rely on inference models and the data capture

This is to ensure that as the product or features change it doesn’t break the instrumentation and data granularity that feed your data stores and training corpus

For RL problems however it’s about making sure you have the right variables captured for state and action space tuple and then finding how to adjust the interfaces or environment models for reward feedback


As somebody whose machine learning expertise consists of the first cohort of Andrew Ng's MOOC back in 2011, I'm not too surprised. One of the big takeaways I took from that experience was the importance of getting the features right.


I remember that class. Someone from Blackrock taught it at Hacker Dojo. The good old days of support vector machines and Matlab.


This was very important with classical machine learning, now with deep learning, feature engineering became useless as the model can learn the relevant features by itself.

However, having a quality and diverse dataset is more important now than ever.


That depends on the type of data, and regardless, your goal is to minimizing the input data since it has a direct impact on performance overhead and duration of inference.


no we just replaced feature engineering with architectural engineering


>was the importance of getting the features right.

Yeah, but also knowing which features to get right. Right?


In a sense, the data _is_ the model (inductive bias) so splitting « data work » and « model work » like you do is arbitrary.


Same here, it's tons of work to collect, clean, validate data, followed by a tiny fun portion where you train models, then you do the whole loop over again.


> it's tons of work to collect, clean, validate data

That's my fun part. The discovery process is a joy especially if it means ingesting a whole new domain and meeting people.


Sounds like a Data Scientist job?


This is a large problem in industry: defining away some of the most important parts of a job or role as (should be) someone else's.

There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.


It's not about "yucky" so much as specialization and only having a limited time in life to learn everything.

Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.


My answer is yes to both of those

If other peoples work is reliant on yours then you should know how their part of the system transforms your inputs

Similarly you should fully understand how all the inputs to your part of the system are generated

No matter your coupling pattern, if you have more than 1 person product, knowing at least one level above and below your stack is a baseline expectation

This is true with personnel leadership too, I should be able to troubleshoot one level above and below me to some level of capacity.


The parent comment had three examples...


2/3 is close enough in ML world


> I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.

I've seen these too, and you aren't wrong. Division into specializations can work "way better" (i.e. the overall potential is higher), but in practice the differentiating factors that matter will come down to organizational and ultimately human-factors. The anecdotal cases I draw my observations from organizations operating at the scale of 1-10 people, as well as 1,000s working in this field.

> Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

To realize the higher potential mentioned above, what they need to be doing is appreciating the value of what those things are and those who do those things beyond: these are the people that do the things I don't want to do or don't want to understand. That appreciation usually comes from having done and understanding that work.

When specializations are used, they tend to also manifest into organizational structures and dynamics which are ultimately comprised of humans. Conway's Law is worth mentioning here because the interfaces between these specializations become the bottleneck of your system in realizing that "higher potential."

As another commenter mentions, the effectiveness of these interfaces, corresponding bottlenecking effects, and ultimately the entire people-driven system is very much driven by how the parties on each side understand each other's work/methods/priorities/needs/constraints/etc, and having an appreciation for how they affect (i.e. complement) each other and the larger system.


> There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.

See: the use of "devops" to encapsulate "everything besides feature development"


Used to do this job once upon a time - can't overstate the importance of just being knee-deep in the data all day long.

If you outsource that to somebody else, you'll miss out on all the pattern-matching eureka moments, and will never know the answers to questions you never think to ask.


My partner is a data engineer, from what I’ve gathered the departments are often very small or one person so the roles end up blending together a lot.


A good DS can double as an MLE.


And sometimes, a good MLE can double as a DS.

Personally I think we calcified the roles around data a little too soon but that's probably because there was such demand and the space is wide.


“Scientist”? Is this like Software Engineer?


I guess it means "someone who has or is about to have a PhD".


Sounds like a data engineer job to me


pip install pytorch

Environment broken

Spend 4 hours fixing python environment

pip install Pillow

Something something incorrect cpu architecture for your Macbook

Spend another 4 hours reinstalling everything from scratch after nuking every single mention of python

pip install … oh time to go home!


I count at least a half dozen “just use X” replies to this comment, for at least a half dozen values of X, where X is some wrapper on top of pip or a replacement for pip or some virtual environment or some alternative to a virtual environment etc etc etc.

Why is python dependency management so cancerously bad? Why are there so many “solutions” to this problem that seem to be out of date as soon as they exist?

Are python engineers just bad, or?

(Background: I never used python except for one time when I took a coursera ML course and was immediately assaulted with conda/miniconda/venv/pip/etc etc and immediately came away with a terrible impression of the ecosystem.)


I think it is worth separating the Python ML ecosystem from the rest. While traditional Python environment management has many sore points, it is usually not terrible (though there are many gotchas still-to-this-day-problems that should have been corrected long ago).

The ML system is a whole another stack of problems. The elephant in the room is Nvidia who is not known for playing well with others. Aside from that, the state of the art in ML is churning rapidly as new improvements are identified.


Two problems intersect:

- You can’t have two versions of the same package in the namespace at the same time. - The Python ecosystem is very bad at backwards compatibility

This means that you might require one package that requires foo below version 1.2 and another package that requires foo version 2 and above.

There is no good solution to the above problem.

This problem is amplified when lots of the packages were written by academics 10 years ago and are no longer maintained.

The bad solutions are: 1) Have 2 venvs - not always possible and if you keep making venvs you’ll have loads of them. 2) Rewrite your code to only use one library 3) Update one of the libraries 4) Don’t care about the mismatch and cross your fingers that the old one will work with the newer library.

Most of the tooling follows approach 1 or 4


Disk space is cheap, so where it's possible to have 2 (or more) venvs, that seems easiest. The problem with venv is that they don't automatically activate. I've been using a very simple wrapper around python to automatically activate venvs so I can just cd into the directory and do python foo.py and have it use the local venv.

I threw it online at https://github.com/fragmede/python-wool/


After many years of Django development I've settled on what I find the simplest solution. It includes activation of virtual environments when I cd into a directory.

pyenv for installing and managing python versions.

direnv for managing environments and environment variables (a highly underrated package imo).

With those two installed I just include a .envrc file in every project. It looks like this:

layout python ~/.pyenv/versions/3.11.0/bin/python3

export VARIABLE1=variable1

export VARIABLE2=variable2

etc...


Oh nice. Stick

    PYTHONPATH="`pwd`/venv/lib/python-3.11/site-packages"
or whatever in .envrc and Bob's your uncle.


You’re already managing a few hundred dependencies and their versions. Each venv roughly doubles the number of dependencies and they all have slightly different versions.

Now your 15 venvs deep, and have over 3000 different package version combinations installed. Your job is to upgrade them right now because of a level 8 CVE


Yeah that sucks. something like:

    for venv in $(find ~/projects/ -type d -name 'venv'); do
      (
        source ${venv}/bin/activate
        pip install --upgrade pip
        pip install --upgrade package_with_cve
        deactivate
      ) &
    done
should do the trick. Sucks that we're in that world, but let's not work any harder than we have to.


I forgot about parallel, which would do better than spawning off all the pip install --upgrade at once.


It's not bad. It works really well. There's always room for improvement. That's technology for you. Python probably does attract more than it's fair share of bad engineers, though.


Just replace python with mac or apple in your comment and I think you will understand.


Unironically yes, it really is that bad. A moderately bad language that happened to have some popular data science and math libraries from the beginning.

I can only imagine it seemed like an oasis to R which is bottom tier.

So when you combine data scientists, academics, mathematicians, juniors, grifters selling courses…

things like bad package management, horrible developer experience, absolutely no drive in the ecosystem to improve anything beyond the “pinnacle” of wrapping C libraries are all both inevitable and symptoms of a poorly designed ecosystem.


As an amateur game engine developer, I morosely reflect my hobby seems to actually consist of endlessly chasing things that were broken by environment updates (OS, libraries, compiler, etc.) That is, most of the time I sit down to code I actually spend nuking and reinstalling things that (I thought) were previously working.

Your comment makes me feel a little better that this is not merely some personal failing of focus, but happens in a professional setting too.


Happens in AAA too but we tend to have teams that shield everyone from that before they get to work. I ran a team like that for a couple years.

For hobby stuff at home though I don't tend to hit those types of issues because my projects are pretty frozen dependency-wise. Do you really have OS updates break stuff for you often? I'm not sure I recall that happening on a home project in quite a while.


Oh god yes I remember trying to support old Android games several OS releases later… Impossible, I gave up!

It’s why I still use react, their backcompat is amazing


Can recommend using conda, more specifically mambaforge/micromamba (no licensing issues when used at work).

This works way better than pip, as it does more checks/dependency checking, so it does not break as easily as pip, though this makes it definitely way slower when installing something. It also supports updating your environment to the newest versions of all packages.

It's no silver bullet and mixing it with pip leads to even more breakages, but there is pixi [0] which aims to support interop between pypi and conda packages

[0] https://prefix.dev/


I had a bad experience with Conda:

- If they're so good at dependency management, why is Conda installed through a magical shell script?

- It's slow as molasses.

- Choosing between Anaconda/Miniconda...

When forced to use Python, I prefer Poetry, or just pip with freezing the dependencies.

The Python people probably can't even imagine how great dependency management is in all the other languages...


I absolutely hate conda. I had to support a bunch of researchers who all used it and it was a nightmare.


rye and uv, while "experimental", are orders of magnitude better than poetry and pip IMHO.


Yeah, I agree, maybe I should have also mentioned the bad things about it, but after trying many different tools that's the one that I stuck with, as creating/destroying environments is a breeze once you got it working and the only time my environment broke was when I used pip in that environment.

> The Python people probably can't even imagine how great dependency management is in all the other languages...

Yep, I wish I could use another language at work.

> Choosing between Anaconda/Miniconda...

I went straight with mamba/micromamba as anaconda isn't open source.


Mamba/micromamba solves the slowness problem of conda


To add. Conda has parallelized downloads and is faster. Not as fast as mamba, but faster than previously. pr merged sep 2022 -> https://github.com/conda/conda/pull/11841


Yes I started with conda I think and ended up switching to venv and I can’t even remember why, it’s a painful blur now. It was almost certainly user error too somewhere along the way (probably one of the earlier steps), but recovering from it had me seriously considering buying a Linux laptop.

This happened about a week ago


Whenever anyone recommends conda I automatically assume they don't know much about python.


> Environment broken

>Something something incorrect cpu architecture for your Macbook

I’m glad I have something in common with the smart people around here.


Yeah getting a good python environment setup is a very humbling experience


I thought most ML engineers use their laptops as dumb terminals and just remote into a Linux GPU server.


Spoiler: my main role isn’t ML engineer :) and that doesn’t sound like a bad idea at all


Yeah, the workday there looks pretty similar though, except that installing pytorch and pillow is usually no problem. Today it was flash-attn I spent the afternoon on.


Isn't this what containers are for. Someone somewhere gets it configured right and then you download and run pre setup container and add your job data ? Or am I looking at the problem wrong?


But then how do you test out the latest model that came out from who knows where and has the weirdest dependencies and a super obscure command to install?


Just email all your data to the author and ask them to run it for you.


Python's dominance is holding us back. We need a stack with a more principled approach to environments and native dependencies.


Here's what getting PyTorch built reproducibly looks like: https://hpc.guix.info/blog/2021/09/whats-in-a-package/

Since then the whole python ecosystem has gotten worse.

We are building towers on quicksand.

It's not about python, it's about people who don't care about dependencies.


Dependency management is just.. hard. It is one of the things where everything relies upon it but nobody thinks "hey, this is my responsibility to improve" so it is left to people who have the most motivation, academic posts, or grant funding. This is roughly the same problem that led to heartbleed for OpenSSL.


Do you know what other ecosystem comes closest to the existing in Python? I've heard good things about Julia.

13 years ago when I was trying to explore the field R seemed to be the most popular, but looks like not anymore. (I didn't get into the field, and do just a regular SWE, so I'm not aware of the trends).

There is also a lot of development in Elixir ecosystem around the subject [1].

[1](https://dashbit.co/blog/elixir-ml-s1-2024-mlir-arrow-instruc...)


I don't think the ML community has an appetite for learning a different language.

There are Microsoft-backed F# bindings for Spark and Torch, but no one seems interested. And this is despite a pretty compelling proposition of lightweight syntax, strong typing, script-ability and great performance.

The answer will probably be JavaScript.

Everyone already knows the language - all that's missing is operator overloading and a few key bindings.


> There are Microsoft-backed F# bindings for Spark and Torch > this is despite a pretty compelling proposition of lightweight syntax, strong typing, script-ability and great performance.

For exactly the reasons your mentioned, i feel like F# would have been the perfect match for both MLE/ETL(spark pipeline) work and some of the deep learning/graph modelisation such as pytorch. Saddly, even from MSFT, the investment in F# as dried up


Not sure what you mean here - Microsoft are actively working on F# and the .NET Spark and Torch bindings.


If you're still doing ML locally in 2024 and also use an ARM macbook, you're asking for trouble.


> ARM macbook

Funnily, the only real competitor for Nvidias' GPUs are Macbooks with 128GB of RAM.


I see your contemporary hardware choices and raise you my P900 ThinkStation with 256GB of RAM and 48 Xeon cores. Eventually it might even acquire modern graphics hardware.


And they don't compete in performance.


I wish my company would understand this and let us use something else. Luckily, they don't really seem to care that I use my Linux based gaming machine most of the time


Can you expand on this a bit? My recent experiences with MLX have been really positive, so I'm curious what footguns you're alluding to here.

(I don't do most of my work locally, but for smaller models its pretty convenient to work on my mbp).


MPS implementations generally lag behind CUDA kernels, especially for new and cutting edge stuff. Sure, if you're only running CPU inference or only want to use the GPU for simple or well established models, then things have gotten to the point where you can almost get the plug and play experience on Apple silicon. But if you're doing research level stuff and training your own models, the hassle is just not worth it once you see how convenient ML has become in the cloud. Especially since you don't really want to store large training datasets locally anyways.


For real


What can I say, I enjoy pain!?


Nahh im l33t - intel macbook and no troubles.


can't read? parent clearly says "ARM macbook".


I can recommend to try poetry. It is a lot more succesful in resolving dependencies than pip.

Although I think the UX of poetry is stupid and I do not agree with some design decisions, I have not had any dependency conflicts since I used it.


Just use Linux. Then you only have to fight the nvidia drivers.


Really can't recommend Nix for Python stuff more, it's by far the best at managing dependencies that use FFI/native extensions. It can be a pain sometimes to port your existing Poetry/etc. project to work with nix2lang converters but really pays off.


Maybe pip should not work by default (but python -m venv then pip install should)


Legends say there were times when you'd have a program.c file and just run cc program.c, and then could just execute the compiled result. Funny that programmer's job is highly automatable, yet we invent ourselves tons of intermediate layers which we absolutely have to deal with manually.


I agree simplicity is king. But you're comparing making a script using dependencies and tooling for those dependencies and a C program with no dependencies. You can download a simple python script and run it directly if it has no dependencies besides stdlib (which is way larger in python). That's why I love using bottle.py by example.


Agree. But even with dependencies running "make" seems to be way simpler than having to install particular version of tools for a project, making venv and then picking versions of dependencies.

The point is the same - we had it simpler and now, with all capabilities for automation, we have it more complex.

Frankly, I suspect most of the efforts now are spent fighting non-essential complexities, like compatibilities, instead of solving the problem at hand. That means we create problems for ourselves faster than removing them.


I actually did a small C project a couple of years ago, the spartan simplicity there can have its own pain too, like having to maintain a Makefile. LOL. It’s swings and roundabouts!


You can do the same in python if you're not new to programming.


And then you'd have to deal with wrong glibc versions or mysterious segfaults or undefined behavior or the the code assuming the wrong arch or ...


python solves none of those issues. It just adds a myriad of ways those problems can get to you.

All of a sudden you have people with C problems, who have no idea they're even using compiled dependencies.


In theory you're right, CPython is written in C and it could segfault or display undefined behavior. In practice, you're quite wrong.

It's not really much of a counterargument to say that Python is good enough that you don't have to care what's under the hood, except when it breaks because C sucks so badly.


I was specifically talking about python packages using C. You type "pip install" and god knows what's going to happen. It might pull a precompiled wheel, it might just compile and link some C or Fortran code, it might need external dependecies. It might install flawlessly and crash as soon as you try to run it. All bets are off.

I never experienced CPython itself segfault, it's always due to some package.


Some Linux distros are moving that way, particularly for the included Python/pip version. My Arch Linux already does so some years, and I do not set it up myself - so I think it is default.


I do this... but air-gapped :(


Oof. At our company only CI/CD agents (and laptops) are allowed to access the internet, and that's bad enough.


that sounds very painful.


Do people doing ML/DS not use conda anymore?


A lot do, personally, every single time I try to go back to conda/mamba whatever, I get some extremely weird C/C++ related linking bug - just recently, I ran into an issue where the environment was _almost_ completely isolated from the OS distro's C/C++ build infra, except for LD, which was apparently so old it was missing the vpdpbusd instruction (https://github.com/google/XNNPACK/issues/6389). Except the thing was, that wouldn't happen when building outside of of the Conda environment. Very confusing. Standard virtualenvs are boring but nearly always work as expected in comparison.

I'm an Applied Scientist vs. ML Engineer, if that matters.


It's probably easier to reinstall everything anew from time to time. Instead of fixing broken 18.04 just move to 22.04. Most tools should work, if you don't have huge codebase which requires old compiler...

Conda.. it interfere with OS setup and has not always the best utils. Like ffmpeg is compiled with limited options, probably due to licensing.


I do all the time, and always have (in fact my first job was bare metal OS install automation), this was Rocky 9.4. New codebase, new compiler weird errors. I did actually reinstall and switch over to Ubuntu 24.04 after that issue lol.


If they are they should stop.

It causes so many entirely unnecessary issues. The conda developers are directly responsible for maybe a month of my wasted debugging time. At my last job one of our questions for helping debug client library issues was “are you using conda”. And if so we just would say we can’t help you. Luckily it was rare, but if conda was involved it was 100% conda fault somehow, and it was always a stupid decision they made that flew in the face of the rest of the python packaging community.

Data scientist python issues are often caused by them not taking the 1-3 days it takes to fully understand their tool chain. It’s genuinely quite difficult to fuck up if you take the time once to learn how it all works, where your putbon binaries are on your system etc. Maybe not the case 5 years ago. But today it’s pretty simple.


Fully agree with this. Understand the basic tools that currently exist and you'll be fine. Conda constantly fucks things up in weird hard to debug ways...


I can't point at a single reason, but I got sick of it.

The interminable solves were awful. Mamba made it better, but can still be slow.

Plenty of more esoteric packages are on PyPI but not Conda. (Yes, you can install pip packages in a conda env file.)

Many packages have a default version and a conda-forge version; it's not always clear which you should use.

In Github CI, it takes extra time to install.

Upon installation it (by default) wants to mess around with your .bashrc and start every shell in a "base" environment.

It's operated by another company instead of the Python Software Foundation.

idk, none of these are deal-breakers, but I switched to venv and have not considered going back.


Oh my! This hits home. We have some test scripts written in python. Every time I try to run them after a few months I spend a day fixing the environment, package dependencies and other random stuff. Python is pretty nice once it works, but managing the environment can be a pain.


You could learn how to use Python. Just spend one of those 4 hours actually learning. Imagine just getting into a car and pressing controls until something happened. This wouldn't be allowed to happen in any other industry.


Could you be a bit more specific about what you mean by "You could learn how to use python"? What resources would you recommend to learn how to work around problems the OP has? What basic procedures/resources can you recommend to "learn python"? I work as a software developer alongside my studies and often face the same problems as OP that I would like to avoid. Very grateful for any tips!


Basically just use virtual environments via the venv module. The only thing you really need to know is that Python doesn't support having multiple versions of a package installed in the same environment. That means you need to get very familiar with creating (and destroying) environments. You don't need to know any of this if you just use tools that happen to be written in Python. But if you plan to write Python code then you do. It should be in Python books really, but they tend to skip over the boring stuff.


How do you learn anything in the space of software engineering? In general, there are many different problems and even more solutions with different tradeoffs.

To avoid spending hours on fixing broken environments after a single "pip install", I would make it easy to rollback to a known state. For example, recreate virtualenv from a lock requirements file stored in git: `pipenv sync` (or corresponding command using your preferred tool).


Oh wow, I’ve been a Python engineer for over a decade and getting dependencies right for machine learning has very little to do with Python and everything to do with c++/cuda


I've done it. Isn't it just following instructions? What part of that means destroying every mention of Python on the system?


I've been programming with python for decades and the problem they are describing says more about the disastrous state of python's package management and the insane backwards compatibility stance python devs have.

Half of the problems I've helped some people solve stem from python devs insisting on shuffling std libraries around between minor versions.

Some libraries have a compatibility grid with different python minor versions, because how often they break things.


uv is a great drop-in replacement for pip: https://astral.sh/blog/uv


Docker is your friend

or pyenv at least


Yes, I’ve switched from conda to a combination of dev containers and pyenv/pyenv-virtualenv on both my Linux and MacBook machines and couldn’t be happier


Why do ML, especially nowadays on a Mac, when you can do it on an Ubuntu based machine.

Surely work can provide that?


Why Ubuntu specifically? Not even being snarky. Calling out a specific distro, vs. the operating system itself. I've had more pain setting up ML environments with Ubuntu than a MacBook, personally - though pure Debian has been the easiest to get stable from scratch. Ubuntu usually screws me over one way or another after a month or so. I think I've spend a cumulative month of my life tracking down things related to changes inn netplan, cloud-init, etc. Not to mention Ubuntu Pro spam being incessant, as official policy of Canonical [0]. I first used the distro all the way back at Warty Warthog, and it was my daily driver from Feisty until ~Xenial. I think it was the Silicon Valley ad in the MotD that was the last straw for me.

[0] https://bugs.launchpad.net/ubuntu/+source/ubuntu-meta/+bug/1...


So what I do in my free time, but getting paid for it?? Sign me up.


Use standard cloud images

> Something something incorrect cpu architecture for your Macbook


...and people criticise node's node_modules. At least you don't spend hours doing this


Not nearly as hard of a problem. Python does work just fine when it’s pure Python. The trouble comes with all the C/Cuda dependencies in machine learning


But you do because your local node_modules and upstream are out of sync and CI is broken. Happens at least once a month just before a release of course. I'd rather have my code failing locally than trying to debug what's out of sync on upstream.


Docker fixes this!


Just use conda.


Then you get one of my favorites: NVIDIA-<something> has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


Iirc I originally used conda because I couldn’t get faiss to work in venv, lol. That was a while ago though


I’m a regular software dev but I’ve had to do ML stuff by necessity.

I wonder how “real” ML people deal with the stochastic/gradient results and people’s expectations.

If I do ordinary software work the thing either works or it doesn’t, and if it doesn’t I can explain why and hopefully fix it.

Now with ML I get asked “why did this text classifier not classify this text correctly?” and all I can say is “it was 0.004 points away to meet the threshold”, and “it didn’t meet it because of the particular choice of words or even their order” which seems to leave everyone dissatisfied.


Not all ML is built on neural nets. Genetic programming and symbolic regression is fun because the resulting model is just code, and software devs know how to read code.


Symbolic regression has the same failure mode; the reasons why the model failed can be explained in a more digestible way, but the actual truth of what happened is fundamentally similar -- some coefficient was off by some amount and/or some monomial beat out another in some optimization process.

At least with symbolic regression you can treat the model as an analyzable entity from first principles theories. But that's not really particularly relevant to most failure modes in practice, which usually boil down to either missing some qualitative change such as a bifurcation or else just parameters being off by a bit. Or a little bit of A and a little bit of B.


Genetic programming however isn’t machine learning, but instead it’s an AI algorithm. An extremely interesting one as well! It was fun to have my eyes opened after being taught genetic algorithms, to then be brought into genetic programming


This seems to be the absolute worst of all worlds: the burden of software engineering with the tools of an English Language undergrad.


The English degree helps explain why word choice and order matter, giving you context and guidelines for software design.


My job title is ML Engineer, but my day to day job is almost pure software engineering.

I build the systems to support ML systems in production. As others have mentioned, this includes mostly data transformation, model training, and model serving.

Our job is also to support scientists to do their job, either by building tools or modifying existing systems.

However, looking outside, I think my company is an outlier. It seems in the industry the expectations for a ML Engineer are more aligned to what a data/applied scientist does (e.g. building and testing models). That introduces a lot of ambiguity into the expectations for each role in each company.


In my experience your company is doing it right, and doing it the way that other successful companies do.

I gave a talk at the Open Source Summit on MLOps in April, and one of the big points I try to drive home is that it's 80% software development and 20% ML.

https://www.youtube.com/watch?v=pyJhQJgO8So


My company is largely the same. I’m an MLE and partner with data scientists. I don’t train or validate the models. I productionize and instrument the feature engineering pipelines and model deployments. More data engineering and MLOps than anything. I’m in a highly regulated industry so the data scientists have many compliance tasks related to the models and we engineers have our own compliance tasks related to the deployments. I was an MLE at another company in the very same industry before and did everything in the model lifecycle and it was just too much.


That's really the kind of job I'd love. Whatever the data is, I don't care. I make sure that the users get the correct data quickly.


Highly paid cleaning lady. With dirty data you get no proper results. BTW: perl is much better than python on this.

Highly paid motherboard troubleshooter, because those all those H100's really get hot, even with watercooling, and we have no dedicated HW guy.

Fighting misbehaving third-party deps, as everyone else.


Could you talk more about “BTW: perl is much better than python on this.”?


I haven't touched Perl in more than 20 years... ... but I (routinely) miss something like:

   $variable = something() if sanity_check()
And

   do_something() unless $dont_do_that


Both work in python:

        variable = something() if sanity_check() else None


        do_something() if not dont_do_that else None


There exists a ternary if statement?

foo = something() if sanity_check else None

Can replace None with foo (or any other expression), if desired.


Additionally to the provided answers:

if sanity_check(): variable = something()

variable = sanity_check() and something

(but the standard python two line way is more readable unless it's a throwaway single use code, then whatever)


Been working as an MLE for the last 5 years and as another comment said most of the work is close to SWE. Depending on the stage of the project I'm working on, day-to-day work varies but it's along the lines of one of these:

- Collaboration with stakeholders & TPMs and analyzing data to develop hypotheses to solve business problems with high priority

- Framing business problems as ML problems and creating suitable metrics for ML models and business problems

- Building PoCs and prototypes to validate the technical feasibility of the new features and ideas

- Creating design docs for architecture and technical decisions

- Collaborating with the platform teams to set up and maintain the data pipelines based on the needs of new and exiting ML projects

- Building, deploying, and maintaining ML microservices for inference

- Writing design docs for running A/B tests and performing post-test analyses

- Setting up pipelines for retraining of ML models


The amount of response may be self explaining.

Not my main work, but spending a lot of time gluing things together. Tweaking existing open source. Figuring out how to optimize resources, retraining models on different data sets. Trying to run poorly put together python code. Adding missing requirements files. Cleaning up data. Wondering what could in fact really be useful to solve with ML that hasn't been done years ago already. Browsing the prices of the newest GPUs and calculating whether that would be worth it to get one rather than renting overpriced hours off hosting providers. Reading papers until my head hurt, that is just 1 by 1, by the time I finish the abstract and glanced over a few diagrams in the middle.


Where do you locate/how do you select papers?


arXiv.org searches

Dedicated groups on social. e.g: https://www.linkedin.com/newsletters/top-ml-papers-of-the-we...


I work on optimizing our inference code, "productizing" our trained models and currently I'm working on local training and inference since I work in an industry where cloud services just aren't very commonly used yet. It's super interesting too since it's not LLMs, meaning that there aren't as many pre made tools and we have to make tons of stuff by ourselves. That means touching anything from assessing data quality (again, the local part is the challenge) to using CUDA directly as we already have signal processing libs that are built around it and that we can leverage.

Sometimes it also involves building internal tooling for our team (we are a mixed team of researchers/MLEs), to visualize the data and the inferences as again, it's a pretty niche sector and that means having to build that ourselves. That allowed me to have a lot of impact in my org as we basically have complete freedom w.r.t tooling and internal software design, and one of the tools that I built basically on a whim is now on the way to be shipped in our main products too.


Although I studied machine learning and was originally hired for that role, the company pivoted and is now working with LLMs, so I spend most of my day working on figuring out how different LLMs work, what parameters work best for them, how to do RAG, how to integrate them with other bots.


Would you not consider LLMs as a part of machine learning?


There is a vanishingly small percentage of people actually working on the design and training of LLMs vs all those who call themselves "AI engineers" who are just hitting APIs.


Probably it's because we are not training them anymore and just using with prompts. Seems like more of a swe regular type of job


except regular swe is way more fun than writing prompts


I'd say deep learning is a subset of machine learning, and LLMs are a subset of deep learning.


They are the result of machine learning.


50%+ of my time is spent on backend engineering because the ML is used inside a bigger API.

I take responsibility for the end to end experience of said API, so I will do whatever gives the best value per time spent. This often has nothing to do with the ML models.


We live in a time when there are many more alleged machine learning roles than real ones.

I'd argue that if you are not spending >50% of your time in model development and research then it is not a machine learning role.

I'd also say that nothing necessitates the vast majority of an ML role being about data cleaning, etc. I'd suggest that indicates that the role is de facto not a machine learning role, although it may say so on paper.


not sure if this counts as ML engineering, but I support all the infra around the ML models: caching, scaling, queues, decision trees, rules engines, etc.


MLOps, sure.


What do you do with decision trees specifically?


In larger companies, and, specifically, bigger projects, systems tend to have multiple ML components, and those are usually a mix of large NN models and more classical (ML) algorithms, so you end up tweaking multiple parts at once. In my case optimising for such systems is ~90% of the work. For instance, can I make the model lighter or go faster and keep the performance? Or, can I make it go faster? Loss change, pruning, quantisation, dataset optimisation etc. -- most of the time I'm testing out those options & tweaking parameters. There is of course the deployment part, but this one is usually a quickie if your team has specific processes/pipelines for this. There's a checklist of what you must do while deploying, along with cost targets.

In my case, there are established processes and designated teams for cleaning & collecting data, but you still do a part of it yourself to provide guidelines. So, even though data is always a perpetual problem, I can shed off most of that boring stuff.

Ah, and of course you're not a real engineer if you don't spend at least 1-2% of your time explaining to other people (surprisingly often to a technical staff, but not ML-oriented) why doing X is a really bad idea. Or, just explaining how ML systems work with ill-fitted metaphors.


In a given week, I usually do the following:

* 15% of my time in technical discussion meetings or 1:1's. Usually discussing ideas around a model, planning, or ML product support

* 40% ML development. In the early phase of the project, I'm understanding product requirements. I discuss an ML model or algorithm that might be helpful to achieve product/business goals with my team. Then I gather existing datasets from analysts and data scientists. I use those datasets to create a pipeline that results in a training and validation dataset. While I wait for the train/validation datasets to populate (could take several days or up to two weeks), I'm concurrently working on another project that's earlier or further along in its development. I'm also working on the new model (written in PyTorch), testing it out with small amounts of data to gauge its offline performance, to assess whether or not it does what I expect it to do. I sanity check it by running some manual tests using the model to populate product information. This part is more art than science because without a large scale experiment, I can only really go by the gut feel of myself and my teammates. Once the train/valid datasets have been populated, I train a model on large amounts of data, check the offline results, and tune the model or change the architecture if something doesn't look right. After offline results look decent or good, I then deploy the model to production for an experiment. Concurrently, I may be making changes to the product/infra code to prepare for the test of the new model I've built. I run the experiment and ramp up traffic slowly, and once it's at 1-5% allocation, I let it run for weeks or a month. Meanwhile, I'm observing the results and have put in alerts to monitor all relevant pipelines to ensure that the model is being trained appropriately so that my experiment results aren't altered by unexpected infra/bug/product factors that should be within my control. If the results look as expected and match my initial hypothesis, I then discuss with my team whether or not we should roll it out and if so, we launch! (Note: model development includes feature authoring, dataset preparation, analysis, creating the ML model itself, implementing product/infra code changes)

* 20% maintenance – Just because I'm developing new models doesn't mean I'm ignoring existing ones. I'm checking in on those daily to make sure they haven't degraded and resulted in unexpected performance in any way. I'm also fixing pipelines and making them more efficient.

* 15% research papers and skills – With the world of AI/ML moving so fast, I'm continually reading new research papers and testing out new technologies at home to keep up to date. It's fun for me so I don't mind it. I don't view it as a chore to keep me up-to-date.

* 10% internal research – I use this time to learn more about other products within the team or the company to see how my team can help or what technology/techniques we can borrow from them. I also use this time to write down the insights I've gained as I look back on my past 6 months/1 year of work.


Keep up-to-date with research paper is key in this area. Every week, new models, new architecture are coming out


How do you select what papers to read? How often does that research become relevant to your job?


I select papers based on references from coworkers, Twitter posts by prominent ML researchers I follow, ML podcasts, and results.

The research becomes relevant immediately because my team is always looking to incorporate it into our production models right away. Of course it does take some planning (3-6 months) before it's fully rolled out in production.


The same job title could have very different responsibilities. I have been more of a "full-stack" MLE, I work from research and prototyping all the way to production software engineering and ML infra/dev-ops. I have also been mentoring MLEs and ML tech leads/managers - schedule a free 30-min intro call at: https://cal.com/studioxolo/intro.


junior level role, but currently it is a mix of working like a proxy product owner and half software engg.

the users are researchers and have deep technical knowledge of their use case. it is still a challenge to map their needs into design decisions of what they want in the end. thanks to open-source efforts, the model creation is rather straightforward. but everything around making that happen and shaping it like a tool is a ride.

especially love the ever-changing technical stack of "AI" services by major cloud providers rn. it makes mlops nothing more than a demo imho.


Junior level role product owner ?


no junior ml engineer in a small team


90% of the time it's figuring out what data to feed into neural networks, 2% of the time figure out stuff about neural networks and the other 8% of the time figure out why on earth the recall rate is 100%.


What are the tools people up to use? Feature platforms like Tecton on the list?


Pretty much the same as the others, building tool, data cleaning, etc. But something I don't see mentioned: experiment design/ data collection protocols


Clean data and try to get people to understand why they need to have clean data.


Do people feel like they are more or less in demand with the hyper around genai?


Demand is higher for flashy things that look good on directors' desks, definitely. But there's less attention on less flashy applications of machine learning, unless your superiors are so clueless that they think what you're doing is GenAI. Which sometimes the systems/models being trained are legitimately generative, but in the more technical, traditional sense.


Teaching others python.


Underrated comment. At my place of work, I find this to be a huge part of the MLE job. Everyone knows R but none of the cloud tools have great R support.


I've interviewed for a few of the ML positions and turned them down because they were just data jockey positions.


I worked as an engineer on an academic project that used sound to detect chewing. I acted as a domain expert to help the grad student doing the project understand how to select an appropriate microphone to couple to skin, how to interface that microphone to an ADC, offered insights into useful features to explore based on human biology, found a sound processing library that gave us 100's of new features for very little work, helped design a portable version of the electronics, did purchasing, helped troubleshoot during the times we were getting poor results, monitored human safety so we weren't connecting student volunteers to earth ground or AC power, helped write up human subjects committee review application forms, tracked down sources of electrical noise in the system, helped analyze results, reviewed 3D models for a headset to hold the microphone and electronics, wrote some signal processing/feature extracting code, selected an appropriate SBC system, explored confounding factors that were different between lab tests and field tests, and more. The grad student organized tests to record chewing sounds and ground truth video (a cheap camera in the bill of a baseball hat), arranged for a Mechanical Turk-like service to mark up the sound data based on the ground truth using a video analysis app I found, tried different pre-written ML algorithms to figure out which worked the best across different types of food, and analyzed which features contributed most to accuracy so we could eliminate most of them and get the thing to run on a wearable system with reasonable battery life. Then we collaborated with some medical doctors (our target customer for the device, it was not intended for consumers) to run tests on kids in their eating labs and wrote papers about it. The project ended when the grad student graduated.

I worked on other ML projects as well. A system that analyzed the syslogs of dozens of computers to look for anomolies. I wrote the SQL (256 fields in the query! Most complex SQL I've ever written) that prefiltered the log data to present it to the ML algorithm. And built a server that sniffed encrypted log data we broadcast on the local network in order to gather the data in one place continuously. Another system used heart rate variability to infer stress. I helped design a smartwatch and implemented the drivers that took in HRV data from a Bluetooth chest strap. We tested the system on ourselves. None of our ML projects involved writing new ML algorithms, we just used already implemented ones off the shelf. The main work was getting the data, cleaning the data, and fine tuning or implementing new feature extractors. The CS people weren't familiar with the biological aspects (digging into Grey's Anatomy), sensors, wireless, or electronics, so I handled a lot of that. I could have done all the work, it's not hard to run ML algorithms (we often had to run them on computing clusters to get results in a reasonable amount of time, which we automated) or figure out which features are important, but then the students wouldn't have graduated :-) Getting labeled data for supervised learning is the most time consuming part. If you're doing unsupervised learning then you're at the mercy of what is in the data and you hope what you want is in there with the amount of detail you need. Interfacing with domain experts is important. Depending on the project a wide variety of different skills can be required, many not related to coding at all. You may not be responsible for them all, but it will help if you at least understand them.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: