Hacker News new | past | comments | ask | show | jobs | submit login

For some reason they focus on the inference, which is the computationally cheap part. If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.



Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.


The way I see it, generally every new data point (on which the production model inference gets run once) becomes part of the data set which then gets used in training every next model, processing the same data point many more times in training, thus training unavoidably taking more effort than inference.

Perhaps I'm a bit biased towards all kinds of self-supervised or human-in-the-loop or semi-supervised models, but the notion of discarding large amounts of good domain-specific data that get processed only for inference and not used for training afterward feels a bit foreign to me, because you usually can extract an advantage from it. But perhaps that's the difference between data-starved domains and overwhelming-data domains?


What you say re saving all data is the ideal. I'd add a couple caveats, one is that in many fields you often get lots of redundant data that adds nothing to training (for example if an image classifier looking for some rare class you can be drowning in images of the majority class). Or you can just have lots of data that is unambiguously and correctly classified- some kind of active learning can tell you what is worth keeping.

The other is that for various reasons the customer doesnt want to share their data (or at least have sharing built into the inference system) so even if you'd like to have everything they record, it's just not available. Obviously something to discourage but it seems common


There's one piece of the puzzle you're missing: field-deployed devices.

If I play chess on my computer, the games I play locally won't hit the Stockfish models. When I use the feature on my phone that allows me to copy text from a picture, it won't phone home with all the frames.


Yup, exactly. It's a good point that for self-supervised workloads, the training set can become arbitrarily large. For a lot of other workloads in the vision space, most data needs to be labeled to be able to used for training.


I have not found this to be true at all in my field (natural language generation).

We have a 7 figure GPU setup that is running 24/7 at 100% utilization just to handle inference.


Also true of self-driving. You train a perception model for a week and then log millions of vehicle-hours on inference.


How do you train new models if your GPUs are being used for inference? I guess the training happens significantly less frequently?

Forgive my ignorance.


We have different servers for each. But the split is usually 80%/20% for inference/training. As our product grows in usage the 80% number is steadily increasing.

That isn't because we aren't training that often - we are almost always training many new models. It is just that inference is so computationally expensive!


Are you training new models from scratch or just fine tuning LLMs? I'm from the CV side and we tend to train stuff from scratch because we're still highly focused on finding new architectures and how to scale. The NLP people I know tend to use LLMs and existing checkpoints so their experiments tend to be a lot cheaper.

Not that anyone should think any aspect (training nor inference) is cheap.


Typically a different set of hardware for model training.


Maybe from the researcher or data scientist's perspective. But if you have a product that uses ML and inference doesn't dominate training, you're doing it wrong.


Think Google: Every time you search, some model somewhere gets invoked, and the aggregate inference cost would dwarf even very large training costs if you have billions of searches.

Marketing blogspam like this is always targeting big(not Google, but big) companies hoping to divert their big IT budgets to their coffers: "You have X million queries to your model every day. Imagine if we billed you per-request, but scaled the price so in aggregate it's slightly cheaper than your current spending."

People who are training-constrained are early-stage(i.e. correlate with not having money), and then they need to buy an entirely separate set of GPUs to support you(e.g. T4s are good for inference, but they need V100s for training). So they choose to ignore you entirely.


This depends a lot on what you're doing. If you are ranking 1M qps in a recommender system, then training cost will be tiny compared to inference.


I wonder if there's room for model caching. Sacrifice some personalization for near similar results so you aren't hitting the model so often.


Yeah we did lots of things like this at Instagram. Can be very brittle and dangerous though to share any caching amongst multiple users. If you work at Facebook you can search for some SEVs related to this lol


If you are training models that are intended to be used in production at scale then training is dirt cheap compared to inference. There is a reason why Google focused on inference first with their TPU's even though Google does a lot of ML training.


I think another part of the question is whether you're scaling on your own hardware or the customers' hardware.


> If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.

Wouldn't that depend on the size of your customer base? Or at least, requests per second?


With more customers usually the revenue and profit grow, then the team becomes larger, wants to perform more experiments, spends more on training and so on. Inference is just so computationally cheap compared to training.

That's what I've seen in my experience, but I concur that there might be cases where the ML is a more-or-less solved problem for a very large customer base where inference is more. I've rarely seen it happen, but other people are sharing scenarios where it happens frequently. So I guess it massively depends on the domain.


Alpha zero used 5000 TPUs to generate games (inference only), and 16 to train the networks.

The split definitely depends on what you're doing past developing/deploying.

(Source: https://kstatic.googleusercontent.com/files/2f51b2a749a284c2...)


Completely agreed. For some of these large language models, it would take a long time before inference spend dominates training spend.


Is your inference running on some daily jobs? That's not a ton of inference compared to running online for every live request (10k QPS?)


More to the point, you don't so training and inference in the same program, so don't have to be on the same hardware in the same machine. It's two separate problems with separate hardware solutions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: