
Too big to deploy: How GPT-2 is breaking servers - calebkaiser
https://towardsdatascience.com/too-big-to-deploy-how-gpt-2-is-breaking-production-63ab29f0897c
======
heavyarms
Concerns like this are not appreciated enough in the data science/ML
community. And it's not just the size and resource consumption of the final
model. For any real enterprise/business solution, the best ML model is usually
not the one that has the highest benchmark scores, but the one that delivers
the greatest value and can put online quickly by integrating with the existing
systems and software in place.

You really have to start with performance and use case considerations from the
beginning. Before you even try to train a model you have to know how scalable
it is to load/process the inputs and what you do with the output. For example,
in a NLP use case for text categorization or conversational agents, do you
have to load historical data like customer notes, emails, etc., that are
sitting in a production SQL instance and have to be queried with complicated
joins and where clauses? How performant is the current API for doing that?
Will you have to run preprocessing on the raw inputs each time? Does it make
sense to have a preprocessed copy of the data in a different data source? If
so, how frequently should that data be synced? Depending on the answers to any
of those questions what looks like a great model that works in a Jupyter
Notebook suddenly becomes either not feasible or maybe too expensive to
justify.

I understand that a data scientist or ML engineer can't also be a cloud
infrastructure expert or software architect who understands how all of the
pieces are connected and how performant/expensive certain options are. But it
makes no sense to start training a model without at least having an answer for
what data will be needed during inference, how fast it has to be to work in
real time, how much traffic the model will get and how much that compute might
cost in the cloud, etc.

