
Beyond L2 Loss – How We Experiment with Loss Functions - rebello95
https://eng.lyft.com/beyond-l2-loss-how-we-experiment-with-loss-functions-at-lyft-51f9303f5d2d
======
jsinai
Really interesting article! This is a great example of scientific strategy and
thought being used in an applied setting. I also like the effort to make the
different choices interpretable.

Loss functions are an often overlooked area of machine learning and tend to be
taken for granted.

Too many times I have seen others (and been guilty of the same) just
arbitrarily choose torch.nn.NLLLoss/TensorFlow equivalent or some other loss
function that’s the flavour of the day for the task.

------
j7ake
One could get deeper theoretical insights into the different loss functions by
thinking of them as different priors over the parameters you're fitting (e.g.,
Gaussian prior is L2, Laplace prior is L1).

I wonder how these other losses you're looking at such as Patton losses could
be interpreted as what prior distributions?

~~~
mlthoughts2018
In Bayesian inference this isn’t actually as helpful as you make it sound. In
some special applications there have been physically motived (even
_discovered_ ) prior distributions, like Weibull, Gumbel or Rayleigh
distributions.

But almost never do you start from freedom for arbitrary loss functions and
work backwards to what prior they would imply. That’s mathematical curiosity
that doesn’t connect to understanding of the problem domain.

More usually you will put simple, known form priors on the quantities you need
to model, and then abstract the parameters of those priors into hyperprior
distributions, and just keep doing this process until the parameters you end
with are so abstracted from how they influence the inference that it will be
perfectly valid to assume uninformative priors at that stage.

You would only adjust the functional form of your intermediate, parameterized
priors if doing this and carrying out model fitting ended up with very poor
model fit or poor posterior predictive checking, in which case it would be the
aspect of poor model fit that informs you about problem structure and leads to
a revised prior, not the other way around.

Even when there is poor model fit, changing from simple priors / uninformative
hyperpriors won’t be a first reaction. You might try data cleaning,
transformation of variables, dimensionality reduction, collecting more data,
changing the core likelihood function of the model, and many more things
before having good enough reason to think that manipulating prior
distributions is going to be important for the solution.

