
The prior can generally only be understood in the context of the likelihood - selimthegrim
https://arxiv.org/abs/1708.07487
======
ta1929901
Nice article, in the sense that I would assign it to students, although I
think it presents sort of a strawman and doesn't really introduce anything
that hasn't been discussed at length elsewhere.

It provides a nice introduction to types of objective priors, hinting at their
advantages and disadvantages. Also a nice discussion of why priors matter.
Incidentally, I agree that "subjective" and "objective" are poor labels for
priors--something better would be something like "estimand-predictive" and
"inferential-property" priors, respectively.

I'm not really sure why they suggest objective priors violate assumptions
about not using information from the data. The model, and thus many objective
priors, can be specified given the design, without any knowledge of what the
data looks like. This is the strawman part of it.

I get the sense Gelman has been wrestling with objective priors the last few
years.

~~~
sgt101
Odd because I read the abstract of the paper as a claim for a fundamental
advance in statistics; systematic and objective determination of priors in
Bayesian analysis. I can't speak to whether the paper is successful in
delivering this as I am very dense and it takes me weeks to understand things,
but if they have identified mechanisms that allow complex priors to be
constructed that minimize overfitting:

1\. I will do a little dance

2\. I will learn how to construct such priors

3\. I will attempt to apply this in practice to see what happens.

I don't see overfitting risk as a strawman, I see it as a nasty business that
means that production models can't be trusted.

~~~
ta1929901
A lot of what they discuss is in the literature on reference priors, if not in
other literature on objective priors as well.

It's a little complex for a comment on HN, but IMHO the best formalization of
overfitting is in the literature on minimum description length, and related
information-theoretic literature
([https://en.wikipedia.org/wiki/Minimum_description_length](https://en.wikipedia.org/wiki/Minimum_description_length)
; the wikipedia page is a little off on some things but the general points are
probably about right).

The relationship between MDL/NML and Bayesian statistics is a little
complicated--Barron, Roos, and Watanabe have a nice recent paper about it
([https://arxiv.org/abs/1401.7116](https://arxiv.org/abs/1401.7116)) -- but
the short story is that there's a certain equivalence (or at least very close
relationship) between MDL/NML and Bayesian inference with reference priors
(i.e., the "capacity-achieving prior" in IT parlance).

So Bayesian inference with reference priors minimizes risk of overfitting in a
very technical minimax sense.

The problem is that reference priors are very difficult in general to
construct, although that is changing rapidly, and they have been worked out
for certain important cases (this paper provides an interesting new approach
with a nice overview of recent papers
[https://arxiv.org/abs/1704.01168](https://arxiv.org/abs/1704.01168)). The
Jeffreys prior is a form of the reference prior for models meeting certain
constraints.

A lot of the issues Gelman et al. touch on have been written about in various
places in the MDL/NML/reference prior/information theory literature.

~~~
Xcelerate
> but IMHO the best formalization of overfitting is in the literature on
> minimum description length

It's always surprised me how few people know about MDL, considering it's
pretty darn close to what we might consider a "universal predictor" (granted,
the difficulty with MDL is that it's uncomputable in the general sense). Even
among most data scientists I know, very few understand what overfitting really
is (and thus cross-validation and regularization are merely tools to what
seems like the blackbox "art" of model selection).

But the concepts of MDL/MML and Kolmogorov complexity are very deep and
fundamental—to such an extent that I think the path to true AGI will rely much
more heavily upon algorithmic information theory than neural networks in the
future.

~~~
ta1929901
Yeah, I have a similar reaction about being surprised MDL and algorithmic
statistics isn't better known. As you say, it's very fundamental stuff. It
seems so fundamental to me that I just sort of assume without thinking about
it or even questioning whether it will eventually become more prominent.

My guesses as to why it's been slow to be adopted so far are that (1) it's
relatively new, in the grand scheme of things, (2) certain things about it are
really challenging to everyone, and (3) it has a certain perspective on
inference that can be alien to a lot of people.

~~~
sgt101
Also (5) it doesn't really work or make sense.

The data isn't necessarily representative of the domain theory, in fact in a
lot of domains the data isn't, because the domains are so large that you can't
capture the whole of it in a tractable training set, for example : images.
Other data doesn't capture the domain because the data is generated in a
regime that isn't operant when gathered - for example bull runs vs. bear runs
in the markets. Bayesian analysis is attractive, we can include informative
priors that capture our knowledge that in circumstances outside of the data
other determining behaviours exist. This is also one of the reasons why deep
networks _can_ outperform support vector machines; deep networks can learn to
prefer domain theories that are not the minimal descriptive one.

The other thing that is interesting about MDL is where does the idea that the
minimal theory is the right theory come from?

Most people say "oh it's Occam's razor" but where did Occam's razor come from
- who was Mr (Fr.) Occam?? Well, he was a 13th century philosopher - part of
the Cambridge school and part of the tradition of Scotus invented to construct
a story that supported the Trinitarian God... and this is why we prefer the
idea that "entities will not multiply beyond necessity" because it says that
you have a Trinity because The University ABSOLUTELY cannot work without it,
and that's why you have three and not two and not four.

I am happy with all this but why should we think it's a good way to do machine
learning? After all there are lots of examples of theories that were simple
but don't work as well as complex alternatives - Gravity is a good one.

------
kgwgk
> With the above prior and likelihood, the posterior for β is a product of
> independent Gaussians with unit variance and mean given by the least squares
> estimator of β. The problem is that standard concentration of measure
> inequalities show that this posterior is not uniformly distributed in a unit
> ball around the ordinary least squares estimator but rather is exponentially
> close in the number of coefficients to a sphere of radius 1 centered at the
> estimate.

Of course the posterior is not uniformly distributed in a unit ball (the
density is higher at the center of the ball, and extends beyond the limits of
the ball). But the fact that it is exponentially close to a sphere doesn’t
have anything to do with that! If the posterior was uniformly distributed in a
unit ball it would also be exponentially close to a sphere.

------
selimthegrim
[http://andrewgelman.com/2017/09/05/never-total-eclipse-
prior...](http://andrewgelman.com/2017/09/05/never-total-eclipse-prior/)

------
kgwgk
> For a fully informative prior for δ, we might choose normal with mean 0
> because we see no prior reason to expect the population difference to be
> positive or negative and standard deviation 0.001 because we expect any
> differences in the population to be small, given the general stability of
> sex ratios and the noisiness of the measure of attractiveness.

That’s a _very_ strong prior. To put it into context (if I did my calculations
right): assuming that _every single child_ from very attractive parents in the
study was a girl (let’s say 600 out of 600) while for the rest we have the
expected distribution (i.e. 1150 girls out of 2400) the estimate of the
difference would be 0.1%.

------
kgwgk
> Stein's trick was to notice that the point µ = 0 has the property that if y
> is sufficiently close to it,

It’s not clear in this presentation but shrinkage can be done toward any
arbitrary point, not just toward the origin.

------
westurner
Bayes assumes/requires conditional independence of observations; which is
sometimes the case.

For example:

\- Are the positions of the Earth and the Moon conditionally independent? No.

\- In the phrase "the dog and the cat", are "and" and "the" independent? No.

\- In a biological system, are we to assume conditional independence? We
should not.

[https://en.wikipedia.org/wiki/Conditional_independence](https://en.wikipedia.org/wiki/Conditional_independence)

...

"Efficient test for nonlinear dependence of two continuous variables"
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539721/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539721/)

\- In no particular sequence: CANOVA, ANOVA, Pearson, Spearman, Kendall, MIC,
Hoeffding

~~~
westurner
From [https://plato.stanford.edu/entries/logic-
inductive/](https://plato.stanford.edu/entries/logic-inductive/) :

> It is now generally held that the core idea of Bayesian logicism is fatally
> flawed—that syntactic logical structure cannot be the sole determiner of the
> degree to which premises inductively support conclusions. A crucial facet of
> the problem faced by Bayesian logicism involves how the logic is supposed to
> apply to scientific contexts where the conclusion sentence is some
> hypothesis or theory, and the premises are evidence claims. The difficulty
> is that in any probabilistic logic that satisfies the usual axioms for
> probabilities, the inductive support for a hypothesis must depend in part on
> its prior probability. This prior probability represents how plausible the
> hypothesis is supposed to be based on considerations other than the
> observational and experimental evidence (e.g., perhaps due to relevant
> plausibility arguments). A Bayesian logicist must tell us how to assign
> values to these pre-evidential prior probabilities of hypotheses, for each
> of the hypotheses or theories under consideration. Furthermore, this kind of
> Bayesian logicist must determine these prior probability values in a way
> that relies only on the syntactic logical structure of these hypotheses,
> perhaps based on some measure of their syntactic simplicities. There are
> severe technical problems with getting this idea to work. Moreover, various
> kinds of examples seem to show that such an approach must assign intuitively
> quite unreasonable prior probabilities to hypotheses in specific cases (see
> the footnote cited near the end of section 3.2 for details). Furthermore,
> for this idea to apply to the evidential support of real scientific
> theories, scientists would have to formalize theories in a way that makes
> their relevant syntactic structures apparent, and then evaluate theories
> solely on that syntactic basis (together with their syntactic relationships
> to evidence statements). Are we to evaluate alternative theories of
> gravitation (and alternative quantum theories) this way?

~~~
nonbel
>"This prior probability represents how plausible the hypothesis is supposed
to be based on considerations other than the observational and experimental
evidence (e.g., perhaps due to relevant plausibility arguments)."

I guess I don't know about how "Bayesian logicism" differs from "Bayesian
probability", but this is totally false in the latter case. The prior is just
supposed to be independent of the current data (eg devised before it was
collected). In practice info almost always leaks into the prior + model via
tinkering. That is why a priori predictions are so important to proving you
are onto something.

~~~
westurner
Bayesian logicism is the logic derived from Bayesian probability.

Magic numbers are an anti-pattern: which constants are what and why should be
justified OR it should be shown that a non-expert-biased form converges
regardless.

~~~
nonbel
The use of the term prior probability in that paragraph is not consistent with
its use in bayesian probability, so something is wrong.

Also, I am not sure what magic numbers you are referring to.

~~~
westurner
Arbitrary priors are magic numbers.

Is there a frequentist statistic that can be used in a deterministic function
to determine which arbitrary priors to use?

