
Neural Language Modeling from Scratch - ofirpress
http://ofir.io/Neural-Language-Modeling-From-Scratch/
======
hacker_9
_" In recent months, we’ve seen further improvements to the state of the art
in RNN language modeling. The current state of the art results are held by two
recent papers by Melis et al. and Merity et al.. These models make use of
most, if not all, of the methods shown above, and extend them by using better
optimizations techniques, new regularization methods, and by finding better
hyperparameters for existing models. Some of these methods will be presented
in part two of this guide."_

Am I right in saying that the recently publicised Google Transformer [1]
Neural Network is actually the state of the art now, over RNNs?

[1] [https://research.googleblog.com/2017/08/transformer-novel-
ne...](https://research.googleblog.com/2017/08/transformer-novel-neural-
network.html)

~~~
yorwba
The Transformer network is solving a different problem: translating a given
sentence into another with the same meaning.

The problem discussed here is about completing the next word in a partial
sentence, where AFAIK some variety of RNN is still best. It might be possible
to adapt the Transformer architecture to that task, but that would make it a
different model.

------
technomalogical
Misread this, I thought this was Neural Language Modeling _in Scratch_ , the
visual programming language.

------
tw1010
Whenever I see stuff like this I think; why RNNs, why not just multivariate
polynomials? Every property you want from an RNN you can get from polynomials,
except polynomials are significantly more exhaustively studied. Want certain
invariants to be guaranteed? You got it! Just look up any undergraduate
textbook on algebraic geometry. I'm glad that Yann Lecun went against the
established paradigm of creating image filters manually. But why stop there.
Let's go beyond the constraint of using only the mathematics commonly taught
in engineering schools. Let's take some inspiration from other departments.
Cross pollination is the key to revolutionary jumps in innovation.

~~~
yorwba
Two reasons multivariate polynomials are not commonly used in machine
learning:

1\. The number of parameters grows as (number of variables)^(degree of
polynomial), which is highly inefficient. You could assume that the polynomial
is a linear combination of easily factored ones, but that's equivalent to a
neural network with one logarithmic-activation layer and one exponential-
activation layer, followed by a linear layer. And most multivariate-polynomial
theory probably hasn't focused on this special case.

2\. To handle potentially unbounded sequences you'll have to use your
multivariate polynomial in some kind of iterative/recursive scheme. That's
what an RNN _is_. You could build an RNN out of multivariate polynomials. It
probably won't work very well, because accumulating error will put you in an
area of fast divergence. LSTMs use addition with a bounded function to avoid
this.

~~~
tw1010
The growth issue is only really a problem if assume you're picking your
polynomials from R[x,y,...]. But there are other choices that would be more
appropriate. Often you want the model to be invariant to rotation (e.g. if
you're doing computer vision), in which case you'd use R[x,y,...]^G, where G
is the group of rotations.

~~~
yorwba
I'm not familiar with the R[x,y,...]^G notation, but I'll assume it refers to
the group of polynomials that are invariant under rotation of their inputs.
I'm not sure whether the rotations of discretely sampled images really do form
a group, since intuitively two 45° rotations lose information compared to a
90° rotation, but maybe you can fix that by assuming the right kind of
periodicity.

Even assuming that rotations aren't lossy, I get at best a reduction in the
number of parameters by a factor of √(number of variables), by fixing the
rotation of a set of variables (representing sample point) so that one of them
lies on a specific axis. In other words, this reduces the exponent by 1/2,
which is still not small enough to make even second-degree polynomials
feasible.

However, that doesn't mean I think symmetry priors like this are useless, so
if you can point out further literature on this topic, that would be great!
(It might also help me understand how exponentiating a group by another makes
sense.)

~~~
tw1010
The notation R^G comes from invariant theory[1]. I'm not aware if it has any
connection with exponentiation or if that is just notation, but it wouldn't
surprise me because notation in algebra seems to have a tendency to be really
sneaky and well connected.

[1]
[https://en.wikipedia.org/wiki/Invariant_theory](https://en.wikipedia.org/wiki/Invariant_theory)

~~~
yorwba
Thank you for the link, now I know at least how to search for more
information.

Can you confirm or refute my calculation of the growth of the number of
parameters when the input variables are the sample points of an image and the
polynomial has to be rotation-invaraint?

