
Transformer: A Novel Neural Network Architecture for Language Understanding - andrew3726
https://research.googleblog.com/2017/08/transformer-novel-neural-network.html
======
emeijer
Very interesting approach, and intuitively it makes sense to treat language
less as a sequence of words over time and more as a collection of words/tokens
with meaning in their relative ordering.

Now I'm wondering what would happen if a model like this were applied to
different kinds of text generation like chat bots. Maybe we could build
actually useful bots if they can have attention on the entire conversation so
far and additional meta data. Think customer service bots with access to
customer data that can learn to interpret questions, associate it with their
account information through the attention model and generate useful responses.

~~~
hacker_9
No doubt a holy grail for chat bots, but I'll believe it when I see it.

~~~
VikingCoder
"Even without evidence, everyone should believe it will solve the problem, but
I won't believe it will solve the problem until there's evidence."

Is that what you just said? :)

------
devindotcom
DeepL (was on HN earlier this week) also uses an attention-based mechanism
like this (or at least, with the same intention and effect). They didn't
really talk about it but the founder mentioned it to me. The two seem to have
independently pursued the technique, perhaps from some shared ancestor like a
paper they both were inspired by.

~~~
albertzeyer
Discussion about DeepL:
[https://news.ycombinator.com/item?id=15122764](https://news.ycombinator.com/item?id=15122764)

Attention is not new. Everyone uses it (for translation and many related
tasks). It's very much the standard right now.

Avoiding recurrent connections inside the encoder or decoder is also not
completely new. That came up when people tried to only use convolutions.

Googles Transformer was made public in June 2017, in the paper Attention is
all you need,
[https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762),
including TensorFlow code,
[https://github.com/tensorflow/tensor2tensor](https://github.com/tensorflow/tensor2tensor)
. Note that the new thing here is that they neither use recurrence nor
convolution but rely entirely on self-attention instead, with simple fully-
connected layers, in both the encoder and the decoder.

DeepL directly compares their model to Transformer, in terms of performance
(BLEU score), here:
[https://www.deepl.com/press.html](https://www.deepl.com/press.html)

------
rayuela
The key to this paper is the "Multi-Head Attention" which looks a lot like a
Convolutional layer to me.

------
jatsign
Has anyone come across good ML to do arabic-english? Seems to be a complete
lack of decent training data.

~~~
woodson
There's the GALE data provided by LDC
([https://catalog.ldc.upenn.edu/](https://catalog.ldc.upenn.edu/)):

\- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24) \-
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) \- GALE
Phase 1 Arabic Blog Parallel Text (LDC2008T02) \- GALE Phase 1 Arabic
Newsgroup Parallel Text - Part 1 (LDC2009T03) \- GALE Phase 1 Arabic Newsgroup
Parallel Text - Part 2 (LDC2009T09) \- GALE Phase 2 Arabic Broadcast
Conversation Parallel Text Part 1 (LDC2012T06) \- GALE Phase 2 Arabic
Broadcast Conversation Parallel Text Part 2 (LDC2012T14) \- GALE Phase 2
Arabic Newswire Parallel Text (LDC2012T17) \- GALE Phase 2 Arabic Broadcast
News Parallel Text (LDC2012T18) \- GALE Phase 2 Arabic Web Parallel Text
(LDC2013T01)

------
mykeliu
I'm a novice when it comes to neural network models, but would I be correct in
interpreting this as a convolutional network architecture with multiple
stacked encoders and decoders?

~~~
knowtheory
Nah, the paper explicitly states that their system is not recurrent nor
convolutional:

> _To the best of our knowledge, however, the Transformer is the first
> transduction model relying entirely on self-attention to compute
> representations of its input and output without using RNNs or convolution._

------
sandGorgon
would something like this work well on mixed/pidgin languages - e.g. Hinglish
, which is a mixture of Hindi and English and used in daily vernacular ?

------
bra-ket
does it mean we don't need gradient descent after all to achieve the same
result?

~~~
sanxiyn
Nope, Transformer is still trained with gradient descent.

