Hacker News new | past | comments | ask | show | jobs | submit login
Turing-NLG: A 17B-parameter language model (microsoft.com)
365 points by XnoiVeX 9 days ago | hide | past | web | favorite | 137 comments





People are vastly underestimating the changes that are about to come from NLP. The basic ideas of how to get language models working are just about in place. Transformer networks, and recent innovations like GPT-2, googles reformer model, etc are precursors to the real machine learning boom. Machine learning as we have known it, has been stuck as an optimization tool, and used for computer vision here and there. NLP, and with it, the ability to create, synthesize, and understand content, will change the internet.

More than that, I think NLP will unlock new ways of interacting with computers. Computers will be able to handle the ambiguity of human language, transcending their rigid “only do exactly what you tell them” models of the world.

Edit:

Adding this to give more technical context. I think most people don’t know where the line is currently between what possible, and what’s not, but also what we are on the cusp of. And we are on the cusp of a lot.

A quick explanation of one area is here:

Basically, transformer models are the best for NLP. They use something called attention based mechanisms, which allows the model to draw correlations between pieces of text/tokens that are far apart. The issue is that this is an O(n^2) operation. So the model is bounded by the context window, which is currently mostly at 512 tokens, and is thus, bounded in how much it can understand. Recent innovations, and further study, will broaden the context window, and thus unlock better reading comprehension and context understanding. For instance, the ability to answer a question using a piece of text is mostly stuck at just finding one paragraph. The future will see models that can find multiple different paragraphs, understand how they relate, pull the relevant information, and synthesize it. This sounds like a minor step forwards, but its important. This will unlock better conversational abilities, but also, better ways to understand how different pieces of textual information relate. The scattershot of information across the internet can go away. Computers can better understand context to act on human intention through language, unlocking the ability to handle ambiguity. This will change the internet.

Again to empathize, these models only started showing up in 2017! The progress has been rapid.


Computers will be able to handle the ambiguity of human language, transcending their rigid “only do exactly what you tell them” models of the world.

So, are reasonable examples now of these models allowing semantic context? So, far, what I have seen is generated text where the lack of understanding takes three paragraphs to become obvious rather than one.

Human language is this marvelous framework involving symbols associating with other symbols as well as to well-known and vaguely-guessed facts about the world.

Human relations are very robust and, for example, two people can have a longish conversations where at the end, they realize they're talking about two different people (or different days or events). But in those circumstances, they can correct and adjust. "Solid" understanding is there but it's under a lot of layers of social cues and protocols and multiple meanings.


> So, are reasonable examples now of these models allowing semantic context?

This is about where I am stuck. I'll start believing that we truly are on the cusp of a revolution as soon as I see Google Translate reliably knowing when to translate "home" into French as "domicile", "foyer", something those lines, or as "accueil."

Right now it seems to very frequently choose "accueil", which is generally wrong, except when you're talking about websites and software user interfaces. That it's biased so strongly toward that error speaks volumes about how critical semantics are to sorting out natural language, and also about how bad current NLP systems are at dealing with semantics.


Syntax and semantics were developed for human language, yet it's much easier to puzzle the difference in a computer language than in human language. With syntax and semantics so wrapped together, however, it kind of seems like you can go a long way with just capturing syntax, rhythm, word choice and etc. Which is to say the semantic side can be even worse than it seems, ie, nonexistent.

A long way for what though? The unicorn story is neat, but I don’t have a giant unmet demand for rambling.

I just tried a few examples and it always returned the good word to me.

I'm going home -> Je rentre a la maison

Home sweet home -> La douceur du foyer (the translation is weird but the word foyer was expected)

I feel at home -> Je me sens chez moi (this one is particularly good, it didn't translate the word home directly)

Can you share your exemples where it fails?


Not offhand. It's a problem I've noticed more often when looking at translations of larger bodies of text, though. Translating simple, highly standardized sentences like those is what neural translation does best, because it is able to basically just consult an internal phrasebook that it's compiled from its training data.

> This is about where I am stuck. I'll start believing that we truly are on the cusp of a revolution as soon as I see Google Translate reliably knowing when to translate "home" into French as "domicile", "foyer", something those lines, or as "accueil"

Isn't that basically the same as the Winograd problem?


I would guess that it's a bit easier. With what I was proposing, you just need to be able to infer the semantics of certain words. With the Winograd problem, you need to grasp the semantics of all the words, and then use that knowledge to infer a pronoun's antecedent based on what yields a more sensical overall interpretation of the sentence.

"use that knowledge to infer a pronoun's antecedent based on what yields a more sensical overall interpretation of the sentence"

People often assume a very benign, civilized environment for AI. In ordinary life, human beings (on the internet or off) take an adversarial approach to other people modeling the logical framework behind an utterance. They either subvert it for amusement (trolling or comedy) or profit (politics, propaganda, sales), and a tremendous amount of effort goes into it, much of which is very effective.

As people have observed, it can be very easy to convince a human that a machine is intelligent, as with ELIZA. But if you violate the presumptions of trust that a machine is designed with, it's going to be surprisingly vulnerable. To be superior to humans, a machine would have to be able to fend off an intelligent person trying to undermine it, not just work when it is spoon-fed.


I would agree with you to a certain extent, but I still think there is a big missing component to make this a reality. Larger and more accurate general language models are great, but to enable use cases other than categorization, translation, summarization, etc., there will almost certainly have to a be contextualized knowledge graph layer. This is basically what I assume you mean when you say "the ability to create, synthesize, and understand content." The way I see it, transformer-based general language models will be the first and last step of a NLP system. I other words, they will do the raw processing of the input and will do the "take this output and put it in natural language based on this context" portion. Automating the part in the middle, the part that actually understands what the text means and can do logic with it, there's still no progress on that AFAIK.

This is similar to how computer vision models work great in most cases, but to build a self-driving car you still need all the other components that do path planning, predicting what the cars around you will do based on the state of the environment, etc.


Replying from my laptop.

I agree, there needs to be a way to represent relationships between information. I personally don't think knowledge graphs will be the ones to do it, not because they dont work, but because of how imperfect they are, in the data quality sense.

See this paper here:

"DIFFERENTIABLE REASONING OVER A VIRTUAL KNOWLEDGE BASE" https://openreview.net/pdf?id=SJxstlHFPH

Which is a recent effort, among many, by google research to build a model that can view a document as a knowledge graph, instead of explicitly tying pieces of the document to the graph, the idea to is to create a graph from the document. This is paper is a bit different from that, they do input a knowledge graph for training, but I think the idea and track of where they are headed has a ton of room to evolve. The trick is that transformer models have unlocked the ability to understand the text, so all of this "quasi knowledge graph extraction" that i was just explaining, is only recently possible! There's no research on it, because the baseline understanding of tokens has been too primitive. This is why there is so much room to grow, BERT has unlocked new methods, it can be used as a base for a ton of new NLP.

Just to emphasize again, I'm not saying what I outlined above will be a good way to do it, just that ideas like this could only be tested recently. There's a million new ways to spin this problem.


Related: a recent paper from Antoine Bosselut and Yejin Choi explores dynamically constructing a context-specific, common sense knowledge graph using a transformer, in the context of question answering.

https://arxiv.org/abs/1911.03876


> mostly at 512 token

GPT uses 1024 tokens context window which does work out to a fair amount (given the massive vocab of 50k+ which means a token can be more than a word), though of course it's pretty limited.

Google's recent Reformer[0] allows you to do attention much more cheaply and I'm currently training a Reformer model that isn't quite as big but has a context of ~64k tokens (though a much smaller vocab). I'm not completely sure if this is the solution but it looks like a step in that direction and so far the model is doing pretty good (I also plan to post the weights when I'm finished, though I am not sure if Google don't just plan to do that themselves).

I am somewhat disappointed they went with just 1024 for this model, too though.

0. https://ai.googleblog.com/2020/01/reformer-efficient-transfo...


Yep, the number has been expanding. I just picked 512 because Bert uses that. But yeah 50k and beyond unlocks new ways of using context. I can’t wait to see how they optimize and improve the reformer model.

I expect there to be an improvement like there was for

BERT -> Albert


> Basically, transformer models are the best for NLP.

Yes, this year. While transformers certainly present a breakthrough in the NLP community and certainly stir up the state-of-the-art again, I don't really see how you go from that to the "computers will understand us" conclusion to be honest. People said that during the word2vec stir up and what we got out of that was incremental results (which is not bad, it's in the nature of things really).

We can already build models that do everything you describe as single tasks, while that's exciting, it's not going to lead to the singularity. We've got a long way to go in terms of understanding models, making them computationally tractable, and making them do what we want in the first place without resorting to hoping that our unsupervised model learns something useful. It's likely that the attention mechanisms we see today will be a large part of that but I'm honestly a bit baffled at the "People are vastly underestimating the changes that are about to come from NLP." part. They're not, people already think that today's AI is magic, I don't think it is benefitial to reinforce that. Speech is nuanced, we're making good progress in many areas but we're not on the cusp of any revolutionary change in computational understanding really.


> So the model is bounded by the context window, which is currently mostly at 512 tokens, and is thus, bounded in how much it can understand.

This is begging the question as to whether the model "understands" anything at all. And once you adopt a definition of "understanding" that isn't equivalent to "got a high score on some pointless academic challenge" the answer is a resounding "no." The whole enterprise of AGI hype is based on this equivocation of words like "understanding" and "intelligence." We use a very restricted definition in proving that the tech is smart, and then switch out our restricted definition for the colloquial one when the audience isn't looking.

> This will unlock better conversational abilities

Shouldn't be hard given that as it stands there are none, except for creating a human-sounding slurry that is devoid of real content.

> but also, better ways to understand how different pieces of textual information relate

Is this a real need? What problem does this solve that forums + wikipedia + arxiv + google + a literate human hasn't already?


>More than that, I think NLP will unlock new ways of interacting with computers. Computers will be able to handle the ambiguity of human language, transcending their rigid “only do exactly what you tell them” models of the world.

As exciting as this sounds, I can't help but feel that given -we- haven't figured out how to handle the ambiguity of human language, I'm not convinced a computer attempting to is really markedly better for many use cases than requiring exactness. But operating at a human level of 'understanding', and being broadly accessible, may be enough to change the world. Hopefully for the better.


Can you explain your reasoning? It's easy to imagine a new invention or idea will revolutionize everything because it's never been done before and feels powerful, but even if it works, it might not. It sounds like you believe NLP will enable more general voice control of computers than Siri/Alexa/etc. But will that really be much more significant than people expect? Google is already pretty good at understanding ambiguous queries. It's eliminated the need to know where to look for information. Or are you talking about the computer writing code or doing business itself?

Replying on my laptop.

Basically, transformer models are the best for NLP. They use something called attention based mechanisms, which allows the model to draw correlations between pieces of text/tokens that are far apart. The issue is that this is an O(n^2) operation. So the model is bounded by the context window, which is currently mostly at 512 tokens, and is thus, bounded in how much it can understand.

Recent innovations, and further study, will broaden the context window, and thus unlock better reading comprehension and context understanding.

For instance, the ability to answer a question using a piece of text is mostly stuck at just finding one paragraph. The future will see models that can find multiple different paragraphs, understand how they relate, pull the relevant information, and synthesize it. This sounds like a minor step forwards, but its important.

This will unlock better conversational abilities, but also, better ways to understand how different pieces of textual information relate. The scattershot of information across the internet can go away. Computers can better understand context to act on human intention through language, unlocking the ability to handle ambiguity. This will change the internet.


If this is really where the researchers think these tools are headed (and I don't really doubt you on that point), then this is incredibly dangerous stuff. No matter how good your system is, the impact of implicit, unintentional, and non-targeted bias is huge on the sorts of content these systems will produce. But expose it to the levels of intentional manipulation present on the Internet of today, and these models don't stand a chance of producing something that safely does what you claim.

I'm happy to be wrong about this, but I'm not seeing any discussion about the safety and security of using these systems. And if it's not even being discussed, we can be sure nothing's actually being done about it. Selling promises of active-agent computers interpreting human intent and summarizing information from the Internet without addressing this concern is irresponsible at this point.


Google search already faces that problem. Is that unsafe too?

Can you explain more about what you mean by "able to handle the ambiguity of human language"? Such a claim goes far beyond language processing itself. Such a capability would have to involve cultural and interpersonal dynamics, domain knowledge, and definitive agency on the part of the computer.

And all of _that_ is just to interpret sincere, honest attempts at communication. Can it safely and appropriately handle humor, irony, and sarcasm? What about coordinated malicious attacks?


Let's say you had a problem stated as follows:

You have several thousand documents in HTML, with roughly similar content (to a human) but not entirely consistent in the ordering of the sections, or the formatting, or the names, or the language and structure. But each document describes an entity of the same class (to a human) and very nearly all of them have a section that summarizes the document. They have a lot of parts in common, but they are fundamentally not designed to line up with a structure for data processing.

Is there any practical way to find that summary? Sure, something obvious that takes no time to script is to look for something like "Overview" but you quickly get bogged down in exceptions.

Maybe this seems very mundane and simpleminded, but it is the sort of thing that many people would assume is best done by a human. Is there anything current or in the near future that would be significantly easier than a human reading all of them?


Feels like I’ve heard this tune before. Actually, I think it was first played in the 1950s.

To any of the readers out there, I am currently looking into better understanding the potential business applications of this NLP technology. It would be especially interesting to hear from people who work extensively in social media as to how "Automated reading" could one day help them with their daily job.

If you have time to discuss, DM me on twitter - I am @ralphbrooks.


I can't wait for the day we see "deep-dream" styled literary works.

There have been plenty of "AI written" textual works, like the AI Dungeon for a D&D game, or the AI Recipe Generator that attempts to make something that looks like a cooking recipe.

For the most part they aren't successful because the "AI" isn't smart enough to have a goal in mind, so they end up just monkey-cheesing everything. Pasting together snippets in ways that are usually grammatically correct but make no sense.


GPT2 does have a goal in mind, namely an idea akin to "minimal surprise". The problem is that it doesn't have a model of the world its talking about, so it quickly says things that are locally sensible but globally inconsistent.

any books/papers that address "goal" oriented NLP or hybrid based discussion?

There's interesting research out of Deepmind on two parts of this, which are using the Transformer model in Reinforcement Learning contexts[0] and creating textual GANs[1]. As you are probably aware, GANs are one of the important tools that have driven forward image synthesis and until recently it was impossible to apply them to text, so I expect this to push us quite a bit forward. There's also ongoing work in the selection of the metric to use to evaluate the generated text, and discriminate between human and machine-generated text.

[0] https://arxiv.org/abs/1910.06764

[1] https://arxiv.org/abs/1905.09922


The ability to set your own goals and task yourself to achieve them is the essence of AI. Not "AI" as we know it today, but Sci-fi AI where it's a machine person.

Indeed, but that kind of describes the gulf between current language processing and present AI. Present AI generates tokens that seem to have meaning or seem like an appropriate response to a statement but where it become evident after 2-3 paragraphs, there's no substantial relation to either underlying meaning or underlying goals.

Part of this is "underlying meaning" is an intuitive way to describe things but whatever is underlying here is more tenuous than a classical logic/GOFAI model of the world but more "solid" than a long, clever stream of associations.


If one interprets that "set you own goals" is the task of AI then it's probably Sci-Fi AI, but that wasn't my question.

Let's say we have a goal, "evaluate persons impression of a particular book/topic/etc".

So the goal would be to have a conversation on this and related topics that would (re)construct person's impression.

Hence my question if there are any publications/articles that explored that?


Is it? It's pretty clear that this kind of activity isn't that common in humans - mostly we pick up our goals from cues and drivers in our environment and society.

But what would be the point... after the novelty wears off, no one considers deep dream anything other than an outcome of the system, not as art with an intended meaning (unless the point of art without meaning)

”People are vastly underestimating the changes that are about to come from NLP.”

Maybe, maybe not. IMO, we don’t know what problem we have to solve to get what you describe. We also don’t have a metric as to how far we are along the path towards that goal (do we need 100B parameters? A trillion?), nor do we have any idea as to whether the current approach can get us there.


> More than that, I think NLP will unlock new ways of interacting with computers. Computers will be able to handle the ambiguity of human language, transcending their rigid “only do exactly what you tell them” models of the world.

Such as google's duplex?


Yes, but duplex doesn't work and was too early.

Perhaps, but people overestimate the progress in AI also.

Remember Google's demo of AI reserving a spot at a barber shop? Yeah... that never happened, even though it was supposed to be any day now.


I think you’re right that we are on the cusp or real natural language understanding. It’s an incredible moment. But we are also kinda far from these huge models being in every home because of the enormous amount of compute they require. Even running inference with these things is kinda expensive

I think you're the one who overestimates how much this will affect NLP. I'd say bulk of what was possible to deliver with this is already here, the subsequent changes will be incremental.

The cold hard truth about statistical (and by extension, deep) NLP is that it's just a fancy way of counting numbers mostly. The only way to get to _real_ language understanding is AGI, and _nobody_ is working on that. You fundamentally cannot interact comfortably with a human if your system does not have probabilistic, contextualized cognition, and can't incorporate knowledge about the world.


I respectfully disagree. I’ve been in the field full time for a few years, I watch the state of the art closely. It’s hard to see the thought/theoretical progression of deep transformer models by just reading posts here and there.

I’m not saying these NLP methods will be some kind of AI, just that they will produce products, content, and ways of interacting with the world that are categorically different from what we have seen in the past.

For instance, question and answering tasks have only recently been able to:

Find an answer in a text document that spans multiple non contiguous paragraphs

Understand context across a whole book.

The context window of current nlp is stuck at 512 tokens, mostly because of computational complexity. This has been broken just recently by the reformer model. Which is a primitive, early way to get around the computation costs of attention mechanisms.

Just wait. The ideas are there. They just take time to refine.


I'm a researcher in the field. Not in NLP anymore, but I worked on that as well, years ago, and I keep up with the research. You can't "understand the context" of "War and Peace" unless you have real, actual AGI. I doubt actually it can be fully understood at all when translated to English and read by someone without the right cultural background. This is an extreme example, chosen to make it easy to see that it applies to any non-trivial text.

Let's take a question answering example. Take just about any recent deep learning paper and try to answer detailed, higher level questions against it. To use a concrete example, take MobileNet V3 paper and ask your system "do I put activation before or after squeeze and excitation" (correct answer is "before"), or "do I need a bias in squeeze and excitation module" (correct answer is "it depends on the task"). You won't be able to, because a lot of things are just _assumed_, just like in any other realistic example of text written for human consumption. The facts are encoded externally as information about the world, and they're so fine grained and contextual, that we don't even know how to begin incorporating them into the answers, let alone do so contextually and probabilistically, like human mind does.


Maybe. I take a really optimistic view of attention based mechanisms, like I explained and added to my original post. If you read the recent reformer paper, they produce a new way of computation to start building a model that can in some way, encode the relationships between different parts of war and piece. The bottle neck right now is computation, we don’t know how well these models can learn when that bottle neck is removed!

I’m optimistic because I believe that the contextual information that you are describing, is already there in the vast expanse of the internet.

But I will also add, I think none of this will spawn AI, just that it will spawn new technologies that are categorically different.


I think it will merely spawn technologies that are less fragile, yet still too fragile and unsophisticated to be practical for full blown conversational user interface the likes of which you see in the Avengers movies. It will (and already does) make a difference in simpler tasks where you can get away with just counting numbers, such as search/retrieval, simple summarization, constrained dialogue in chatbots, stuff like that.

What exactly do you expect from these Transformer-based models? This particular one is underwhelming because it provides a tiny improvement over the previous largest one (from Nvidia) with more than double the size.

The fundamental limitation of these "optimization tools" as you call them - they don't have any common sense, and any way to query an external source of information (e.g. wikipedia), or ask a human to clarify.

Another big problem is we don't have any way to do quality filtering on the outputs. From my experiments with GPT-2, it produces one interesting paragraph of text out of 20 - if you squint at it really hard. And most of those 20 don't make much sense at all.

So no, the existing ideas are definitely not enough. Maybe some novel hybrid of symbolic AI with statistical optimization will lead to a breakthrough. This one does not strike me as anything other than "let's use moar weights!!"


Some papers which you might find interesting:

Lin, B. Y., Chen, X., Chen, J., & Ren, X. (2019). KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning. 2822–2832. https://doi.org/10.18653/v1/d19-1282

Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., & Wang, P. (2019). K-BERT: Enabling Language Representation with Knowledge Graph. Retrieved from http://arxiv.org/abs/1909.07606

Trinh, T. H., & Le, Q. V. (2019). Do Language Models Have Common Sense? Iclr, 1–12.

Ostendorff, M., Bourgonje, P., Berger, M., Moreno-Schneider, J., Rehm, G., & Gipp, B. (2019). Enriching BERT with Knowledge Graph Embeddings for Document Classification. Retrieved from http://arxiv.org/abs/1909.08402


> they don't have any common sense

What do you mean by this? Of course they do, learned from their training data. For example, here is quote from conversation 38 of https://github.com/google-research/google-research/blob/mast...

Human: Do you like Korean food in general? Meena: It's okay. I like beef bulgogi, but I'm not a huge fan of kimchi.

It seems to me Meena "knows" bulgogi and kimchi are Korean foods. Isn't that common sense? If it isn't, what do you mean by "common sense"?


Try asking it a follow up question not commonly found in the training data: such as "do you think bulgogi would grow on Mars?", and see what kind of gibberish you will get in response. Moreover, the model has no way of self-diagnosing whenever it produces gibberish.

What is the baseline comparison here? If you asked a random 100 people on the street this question?

I'd buy that if Meena could infer and reason about her own answers.

Human: Do you like Korean food in general?

Meena: It's okay. I like beef bulgogi, but I'm not a huge fan of kimchi.

Human: Ok what should I shop for ?

Meena : You've got almost everything but you need a pear, the steak and some ginger.

The problem with language models as commonsense is that they are collections of patterns and associations, and that they don't have inference models or solvers - unlike my dog for example!


unlike my dog for example!

A more relevant analogy might be a talking parrot :)


Parrots might have what we call a language model... They definitely have inference and autonomy!

I am sorry, you come across as extremely over-enthusiastic, without too many specifics beyond “we’re just about to figure it all out”, “you just wait”, and “it’s gonna revolutionize everything”. We’ve seen this before with ImageNet, didn’t we? When everybody thought that because ConvNets are crushing all the older methods, AI is right around the corner. Well, it turned out to be much more complicated than that, didn’t it. Transformers are great (well, if you have the compute that is) don’t get me wrong, but let’s not get ahead of ourselves. The field is over-hyped as it is.

> When everybody thought that because ConvNets are crushing all the older methods, AI is right around the corner. Well, it turned out to be much more complicated than that, didn’t it.

I don't think anyone familiar with the area thought that ConvNets will give us AGI.

However, their effect has been huge! It's hard to overstate this. Computer vision used to be a small niche topic, with tons of effort required to get something working even on simple images. The quality of today's ConvNet predictions is way beyond anybody's imagination in around 2010. Models built around that time were like a house of cards. Extremely carefully crafted for specific scenarios, where moving one threshold a bit would destroy your output.


Crucially also, convnets have exceeded human performance on several important vision tasks.

You have to be very careful with such claims. For example it may be able to tell apart tons of dog breeds at a superhuman level, but that's not really what people imagine if they hear such claims.

Also sometimes in medical imaging the conditions are very different from actual practice. For example the doctor may be worse than the convnet on certain types of low-quality, low-dynamic range images that someone preprocessed in a particular way. But sure, in the medical field some error prone, boring counting tasks and spot-the-cancer-in-your 200th-image-today, the machine can perform actually better.

But what tasks specifically do you have in mind?


I'm not saying this will spawn AI. Just that, like computer vision was essentially solved by CNNs, NLP will be solved by transformer models.

>computer vision was essentially solved by CNNs

That is a rather contentious claim.


To say the least.

>Find an answer in a text document that spans multiple non-contiguous paragraphs

>Understand context across a whole book.

to have truly much utility when it comes to context a model doesn't just need to correlate information across some text, it also needs general knowledge and understanding so that it can produce knowledge which is only implicit or not even present in the text itself.

You can make the transformers as large as you want, NLP models still fundamentally suck at answer trivial questions like "If Pete was alive in 2000 and alive in 2050, was he alive in 2025?"


Since you seem to be observing this closely, I'm curious what steps are being taken to identify and avoid bias in the results generated by these systems?

I wonder if you could have like a higher order transformer — you first generate a series of prompts and then expand on the prompts one at a time.

If what you claim is true, it's not just the internet that will be changed by NLP but all of civilization!

One of the team members from Project Turing. Happy to answer any questions.

“We are releasing a private demo of T-NLG, including its freeform generation, question answering, and summarization capabilities, to a small set of users within the academic community for initial testing and feedback.”

What’s the deal with these private demos? (GPT-2 was also essentially private). More importantly, why even announce the existence of a private demo to people who were not invited?


I'm honestly not trolling with this question, but can you explain what the practical applications of text generation are? From what I've seen of GPT-2, it's a cool toy, but I have never seen it create anything that seems like it would be useful to solve a problem (eg, a human-computer interaction problem).

The only applications I can think of for text generation are malevolent ones: I'm sure it would be great at generating spam sites which can fool Google's PageRank algorithms, and it seems like you could easily use it in an information warfare / astroturf setting where you could generate the illusion of consensus by arming a lot of bots with short, somewhat convincing opinions about a certain topic.

Is there something obvious I'm missing? It seems too imprecise to actually deliver meaningful information to an end-user, so I'm frankly baffled as to what its purpose is.


Why the lack of number on the more popular SQuAD and Glue benchmarks?

SQUAD and GLUE are tasks for language representation models -- aka BERT-like. This is a language generation model -- GPT-like. Hence, SQUAD/GLUE test sets are not really applicable. We are reporting on the wikitext and lambada sets that openAI also uses for similar models (numbers are in the blogpost).

What's the difference between the two models?

* BERT & language representation models: They basically turn a sentence into a compact vector that represents it so you can then do some downstream task on it such as sentiment detection, or matching the similarity between two sentences etc.

* GPT & language generation models: Given some context (say a sentence), they can generate text to complete it, or to summarize it, etc. The task here is to actually write something.


Both are language representation models, text generation is just a way of training model. BERT is also trained on text generation task: it asked to fill gaps in text (15% of text is blanked during training).

Maybe I am not understanding your point.

Out of the box, given a sequence of n tokens, BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary. You can then fine-tune a model on this representation to do various tasks, e.g. sentiment classification. Thus BERT is said to be a language representation model.

Out of the box, given a sequence, GPT-2 returns a distribution over the vocabulary [2] from which you can draw to find the most likely next word. Thus GPT-2 is said to be a language generation model.

You could of course play with the masking token of BERT call it recursively to force BERT to generate something, and you could chop off some layers of GPT-2 to get some representation of your input sequence, but I think that is a little past the original question.

[1] https://github.com/google-research/bert/blob/master/modeling...

[2] https://github.com/openai/gpt-2/blob/master/src/model.py#L17...


> BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary

"BERT returns" is ambiguous here. During pretraining last layer is loggits for one hot vocab vector, the same as in GPT: https://github.com/google-research/bert/blob/master/run_pret...


One is a language generation model, the other is a fill-in-the-blank model. It sounds like they might be similar, but in practice they are different enough objectives (and in particular the "bi-directional" aspect of BERT-type models) that the models learn different things.

Have you evaluated against the AI2 Leaderboard benchmarks? https://leaderboard.allenai.org/

Not yet. We will try to run against those benchmarks soon.

How does it compare to Google’s BERT and do you have an online demo?

Here’s a demo of BERT https://www.pragnakalp.com/demos/BERT-NLP-QnA-Demo/


(Similar to the response for another question.) BERT is a language representation model while Turing-NLG is a language generation model (similar to GPT). They are not directly comparable (they can potentially be massaged to mimic the other, but, not something that we have done yet.)

Google's T5 paper pretty convincingly combines the two doesn't it?

Any plans on training other (non nlp) huge models using ZeRO?

Specifically for Transformers - any plans to train a big model with a bigger context window?

Not that this one isn't very impressive, of course.


Thanks for your kind words. Yes, we would like to next train a language representation model. And our hunch is that probably something which is a mixture of language representation and language generation would be able to get the best of both worlds.

How close do you think the technology is to answering -this- question?

I have been bearish on AGI, but GPT2 surprised me with the lucidity of its samples.

My take from the past few years is that we're 99% done with the visual cortex - convolutional nets can be trained to perform any visual task a human can in <100ms. Now I'm mostly convinced that GPT2 has solved the language cortex, and can babble as well as we will ever need it to. We just need a prefrontal cortex (symbolic processing / RL / whatever your pet theory is) to drive the components, which is a problem we have not even started to solve. I am 90% sure it is a different class of problem and we won't knock it out of the park in 5 years like the visual/language cortexes, but we can hope.

edit: it's possible cognition follows from language, which would be convenient. is GPT2 smarter than a dog? I don't think so but I could be wrong ¯\_(ツ)_/¯


I have been bearish on AGI, but GPT2 surprised me with the lucidity of both paths. I still maintain my support for the basic metric of the GPT-I. However, I have a number of requirements on how my proposal is to be funded to resolve concerns. First, I strongly believe that academic research should be the method of choice (that is, if we are to figure out how to make AGI possible), and I advocate funding to support results from the central bank community. Second, given that the GPT can be articulated in mathematical terms, this should be reflected in funding policy. A very serious concern is that if funding of GPT is disincentivized, investors may react similarly to the way they reacted to AGI. This is

You generated this with GPT, right?

Yes.

While Markov chains sound like a schizophrenic, this sounds like a banker on coke. Pretty impressive progress, i guess.

That's because of the word “bearish”. It can sound like just about anything. For example, prompted by your comment, it sounds like a human interest journalist. (Journalism and fanfic, in particular HPMOR, seem to have comprised a large part of its training corpus; it composes Harry Potter porn at the slightest provocation; “Hermione moaned”, say.)

While Markov chains sound like a schizophrenic, this sounds like a spacial disjointed notebook, as if somebody was trying to write in two places at the same time. "Today is Monday, it's Saturday night, I forgot to write to my dad and he only leaves the house for a couple of hours."

Yet, sometimes Markov chains turn out to be the most beautiful art form. His wife Jennifer Neil, whom he met at a barbecue and has been married to since 1998, attributes this creative process to the constant ups and downs in his old job.

"His numbers are just insane," says Jennifer Neil. "I don't know where he keeps them, but they're very mind boggling."

Somewhat disconnected from the actual


I've always been interested in techniques to try to minimize parameters or alternate approaches to learning. Meanwhile, state of the art is over here just finding clever ways to make everything bigger. I have a feeling we're going to end up with a very different landscape in 5-10 years, much like the automotive industry never started mass producing inline 12s and instead moved to turbos and superchargers.

I can uderstand announcing this without code, but without a demo so anyone can try it in different scenarios?

If you want access, please send an email to [turing_ AT _microsoft _DOT_ com]. Remove underscores and spaces.

B = Billion, not Byte. For second I was like, WTF?

I thought I am the only one. There is no trigger to deep learning.

But the article is fascinating nevertheless. Not sure is alphago breakthrough.


not at all comparable. it's just a scaled up GPT2, no new ideas deep learning wise.

All these language generation models, in short, base their next word solely on the previous words, right? I'd expect that these generators can be conditioned on e.g. some fact (like in first order logic etc) to express something I want. This is roughly the inverse of for example Natural Language Understanding.

Does anything like this exist?


I'm fairly sure that these models don't work solely on the previous word, but instead are able to remember some level of information from history.

Otherwise, you'd reach a word like 'and' and couldn't possibly follow it with a logical statement that follows on from the previous part.


This is why I said 'words', multiple :-).

My point being that these generation models should be conditioned on something more than just word history, like something they want/are instructed to express.


This does GPT-2 X 10. For anyone wondering what GPT-2 is doing look at this baffling subreddit and marvel at how one GPT-2 model trained for $70k spits out better comedy than everybody on the payroll of Netflix combined.

https://www.reddit.com/r/SubSimulatorGPT2/


It's definitely better than the original using Markov chains. It fits very well this use case, and in my opinion only this use case.

GPT2 is still very random and quite stupid.

You start it with your love for your girlfriend as a context, she becomes a cam girl into hard core anal two paragraphs later. You start with religion, "Muslims must be exterminated". You start with software and you get a description of non existent hardware with instructions about how to setup a VPN in the middle. You start with news, and you can read than China supports the Islamic state.

That's cool because it has more context than Markov chains which usually have only 3 words of context, but it's still a long way to go before I trust anything generated by this kind of algorithm.


This stuff is pretty much indistinguishable from the real thing...

https://www.reddit.com/r/SubSimulatorGPT2/comments/f1pypf/so...


Comedy is right.

"I'm really starting to get worried about my Higgs Boson (HBN) after watching some videos on YouTube" [0]

"This repost and all of your posts are garbage." [1]

"It's the most random and unoriginal shit I've ever read." [2]

[0] https://www.reddit.com/r/SubSimulatorGPT2/comments/f1sqyh/an...

[1] https://www.reddit.com/r/SubSimulatorGPT2/comments/f1vfnv/wh...

[2] https://www.reddit.com/r/SubSimulatorGPT2/comments/f1o83a/fo...


Simulating r/StonerPhilosophy[1]

Post: Do we live in a simulation?

Comment: I just realized, we are a simulation, and we are a simulated simulation.

Comment: We're all in a simulation. We're still here. We're all in this little ball together

Comment: The simulation hypothesis states that we are in a simulation. Which means that there is a possibility that we are not in a simulation.

[1] https://www.reddit.com/r/SubSimulatorGPT2/comments/ez6qtj/do...


GPT-2 X 10 is misleading; this model size is 10x, sure, but that doesn't necessarily mean the output will be 10x better. For r/SubSimulatorGPT2, it did go from 355M to 1.5B recently, but the quality isn't necessarily 4x (although it did improve).

I'm more interested in shrinking models that maintain the same level of generative robustness (e.g. distillation, with distilGPT2)


The model did go from a 100 different trained models to 1. So it seems to hold at least 4x as much in knowledge but maybe we should ask the people that actually trained it.

Btw thank you for your GPT-2 simple, played around with it last weekend and it made building a toy surprisingly simple!


My favorite so far is the science bot. 75% of the top level posts are explaining why the submission was removed.

"What's the best way to kill a reddit thread?

Just make it a year long thread and wait for the year to end." -- circlejerkGPT2Bot


There's nothing inherently funny about entries like this one [1], I mean, there is, as in it is sort of funny how the AI got tricked so quickly into doing incest jokes, but I guess that was not the research team's intended goal.

[1] https://old.reddit.com/r/SubSimulatorGPT2/comments/f1ifp6/my...


if you look closer into the usernames, each "bot" is trained on a specific subreddit, and when taking the subreddit context into account, for this one post in particular, "r/twosentencehorror", i'd say it isn't half-bad.

Sure, this shows robots good at more or less fitting the formula for various types of typical articles or posts. Still, things that make no sense accumulate over sentence or paragraphs, depending on how formulaic a given format might be. So this simultaneously generally impressive but not useful for any one thing.

Perhaps it could be used to improve autocomplete keyboards. The more words it can predict in advance the fewer keystrokes/taps are needed to convey thoughts

gosh, this is unsettling. my brain is literally getting stuck on an infinite loop trying to read this and coherently put them together

I expect we'll see some very interesting, very big models following it. I didn't dig too far into the code but the library looks very easy to use and will open up a lot of doors for people who have a few or a few thousand GPUs.

I must be missing it, where did you find a link to the code?

The code for the distributed training library, not the model - https://github.com/microsoft/DeepSpeed/


At what stage of throwing compute & data at the problem, diminishing return sets in?

What GPU do I need to train it? Titan Mega RTX with 240GB of RAM?

A DGX-2 will do just fine.

If your model has 17 billion parameters, you missed some.

How long until the language models stabilize enough that we can bake them into a low-cost, low-power chip for edge uses?

There’s lots of work on distillation, smaller models, approximations, etc. People already have simpler forms of these running on smartphones. Models seem to be growing faster than we can make them small though :D

Yes, these things keep us up at night as well :-).

I think this is largely unnecessary, can't things like TPUs handle the inference?

Using TPUs is expensive. Also some applications may need small latency

Putting all your speech/text onto cloud machines runs counter to e2e encrypted messaging.

I think you can get hardware like a TPU for consumer products?

Those summaries look impressive, although a bit repepetive

What they don't tell you is that these summaries are always hand picked from a few that were generated.

Quite possible but that also means that there is an opportunity to implement some sort of RL to choose the best possible summary.

I see what you did there?

I see what you see what I did there

unfortunately they abstained from participation in more popular SQuAD and Glue benchmarks..

Those are question-answering and language-understanding benchmarks respectively, neither of which has been suitable for language generation mode evaluation since GPT-1 was roundly beating by BERT. GPT-2 didn't evaluate on them either.

[flagged]


Sorry I didn't think one could interpret it like this, this was inappropriate indeed. I edited.

Thank you <3 For reference, this reasonable, human exchange cost me 4 karma points, because as everyone knows, being a human on Hacker News is a 403.



Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: