Hacker News new | past | comments | ask | show | jobs | submit login
GPT-2: 6-Month Follow-Up (openai.com)
160 points by xcodevn 26 days ago | hide | past | web | favorite | 93 comments

"Cornell University is studying human susceptibility to digital disinformation generated by language models."

"The Middlebury Institute of International Studies Center on Terrorism, Extremism, and Counterterrorism (CTEC) is exploring how GPT-2 could be misused by terrorists and extremists online."

"The University of Oregon is developing a series of “bias probes” to analyze bias within GPT-2."

But apparently no university studies the social and economic impact of using terabytes of public data to train algorithms that for all practical reasons end up being inaccessible to an average person.

If things go on the way they're going right now, in 20 years millions of people will be "mechanical turked". Most of information processing tools will be mediated exclusively through companies like Google and Amazon. They will be less like normal tools (e.g. word processors) and more like systems you have to be a part of. Can you imagine the levels of inequality involved? The hyper-centralization of power? This is the foremost challenge presented by AI, not some hypothetical nonsense involving terrorists using a text generator.

And it's not like there aren't any solutions. Douglas Engelbart, for example, pointed out a great way of introducing technology into society without screwing most of the society over:


We kind of followed his vision for a while, with good results, but AI seems to be going in an entirely different direction.

We (the Middlebury Institute's CTEC) are an extremism and terrorism research lab, and so we're tracking the ways that tech is used by terrorists and extremists.

For a lot of nonstate orgs with sophisticated propaganda arms, an ideologically cohesive text generation capability would be a huge advantage in scaling up info ops. We are looking to measure whether or not GPT-2 or other neural text generators are useful for this, or if that risk is, as you say, nonsense.

I think their point isn't that terrorists leveraging this tech not a problem. It is certainly a problem. But the greater problem being a few large entities being the only ones who have access to or control over it.

I think it's pretty clear that terrorists or any other bad actor will find great value & utility in this tech. The article from OpenAI says 'Humans can be convinced by synthetic text.' & research at Cornell found people find it almost as convincing as New York Times articles. I would be interested in learning about the methods you guys are using to determine of it? I wonder how that could be measured?

So let's assume the answer is 'YES! This technology is dangerous". The Middlebury program, Cornell, and more and more universities and research groups find the same thing. Then what will the recommendations be? Certainly not to release it into the wild. I think they will be to keep it locked up. To keep it in the hands of a few large and powerful companies, with the resources to 'manage' such a thing.

This seems to be what the original comment is trying to illustrate, and I think it's an interesting point to consider the implications of long term. The tech exists now. There is no going back. So is it worse to let it out of the box, or to let a but a few have control over it?

In spite of all that we're studying wrt abuse potential, I (and my team) generally support open-sourcing tech, and I hope that we can contribute not to "oh this is dangerous, don't release" but rather to "oh this is dangerous, it's already released, what are we going to do now?"

Great, keep up the good work! Are you able to discuss how studies like yours work? Is it along the lines of determining if people can distinguish between human written and AI generated text? Sounds like a difficult question to answer.

I suspect they will release the full model in time. It's already trending in that direction.

do you not consider the scenario described in your comment's parent to be worse than any terrorist scenario?

Clearly. I also think that the pain of the centralization of tech like this will be felt in the scope of years, while the increase in the automation of propaganda and radicalization will be felt in the coming months.

Like I replied to the other poster, I strongly support open-sourcing tech. Centralization of tech like this helps exacerbate the problem: state and sophisticated nonstate groups have the resources to develop it indigenously, while the public can't dig into it and start developing a set of norms and best practices to approach detection and mitigation.

I'm not sure if this captures this full set of tensions at play. I'm a huge fan of Doug Engelbart, and he was my inspiration for a long time...but it turns out augmentation of power without checks and balances can end up very messy (I'm also a fan of James Madison, though Engelbart is closer to a first love!).

One could argue that I actually started work on misinformation though Engelbart. I was working on various projects specifically to achieve a vision like his, and it is still my guiding light. But it turns out that some technologies (initially focusing on Facebook and YouTube's engagement dynamics...), without appropriate checks and balances, are sort of anti-augmentation of intellect. They give an asymmetric advantage to those who are trying to weaken our intellects. So I ended up dropping those projects to attempt to address urgent misinformation issues in May of 2016.

Going back to GPT-2 and research release, I and my co-author recently went deep into understanding the types of risks and tradeoffs in a recent paper. You can see the summary here: https://medium.com/@aviv/reducing-malicious-use-of-synthetic... or go directly to arXiv https://arxiv.org/pdf/1907.11274.pdf . The goal of our paper is specifically to go past the angry invective of the "here is the most important problem" (in your case, "AI inequality") and actually dive into the weeds of threat models and tradeoffs.

You're just proving my point. If we go past the fashionable rhetoric about AI and apply the same reasoning to existing technologies, we could conclude that Photoshop was too dangerous to release to the public without "appropriate checks and balances". It can be used to create harmful memes and doctor photos! So what tools should an average person be allowed to use?

Also, with so many people concerned about disinformation, where is research on tools that would empower individual users to process information in better ways and make sounded judgements?

It seems like this is missing the point of public data? When you make an edit to Wikipedia, anyone in the world can read it. You don't benefit when they read an article, but it doesn't cost you anything either.

"Anyone" includes researchers. That's part of the deal. Yes, they benefit, but you aren't harmed. That's zero-sum thinking.

I think you are correct that nothing is taken away from an author when someone reads their Wikipedia article.

Perhaps what the poster above you is saying is that there is a continuum of information, some more personal and sensitive, like your current location, and some less personal, like the Wikipedia article on Elephants.

Taking data about specific humans, (or humans in general), and turning it into code that has predictive power seems like a different type of power than the knowledge given by an encyclopedia.

It doesn't cost you anything when someone uses your Reddit posts to train a model either. The supposed harm is very tenuous.

Look up some semi-recent talks on social media by Jaron Lanier.

Whoa, it’s remarkable how well that paper holds up for being nearly 60 years old. I wonder if it’s just a timeless problem.


Seems accessible to me

That isn't running the full model.

It's running the latest model to be released, which is 774M. Since the initial announcement, larger models have been released every few months, so we're on track to have the full model released by 2020. (This is what OP is literally about -- the roadmap for releasing larger subsets of the model.)

"Accessible" means more than "there's a web service available somewhere that's running some version of it".

The OpenAI approach to managing the release of the larger dataset strikes me as totally flawed and upside down. The biggest concern the team seem to have is that the fully trained GPT2 model will be used to spread propaganda and misinformation. They also imply that the biggest hurdle to training a similar model is money needed to pay for the training resources.

The problem with this approach is that the users most likely to be malicious users of GPT2 are state actors. China, for example, already spends millions on an immense propaganda factory. Money is not a serious obstacle for a state. Given that other research entities are, by the sound of things, already far along with development of similar models it seems unlikely that China and the US don't already have functional models internally.

On the other hand, legitimate business and research is clearly hamstrung by withholding the full model. What we have is the maximum degree of inconvenience and the minimum degree of security. It feels almost perfectly analogous to ban on liquids in airports. The motivation for that ban was that existing security measures couldn't detect liquids, but simply announcing a ban was to be enforced didn't change the fact that liquids were undetectable. Instead millions of travelers were pointlessly inconvenienced at great cost.

Release the kraken already!

I’d recommend re-reading the original GPT2 announcement, particularly this section regarding their release policy:

This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas.

This release approach is an experiment used to force the conversation around a release strategy before we actually and unambiguously need it.

Uh, you're forgetting about spammers and malware authors.

They do seem to have a bit of a mismatch between what they say and what they do. They wanted to be the non-profit benefiting humanity with their advanced research, but had to raise money because that wasn't working. And now they claim to have these impressive models, but also claim it's not safe to release them to the public. Okay, so what exactly are you developing that's good for anybody?

For finetuning GPT-2 on custom text, my gpt-2-simple package (https://github.com/minimaxir/gpt-2-simple) gets close to going OOM when finetuning the 345M model, even on a 16GB VRAM server GPU. Doubling the size of the model with the 774M model might cause it to not work at all, so I’ll need to test.

Of course, the default output from the model might be sufficient, although it’ll take twice as long to generate text compared to the 345M which is slow even on a GPU.

How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.

>How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them _before_ decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.

Any thoughts on the larger model? Doesn't seem materially better than the last one. Maybe the fine tuning exercises will show the benefit?

I've already tried training with nshepperd's codebase. Sampling works, but even with the memory checkpointing and freezing the embedding and using SGD rather than Adam, it OOMs on a 1080ti's 11GB. Either additional tricks or CPU training are going to be required.

I'm the lead researcher on the Middlebury Institute project looking at fine-tuning the bigger models, and I got OOM on 745M and 1.5B originally. I had to get an Azure instance with 24GB VRAM to handle it (using nshepperd's codebase). It works, but takes a while (~500 epochs takes 12 hours on a 100k word training dataset).

Ouch! So 11GB is nowhere close to being enough, then. I wonder if even switching to FP16 will be adequate?

Might be able to get 745M down to work on a single GPU. I'm definitely not using all 24GB, so fp16 might be able to get it down enough.

How would you use fp16 to get it to work on a single GPU? And if you did, what GPU should you use?

Figured. I'll make changes to allow sampling from the default model more easily.

Are you using FP16?

No. We weren't sure if that'd be a good idea since it wasn't trained with low-precision, and 345M thankfully didn't require going that far. 744M might, though. (Another option is model parallelism since I have 2 GPUs and that might be enough, perhaps freezing more layers and training incrementally, or reducing the 1024 token window to smaller ones like 700.)

On the NVIDIA GPT- 2 implementation:

>What would be the largest model one could train across 2x 2080Ti?

>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations


They haven't released such models, though, and I don't know if it would be drop-in compatible with the OA GPT-2-774M checkpoint (they're training their own GPT-2s using their own webtext corpus).

I haven't look into at all myself, but he also said:

>We do provide training code that should work out of the box for gpt2 117M/345M


It would take forever (or $$$) to train even 117M model from scratch.

I read that meaning you can start with the actual pre-trained GPT-2 models but I never got an answer when I specifically asked if that was the case.

You maybe should try tensorflow automatic mixed precision! https://github.com/zihangdai/xlnet/pull/200

fp16 saves a lot of memory and is worth doing. I've not had trouble fine tuning all these models with fp16.

Have you fine tuned 774 successfully using a single GPU?

I recommend Nvidia Apex, it offers several ways to mix precision.

Possibly a stupid question, but does AMD lift such restrictions on models with its unified memory, by allowing the GPU to "page out" chunks of vram to system ram?

My guess is it would be much slower, because GPU processor would wait for data. Compare bandwidth - system RAM to GPU memory (PCIe): 16GBps vs GPU memory to GPU processor: 900GBps.

No idea how modeling works on AMD. (most discussions are about NVidia/CUDA)

Isn't that a macOS specific feature?

<rant> Are there any real use case for GPT-2? Does it solve any problem? I've read almost all state of the art leaderboards of all Nlp tasks of paperswithcode.com and truth is except text generation, openAI has not one state of the art, they are not even visible in leaderboards. OpenAI is maybe the AI research center with the biggest funding and comparatively to other well known (Microsoft, Facebook, Google or even zalando..) they are the ones with the least results.

From my observations most SOTAs come from chineses researchers by far, followed by deepmind.

BTW isn't that a sad truth that not even one of all major AI actors has a draft of an AGI architecture, something comparable to CYC or opencog. https://wiki.opencog.org/w/CogPrime_Overview

Two other observations I would like to share: Many important NLP tasks have almost nobody publicly working on them it seems, on paperswithcode.com or NLP-progress (from github) some tasks have only one or two papers... And many others have not evolved since 2016. Most of the time it seems trivial to beat the old state of the art, just use BERT or XLnet on a task where nobody applied it before and hop, free state of the art for you! Yet researchers don't seems to chase those low hanging, high returns fruits. Also researchers seems to work a lot in isolation, many new generic improvements like new optimizers (RAdam for example) and new activation functions (Swish) allow to beat most of older state of the art on almost all task just by using them. Yet researchers will take years before using them because of an absurd inertia. Also unlike an open source program, BERT and XLnet have very low response and activity on github despite major open issues... </rant>

Inference, question-answering, NER detection/disambiguation are pretty important NLP tasks (at least from a practitioner's perspective). While GPT-2 has gained mindshare for its generative capabilities, BERT and other pre-trained Transformer Encoder models are used for production workloads (we use BERT for text classification and explored using it for clustering).

It's useful to view the GitHub projects for these models as reference implementations. They're intended to provide a roadmap for reproducing the research and to aid in implementing production libraries.

Regarding the latter, take a look at the work by HuggingFace, the Flair project, Spark-NLP and others.

"Inference, question-answering, NER detection/disambiguation are pretty important NLP tasks" Yes indeed.

"While GPT-2 has gained mindshare for its generative capabilities, BERT and other pre-trained Transformer Encoder models are used for production workloads" You rephrased my point pretty well, while openAI search for "fun" tasks, deepmind and others allow progress on real world tasks.

You use BERT which is nice but do you consider using it's successor: XLnet?

"take a look at the work by HuggingFace, the Flair project, Spark-NLP and others." I was aware of Flair (from Zalando) but thank you for Huggingface and Spark-NLP, I will take a look!

The article mentions TabNine, which has made it to the HN frontpage before.

For reference, the blog article for the release of "Deep TabNine", the auto-completion engine based on GPT-2: https://www.tabnine.com/blog/deep/.

And earlier this month they released a local version: https://www.tabnine.com/blog/local/.

> are there any real use case?

Writing Bloomberg's "market wrap" articles.

> Most of the time it seems trivial to beat the old state of the art, just use BERT or XLnet on a task where nobody applied it before

If it was a high-return fruit somebody would be doing it. Not necessarily publishing papers about it or trying to beat useless artificial benchmarks on it.

If it was a high-return fruit somebody would be doing it." Not necessarily.

Not necessarily publishing papers about it Yes.

trying to beat useless artificial benchmarks on it. Wtf is this bullshit? AI benchmarcks are what direct progress in AI and allow to quantify it. And they are less and less artificial and more and more real world: E.g quora, reddit, Wikipedia and Facebook datasets.

>"If it was a high-return fruit somebody would be doing it." Not necessarily. With high likelihood given current funding of ML with $$$ applications but yes, not necessarily.

> AI benchmarcks are what direct progress in AI

Sadly this is largely true.

The AI benchmarks + culture around it are the bullshit.

What actually moves forward the field of AI is:

- accessible

- reproducible

- comprehensible

results done with some thought and reasoning which is explained well, published well, and justified by more than some #$!& "our F1 score went up by 2 therefore our approach makes sense" bullshit.

AI benchmarks have done as much to retard progress in AI as they have to promote it.

Current AI benchmark top scores are gamification for big companies to waste even more resources running algorithms they can't explain. They are not machine learning, they are machine pissing contests.

Many important NLP tasks have almost nobody publicly working on them

Well, then perhaps you should go work on them, instead of ranting here.

Why the ad hominem? I am pointing a problem of allocation of ressources on the AI research field. It's not to me to fixe that, but yes I am actively working on a logical fallacies detector which is the first of human history and works for the 256 possible forms of syllogisms, I'm expanding it to other logical forms such as modus ponens/tollens.

It's not to me to fixe that

There's nothing to fix. People work on what they want to work on. Things that seem important to you are not important to me, and the opposite. I'm OK with that.

"People work on what they want to work" ideally yes, but ultimately they work on something that please them AND that give them a decent salary. Funding should not go to fun (but useless in the real world) Nlp tasks. "Things that seem important to you are not important to me, and the opposite." and here's go relativism or the abandon of thought... It's indeed difficult to quantify cardinally the utility of an NLP task against an other, but we can agree on an ordinality (order of magnitude) E.g do you understand that POS tagging or dependency/constictuency parsing are angular tasks needed by much of the others. Thus making them the most important NLP tasks as they enable other Nlp tasks and are the most used in practice? You think that what exactly is more important? Are you talking about text generation? Why is that important? Something important enable to solve important problems in the real world. How text generation solve any real world problem is beyond my knowledge. But if you rationally think that it's more important that angular Nlp tasks, you can probably explain why and give an example or two? Yes, an AGI will need to emit text just as humans do, indeed. But before that she needs to understund the natural language before emitting it. GPT-2 maybe capture an aesthetic of the initial input pretty well but it does not generate meaningful sentences or only by accident, so no GPT-2 does not advance the quest to create an intelligent agent mastering natural language.

do you understand that POS tagging or dependency/constictuency parsing are angular tasks needed by much of the others.

I'm not sure. I rarely have to do that explicitly in my head. Perhaps a model should learn to infer/guess them implicitly, from context, just like I do.

what exactly is more important?

In my opinion, having a world model (for common sense) and situational awareness (e.g. through sensor fusion, or from prior conversational history, or using some externally supplied conditioning) would be far more important.

GPT-2 does not generate meaningful sentences or only by accident

You think adding POS tags would help it generate meaningful sentences?

I'm not sure. I rarely have to do that explicitly in my head. Well I can't prove it but I strongly believe that our brains use part of speech too, unconsciously. Perhaps a model should learn to infer/guess them implicitly, from data. That's exactly what deep learning POS tagger do, they are far better than hard coded algorithms. SOTA has 97.96% of accuracy.

In my opinion, having a world model (for common sense) and situational awareness (e.g. through sensor fusion, or from prior conversational history, or using some externally supplied conditioning) would be far more important. Haha you basically want a general intelligence (AGI), I want it too! And not enough persons works on "architecting" such a thing. Opencog may interest you a lot then. But the reality is many other "simpler" tasks are needed to make this happen.

having a world model (for common sense) is an NLP task There are some interesting results https://github.com/sebastianruder/NLP-progress/blob/master/e... OpenAI does not work on this task sadly, at least for now.

You think adding POS tags would help it generate meaningful sentences? I would be clearly insufficient yet necessary. I believe they already use internally a POS tagger and a dependency parser.

they already use internally a POS tagger and a dependency parser.

Interesting. Where did you see that?

Well it was just a belief. I may be wrong. I asked them by curiosity https://github.com/openai/gpt-2/issues/168 So we will know.

How do you think it could be used there? A separate model just for providing tags, or the same model but trained to predict tags as well?

I was imagining using a separate model just for providing tags as they are very accurate. It would theoretically give gpt-2 useful data.

GPT-2 has not (yet) been trained to predict POS tags to my knowledge, nor BERT, or ernie 2 or xlnet has, but I think they have great potential to improve POS accuracy.

Peer review is so shit at major AI conferences that his paper was most likely rejected for nonsensical reasons

Why do you say that? Does it add anything to the conversation?

Hopefully someone will make a working demo of it, like Adam King did for 345M. People should be able to experiment with this stuff without relying on the hype of press releases:


Not sure why open AI doesn't do this themselves. That fully aligns with their stated mission.

It appears TalkToTransformer has been updated for 774M: https://twitter.com/AdamDanielKing/status/116387950071694131...

I made a discord chatbot for interacting with gpt 2: https://github.com/itsmehemant123/gpt2-discord-bot

It's not particularly hard to check out the source & run it on your own machine. For those who don't know how to use git or are afraid of the command line, there's TalkToTransformer.

I was able to take all of Donald Trumps tweets and using GPT2 to make a program that would mimic his tweets.

I found that it might be very effective. I have the test at


I got the information from trumptwitterarchive.com

I also explored creating a system that could recognize fake tweets from real ones and I believe I got 94% accuracy. It was a Bayes classifier but I think I have to double check my work.

"I also explored creating a system that could recognize fake tweets from real ones and I believe I got 94% accuracy. It was a Bayes classifier but I think I have to double check my work." Is it open source? This interest me a lot!

Hmm, no mention of Megatron in their timeline? https://nv-adlr.github.io/MegatronLM

They do mention the 8b+ "GPT-2" model trained by nVidia. Which is their reference to megatron.

Oh! You are right! How did I miss that..

Even if GPT-2 were released, very very few would have the hardware to run it because of gpu ram running out (and doing some sort of load-unload system would make training times unfeasibly long). And those who have the hardware to run it, has probably already made a version of their own or reasons not to. So I'm wondering if this GPT-2 hype is a genuine concern of openai, or if it's mostly a PR flex to say 'Look at us, we made a good model!'.

As an example, look here by Nvidia https://devblogs.nvidia.com/training-bert-with-gpus/ who made GPT-2 8B, which is ~5 times as large as GPT-2.

>As part of our staged release strategy, our current plan is to release the 1558M parameter model in a few months, but it’s plausible that findings from a partner, or malicious usage of our 774M model, could change this.

This seems naive but I think it's a misdirection. Of course the model will have malicious users. Propaganda teams started testing its integration as soon as it was released. It's likely that OpenAI is counting on this for insights into HOW the model can be used maliciously. It's also possible that the model results have inherent trackable markers and OpenAI can later say that X% of social media posts were made using this model.

So what are the positive applications, aside from prettifying data like sports and weather reports?

Even with Skyrim's 800+ books, you frequently ran into the same book. Imagine libraries filled with plausible text that hides nuggets of lore seeded by developers. Along with more realistic text-to-speech this can allow games to support a large diversity of NPCs that have true radiant dialogue and sound more realistic than "I saw a mudcrab the other day".

With some modifications, I think models like this can outweigh even their nefarious applications:

Defense against text decomposition analysis. The model can be used to obfuscate writing patterns that can reveal a person's identity, either by randomizing form or standardizing it. Take your post and run it through the formatter to get the same idea and intent, but in a style that can't be traced to your other writing. Or you reform it into style of Ernest Hemmingway, like thousands of others.

Realtime plausible deniability encryption. Messages in a monitored chat can look like mundane conversation but contain encrypted messages. This would require the model accept seeds and work partially in reverse to diff two sets of text to reveal the hidden message.

In it's current form it doesn't look like it can do any of those things, but there's the potential.

Are there any applications for the GPT-2 models beyond text synthesis? Inference, question-answering, NER detection/disambiguation, anything like this?

BERT and its descendants do better at all of this, and are the industry standard now https://arxiv.org/abs/1810.04805

Except that BERT is now obscoleted by https://github.com/zihangdai/xlnet (but xlnet would never have existed without BERT)

Kind of, there are a bunch of transformers that might perform better than BERT (Ernie 2.0 being stronger than xlnet, for example), but often this is a function of training size (xlnet trained on 10x more data than original BERT). Realistically there are now BERTs released finetuned for special corpa (biobert, clinical bert, etc) so if you want to work on those kind of texts you are better off starting with a BERT that was previously fine tuned to something close to your task (and then fine tune it more yourself).

Well you comment was really interesting to me because I didn't know ERNIE 2.0 and it's concept of continual learning seems to be really a step forward!

But some of you statements seems incorrect: Ernie 2.0 being stronger than xlnet XLnet is the neural net with the biggest number of first places on benchmarck leaderboards. Cf: https://paperswithcode.com/paper/xlnet-generalized-autoregre... While ernie 2.0 has currently 0 first place on paperswithcode.com https://paperswithcode.com/paper/ernie-20-a-continual-pre-tr...

xlnet trained on 10x more data than original BERT No, I've read on a github issue of xlnet that xlnet base is same size as bert base and xlnet large is same size as bert large. (I don't know for ernie 2)

Well your point on finetuned bert vs non finetuned xlnet is interesting. ROBERTA is so fine tuned it beat XLnet on some tasks. But generally xlnet non finetuned beat BERT finetuned and there are more and more xlnet finetuned each week. But your point does apply for Roberta, and for the few tasks where bert as been applyed but xlnet hasn't yet.

> xlnet trained on 10x more data than original BERT No, I've read on a github issue of xlnet that xlnet base is same size as bert base and xlnet large is same size as bert large. (I don't know for ernie 2)

It's not about the size of the model, but the training data. If you read the XLNet paper https://arxiv.org/pdf/1906.08237.pdf they clearly state in section 3.1:

"Following BERT [10], we use the BooksCorpus [41] and English Wikipedia as part of our pretraining data, which have 13GB plain text combined. In addition, we include Giga5 (16GB text) [23], ClueWeb 2012-B (extended from [5]), and Common Crawl [6] for pretraining. We use heuristics to aggressively filter out short or low-quality articles for ClueWeb 2012-B and Common Crawl, which results in 19GB and 78GB text respectively. After tokenization with SentencePiece [16], we obtain 2.78B, 1.09B, 4.75B, 4.30B, and 19.97B subword pieces for Wikipedia, BooksCorpus, Giga5, ClueWeb, and Common Crawl respectively, which are 32.89B in total"

If you compare to BERT paper https://arxiv.org/pdf/1810.04805.pdf training data for "pretraining data" section in also section 3.1:

"Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences."

So 32.89B words for XLNet vs 3.3B words for BERT.

We've also run fine tuning experiments supplementing additional private medical corpus (~10B words) and felt starting from clinical-bert was better than xlnet (for our rather specific use cases).

then Facebook's roberta came out which beat xlnet and is essentially a more intelligently trained. I included XLNet as being a bc descendant of BERT, but I guess they are all descendants of GPT1

Gee, it seems like the state of the art moves very quickly. These are less than twelve months old?!

All of those were part of the original benchmarks GPT-2 was evaluated on.

I really doubt it as they are not on any state of the art leaderboard from both NLP-progress and paperswithcode.com

I'm curious about the "fine-tuning based detection" mentioned in the report ("Fine-tunes a language model to 'detect itself'... over a range of available settings"). Does anyone know good articles/papers (or have an off-the-top tl;dr) to get a high-level grasp of "self-detection" for generative models?

Hiya, I work at OpenAI. I think the Grover paper is a good place to read about some of this:https://arxiv.org/abs/1905.12616 We're likely publishing more on detecting fine-tuned outputs in the future, also.

Many thanks! Looking forward to reading the OpenAI research when it comes out as well.

Anyone wired a "talktotransformer"-style system to this one yet? Would like to see how it works without going through the steps of setting it up.

EDIT: Looks like https://talktotransformer.com/ already uses the 774M one!

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact