I don't think this would be an improvement per se. The way Byte Pair Encodings are constructed that GPT-3 (and GPT-2) use is that they already take higher-level text representations and compress it down (into tokens), which are then reflected in the training of the model.
As an extreme example, if I put 50 contracts involving multiple different parties into one big document and the attention mechanism was triggered on the first document title, would the subsequent document titles sufficiently demarcate a new contract to reset the context? Or would the attention stay “on” the first high accuracy context match and mix up the terms and parties? In context of the paper, GPT-3 missed a ton of issues, indicating to me that the attention mechanism is not being properly triggered. Again I may be wrong that tokenizing and predicting based off of clauses, paragraphs and sections would help improve the output, but that was one way I thought would capture the different levels of context.
The goal should be to properly ascertain context at multiple levels. When I am reviewing a document, I scan the document title and section headings to grasp the structure of the document prior to diving into the relevant clauses and their elements. If there are external references, I will integrate them before reading the clause in order to capture the complete rule. A crucial mistake in contractual interpretation is falsely attributing an element from one rule or section to another, or excluding an element. What applies in A context might not apply in B context, or it may be conditional on another factor C.
I did not intend to argue against deep meaning as an approach categorically. Sorry if I gave off that impression. I think the right approach would include a combination of deep meaning and frequently updated fixed references. Regarding how that plays out on novel v. boilerplate language would be an interesting follow-up. Boilerplate language seems to favor authorities more than novel language, but there are a lot of hypotheticals to consider.
That said, it doesn't hurt to have multiple attempts (aside from the cost of using GPT-3, ahem).
Since words are chosen in order, picking one word differently will likely mean that every word that follows will be different, and this choice between suffixes is not made intelligently.
This randomness is there for a reason. Yes, you can turn it off, but then you get loops, which is obviously not intelligent. A bit of randomness makes it less obvious.
In more open ended generation such as summarization, question answering and story generation, beam search leads to poor and repetitive outputs. Different (stochastic) sampling methods lead to more interesting, diverse and .. functional outputs.
GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about
GPT-3 seems like an extended Eliza-effect device . It seems that a general database of word-fragments sufficient to give the impression it's following-along on any topic but that doesn't involve any coherence but rather shows how much of ordinary language is "just associations" (which isn't entirely unimportant but still).
Altogether, it even seems less sensible than narrow, hand-tuned chatbots like Alicebot but it's harder to see it's limitation because of it's huge dataset.
One thing I would note is the potential for a feedback loop between summaries and the model: if a summary of a specific ToS or piece of law is wrong, you can hardwire an expert-vetted one (there's only so many ToSes or pieces of law and it'll be a long tail), and you can feed back in the correct one as training data to finetune the model. The bigger the GPT model, the smarter it is, and the less feedback it takes to correct its summaries: https://openai.com/blog/learning-to-summarize-with-human-fee...
I love to see all of these articles about legal summarization - but it's always covering abstractive summarization! I want an effective highlighter model. To be fair, I did build a system for using transformers to do extractive summarization in an unsupervised manner, but the results aren't that great. I'm sure that Lawyers would find something that could highlight the most legally salient sections to be extremely useful if it's reasonably accurate.
I really did figure that we'd get effective word-level extractive summaries before we'd get effective abstractive summaries but I guess that intuition was wrong...
It is also true that high quality data is necessary. There is a reason why lawyers rely on Westlaw and LexisNexis to search for relevant laws and even scholarly articles. They are trusted sources. A better approach would rely on something like those with a more narrow universe of quality sources. There is a ton of labeling work that needs to be done, even beyond the “KeyCite” type of labels Westlaw applies to documents. Note that YC company www.rossintelligence.com ran into some trouble recently with Westlaw and LexisNexis.
The quality v. quantity of data debate is particularly relevant here. The power of GPT-3 is in part supposed to come from the sheer scale of its training dataset size. It would be nice to leverage some of the semantic training from a large non-legal dataset to be able to stylistically output in layman’s terms while sourcing the authorities from more closely vetted sources.
I want to point out that a big part of building anything on GPT-3 (or other large LMs for that matter) is "prompt engineering", which means you try out thousands of prompts and sampling parameters until you find something that works reasonably well for your use case.
Taking two default templates and a few different temperatures is like taking some tutorials for a new framework, building a proof of concept from them, then making a judgement from that. Sure, it can provide a good first assessment, but that's it. You would need much deeper experience to come to a meaningful conclusion.
As someone who is instinctively skeptical about these language models, this kind of statement makes my antennae twitch. You have this black box model that generates all sort of plausible outputs, you jiggle the handle until those outputs meet your expectations for some range of tested inputs, and then ... you assume it's just going to work?
For parlor tricks, or even low-stakes real world activities that might be enough - but how can you trust it?
How do you know you're not just "programming" a chatbot (by indirectly filtering for the text-pieces you want) but in the most indirect and unguaranteed fashion possible? I suppose the advantage is you can say "look, it's intelligent".
As a for instance, they didn't even use the word "understand" once, to refer to what GPT-3 is doing.
I think there is some constructive dialogue in this thread, and I am thrilled by that. Lawyers and engineers need to work together on this to be successful. The goal of summarization is to enable more people to accurately understand text in less time. It is pretty clear that the feedback of a lawyer in the training loop (at least to label meaning and context) would lead to a significant improvement. IIRC, Andrej Karpathy labeled a lot of data when his team achieved a 50% improved classification and detection jump on ImageNet.
Can more (and better) labeled data get us to an accuracy level that is good enough to generate term sheets out of contracts? I would like to find out. I am interested in diving deeper and connecting with anyone who is game to tackle some of the critical challenges.
Some of the motivation on my side is judging at which point it becomes appropriate to explore having an algorithm provide a summary of a medical conversation, since my day job at Medcorder is building a system that acts as a patient advocate to help them better understand what their doctors are saying. The universal feature request from day one has been summaries (which are fraught if you get them wrong!), so I'm keen to develop a sense of when we're going to get there.
(As for the downvotes on my comment above, I'd bet they are, appropriately ironically, from people who themselves didn't actually read the article, since the comment was made with a wink to the fact that the article explicitly evaluates the appropriateness of "tl;dr" summaries and found them wanting.)