Hacker News new | past | comments | ask | show | jobs | submit login
GPT-3, Esq? Evaluating AI Legal Summaries [pdf] (davidvictorrodriguez.com)
51 points by gavelin 15 days ago | hide | past | favorite | 41 comments

> Second, text ought to be tokenized (a term used in natural language processing wherein text is assigned a numerical value) at many different levels (character, word, sentence, paragraph, section, etc.) in order to make predictions that are relevant to the excerpt of a text, while also remaining consistent to the broader document. This is challenging because doing so involves a tremendous amount of computational resources. However, it may be necessary to accurately capture meaning at different levels of abstraction.

I don't think this would be an improvement per se. The way Byte Pair Encodings are constructed that GPT-3 (and GPT-2) use is that they already take higher-level text representations and compress it down (into tokens), which are then reflected in the training of the model.

And the attention mechanism should take care of "sentence, paragraph, section" level context. Attributing this to tokenization is a weird mistake to make.

Tarq0n: Also a great point. The question then is whether the attention mechanism is being triggered on the proper word or character sequences. Lawyers employ a form of attention mechanism when “issue spotting.” For example, in a non-compete I might scan for the duration (how many years until the client can join a similar venture?) and breadth (what is the definition of a similar venture?). For an attention mechanism to work well for legal summaries, it seems to me that it must trigger on many relevant context triggers at different levels.

As an extreme example, if I put 50 contracts involving multiple different parties into one big document and the attention mechanism was triggered on the first document title, would the subsequent document titles sufficiently demarcate a new contract to reset the context? Or would the attention stay “on” the first high accuracy context match and mix up the terms and parties? In context of the paper, GPT-3 missed a ton of issues, indicating to me that the attention mechanism is not being properly triggered. Again I may be wrong that tokenizing and predicting based off of clauses, paragraphs and sections would help improve the output, but that was one way I thought would capture the different levels of context.

minimaxir: Great insight and probably merits an edit for precision. My understanding is that Byte-Pair encoding is done at the character and word level (and maybe even the sub-character level), but not at the higher-level representations such as paragraph, section—and beyond. Am I mistaken? Taking a few steps back, is that an effective way to pinpoint context?

The goal should be to properly ascertain context at multiple levels. When I am reviewing a document, I scan the document title and section headings to grasp the structure of the document prior to diving into the relevant clauses and their elements. If there are external references, I will integrate them before reading the clause in order to capture the complete rule. A crucial mistake in contractual interpretation is falsely attributing an element from one rule or section to another, or excluding an element. What applies in A context might not apply in B context, or it may be conditional on another factor C.

The criticism I intended to make was that GPT-3 likely is not (accurately) identifying the right context “bucket” before making the prediction and that perhaps could be improved by tokenizing different levels of context. I may have falsely reasoned that this could be best accomplished through tokenization at different levels. In context of the paper, GPT-3 referenced Tinder and MommyMeet when I inputted Linkedin’s Privacy Policy. Also, if GPT-3 contains Section 230 in its training data, it did not look to the definition list at the bottom of the statute to the define a key term (I excluded the definitions from the input). My hunch is that a better approach would localize based on the document, section and clause type to precisely narrow the context before utilizing character and word level predictions.

I would claim it is easy to think you're seeing GPT-3 fail because it's taking the wrong association path (not noticing the hierarchical decomposition of the context). But the general problem is that there is no fixed decomposition of the meaning of a text.

It's tempting to think a "simple" procedure like summary doesn't need "deep" meaning but it doesn't seem like that's the case. It should especially be noted a lot of the plausible "summaries" could be summaries of any privacy policy or just what people say about privacy policies on the net. It would have been interesting to give the system a novel bit of text to interpret instead.

A bright-line rule like "It is unlawful to exceed 65MPH in any vehicle on Highway 101" has a pretty narrow meaning and could be concretely decomposed sufficient for the average human to understand. I think you are right to point out that standards such as "It is unlawful to exceed 65MPH in any vehicle on Highway 101 unless it is reasonable under the circumstances" break down into many more concepts (what is reasonable?) and therefore seem like there is not a fixed point to decompose the text. In that case, lawyers look to precedent, among other sources, to get guidance on how the rule plays out in a sufficient number of contexts to determine the threshold of reasonability to properly analogize to new circumstances.

I did not intend to argue against deep meaning as an approach categorically. Sorry if I gave off that impression. I think the right approach would include a combination of deep meaning and frequently updated fixed references. Regarding how that plays out on novel v. boilerplate language would be an interesting follow-up. Boilerplate language seems to favor authorities more than novel language, but there are a lot of hypotheticals to consider.

Consider boilerplate language in a Privacy Policy that reads that “The Company stores your data and aggregates your data with data from multiple users to make inferences of users habits and may from time to time sell that data to 3rd parties.” In 2007 that might have meant that Google figured out that you and Jane both like M&Ms and sells chocolate companies your craving data. But what if Garmin recorded your GPS data and that of your friend who sleeps on your couch on Fridays, sold your data to a 3rd party who then mailed you wedding planning cards with your faces in it? The reference table, unless prospectively updated with new hypotheticals, would be challenged to explain that hypothetical. Would deep meaning fair better? I can think of a few ways of how it could.

GPT-3 picks words literally at random (according to a probability distribution) so it would be good to run each experiment multiple times to get a sense of the probability distribution. I doubt it would change the conclusions, though.

There's a big difference between sampling from a learned probability distribution and "picks words literally at random". The temperature = 0 examples have zero sampling by construction, while higher temperatures having a slight sampling.

That said, it doesn't hurt to have multiple attempts (aside from the cost of using GPT-3, ahem).

I guess I could have left out "literally," but picking words from a learned probability distribution is picking at random.

Since words are chosen in order, picking one word differently will likely mean that every word that follows will be different, and this choice between suffixes is not made intelligently.

This randomness is there for a reason. Yes, you can turn it off, but then you get loops, which is obviously not intelligent. A bit of randomness makes it less obvious.

In a professional setting, this is definitely done, and the generations are then ranked with separate models that predict different quality metrics, such as "interestingness", "safety" in an inclusiveness type of way, whether the answer seems to fit a style that you want, whether the facts in the answer seem to make sense, etc., and it makes a big difference actually

Good tip! I repeated a few inputs to see if the variation was significant enough to warrant including that, and as your intuition suggested, it was not with the parameters I selected. A more robust experiment should definitely include repeat attempts.

I thought generating text from language models was a deterministic operation, searching for the maximum likelihood sequence using beam search?

It is now common knowledge in the NLP community that beam search only works for situations where the output space is very constrained, specifically, in neural machine translation.

In more open ended generation such as summarization, question answering and story generation, beam search leads to poor and repetitive outputs. Different (stochastic) sampling methods lead to more interesting, diverse and .. functional outputs.

Great blog post here exploring the different methods of generation: https://huggingface.co/blog/how-to-generate

It can be as deterministic as you want it to be -- there are parameters that control how much randomness is used during the sampling process. Finding the most probable sequence from the learned distribution is intractable for all but the shortest of sequences. As you've said, beam search is used, but this is just a local search heuristic and provides no guarantees of producing the most probable output.

That article just reinforces what I gathered from this one:

GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about


It's remarkable that the article seems to be saying "it's not working now but maybe with a few tweaks we could swing this" where the real situation seems to be that this won't be doing any "meaning-critical" tasks for a long time, if ever.

GPT-3 seems like an extended Eliza-effect device [1]. It seems that a general database of word-fragments sufficient to give the impression it's following-along on any topic but that doesn't involve any coherence but rather shows how much of ordinary language is "just associations" (which isn't entirely unimportant but still).

Altogether, it even seems less sensible than narrow, hand-tuned chatbots like Alicebot[2] but it's harder to see it's limitation because of it's huge dataset.

[1] https://en.wikipedia.org/wiki/ELIZA_effect [2] https://en.wikipedia.org/wiki/Artificial_Linguistic_Internet...

Also of interest is Arbel's paper on GPT-3 (AID currently but he got GPT-3 access recently) legal summaries: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3740356

One thing I would note is the potential for a feedback loop between summaries and the model: if a summary of a specific ToS or piece of law is wrong, you can hardwire an expert-vetted one (there's only so many ToSes or pieces of law and it'll be a long tail), and you can feed back in the correct one as training data to finetune the model. The bigger the GPT model, the smarter it is, and the less feedback it takes to correct its summaries: https://openai.com/blog/learning-to-summarize-with-human-fee...

Thanks for the paper link! I think your reasoning of hardwiring boilerplate to an expert-vetted (or written) summary is a much more accurate approach. If only we could scrape the clause explanation footnotes from quality sources I would not be forced to write them!

This is at the intersection of my research interests. I'd love to see what happens if you run it on a large debate case with slight variations of the input prompt queries.

I love to see all of these articles about legal summarization - but it's always covering abstractive summarization! I want an effective highlighter model. To be fair, I did build a system for using transformers to do extractive summarization in an unsupervised manner, but the results aren't that great. I'm sure that Lawyers would find something that could highlight the most legally salient sections to be extremely useful if it's reasonably accurate.

I really did figure that we'd get effective word-level extractive summaries before we'd get effective abstractive summaries but I guess that intuition was wrong...

If keyphrase extraction and extractive summarization of legal documents interests you, I have a side project I’ve been wrenching on that I could use a hand with. Contact info in my profile.

The author seems to forget that fine-tuning exists all-together, which is how real-world NLP applications are actually made. In reply to the following paragraph, in a real world setting, this would be done through model fine-tuning on high quality data. "First, there must be greater transparency. The sources of GPT-3’s references must at least be referenceable and perhaps tweaked to follow a proper hierarchy of authorities. Although it is challenging to audit a 175 billion parameter algorithm, it would be beneficial to understand the most influential semantic parameters used to generate the output text. This would ideally enable users to choose word choice (perhaps as a level of sophistication), appropriate voice, and tone"

You are quite correct that fine-tuning is necessary to improve accuracy. I did not intend to dismiss that. The point I intended to make there is that it would be nice to be able to see whether, when interpreting a statute, GPT-3 relied heavily on a blog interpreting a statute v. legislative history v. Supreme Court precedent. The right approach would be to control the proper hierarchy of authorities. It would be helpful to understand, even as a textual matter, what GPT-3 was most heavily relying on for a given prediction. That would speed up categorically fixing bad prediction patterns.

It is also true that high quality data is necessary. There is a reason why lawyers rely on Westlaw and LexisNexis to search for relevant laws and even scholarly articles. They are trusted sources. A better approach would rely on something like those with a more narrow universe of quality sources. There is a ton of labeling work that needs to be done, even beyond the “KeyCite” type of labels Westlaw applies to documents. Note that YC company www.rossintelligence.com ran into some trouble recently with Westlaw and LexisNexis.

The quality v. quantity of data debate is particularly relevant here. The power of GPT-3 is in part supposed to come from the sheer scale of its training dataset size. It would be nice to leverage some of the semantic training from a large non-legal dataset to be able to stylistically output in layman’s terms while sourcing the authorities from more closely vetted sources.

To those that are interested in the state of the art in commercial legal summarization (US law) as of Q1/2021: https://arxiv.org/pdf/2102.05757.pdf

Great link. Thank you for sharing this.

Thank you for your analysis, it's great to have the insights of someone with both legal and ML backgrounds.

I want to point out that a big part of building anything on GPT-3 (or other large LMs for that matter) is "prompt engineering", which means you try out thousands of prompts and sampling parameters until you find something that works reasonably well for your use case.

Taking two default templates and a few different temperatures is like taking some tutorials for a new framework, building a proof of concept from them, then making a judgement from that. Sure, it can provide a good first assessment, but that's it. You would need much deeper experience to come to a meaningful conclusion.

>I want to point out that a big part of building anything on GPT-3 (or other large LMs for that matter) is "prompt engineering", which means you try out thousands of prompts and sampling parameters until you find something that works reasonably well for your use case.

As someone who is instinctively skeptical about these language models, this kind of statement makes my antennae twitch. You have this black box model that generates all sort of plausible outputs, you jiggle the handle until those outputs meet your expectations for some range of tested inputs, and then ... you assume it's just going to work?

For parlor tricks, or even low-stakes real world activities that might be enough - but how can you trust it?

(As one of them) every professional researcher in NLP at every large company (incl me) knows you can't rely on generation right now, and huge teams everywhere are working on reliability in text generation

So, you take this "general purpose model" (with a huge corpus of standard text) and you attempt to use it for a narrow purpose. The model requires a lot of prompt tweaking and other things for this narrow model but eventually you "make it work".

How do you know you're not just "programming" a chatbot (by indirectly filtering for the text-pieces you want) but in the most indirect and unguaranteed fashion possible? I suppose the advantage is you can say "look, it's intelligent".

I'm getting concerned that even researchers who know better are anthropomorphising GPT-3 in their descriptions of it's output.

As a researcher, it's pretty obvious from the language and from the analysis of the author that he's a novice at machine learning who is a lawyer, not a researcher in machine learning

The author seems to have a very clear understanding of how GPT-3 works and their down-to-Earth, plain language analysis is miles away from the wild flights of fancy we're used to reading about GPT-3.

As a for instance, they didn't even use the word "understand" once, to refer to what GPT-3 is doing.

I'm talking about knowledge of the NLP literature beyond GPT-3, and other machine learning basic technical stuff

Is any of that referenced in the article in a way that needs a deep explanation which isn't there?

Thank you for the support—means a lot to me. I wrote the piece after finding that recent journalism on GPT-3 did not provide a sufficiently accurate snapshot of how vanilla GPT-3 scores on legal tasks (not to mention the misleading snapshots of promising sandbox outputs). Meanwhile, even the most capable people in the ML community do not get to the papers discussing the minutiae of handling the hard problems in legal texts. I never intended the piece to comprehensively address and solve the technical problems GPT-3 (or NLP more broadly) has in handling legal texts. The inability to examine and audit GPT-3 at different levels of the network makes makes any investigation a partially speculative endeavor. Instead, I merely wanted to provide an overview of outputs critiqued by a lawyer, and offer up some ideas of how to improve performance on legal tasks drawing from ML and legal knowledge.

I think there is some constructive dialogue in this thread, and I am thrilled by that. Lawyers and engineers need to work together on this to be successful. The goal of summarization is to enable more people to accurately understand text in less time. It is pretty clear that the feedback of a lawyer in the training loop (at least to label meaning and context) would lead to a significant improvement. IIRC, Andrej Karpathy labeled a lot of data when his team achieved a 50% improved classification and detection jump on ImageNet.

Can more (and better) labeled data get us to an accuracy level that is good enough to generate term sheets out of contracts? I would like to find out. I am interested in diving deeper and connecting with anyone who is game to tackle some of the critical challenges.

Correct. I hope that meant that this was more accessible than the average write-up (and not less accurate!) :)

Great read, thanks for sharing. It's always interesting to find out how AI is impacting occupations outside the technology sector.

Thank you for reading!

tl;dr - Don't use GPT-3 to summarize your legal documents yet.

"tl;dr" may very well be the problem! There is a tendency of many tl;dr summaries on the web to oversimplify and skew concepts. If those are included in GPT-3's dataset, GPT-3's output will try to match the dataset style (according to the parameters) and likely not meet the legal standard. There is a section following the conclusion in the paper where I touch on other ways GPT-3 might be improved for the legal summarization use case. The only way we get there is by delving into the nuance of WHY GPT-3 is not yet good enough to replace lawyers, and HOW we can improve on it as an architecture.

Absolutely, and I am grateful for your detailed analysis here. These kinds of spot checks on subjective quality for usability in a domain are a critical part of scoring "are we there yet?" and providing solid exemplars that are motivating to the researchers. It's not hard to imagine a GPT-5 recap in a few years with very different outcomes.

Some of the motivation on my side is judging at which point it becomes appropriate to explore having an algorithm provide a summary of a medical conversation, since my day job at Medcorder is building a system that acts as a patient advocate to help them better understand what their doctors are saying. The universal feature request from day one has been summaries (which are fraught if you get them wrong!), so I'm keen to develop a sense of when we're going to get there.

(As for the downvotes on my comment above, I'd bet they are, appropriately ironically, from people who themselves didn't actually read the article, since the comment was made with a wink to the fact that the article explicitly evaluates the appropriateness of "tl;dr" summaries and found them wanting.)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact