The prompt used: "Write a New Yorker-style story given the plot below. Make sure it is at least {{word_count}} words. Directly start with the story, do not say things like ‘Here’s the story [...]"
They didn't even bother using multiple passes to prompt better quality creative writing like "Write an outline for the story..." "Expand this outline focusing on improving style and tone..." "Here's a story written by an amateur writer. As a professional writer, give it a rewrite improving vocabulary, metaphor, and overall quality."
As with a lot of the LLM stuff I see these days, I have to wonder at how much of what's being measured is the capacity of the tool and how much the measurement of the capacity of its users.
I'd never imagine using a LLM zero shot in a single pass for any creative writing tasks, and measuring the inability to perform in suboptimal conditions isn't all that revealing or novel (pun intended).
If you are testing a system, you can't simultaneously tune the system - or at least, if you're establishing a new testing approach you can't once do. Maybe the testing approach is established, someone could experiment with ways to improve the system on the benchmark. But if you're establishing a benchmark, improving at the same time is going to make such a benchmark kind of meaningless.
A better faith version of this should have been the experimenters creating a creative writing pipeline they found created the best output and then testing that.
Using naive zero shot prompts in a single pass with no CoT doesn't even reflect real world production usage outside of the casual user of ChatGPT who discovered it a week ago.
So yes, surprise surprise - casual and naive use of a tool isn't comparable to experts without the tool.
And I'm sure my using a hammer to build a deck would perform much worse than an expert carpenter without a hammer. But that's saying less about the utility of the hammer than it is my ability to use the hammer.
One of the coolest prompt techniques I saw in research was asking "what expert would best be equipped to answer this task" followed with plugging that answer into "you are XYZ expert, answer this task."
Let's say you want an LLM to generate product descriptions.
You might have a first pass where you generate a handful (each run will be slightly different).
Then you can throw those against a pass where you give a system message of an in market shopper matching your target demo asking for the points that are most relevant to making a decision.
Then you give the descriptions and bullet points to a final pass with a system message as an expert copywriter summarizing the multiple descriptions with an emphasis on the bullet points.
You'll have MUCH better results than anything you'd be able to generate in a single pass. You'll need to play with zero shot vs few shot in each step, and there's other little hacks for consistency here and there that would be relevant, but a multipass approach like this makes a world of difference.
As for code, it'd certainly be worth a try of taking the output from pass 1 and then running it though another pass "Can you refactor this code with an emphasis on reducing cyclomatic completely while improving readability and its future maintenance?"
Though in general I'd love to see a dedicated refactoring fine tuned layer (or language specific layers) as an offering in the next year or so, as it's a task nuanced enough I don't know that a general purpose model is going to be great. But yes, I could definitely see running initial code generations though a second pass for improving the results.
> Is it a good idea to do multiple passes when asking for code?
Probably. However, I see their point when it comes to articles; having used it to duplicate writing styles. Having it kick off the beginning without a planned out ending and just having it ramble on is just asking for crap, whether it be from an LLM or a graduate student writing a paper.
I'm not saying it should be obvious because I only learned this from trial and error. It just won't compose an article with a deliberate structure if you ask it to just start writing it. It'll ramble and pack it with filler, in my experience. It'll also make frequent use of ineloquent and unnecessary transition clauses, such as "Finally,", or "Subsequently,", or "On the other hand," unless each paragraph is made independently and zooms out while following an outline. There might be another way, but telling it to just write the first chapter of a new James Joyce novel is just asking for rambling crap. On the other hand, asking it rewrite the first 5 pages of another novel in James Joyce's style with example writing given in-context will yield something much more coherent.
Whether written out or not, few writers just start typing a story and see where it goes.
There was actually a great neuroscience paper last year looking at how human predictive processing seemed best categorized as a three layer approach - broad, medium, narrow temporality. And discussing the idea that this might apply to LLMs .
So if you have a paragraph summary of a story you want to be 10 pages, having the first step be generating a one page treatment of the story from the paragraph and then feeding that in to writing the full story will lead to a better pacing.
Want more interesting character arcs? Have it generate short character motivations, vulnerabilities, etc for key characters that gets fed in before the one page step.
"Fancy autocomplete" is a powerful tool, but it isn't a replacement for story workshoping. But the latter can be embedded as a process into the former fairly successfully, particularly with few shot intermediate steps.
They’re comparing LLM output to professional creative writers? That seems like a high bar to start with. I’d think that would be the highest bar they could achieve. How does it compare though to your average college student?
Also I feel like I haven’t heard people say that much about their ability to be creative. It’s more about their ability to generate human like content — and most human content isn’t that creative either.
Yes, it's a testament to how powerful these models have become in a such a short time that we've gone from being impressed that they can write a coherent paragraph to being critical that they can't win a professional writing competition, all in the span of 5 years.
> They’re comparing LLM output to professional creative writers?
I mean that seems a fair experiment to judge based on the rhetoric employed by AI enthusiasts, who talk of things like "democratizing art" (as though art as it exists is somehow anti-democratic?) and letting any layman slam some text into a prompt and get a work of shakespeare, a painting by Da Vinci, or a Mozart symphony back.
Yeah GPT-4, the current Sota isn't generally professional/expert level in most domains so this isn't exactly surprising revelation even if true.
This paper also suffers from the "single pass problem". LLMs aren't that new anymore. Most researchers should know that whatever you get on a single pass can be/is often much lower quality than if taking the effort to be more specific, provide examples or breaking the task down and going through multiple passes.
From my experience getting several LLMs to write, this doesn't hold any less true in the domain of creative writing so there's a very good chance this paper severely underestimates the potential of even current LLMs.
An interesting study but a horrible title. Calling the entire concept of creativity by LLMs a "false promise" in 2023, three years after the launch of the first commercial LLMs, like someone in 1897 three years after the first production car calling the automobile a "false promise of commuting".
The most interesting part to me is that Claude 1.3 outperformed GPT-4.[1] They also note that Claude stories are "more likely to be attributed to an amateur writer than an AI, whereas GPT3.5 and GPT4 stories are 80%+ attributed to AI."
That's honestly not surprising. Whatever RLHF Open ai does to the GPT models really messes with its creative writing. Like a hallmark filter over output if you don't try to steer it away. The Instruct models don't have the same problem to that degree.
I really hope Open ai releases a GPT-4 Instruct model, for this reason and for chess.
>We prompt three top-performing LLMs: GPT3.5, GPT4, and Claude V1.3 to generate a story of similar length to each New Yorker story, based on the one-sentence plot summary
This is kinda like resizing an image down to 100x100px and then asking AI to upscale it and comparing against the original. Of course the attempts to reverse lossy compression won't be as good, even if the system is capable of similar quality work when correctly prompted.
Not really. The point of the LLM exercise is to measure creativity. Upscaling is not a creative process, it's basically the opposite of creativity. The only reason to keep the prompt in the same vein as the original story is for the judges, to at least keep the premise in the same ballpark so they're comparing comparable stories.
I'm seeing a lot of AI-generated images on my facebook feed (they're of, um, predictable subjects) - if a human generated one, it would definitely be considered "art". Is Shrek not art because computers generated it based on input from humans?
> Is Shrek not art because computers generated it based on input from humans?
Computers didn't "generate" Shrek based on input from humans - human artists used their creativity and expertise to design those models, build the environments and animate the scenes. Those people went to college for years and studied and honed their craft, and applied it.
If you can't see the difference between that and typing words into a prompt and having a machine stochastically generate something that hopefully kind of resembles what you want, but that you wouldn't be able to create on your own, and that you have no real ability to edit or fine tune, then I don't know what to tell you.
And I agree with you - a lot of what AI is generating now looks really good. The thing is, even if it could be considered art, no one using those tools should consider themselves an artist. They didn't create anything, they're not capable of creating anything, and as long as they keep using AI, they will never be capable of creating anything. They're cheating themselves out of the experience of real creativity for instant gratification and a hit of endorphines.
Which is fine as long as that's accepted. That's got to be the deal with AI - you give up the right to ever be talented at anything for the instant gratification machine that makes generic pretty pictures, and you'll never be anything more than a masturbating monkey until you put it aside and actually learn something.
A lot of generic images aren’t art, regardless of how they’re produced. Much the same way as someone’s holiday snapshots are not art, even if some photography definitely is.
They didn't even bother using multiple passes to prompt better quality creative writing like "Write an outline for the story..." "Expand this outline focusing on improving style and tone..." "Here's a story written by an amateur writer. As a professional writer, give it a rewrite improving vocabulary, metaphor, and overall quality."
As with a lot of the LLM stuff I see these days, I have to wonder at how much of what's being measured is the capacity of the tool and how much the measurement of the capacity of its users.
I'd never imagine using a LLM zero shot in a single pass for any creative writing tasks, and measuring the inability to perform in suboptimal conditions isn't all that revealing or novel (pun intended).