LLM Fight Club

advael · 2024-06-24T17:47:58.000000Z

Played with this a little more and I think the assessment at the end is both the weak link in the UX and a very illustrative debugging tool, not only of LLM limitations, but oddly, the format of internet debate

I've thrown about a hundred prompts at this, half of them phrased as neutral questions, the other half as statements comprising the position of one side of the debate. I notice that when doing the latter, LLM A always takes the position I state, and LLM B is always said to "win" the debate, and the form of this is usually the same: B claims that there's insufficient substantiating evidence for A's claim. Neither says much substantial about the topic of the prompt after the first exchange. Usually A's first paragraph will expand a little bit on the topic and B will mostly try to discredit the arguments made, then they'll repeat this three or four times while insulting each other's intelligence without bringing up anything new, expanding further on the topic, or notably, trying to use any examples aside from the ones initially discussed in the first reply. What's interesting about the "neutral question" style is that the debate tends to go about the same, but the assessment at the end only considers B to have won about 90% of the time

As someone who studies AI and has gotten into a lot of internet arguments, this is a fascinating example of how LLMs are great at capturing the average case with essentially zero chance of capturing the best-case of the format, in terms of being informative on the topic at hand to a third-party observer

advael · 2024-06-24T06:20:18.000000Z

Weak debates but great spectacle

I kind of wonder if systems of LLMs are going to become its own new entertainment medium

unraveller · 2024-06-24T07:18:18.000000Z

I wouldn't listen to a knowledge based podcast without an LLM at the ready and that makes it a rather different experience to the other kinds of podcasts available. I suspect the hosts will start to get wind of this and just glance a lot of surfaces and let the audience branch out as they want.

Entertainment could just become a prompt example dive off point for the user to remix to their liking.

advael · 2024-06-24T07:37:19.000000Z

To be frank this is the opposite of how you should use an LLM. It's not going to be useful for diving deep on stuff that needs factual accuracy. It will on the other hand be useful for giving you shallow overviews you can drill deeper into, maybe even by helping come up with good search terms, or sometimes interrogating your thought process

unraveller · 2024-06-24T10:37:31.000000Z

Deep diving or search would be tough to do in real time. Just rubber ducking while listening to economics and business type podcasts works great, accepting the limits of the tool and domain. It's no scholar in your pocket yet.

Sometimes hosts/guests gloss over topics as they are trying to sell their book as the one way find out more. LLMs can enhance the conversation as if the host drilled in a little further on some sore point similar to if it were a popular post with lots of comments. Not that those will be free of LLM influence for much longer.

advael · 2024-06-24T16:26:21.000000Z

I mean I get that people are doing that but it mostly seems like a way to poison your information stream by multitasking in-depth recitation and thereby making yourself vulnerable to what I'd call "recall exploits", where people remember some wrong information and can't even remember where they heard it, possibly misattributing it to the podcast itself, kinda like old people do to themselves constantly by just casually having cable news on in the background. Even if you assume no malice, like you're self-hosting your own fine-tuned llama or whatever, lack of accuracy is enough for this to be a problem

Like I'm aware that 80% of econ and business talk is gonna be BS anyway but turbocharging that effect seems unwise

unraveller · 2024-06-25T11:23:16.000000Z

On the contrary, not using LLM extensively makes you more susceptible to slop in all forms as you haven't learned to avoid it by seeing how the sausage is made. Most bias you see in media is surface level anyway it's not deep divable, cable news is such a limited hangout it could only add to the experience to push against in a pinch like OP's debate summary.

For some reason people trust the error rate of the internet over the single entity LLM, it all depends on correct use and remaining sceptical. Either way the same gambit that got everyone online will get everyone on LLMs: I'll 1000x your knowledge for some low % error rate (reducing every month).

It's like muskets that didn't shoot straight either when they first arrived but you could compensate and you'd be unwise to stick to the broadsword because of tradition. Ask Scottish highlanders.

advael · 2024-06-25T21:42:02.000000Z

Eh, I was already good at checking things in parallel from a combination of wikipedia and google scholar, and so it seems like this use case of LLMs is just a more expensive, higher error rate version of a workflow a lot of integrated multitaskers have been good at for literally over a decade. Multimonitor or multi-device workflows with known good sources still beat LLMs for augmenting linear information streams with detail we're mostly going to fail to retain, especially because in nearly every such use case, the cost of negative errors (omissions) is essentially nonexistent but the cost of positive errors (fabrications) can be significant. These outsized claims of automated efficiency have yet to bear out in any tangible way for any real person I've encountered, because even for the wikipedia version of the use case, retention rate is extremely low, and an LLM is inherently an unaudited source and therefore we have less good probabilistic epistemic guarantees about it

As I've said before, I am using LLMs in my workflows, primarily in places where either factual accuracy doesn't matter (ideation/elaboration is a great use of any generative model) or it's feasible to both test the output in a tight feedback loop and obtain primary sources from which it's drawing information. They're also exceptionally good at translating fuzzy questions into actionable search terms, a use case that google's been trying to approximate poorly for years and which probably would have made a much bigger splash as the gamechanging practical NLP result it is if they hadn't (it looks like incremental improvement from an end-user perspective but decidedly isn't). Everyone who tells me how "efficient" they're getting from adding yet another captured passive stream to their already oversaturated information diet seems no more competent or knowledgeable than prior to the widespread usage of LLMs, but they are about 1000x more smugly assured I'm a luddite for thinking their unfathomable credulity in their naive use case being the future is overwrought (and are, statistically, also likely the people who buy the smear campaign by industrialists that's resulted in the modern usage of that word)

From my perspective you're the guy swinging the musket like it's a broadsword, and I would very much like for fewer people to be doing that with such enthusiasm, given the hazard to benefit ratio. But at least you're not the guy trying to chop vegetables with the musket, which unfortunately is the level at which a decent chunk of the business world currently functions in this analogy

unraveller · 2024-06-26T07:13:00.000000Z

To ruin your analogy remix further it was actually bayonets affixed to muskets that defeated Little Bonnie Prince's army. Again confident compensation won the day.

Lies by omission are worth railing against even if the first guess at those omissions are wildly wrong, the important thing is you've started the refinement process instead of letting the implied falsehood fester and mix with other choices you might make. The hazard of LLMs is only for people who expect to be babied. The people who want no one else but the elite knowledge experts using LLMs are just chuffed at their own trivia recall abilities and perhaps a little afraid of being easily shown up after dedicating so much effort to the legacy way.

advael · 2024-06-26T08:46:25.000000Z

lol you really are one of these AI cultists huh? Do I get to hear "it's a godsend for the untalented" for the 28043rd time?

Your proposed model for adoption doesn't "show up the elites", it destroys the ability of anyone who doesn't already have elite-level epistemic integrity checking skills to ground themselves in reality, and honestly may well over time subvert even decent epistemics via automation blindness. Using new tools is something that people should absolutely do, but using them stupidly for the sake of using them creates a small-to-modest momentary edge for an infinitessimal fraction of this especially reckless class of early adopters who luck out while most of them just serve as guinea pigs as the world figures out what the things are actually good for and irons out the rough edges. Regulating the way you ingest new information such that you can tell fact from fiction isn't "trivia recall", it's a way to retain a semblance of soundness of mind. It's readily apparent from the increasingly unhinged landscape of conspiracy theories that it is in fact not better to wildly guess at gaps in your knowledge if those guesses have a chance of becoming sticky, and human brains are a lot better at filling in gaps than unfilling them. You're not making a good case for not grounding usage of these tools in a realistic understanding of what they're good at - even setting aside the nominative determinism that's becoming increasingly apparent in your replies

gregw2 · 2024-06-24T11:30:15.000000Z

“Iceberg is better than delta lake.” Fight!

Plato or Hegel this is not. But it is kinda amusing.

Seeing the fight club topics others asked would be an additionally fun and engaging feature. I asked some funny or interesting ones but would like to see others’…

SushiHippie · 2024-06-24T11:33:20.000000Z

> Seeing the fight club topics others asked would be an additionally fun and engaging feature

I'll start, "Tabs vs Spaces" ;)

SushiHippie · 2024-06-24T10:13:22.000000Z

Really digging that UI, what did you use for designing the frontend?

And are LLM A and B different models or just different system prompts, where the system prompt for LLM B includes that it disagrees with the output of LLM A?

SushiHippie · 2024-06-24T11:48:08.000000Z

Okay after looking at it on my computer:

- OpenGraph data somehow is for https://claros.so an "AI Shopper"

- It uses nextjs and tailwindcss

- The model says it is GPT-4, but you'll never know if that's true

- LLM B seems to be instructed to 'argue against the topic'

timonoko · 2024-06-24T05:59:16.000000Z

TIL. It understand languages, but debates only in Ænglish.

Anyways. The eternal kwestion "Oliko Urho Kekkonen diktaattori?" produced better discussion I have seen in years.

busssard · 2024-06-24T14:24:28.000000Z

i managed to let it argue in german

joshagilend · 2024-06-24T05:26:54.000000Z

This is fun :) I like it a lot! Good job :)

fzliu · 2024-06-24T05:51:52.000000Z

Not my website, but I agree - it is fun indeed.

gregw2 · 2024-06-24T11:45:49.000000Z

The author aquajet first posted it as a ShowHN about a month ago it seems.

huxflux · 2024-06-24T09:14:15.000000Z

Which LLM does it use?