Thanks for sharing. To me, purely on personal preference, the Gemini models did best on this task, which also fits with my personal experience using Googles models to summarize extensive, highly specialized text. Geminis 2.0 models do especially well on Needle in Haystack type tests in my experience.
Seeing the other models, I actually come away impressed with how well GPT-4.5 is organizing the information and how well it reads. I find it a lot easier to quickly parse. It's more human-like.
I noticed 4o mini didn't follow the directions to quote users. My favourite part of the 4.5 summary was how it quoted Antirez. 4o mini brought out the same quote, but failed to attribute it as instructed.
It's fascinating, but while this does mean it strays from the given example, I actually feel the result is a better summary. The 4.5 version is so long you might just read the whole thread yourself.
Interesting, thanks for doing this. I'd say that (at a glance) for now it's still worth to use more passes with smaller models than one pass with 4.5
Now, if you'd want to generate training data, I could see wanting to have the best answers possible, where even slight nuances would matter. 4.5 seems to adhere to instructions much better than the others. You might get the same result w/ generating n samples and "reflect" on them with a mixture of models, but then again you might not. Going through thousands of generations manually is also costly.
Compared to GPT-4.5 I prefer the GPT-4o version because it is less wordy. It summarizes and gives the gist of the conversation rather than reproducing it along with commentary.
GPT-4o: https://gist.github.com/simonw/592d651ec61daec66435a6f718c06...
GPT-4o Mini: https://gist.github.com/simonw/cc760217623769f0d7e4687332bce...
Claude 3.7 Sonnet: https://gist.github.com/simonw/6f11e1974e4d613258b3237380e0e...
Claude 3.5 Haiku: https://gist.github.com/simonw/c178f02c97961e225eb615d4b9a1d...
Gemini 2.0 Flash: https://gist.github.com/simonw/0c6f071d9ad1cea493de4e5e7a098...
Gemini 2.0 Flash Lite: https://gist.github.com/simonw/8a71396a4a219d8281e294b61a9d6...
Gemini 2.0 Pro (gemini-2.0-pro-exp-02-05): https://gist.github.com/simonw/112e3f4660a1a410151e86ec677e3...