Because otherwise we are talking about LLMs augmented with external tools (i.e. ...

sitkack · 2025-01-03T17:07:04 1735924024

You wouldn't ask a programmer to solve a problem and then also not let them write down the source or debug the program as you write it?

Are you asking it to not write down an algorithm that is general? They are doing a pretty good job on mathematical proofs.

I still don't understand why you wouldn't let its full reasoning abilities by letting it write down code or even another agent. We should be testing towards the result not the methods.

ActivePattern · 2025-01-03T17:59:31 1735927171

I'm simply pointing out the limitations of LLMs as code writers. Hybrid systems like ChatGPT-o1 that augment LLMs with tools like Python interpreters certainly have the potential to improve their performance. I am in full agreement!

It is worth noting that even ChatGPT-o1 doesn't seem capable of finding this code optimization, despite having access to a Python interpreter.

sitkack · 2025-01-03T19:06:08 1735931168

> y = sum([x for x in range(1,n)] <= 30

> Write an efficient program that given a number, find the integer n that satisfies the above constraints

Goal: Find n where sum of integers from 1 to n-1 is ≤ 30

This is a triangular number problem: (n-1)(n)/2 ≤ 30

... code elided ...

> Ok, now make an find_n_for_sum(s=30)

def find_n_for_sum(s: int) -> int: return int((-(-1) + (1 + 8s)*0.5) / 2)

# Tests assert sum(range(1, find_n_for_sum(30))) <= 30 assert sum(range(1, find_n_for_sum(30) + 1)) > 30

tags2k · 2025-01-04T08:20:48 1735978848

But programmers are LLMs augmented with the ability to run code. It seems odd to add a restriction when testing if an LLM is "as good as" a programmer, because if the LLM knows what it would need to do with the external code, that's just as good.