> "In an experiment, I asked 40 questions about PostgreSQL. 23 came back with misleading or simply inaccurate information. Of those, 9 came back with answers that would have caused (at best) performance issues. One of the answers could result in a corrupted database (deleting WAL files to recover disk space)."
I am curious if these tests could be written down and also tested with some versions of GPT (2, 3, 3.5, 4) not because I think they will solve it but to see some trend, and to have a new benchmark if we see 4.5 or 5. Or maybe even test with the 'Code Interpreter' plugin.
I am curious if these tests could be written down and also tested with some versions of GPT (2, 3, 3.5, 4) not because I think they will solve it but to see some trend, and to have a new benchmark if we see 4.5 or 5. Or maybe even test with the 'Code Interpreter' plugin.