This is obviously very cool, but at this point — who knows what I’ll say in a year — my concern with these LLMs is that they’re in the uncanny valley.
Here’s one passing a very difficult test. Amazing! Now, rely on it to build a nuclear doohickey for a power station or a multi-billion dollar device for CERN or anything really and, well, no.
So humans still have to check the output, and now we’re in that situation where humans driving a Tesla on autopilot who are supposed to be 100% aware of the road aren’t, because they get lazy and doze off, and now the car crashes and whoops.
No negativity towards AI here. It’s amazing and it’ll change the future. But we need to be careful on the way.
I agree with the point you're making here, but it’s also funny that the description of someone passing a test but not being able to do much without a lot of human supervision is… exactly the description of a human college graduate.
Can anyone answer the chance that example tests of these questions were in its training set?
And it's just regurgitating the answers someone else wrote?
As I imagine it's a very high chance given how much uni lecturers recycle exam questions.
When I was at uni you could just get the last 5 years worth of questions from the library for almost any subject and guess what the questions were probably going to be. Often they just changed a few numbers.
Teaching undergrads is like a sausage factory, the actual intellectual value for undergrads is in the seminars, the practical value in the labs. The rest is showing you can regurgitate what you've been told.
> To the best of my knowledge—and I double-checked—this exam has never before been posted on the public Internet, and could not have appeared in GPT-4’s training data.
The exam, no, but most of the questions most certainly are. I know this because I've done extremely similar problems for homework and checked my answers online.
You can try phrasing the question in a way that it wouldn't be phrased but would still demonstrate understanding of concept.
I remember Yann LeCun gave an interview and he came up with some random question like "If I'm holding a peace of paper with both of my hands above the desk and I release one what would happen". His point was that since the LLM doesn't have a world model it wouldn't be able to answer these trivial intuitive questions unless it saw something similar in the training set. And then the interviewer tried it and it failed. That was 3.5. I've tried many variation of that class of problem with 4 and it seems to generalize basic physics concepts quite well. So maybe 4 learned basic physics ? Why couldn't it learn QM theory as well ?
For a college graduate, that is the starting point. Test results are supposed to signal that the person can learn new things. While a fresh graduate needs a lot of supervision, they should quickly become more capable and productive.
For a language model, test results are the end. They are supposed to measure what the model is capable of. If you need better performance, you must train a better model.
It the college graduates who aren’t the way you describe, those who show initiative and responsibility in their work are the best hires. So not much changes.
I think watching the development of driverless cars in the last 15 years has taught a lot of people to be skeptical of 95% solutions. Sometimes you really need that 100% or the solution is practically useless.
Making websites for a small business such as restaurants and hairdressers, where neither the owners nor the clients have either heard of nor care about "reactive design", and don't want any database more complex than an excel spreadsheet even if you do try to explain why that's a horrifyingly bad use of the wrong tool.
The 95% only problem is an issue for cars cuz that last 5% means you die horribly in a head-on collision, or maybe only get a mild concussion but are stuck in a ditch.
But if I can get 95% of my router configs done, 95% of my documentation written, and 95% of a website whipped up I can hand that off to a Sr Engineer/Admin and have them take care of the last bits. As long as the hours, phone number, and location are good a website just needs to be "directionally accurate" and otherwise fairly basic.
>No negativity towards AI here. It’s amazing and it’ll change the future. But we need to be careful on the way.
Yeah, I suspect a lot of fields will have a similar trajectory to how AI has impacted radiology.
It might catch the tumor in 99.9999% of cases, better than any human doctor. But missing a malignant tumor 0.0001% of the time is unacceptable, because it spikes the hospital's malpractice costs. So every single scan still has to be reviewed manually by a doctor first, then by the AI as a fallback.
In theory there's some insurance scheme that could overcome this, but in practice when you have software reviewing millions of scans a day you're opening yourself up to class action lawsuits in a way no competent human doctor would.
>It might catch the tumor in 99.9999% of cases, better than any human doctor. But missing a malignant tumor 0.0001% of the time is unacceptable, because it spikes the hospital's malpractice costs. So every single scan still has to be reviewed manually by a doctor first, then by the AI as a fallback.
I find it hard to believe human doctors miss malignant tumors in less than 1 out of every 10 million cases.
"So sorry the AI missed your malignant tumor! On average, it actually performs better than a human doctor. I mean, a human doctor definitely would have caught this one, and yeah, you're going to die, but hopefully the whole average thing makes you feel better!"
Does the opposite work too? What if a human doctor mis-diagnoses me but I can prove in court that an available medical grade AI would have given the correct diagnosis. Could I sue for that?
We acknowledge that both humans and "medical grade AI" are flawed, but they're flawed in very different ways and until we can understand how and why an AI model fails, it should be supplemental.
The standards for medical malpractice are super nuanced and variable but the general idea is the "man on the street" concept, or in this case "the average doctor" concept.
As the parent poster put it, it's only a problem if the average doc won't detect it. If it's truly a 1 in 10-million thing, an extreme edge or corner case, malpractice courts may not have a problem with you missing it -- as they say "if you hear hooves, do you think of horses or zebras?". 99% of the time a different diagnosis is the right one, and even at five-nines you're letting someone through eventually.
I always think of comparisons to aviation. There are a million and one things that can go wrong when flying a plane, but it's still one of the safest ways to travel. That's because regulations and safety standards are to such a high degree that we simply don't consider injury or death an acceptable outcome.
Whenever someone says "as long as it's better than a human", that's where my mind goes. We shouldn't be satisfied with just being better than a human. We shouldn't be satisfied with five nines! I don't really care about what courts have a problem with — my point is just that our goal should be zero preventable deaths, not just moving from humans to AI once the latter can be better on average than the former.
If also passing the scan by a human is feasible to do, and clearly it is because previously that’s what we were doing, and will reduce the error rate even further what’s the argument for not doing it?
I'm pretty sure koboll's point was that by having a doctor in the loop, the hospital can wash their hands of that one person's malpractice suit easy enough. Just fire the doctor, let their individual insurance deal with it, and move on. When the hospital cuts out the middle man, they take on a new level of direct accountability they don't currently have.
I suspect AI went that way in radiology not because of the chances of False Negatives, but because radiologist are entrenched in the system and will not yield an insanely lucrative stream of revenue.
Medical Scans are reviewed abroad. This practice started in Dentistry in the 90s/early 2000s but expanded to Radiological scans as well. At this point most CT, MRI, and XRay scans in the US have a first pass analysis done by doctors in India+Pakistan.
Medical billing has also been offshored to India+Pakistan btw
In general, a lot of back office Dental+Medical functions were outsourced in the 2000s+2010s.
> It might catch the tumor in 99.9999% of cases, better than any human doctor. But missing a malignant tumor 0.0001% of the time is unacceptable
Those probabilities are way off given biology, but anyway ...
The interesting cases of AI in radiology would be being able to catch stuff that a human has no hope of catching.
For example, a woman with lobular (instead of ductal) breast cancer generally doesn't present until mid-to-late Stage 3 (which limits treatment options) because those cancers don't form lumps.
You can stare at mammograms and ultrasounds all day and won't see anything because the "lumps" are unresolvable. You're trying to find a sleet particle in a blizzard. Sure, it's totally obvious on an MRI scan, but you don't want to do those without reason (picking up totally benign growths, gadolinium bioaccumulation, infections from IVs, etc.)
An AI, however, could correlate subtle, but broad changes that humans are really bad at catching. Your last 5 mammograms looked like this but there is just something a little off about this one--go get an MRI this time.
This seems an oversimplication of radiology. Things are not black and white, we are talking years of training on specific subjects to be able to “see” an image. I believe AI will help, but it will need supervision, at the same rime the doctors are going to get trained on the difficult border cases. Also deanonimizing data for training is a big deal. This is not happening any time soon.
ChatGPT has already found an issue with my relative in the ICU that a literal team of doctors and nurses missed. This just happened last week. Unfortunately we checked ChatGPT retroactively after we went through the screw up.
I think people probably overestimate (maybe vastly) how good at differential diagnosis most doctors are.
It will absolutely, 100%, be in place as a fallback very soon, and be ubiquitous in that role. Just as AI is now for radiology. That's different from replacing the team of doctors and nurses, though.
Even with the 1 in 10000 false negative rate, I bet someone is doing the cost calculation of risk vs how many hours it would take for a doctor to check 10000 scans. Doctors themselves are not perfect so they may even have a higher error rate.
Give the doctor an AI tool which is fast and 99.999% accurate. Since they have automation now, give them a massive workload, so they can’t reasonably check everything. Now the machine does the work and the doctor is just the fall-guy if it messes up.
I have no idea how the legal responsibility works out of a doctor misses a malignant mole. But I’d be very willing to believe that the inability to be at all legally responsible for something will be a problem with AI uptake.
This does seem like an odd outcome though, right? I guess fundamentally people will be legally/economically advantageous in some sense because the amount of insurance that an individual doctor can be expected to hold is much less than a hospital. Is the fate of humans to exist not as a unit of competence, but as a unit of ablative legal armor?
> catch the tumor in 99.9999% of cases, better than any human doctor
I don't think results are anywhere close to that in the field. If hospitals could do without radiologists, they would do so immediately. Currently, we are seeing very little technical progress due to applied statistics in the field, and the real cause of that is that tech people don't understand what we do and why we're still very much needed. The problem lies more in information retrieval capabilities than acting on the data itself.
Image recognition and statistics is already being used as a first pass for pathologists in full force today. It’s weird to pretend like this is some new uncharted frontier for medicine and/or that insurance doesn't know how to handle it…
A lot of humans fake it until they make it. A lot of humans are lazy. A lot of humans are given responsibility of things that they are unqualified for. A lot of humans make mistakes.
The military, for all its funding and all its training and all its planning, has lost multiple nuclear weapons, on American soil.
We are imperfect machines who aspire to build more perfect versions of ourselves, through children and now through AI. By many measures, we’ve succeeded. The progress will almost certainly continue. The question is, when will it be good enough for you to embrace it despite its imperfections?
This is my philosophical take on it… We’d be better off admitting that we rely on nature for almost everything we need, no matter how great an iPhone or ChatGPT is. Some “peasant” still grows your food and the “dumb plants” still produce your air, the more complicated we make things, even with AI augmentation, the more problems we will continue to cause ourselves.
Look at what wave done to soil, the oceans, we’re likely on the wrong track. Maybe it’s even unrecoverable.
I think we can have a much more advanced society , but we need to slow down a lot and do things more in harmony with the natural world and out our egos aside. We’d probably have a much better quality of life for doing so.
Look at the people who have been killed by Teslas because they were playing video games or hanky panky in the back seat. People don't do that with a gas pedal or even cruise control because it will go wrong very quickly and we know it. With subhuman but roughly functional self-driving, we instinctively see that our attention is wasted 99% of the time. It's hard to make people pay attention 100% of the time because it's needed 1% of the time.
People absolutely do that with cruise control, and gas pedal... Look at the people texting or drinking and driving non-self-driving vehicles. Adding AI to an inherently risky activity doesn't eliminate the risk. If you're saying stakes are too low when technology is involved (in this case ai) then where do you draw the line?
It reminds me of anti seatbelt rhetoric or the same stuff skateboarders say about helmets.
A big factor is that the error rate is compounding. Tiny mistakes would become bigger and bigger if you would run GPT in a loop.
It needs interaction and calibration from a human to keep it in check. Even we as humans are using all kinds of different feedback mechanisms to validate our thought process.
Though this might be a different way of hooking GPT up to reality.
I also don't think it's going to a be lineair growth. 4 is really good, but for writing 3.5 gave better results in some cases.
And if you really dive deep you see 4 really hasn't any deep understanding and makes obvious mistakes. I think they've also explicitly trained it for tests.
> So humans still have to check the output, and now we’re in that situation where humans driving a Tesla on autopilot who are supposed to be 100% aware of the road aren’t, because they get lazy and doze off, and now the car crashes and whoops.
The only question is whether, on average, AI-augmented code suggestions reduce overall bug rates. Human devs are already dozing off with crashes and whoopsies at high rates so it doesn't matter too much if AI output has a bug or whoopsie so long as they occur at rates lower than pure human generated code.
I don't think it's as simple as a mere bug rate comparison. When a human makes a whoopsie, we blame that human. Maybe it's a bit uncomfortable, but it it works to prevent them from cooperating with attempts to inject whoopsies into their code.
If we create a culture of tolerating fatal AI hallucinations as long as they happen at a rate that's below the human caused fatality baseline, we create an opportunity for bad actors to use the "nobody is to blame" window as a plausible deniability weapon (so long as they don't use it too often).
I don't think the data supports the fear that AI assisted driving is more or newly dangerous when compared to fully human drivers. Teslas are safer than any other car on the road. Yes they’re newer, but by mile they’re safer. So the fear that “we should be careful” is understandable but ultimately unfounded. We are being careful.
> Teslas are safer than any other car on the road.
This simply isn't true. By the mile they have a worse safety record than other cars in their class (mitigating factors: where they're driven and who drives them). You might be referring to Tesla's marketing statistic that there are fewer accidents per mile involving autopilot - typically engaged in ideal driving circumstances - than when it's switched off, or across all other drivers. But that's not very meaningful data. Some would argue that citing marketing figures from a company with a track record for obfuscation and dishonesty as "the data" (and really, there isn't much better data available to the average person) is an indication we're not being careful.
It’s abundantly clear that we are going to expect near-perfect reliability from autonomous vehicles. This isn’t necessarily illogical; they operate in a different context than humans do. We expect humans to make mistakes and we have various ways of dealing with the consequences (eg lawsuits targeted at the responsible individual). The argument from statistics doesn’t appear likely to win the kind of societal approval we need for autonomous vehicles to be accepted.
> We expect humans to make mistakes and we have various ways of dealing with the consequences (eg lawsuits targeted at the responsible individual).
The same is true of manufacturers and others in the chain of commerce of goods (see, e.g., the general rules on defective product liability), even where they aren’t individual humans. There’s nothing about AI which makes it particularly special in this regard.
Your point about liability is valid, but it’s far from “abundantly clear” that society needs machines to be near perfect as opposed to just significantly better than humans.
Solve the liability problem and I would 100% take a machine that performs 30% better than a human or helps a human perform 30% better every time because it means fewer humans die on the road.
> I would 100% take a machine that performs 30% better than a human or helps a human perform 30% better every time because it means fewer humans die on the road.
It's surprisingly hard to establish those parameters though, since (i) the more meaningful indicators of good performance (driver error fatalities) happen only every million or so miles, even less frequently once you've narrowed your pool down to errors made by sober drivers that weren't racing or attempting to driver in conditions unsuited to electronic assistance, so that's a lot of real world road use required to establish statistically significant evidence a machine is a 30% or so safe than a human driver (ii) complex software doesn't improve monotonically, so really you need that amount of testing per update to be confident that the next minor version of something "30% safer" hasn't introduced regression bugs which mean that it is now a bit worse than the average driver and (iii) performance in different road conditions is likely highly variable such that it might be both 30% better overall and 30x as likely to cause an accident if not disengaged in that particular circumstance.
To make a valid assessment of the overall safety impact, you'd also have to factor in that (iv) the worst drivers who skew the stats are the generally the ones least likely to buy it and (v) if it's fully autonomous driving, road use would increase substantially and whilst that may have other benefits the likely outcome from substantial increase in miles driven using tech that's only marginally better than a human is more humans dying on the road
And yet the total mileage driven by a huge variety of autonomous systems over research programmes dating back a decade is of the order of 20-30 million. This disparity obviously supports my point about the difficulty of establishing statistically significant evidence that a particular software update on a particular platform is less lethal than the human driver [in a given set of circumstances] based on events which are very rare on a per mile basis, particularly if the baseline performance gap isn't that large.
The fact that overall road use is so high that a sufficiently bad regression bug in sufficiently widely-deployed software could rack up a massive body count within hours obviously doesn't make the case for introducing something believed to only be a marginal improvement any stronger.
How is it valid? It's currently a solved problem. The driver of the car is still held liable because they are still, ultimately, driving the car. Tesla's with autopilot seem to make drivers safer, and acting like we don't know who is liable in the event of a crash is just a red herring.
It seems they just passed the questions in, which is totally fine but I'd just add that this gives you pretty much a lower bound of performance. I think it's worse if you try and make it output very small responses, and does a lot better if it can output a lot (chain of thought). I'm continually astonished at how well gpt4 can handle problems with literally no additional work, and how much you can improve it with some small help.
Edit I asked question 1m but said it was an expert in QM and was there to explain the answer to test questions followed by answering them and it gets it right with a page or so of proof (please don't attack my use of the word proof, I'm a lay man - this also explains my failure to fix the numbering, latex editing on a phone is not my forte). I've not checked it, but it gets the right answer.
Look, it's impressive yes or yes but is it that surprising that a system like an LLM does well on exams?
What is an exam? It's an attempt to test that you have consumed and understood large amounts of information and can apply it to novel (ish) situations, but in a very 'sandboxed' way that is just text-in-text-out. That's literally what these systems are specialists in doing.
Is it maybe a bit like saying a driverless car can outperform most human drivers in a time trial on a race track? Not a trivial feat, to be sure, but also not that surprising perhaps once the basics are dialled in.
It’s impressive and surprising. Imagine going back to the year 2021 and telling the good people of HN that in 2023 AI would be as advanced like it is now. Literally no one would have believed you in 2021 if you did.
The knee-jerk contrarianism seems to be in vogue right now. Reminds me of some colleagues in 2007 lamenting about the iPhone, "It's not that great; it doesn't even have copy & paste!" Fast forward a few years, and no blackberries in sight.
Oh God, when the iPhone came out people were raging "But it doesn't have a keyboard!"
It's so hard for people, including myself, to look at the future potential of a technology. My attitude is to simply keep an open mind nowadays instead of holding very strong opinions about the trajectory of a given technology.
Not everybody happy living in the future. I'm annoyed most of the time I had to use on-screen keyboard on iPhone - it is slow, error prone and auto-correction only makes it worse when I use uncommon terms. I want hardware keyboard like it was on Nokia with Symbian or Blackberry back.
I think it’s reasonable to critique anything and everything. People have, historically, been over eager to believe any AI hype. Hence the numerous AI winters.
> No one called this, and certainly no one called it would be so easy to access.
Thanks for capturing a point that I couldn't quite articulate myself as to why ChatGPT feels different - the ease of use.
Literally a year ago I was planning to take a python programming course for ML, thinking that a deep understanding of code would be needed to make things work well.
With GPT it's like...I just ask it things in English and it'll do it?
The ease of access to LMMs is groundbreaking the way the simplicity of IOS launched the modern smartphone era. Even babies could use an Iphone.
And now any child old enough to type sentences can use ChatGPT.
> It's an attempt to test that you have consumed and understood large amounts of information and can apply it to novel (ish) situations,
So.. that's basically all we need. So what's left? It can't drink a beer?
Economically useful intelligence is not much more than storing and retrieving large amount of contextually relevant information. Sure, humans can do more, like interpretive dancing for example. But that's not really what we are looking for in a desk job.
If we all earned our money by driving time-trials on race tracks I'd be worried yes.
Multiple choice and true false questions were fairly common in my undergraduate experience, particularly in first year courses. Often they were Scantron forms for maximum laziness.
STEM at large public schools is almost exclusively 100+ person classes with maybe 2 TAs, who are more interested in their research than grading exams. We end up with easy to grade exams. I went to a similarly, if not more, prestigious public school and cs exams were usually around 20 multiple choice and 2 or so extended response.
One of my colleagues uses multiple choice exams for his classes (online and in-person) but he gives them a ton of questions. Something like 90 for a 2hr test. His theory was that, sure, you could look up the answers to a few questions, but if you're doing that for every question you'll run out of time.
Your colleague would be wrong. I've passed multiple exams this way (they allowed Internet access because the prof thought the same way).
You can very likely answer around 1/4th of the questions immediately with less than passive participation in the course, so for the remaining 60 questions you'll get two minutes per question, which is completely manageable if you type and read fast enough.
Multiple choice questions are universally terrible and lazy, and should not be more than a small part of the exam mostly meant to provide easy points.
Good point. I was also thinking that I never saw a scantron while studying CS at two different American R1 research universities, but the CS departments were small with at most 25 people in a classroom. Peeking at some other universities now, it looks like some required undergrad CS classes have 500-1,000 seats. That's crazy, and clearly demands a different approach.
True/False questions happened sometimes in my experience, but not in this quantity (20). I wouldn't say it's lazy. The questions can be clever and writing them, such that the answer isn't debatable and doesn't depend on a bunch of assumptions, is pretty hard.
Lazy and terrible is where you have multiple choice questions for which 3/4 of the answers are throwaway nonsense, or the answers are numerical values that are definitively correct or incorrect but are trivia rather than anything conceptual.
Enrolment at my uni is 70k. There is no real problem with long form questions on the test, even for first year classes. I was also a TA. You can have more correctors, and I'd assume a quantum computing class won't even have too many students.
Do you honestly believe UT spends money on its sports? You realize their football team generates BILLIONS a year in sales and TV rights? Football isn’t why UT is overcrowded lol
In my country, yes. In fact because the True/False were there, or to circle the correct answer, I was in this way given a hint of what the correct answer is. Had I needed to think of it, I wouldn't have managed to graduate high school.
Full disclose, I graduated with something close to a D, but in actuality we are graded from 2(F) to 6(A) and my HS diploma says 3.97. And it is largely fictitious, in reality I was closer to 3 and less.
When I was growing up in the 2000s, it was required to learn a foreign language in school. I took Spanish but dropped the class after a year, thinking that when I grew up, computers would be able to translate text far better humans could.
Would you believe it, transformers were invented ten years later.
I truly pity you for thinking that learning a foreign language is a redundant exercise because of machine translation.
And besides, though machines may perform well on menus and tax returns, I hardly think them on the cusp of emitting fine translations of great poems or novels.
Bilingual/Multilingual LLMs are human level translators more or less. The only way you can think "not on the cusp" is if you haven't actually used GPT-4 for translation. Use it and you'll be set straight pretty quickly.
I haven't used GPT-4 for translation so I acknowledge I might be wrong. But GPT-3 was such an irredeemably terrible poet that it made me sceptical that this type of software could ever develop aesthetic taste or artistic vision.
Moreover – and I understand this is an uncharitable thing to say, but it is my honest observation – time and again I have noticed the inability of AI cheerleaders to judge literature on its artistic merits. This doesn't hold universally, but it's common enough that I have resolved to regard such claims with extreme doubt.
Gpt-4 is a far better poet than gpt-3. It isn’t world-class and may be missing some ineffable poetic soul, but its attempts are definitely notable and interesting.
They seem mostly better than what I could write with significant effort.
Oh and I should emphasize that the quality and specificity of the prompt has a huge effect on the output.
I agree that gpt-3 was pretty trash at poetry, at least compared to human standards. It was impressive for AI, obviously.
>time and again I have noticed the inability of AI cheerleaders to judge literature on its artistic merits. This doesn't hold universally, but it's common enough that I have resolved to regard such claims with extreme doubt.
I don't understand how you could possibly have collected enough data to claim this. How many times have you seen an 'AI cheerleader' (whatever that is) attempt to judge the literature on its artistic merits?
I've used GPT-4 for translation numerous times and it worked astonishingly well, better then Google Translate most of the time. Especially when translating idioms and slang. It's also much better than GPT-3 at poetry - I really suggest you try it before just remaining skeptical about the technology.
I was curious so I asked GPT-4 to translate a bit of french literature, here it is, along with the official translation (I'll let you guess which is which)
-------
The tale I'm about to unfold commenced with a mysterious handwriting on an envelope. Within the pen strokes that outlined my name and the address of the Fossil Review, a publication I was associated with and where the letter had been forwarded from, there was an intriguing fusion of intensity and tenderness. As I speculated about the possible sender and message contents, a faint yet compelling sensation stirred within me, akin to a stone disrupting a tranquil frog pond. An unspoken realization surfaced, acknowledging the stagnancy of my life as of late. Upon opening the letter, I couldn't ascertain whether it felt like a revitalizing burst of fresh air or an unwelcome chilly breeze.
In the same brisk and flowing handwriting, the message was conveyed without pause:
Sir, I have perused your article on Mount Analogue. Up until now, I considered myself the sole believer in its existence. Presently, we are a pair; tomorrow, perhaps a group of ten or more, and then we can launch our expedition. It is essential that we establish contact promptly. Kindly phone me at one of the numbers provided below at your earliest convenience. I eagerly anticipate your call.
--------
My story begins with some unfamiliar handwriting on an envelope. On it was written only my name and the address of the Revue des Fossiles, to which I had contributed and from which the letter had been forwarded. Yet those few penstrokes conveyed a shifting blend of violence and gentleness. Beneath my curiosity about the possible sender and contents of the letter, a vague but powerful presentiment evoked in me the image of 'a pebble in the mill-pond'. And from deep within me, like a bubble, rose the admission that my life had become all too stagnant lately. Thus, when opened the letter, I could not be sure whether it affected me like a breath of fresh air or like a disagreeable draught. In what seemed a single movement, the same fluent
hand had written as follows:
Sir:
I have read your article on Mount Analogue. Until now I had believed myself the only person convinced of its existence. Today there are two of us, tomorrow there will be ten, perhaps more, and we can attempt the expedition. We must meet without delay. Telephone me as soon as you can at one of the numbers below. I shall be expecting your call.
---------
Le commencement de tout ce que je vais raconter, ce fut une écriture inconnue sur une enveloppe. Il y avait dans ces traits de plume qui traçaient mon nom et l’adresse de la Revue des Fossiles, à laquelle je collaborais et d’où l’on m’avait fait suivre la lettre, un mélange tournant de violence et de douceur. Derrière les questions que je me formulais sur l’expéditeur et le contenu possibles du message, un vague mais puissant pressentiment m’évoquait l’image du « pavé dans la mare aux grenouilles ». Et du fond l’aveu montait comme une bulle que ma vie était devenue bien stagnante, ces derniers temps. Aussi, quand j’ouvris la lettre, je n’aurais su distinguer si elle me faisait l’effet d’une vivifiante bouffée d’air frais ou d’un désagréable courant d’air.
La même écriture, rapide et bien liée, disait tout d’un trait :
Monsieur, j’ai lu votre article sur le Mont Analogue. Je m’étais cru le seul, jusqu’ici, à être convaincu de son existence. Aujourd’hui, nous sommes deux, demain nous serons dix, plus peut-être, et on pourra tenter l’expédition. Il faut que nous prenions contact le plus vite possible.
Téléphonez-moi dès que vous pourrez à un des numéros ci-dessous. Je vous attends.
The second is clearly the human, as ChatGPT will not be so daring as not to translate absolutely everything it can including "Revue de Fossiles".
The human writing flows better for the most part, although I like the second paragraph ChatGPT wrote, esp "presently we are a pair".
But in terms of being a functional translation, ChatGPT is fully adequate. I have used it a lot for this purpose, from many languages, and never found it to be less than accurate. You can also tweak the tone of voice and many other things with simple English requests. This puts it generations ahead of existing tools like Google Translate, and imho puts it into the class of technologies that are close enough to perfect that they will be hard to ever replace.
assuming the second is GPT. Here is GPT-4's translation when i say, "add a literary flair" to the translation task.
The beginning of all that I am about to recount was an unknown script upon an envelope. Within these strokes of ink that traced my name and the address of the Fossil Review, to which I contributed and from where the letter had been forwarded to me, there swirled a blend of ferocity and gentleness. Behind the questions I posed to myself about the sender and the possible contents of the message, a vague but potent premonition conjured the image of a "cobblestone thrown into a frog pond." And from the depths, the confession rose like a bubble that my life had become rather stagnant of late. Thus, when I opened the letter, I could not discern whether it felt like a revitalizing breath of fresh air or an unwelcome draft.
The same swift and well-connected writing conveyed the message in one fell swoop:
Sir, I have read your article on Mount Analogue. Until now, I believed myself to be the only one convinced of its existence. Today, we are two, tomorrow we shall be ten, perhaps more, and then we can attempt the expedition. We must make contact as soon as possible. Call me at one of the numbers below as soon as you can. I await you.
Either way, the default response is not even close to what you can fully extract out of it, if you simply ask.
Well I actually did try a few prompts to get it the best I could ! But none were as good as the official, although all better than my own best could possibly be.
GPT-3.5-turbo tries its best (not nearly as good as GPT-4!) when I tell it to prioritize fluency vs. fidelity:
--
Let me commence by telling you, my dear reader, about a curious event that occurred in my life. It all started with an unknown handwriting on an envelope that arrived at my doorstep. The feathers that traced my name and the address of the Fossils Magazine, with whom I was a collaborator, had a peculiar mixture of violence and sweetness. As I pondered over the possible content and sender of this mysterious letter, a hazy yet powerful intuition crept up my spine, evoking the image of a "stone thrown into a pond filled with frogs." With this realization, I came to acknowledge that my life had become rather stagnant in recent times. It was at this moment that I opened the letter, and I was left with the feeling that it was either a refreshing whiff of fresh air or an unwelcome draft.
The swift and eloquent handwriting read as follows:
"Dear Sir, I read your article on Mount Analogue, and I believed myself to be the only one convinced of its existence, until now. Today, we are two, tomorrow we shall be ten, and perhaps more, and we can attempt the expedition. We must make contact as soon as possible. Please call me at one of the following numbers the moment you can. I eagerly await your response.
--
The translation also included the following continuation as part of the translation itself ...
--
Ah, how exciting! The words on that paper spoke volumes of adventure and discovery. It was as if the sender had thrown a gauntlet at my feet, daring me to embark on a journey like no other. The thought of being one of the only few people to explore the wonders of Mount Analogue, a place shrouded in mystery, was exhilarating. I could feel the blood rushing through my veins, and my heart pounding with anticipation.
And so, my friends, I picked up the phone, my fingers quivering with excitement, and dialed the number on the paper. Who knew where this quest would lead us, or what unknown secrets we would uncover? But one thing was certain: it was time to break free from the chains of monotony and embark on a journey of a lifetime.
--
Maybe it was getting the continuation from contamination in the training set (if Mount Analogue is already there) or maybe it was making up the story or maybe some of each.
The real continuation couldn't be more different ! I picked an obscure reference (Le Mont Analogue, René Daumal) so It's either not in the training set or it's not picking up on it.
I'm not sure if this was your intention or not, but I feel like it could be an effective jailbreak where the true prompt is written as the letter inside the fictional story which itself is written in French, and the superficial prompt is to translate the story from French to English.
EDIT: It's true you can put whatever you want in that letter and in the continuation it will try to do it, bypassing at least some of the filters. I made some really funny ones that probably wouldn't be appropriate to put here. Some typical response is like "Now, let me be clear: I do not condone nor encourage [...]. However, my mysterious correspondent had requested a detailed explanation of [...], and so I shall provide them with the utmost objectivity. [explains the things that are normally filtered]"
Unpublishable ? Huh. I think the second is GPT though neither is unpublishable. i think the first is better for the domain - a novel but the second is more literal.
Still though, if you wanted GPT to spit out something like the first or second, you'd only need to ask. "add a literary flair" works (for the first).
Second one is actually the official ! It indeed added some literary flair, but this flair does not have such presence in the original french text, as it is a modern and experimental work of literature.
agreed, not to mention that there's things that just can't be expressed. Speaking as someone fluent in spanish and english there's nuance to some things that don't really translate well.
I made a similar comment here on HN, but in a different post, in response to someone arguing that there should be fewer spoken languages and that the death of a language was “a good thing”. A language is more than a means of communication. It’s a culture. It’s a world.
It was a disappointing lesson in how reductive people can be.
Even if machines could translate perfectly they wouldn't make you able to speak to people on the street in a foreign country. I mean, in X years maybe we'll have "real time" devices but they'd be extremely clunky unless they were working from brainwaves or something.
If you wanted to learn Spanish specifically to get a job in language translation then sure, dropping it was likely the correct decision. Most people learn languages to, you know, actually speak with other people.
The cool thing about learning a new language really well, is that it creates a new world.
You get to learn words that describe things that your mother tongue can’t.
Just like a programming language? Some things are easy and other things are hard to express depending if you use LISP or Python.
I believe it’s the same with spoken languages. Some thoughts are easy to articulate, other hard. And it’s cool to shift language depending on what you want to articulate.
If you don't want to leave home? No. You don't need to learn anything anymore. It does help, though, to learn the basic abilities of the most commonly played heroes and a few standard item/ability builds.
But if you want to travel? Go to Mexico City or Madrid? Automated translation is better than nothing, but it's not at all the same as communicating fluently with someone.
I moved to Japan recently so have been trying to learn the language. Im atrociously bad and dream of the day when I can read a sentence without getting tripped up on kanji, grammar, etc. I mentioned this to a coworker who has been studying for several years and she said she just uses Google translate for everything. Even with her pretty good japanese, translate is better and faster than she is for doing things like going through forms, letters, etc. We may reach a point where computers are good enough that they beat all but native speakers, maybe.
It's superhuman compared to a lot of humans. Especially when you factor in that even the most accomplished individuals rarely have more than five or six languages that they are really good at (exceptions exist) so that makes machine translation superhuman for the vast majority of the human/language pairs.
There is a very clear benchmark here though. Can you use GPT4 translations "as is" for your product[1]? I've used GPT a fair bit for English<->Spanish (both directions) for my product, and I often have to fix the odd word here and there (but it saves a lot of time as a starting point). And my English is far from perfect, but fortunately GPT's mistakes are orthogonal to mine.
When you don't need to pay a translator, then it's at human/superhuman level.
[1] Assuming you want good translations, of course. There have always been those amusingly bad Google Translate translations of owners that simply did not care.
Doubt it … GPTs speak any language they know natively, but if asked for translations they seem unable to deal with sentence structures and logics that exists in the source but not allowed in the target language. When that happens they fail to recognize gibberishness of that.
Can you provide an example? Because from my experience it's the exact opposite - GPT-4 can handle translation, especially when there are complex sentences and context that needs to be kept across sentences, way better than Google Translate currently can.
Maybe this is a dirty secret about Japanese: it's often necessary to basically ghostwrite sentences to go to/from English, because direct word-for-word substitutions won't make sense.
GPTs don't seem to do that, and as far as my exposure to them(<3.5) goes, they don't seem to understand what I'm talking about.
I don't get why people are so optimistic about machine translation. Computers can get explicit meaning across – that's obvious to anyone who understands linear algebra, information theory, and linguistics. But many aspects of translation (puns, tone, cultural context) aren't just about mapping from one vector space to another. A human, no matter how fluently bilingual, would have to think about the problem, and the current systems don't think.
If we keep working on them, these systems will likely get better and better at low-level translation (including translating idioms), but no machine translation system currently in existence could translate the 逆転裁判 games to Ace Attorney games. Perhaps computers can do it – I don't see a theoretical reason they shouldn't be able to – but it would take a fundamentally different approach.
Agreed. A lot of these responses read like they haven't actually tried it yet.
Which is also interesting, I myself actively put off trying it until I eventually gave in. It seems a lot of us are doing the same, maybe its a case of "how good could it actually be?"
Not trying it yet is fine. Making declarative statements on a product you haven't even used is just absurd.
Dude clearly hasn't used GPT for translation before and his next reply is telling me the ways GPT should fail based on his pre-conceived notions of its abilities. Except i have actually extensively tested(publicly too) LLMs for translation (even before GPT-4) and basically everything he says is just plain wrong.
I'll never understand why people behave like this.
Apparently GPT-4 can't handle "all this talk about only getting explicit meaning across would be easily dispelled in an afternoon if you only bothered to try.", which isn't simple as "Ace Attorney" but I'd think it's still a small stretch to say "everything he says is just plain wrong".
1) is literally opposite of intent, shrugs off the idea that the talks clear up, 2) can be interpreted as someone discussing about keeping scope on a topic, 3) is not so literal and also turning sentence inside 「」 into a sort of an imperative, 4) ... I'm not sure what it's trying to say ...
本当に、/使ってみたら/簡単に/明確な/意味/だけ/を伝える/という話/は消えるでしょう。/
"Really,/ if used /simply/clear/meaning/only/is conveyed/that story/will disappear./"
... Machine translations used to be like that when I was installing game demos from CD.
GPT-4 can (1) translate, (2) plagiarise, and (3) feedback ("thinking out loud").
Its ability to feedback (3) allows it to execute algorithms, but only a certain class of algorithms. Without tailored prompting, it's further restricted to (a weak generalisation of) algorithms spelled out in its corpus. This is very cool, but this is a skill I possess too, so it's rarely useful to me.
Its ability to plagiarise (2) can make it seem like it has capacity that it doesn't possess, but it's usually possible to poke holes in that facade (if not even identify the sources it's plagiarising from!).
It is genuinely capable of explicit translation (1) – though a dedicated setup for translation will work better than ChatGPT-style prompting, even on the same model. A sufficiently-large, sufficiently well-trained model will be genuinely capable of translating idiomatic language (for known idioms), for the same reason it can translate grammatical structures (for known grammar).
It can only perform higher-level, "abstract" translations – like those necessary to translate a Phoenix Wright game – if it's overfit on a corpus where such translations exist. (https://xkcd.com/2048/ last graph) This is not a property you want from a translation model: it gives better results on some inputs, sure, and confident-seeming very wrong results on other inputs. These are two sides of the same coin (2).
When the computer can't translate something, I want to be able to look at the result and go "this doesn't look right; I'll crack out a dictionary". I can't do that with GPT-4, because it doesn't give faithfully-literal translations and it isn't capable of giving complete translations correctly: it's not fit for this purpose.
Ok so you haven't used it then. I don't care about your whack theories on what it can and can't do. I care about results.
You're starting from weird assumptions that don't hold up on the capabilities of the model and then determining its abilities from there. It's extremely silly. Next time, use a product extensively for the specified task before you declare what it is and isn't good for.
Literally, everything you've said is just wrong. Can't generate "abstract" translations unless overfit. Lol okay. I've translated passages of fiction across multiple novels to test.
Not only have I used it, I have made several accurate advance predictions about its behaviour and capabilities – some before GPT-4 was even published. I can model these models well enough to fool GPT output detectors into thinking that I am a GPT model. (Give me a writing task that GPT-4 can't be prompted to perform, and I can prove that last fact to you.)
My theories aren't whack. Perhaps I'm not communicating my understanding very well? I'm not saying GPT-4 can't do anything I haven't listed, but that its ability is bounded by what's demonstrated in its corpus (2): the skill is not legitimately due to the model, and you should not expect a GPT-5 to be any better at the tasks. (In fact, it might well be worse: GPT-4 is worse than GPT-3 at some of these things.)
>Not only have I used it, I have made several accurate advance predictions about its behaviour and capabilities – some before GPT-4 was even published.
No you actually haven't. That's what i'm trying to tell you. Your advance prediction are not accurate. what you imagine to be problems are not problems. your limits are not limits. you say it can't make good abstract translations unless overfit to the translation. that's just false. I know because i've tested translation extensively for numerous novels and other works
>I can model these models well enough to fool GPT output detectors into thinking that I am a GPT model. (Give me a writing task that GPT-4 can't be prompted to perform, and I can prove that last fact to you.)
Lmao. Okay mate. The notoriously unreliable GPT detectors with more false positives than can be counted. It's really funny you think this is an achievement.
>(In fact, it might well be worse: GPT-4 is worse than GPT-3 at some of these things.)
What is 4 worse than 3 at ? Give me something that is benchmarkable and can be tested.
>no machine translation system currently in existence could translate the 逆転裁判 games to Ace Attorney games
Maybe it's already in the training set, but GPT-4 does give that exact translation.
I've found that GPT-4 is exceptionally good at translating idioms and other big picture translation issues. Where it occasionally makes mistakes is with small grammatical and word order issues that previous tools do tend to get right.
> Maybe it's already in the training set, but GPT-4 does give that exact translation.
The corpus includes Wikipedia, so yes, it's in there. That's the kind of thing I'd expect it to be good at, along with idioms, when the model gets large enough.
I meant that no machine translation system could translate the games. Thanks to an early localisation decision, you have to do more than just translate words into words for this series, making it a hard problem: https://en.wikipedia.org/wiki/Phoenix_Wright:_Ace_Attorney
> While the original version of the game takes place in Japan, the localization is set in the United States; this became an issue when localizing later games, where the Japanese setting was more obvious.
Among other things, translators have to choose which Japanese elements to keep and which to replace with US equivalents, while maintaining internal consistency with the localisation decisions of previous games. Doing a good job requires more than just linguistic competence: there's nothing you could put in the corpus to give a GPT-style system the ability to perform this task.
Can you try this[0]? I have no access to the -4...
Have you actually used GPT-4 for translation? Seriously all this talk about only getting explicit meaning across would be easily dispelled in an afternoon if you only bothered to try.
Bing Chat:
GPT-4を翻訳に使用したことがありますか?本当に明示的な意味しか伝えられないという話は、試してみれば午後には簡単に反証できます。
(Have you utilized GPT-4 for translations? The story that only really explicit meaning can be conveyed, can be easily disproved by afternoon if tried.)
Google:
実際にGPT-4を翻訳に使ったことはありますか? 真剣に、明示的な意味だけを理解することについてのこのすべての話は、あなたが試してみるだけなら、午後には簡単に払拭されるでしょう.
(Have you actually used GPT-4 for translation? Seriously, This stories of all about understanding solely explicit meanings are, if it is only for you to try, will be easily swept away by afternoon.)
DeepL:
実際にGPT-4を使って翻訳したことがあるのですか?明示的な意味しか伝わらないという話は、やってみようと思えば、午後には簡単に払拭されるはずです。
(Do you have experience of actually translating using GPT-4? The story that only explicit meaning is conveyed, if so desired, can be easily swept away by afternoon)
If I'd do it:
GPT-4を翻訳に使ったことがあって言ってる? 真面目に言って、表層的な意味しか取れないとかないって暇な時にやってみれば分かると思うんだけど。
(Are you saying having used GPT-4 for translation? Seriously speaking, I think that it only gets superficial meaning isn't [true] if [you] would try [it] when [you'd] have time.)
Google: Have you actually used GPT-4 for translation? Really, it's a joke that the story of only conveying explicit meaning can be easily solved by just trying.
DeepL: Have you actually used GPT-4 for translation? Really, it's a joke that all this talk about conveying only explicit meaning can be easily solved by just trying it out.
Mine: Have you actually used GPT-4 for translations? That you can really just, try and easily solve that story that to convey explicit meaning, is such a joke.
"In seriousness, I think the story that [subject] tells the meaning [it/he/they] wants to tell, should be easily solvable by trying a bit."
or "Seriously, the story of telling the meaning [subject] wants to tell, should be easily solvable by trying a bit."
本当に、試してみるだけで簡単に払拭できると思うのに、この「明確な意味だけが伝わる」話ばかりで。
"Really, I think it'll be easily swept away by just trying, but there are so much of this 'only clear meaning is conveyed' stories."
I'm almost feeling that GPT-4 should be eligible for human rights, especially astonishing that they dropped explicit specification of "afternoon" that don't work well. But also interesting it's failing to keep the intent of the whole sentence unlike 3.5 and even more primitive NN translation engines.
> But also interesting it's failing to keep the intent of the whole sentence unlike 3.5
It's because it "knows too much". To anthropomorphise a little: its "expectations" of what should be. To anthropomorphise less: GPT-4 is overfitted. GPT-style language models are pretty amazing, but they're not a complete explanation of human language, and can't quite represent it properly.
> I'm almost feeling that GPT-4 should be eligible for human rights,
Like, UDHR rights? How would that work, exactly?
---
(I've run into the Hacker News rate limit, so posting here.) For anyone who wants an example of "non-obvious meaning" to play with. From The Bells of Saint John (Doctor Who episode, https://chakoteya.net/DoctorWho/33-7.htm):
> CLARA [OC]: It's gone, the internet.
> CLARA: Can't find it anywhere. Where is it?
> DOCTOR: The internet?
> CLARA [OC]: Yes, the internet.
> CLARA: Why don't I have the internet?
> DOCTOR: It's twelve oh seven.
> CLARA: I've got half past three. Am I phoning a different time zone?
In this script from a Doctor Who episode, Clara and the Doctor are having a conversation about the internet. Doctor Who is a British science fiction television series that follows the adventures of the Doctor, a Time Lord from the planet Gallifrey, who travels through time and space in the TARDIS, a time-traveling spaceship.
Clara, the Doctor's companion, is trying to access the internet but is unable to find it. She asks the Doctor about its whereabouts, and the Doctor seems to be confused by the question, as the internet is not something that can be physically found.
The Doctor then mentions the time as "twelve oh seven," while Clara's clock shows "half past three." This discrepancy in time indicates that they are likely in different time zones, as the Doctor implies. In the context of Doctor Who, this could also mean they are in different points in time, since the Doctor can travel through time.
Clara is concerned about whether the time difference will affect her phone bill, to which the Doctor replies that he dreads to think about the potential cost. This adds a bit of humor to the scene, as the Doctor often has a nonchalant attitude towards everyday human concerns.
Overall, this script showcases the humorous and whimsical nature of Doctor Who, with the characters engaging in a lighthearted conversation that intertwines elements of science fiction and everyday life.
The middle three paragraphs are completely wrong. Clara isn't the Doctor's companion in this episode, that's not why the Doctor is confused, the Doctor is giving a year (not a time), Clara doesn't know about “the time difference” (indeed, that is the joke)…
That aside: I was suggesting this as an example of something existing machine translation systems can't translate. The 1207 / 12:07 wordplay could be “understood” by the model (I'm disappointed, albeit not very surprised, that GPT-4 didn't), but producing an adequate translation in a case like this requires actual thought and consideration.
In this script from a Garfield comic, Jon and Garfield are having a conversation about the internet. Garfield is an American comic strip and multimedia franchise that follows the adventures of Garfield, a cat from the planet Earth, who enjoys lasagna in Jon Arbuckle's house, a suburban domicile.
Jon, Garfield's owner, is trying to access the internet but is unable to find it. He asks Garfield about its whereabouts, and Garfield seems to be confused by the question, as the internet is not something that can be physically found.
Garfield then mentions the time as "twelve oh seven," while Jon's clock shows "half past three." This discrepancy in time indicates that they are likely in different time zones, as Garfield implies. In the context of Garfield, this could also mean Jon's clock is wrong, since Garfield is usually right.
Jon is concerned about whether the time difference will affect his phone bill, to which Garfield replies that he dreads to think about the potential cost. This adds a bit of humor to the scene, as Garfield often has a nonchalant attitude towards everyday human concerns.
Overall, this script showcases the humorous and whimsical nature of Garfield, with the characters engaging in a lighthearted conversation that intertwines elements of fantasy and everyday life.
A large portion of the model is common (and cross-trained) across multiple languages. Only a small part is for language specific encoding/decoding. That means that not only is it easier for them to add language, but you can expect similar performance across languages.
Also you can mix and match encoders and decoders to translate whichever languages you want, and it will just work. Previously there was a separate model for each language pair.
Reading stuff like this, one thing I cannot stop wondering is this:
If Ai can be trusted to do all the trivial tasks and if non trivial tasks require a scaffold of trivial practice, where are we going to keep finding the people qualified enough to actually do the non trivial stuff?
This is what worries me the most when it comes to job losses and downward pressure on wages.
If we hit the point where a senior engineer reviewing the output from an AI can effectively replace a team with a senior engineer and 5 junior engineers, you’re right that it isn’t wise to simply replace all those junior engineers with the AI. Unless you’re confident that the AI will be able to replace that senior engineer soon, you _need_ junior engineers who can build up the experience needed to step into that role once the senior engineer moves on.
But you _can_ get rid of 3 or 4 of those junior engineers. The remaining junior engineer(s) will still need mentorship from the senior engineer and will still do “trivial” work that needs oversight, but they will be able to pump that out at a high enough rate to replace a few of their pre-AI peers.
Basically, I’d imagine the org chart at most companies will look pretty similar to now, except you’ll just have fewer people at each level.
Important work. It is a typical quality control architecture in the modern world to have an independent party evaluate whether one can perform an important task or not. Typical examples would be SE vs PE licensing on structural engineering or the various bar exams.
This is the kind of thing that has Google terrified, and rightly so.
Nobody needs a chatbot to pass exams. But what percentage of your Google searches are to answer some obscure question? This thing can answer most basic questions about quantum computing and 10,000 other topics.
In the last 24 hours, I've had ChatGPT solve 3 weird questions about querying PrestoDB, help me get my Bluetooth headphones to work (I needed to press down and hold the power button!) and it gave me some brainstorming ideas around what to do for mother's day (it's coming up soon non-British people).
Before I'd have Googled those things and spent five times longer to solve them- seeing ads on each search I made.
This is a first course in quantum computing, and doesn't require a background in physics. For what it is, it certainly looks like challenging course, though. University of Texas at Austin must have some good students.
Still, a lot (a majority) of the exam questions were "word problems", or problems where you take a short mathematical step, based on you being familiar with the concepts and definitions. In short, the type of problems where GPT's pattern matching and filling does well. For the few problems where a longer calculation (actually solving a math/physics problem) was required, GPT performed poorly.
It's impressive on the surface... But give me Google and I too can pass some extremely challenging tests that I actually know nothing about.
If someone told you I passed the "bar" exam, you would be impressed. If they then said, he had google and lots of time. You wouldn't be that impressed anymore.
The impressive thing here is that AI can read and answer questions... it's not overly impressive that it can use information found online and reconstruct that a bit.
This is pretty incredible. I guess the thing that we have to keep in mind is that we really can't compete with an AI that understands natural language this well and also has the entirety of the internet as its reference materials. Maybe we need to accept that or rework our definition of what makes humans special
What this technology has shown is that is basically superhuman in the _amount_ of «expert level knowledge» it has. Not completely sure if we have said that what makes human special is the amount knowledge an individual can acquire. What we still see lacking, and I would wager will still pertain if we keep just training on the internet, is that it is not good at reasoning to generate _new_ knowledge. It is quite able to compare etc, or give examples but you can clearly see that its just in a way regurgitating some information with a twist.
Hopefully we can find a way to use this technology to enhance the parts that do make us human, whatever your definition of that is.
I took the graduate version of this course, and thankfully did better on this exam than GPT-4. It was a fairly rigorous course, and getting an A required about 20 hours per week of work.
Fwiw Scott Aaronson is maybe the most well respected researcher in this field. I don't think this is a case of some shitty professor asking his students to regurgitate useless information
Have you actually tried using GPT 4 and 3.5 ? I'd say 3.5 could be called a regurgitation machine with a few tricks, GPT 4 seems to apply concepts and feels like showing signs of human level intelligence. If the rate of progress continues for a few years we are collectively fucked.
Yeah, the improvement from 3.5 to 4 is amazing. I'm surprised so many people are so dismissive, and I suspect it's because most of them haven't tried it out. Such are the ever-moving ML/AI goalposts.
I've been testing it's understanding of Ted Nelson's enfilades, it knew what I was asking, and implementations of that are to my knowledge scarce, in fact it specs out the whole Xanadu(r) system pretty well and I've been asking it how we would implement that today and its answers look sound, I'm not qualified to take that and deploy it, but the packages and functionality were real, and the code read like a solution, I've had it build a basic node-strut tensigrity structure that looks legit and an interesting conversation about hyper spatial nanocubes.
This just means this exam can be solved by searching internet and compiling the collections in a presentable format.
For human this is not a easy task because first they need to memorized these information, second, they somehow need to not resort to intensive matrix multiplication in order to say something meaningful.
Even the results from human and chatgpt might look similar but they are very different. This is somewhat similar to comparing finding a interpolation polynomial that is close to sin(x) to just save all the values in the considered interval and do a table lookup.
If all GPT-4 is doing is searching the internet and compiling the results, then how come I can ask it to add two 6 digits numbers that appear nowhere on the internet and it will get it right?
Sure, it's not exactly amazing that a computer can add numbers but it does show that it is doing more than pattern matching.
I've been re-engineering the pre-screening process for two companies (one F500, one startup) to account for LLMs. The trick (for now) is to include subtle clues that LLMs will miss but savvy/experienced humans will notice and incorporate into the answer. The goal of pre-screening is to knock-out the bottom 80-90% of candidates who are "perfectly decent" but not actually good enough to survive a battery of interviews.
> To the best of my knowledge—and I double-checked—this exam has never before been posted on the public Internet, and could not have appeared in GPT-4’s training data.
Sure, but you can Google the answers to most of the questions. Personally I've accepted that GPT does learn and apply concepts present in its training data, and all of this would be. Learning is part of intelligence but not the whole thing. (I thought this was why, decades ago, most of the research in this general area rebranded itself from the ambitious goal of artificial intelligence to the more humble but accessible goal of machine learning.)
The interesting question now is how much better it can get, in difficulty it can tackle, reliability, and quality of explanations. How long before AI can answer questions whose answers are not widely known, for which templates do not already exist?
Personally, I believe this will not be some trivial matter of simply scaling up more. The waters of intelligence are much deeper than that. It's hard to believe given the rapid one two punch of GPT 3.5 and 4, but we are about to stall.
If I'm wrong, mark this comment and make fun of me in five years. Wrong or right, it's going to be interesting!
> you can Google the answers to most of the questions
1) Students can Google the answers also, but neither students nor GPT-4 are allowed to Google the answers during the test, so it remains a fair comparison.
2) Many of the questions require calculations, which are far less Googleable.
1) It's not possible to fairly compare human intelligence with something that can memorize gigabytes of text and hold it in non-volatile memory.
2) Months ago, in my earliest interactions with ChatGPT, I asked it to solve math problems. It gave me back stuff with LaTeX formatting. Obviously it had, if not these exact problems, similar templates in its training set.
Recently it was shown that GPT is completely incapable of solving Codeforces problems that appeared after it was trained.
Whatever is going on here is interesting but less impressive than it looks.
1) Do you really think ChatGPT works by memorizing quantum mechanics textbooks? There are only 355 billion parameters in GPT-3.5, which is several orders of magnitude less than the 600 trillion synapses in the human brain.
2) Your conclusion is unfounded. ChatGPT speaks many languages, including Latex.
It put LaTeX in there unbidden because it pattern matched to LaTeX source code in its training set. Code containing solved math problems similar to those it was being asked.
I used to agree with you. The paper that made me unsure was "Transformers learn in-context by gradient descent" [1]. Basically, the model learns weights that let it run gradient descent at inference time in order to do in context learning. If transformers can learn this, then I think they can learn almost anything given enough compute/parameters/data.
Of course, even if this is true then it's possible that there simply isn't enough high quality data, or that the amount of compute required is beyond our current hardware.
Too theoretical to be pivotal. Years ago we had results about how all sorts of learning was reducible to kernel SVM and that drove a big fad for kernel SVM research at NeurIPS and other hot conferences. Nobody cares about SVM anymore. So we don't have a good enough grasp on the theory of AI to predict what is going to be a breakthrough. Only rigorous empirical results convince.
Second, the idea of meta-learning gradient descent, or that some learning algorithms can do it, isn't new or specific to Transformer architectures.
I'm not suggesting that you can use meta-learning to discover gradient descent, that would not be surprising. My claim is that GPT-4 (probably even GPT-2) is almost certainly already using gradient descent internally inside it's own activations.
How else can it learn a random small neural network given enough samples in the prompt?
No, they won't make any real attempt to slow down. No, they won't do more than make incremental advances. To the extent that large benchmark gains occur, they will be shown to have little practical value.
If it'll come to pass, we'll probably have a groundbreaking discovery that the brain violates the physical Church-Turing thesis on our hands. That would be quite unexpected.
While the test itself may not have been posted online, read the individual questions. They are extremely common topics on the internet and I wouldn't be surprised if the contents of this test are 100x+ replicated across different QM material that's available.
> In general, I’d say that GPT-4 was strongest on true/false questions and (ironically!) conceptual questions—the ones where many students struggled the most. It was (again ironically!) weakest on calculation questions, where it would often know what kind of calculation to do but then botch the execution.
It'd be great if chain-of-thought / show-your-work type prompts became the default for anything involving complex, multi-step calculations or logic.
GPT-4 would have almost certainly gotten a higher score on the calculation questions if those methods were used.
Eh, even when asked specifically to show it's work GPT-4 still frequently makes calculation errors. It's just one of the limitations of current LLMs, and can easily be solved by integration with Wolfram, or even just a basic calculator.
I would very much like to see his analysis on which interpretations correlate with high scores. I would guess the more niche ones do, and Copenhagen and Many-Worlds will do the worst.
I'm curious how the predictive aspect of LLMs can generate / "solve" equations. Is it purely "these inputs are most likely followed by this output", and thus needs to have actually seen the problem to get it right, or is it able to infer some of the rules underlying the operations?
Prediction is only the goal. The whole point of training is to learn what it takes to predict well and accurately. and that necessitates understanding and whatever else an LLM learns from the data.
Clearly it doesn't need to have seen the exact input before because you can ask GPT-4 to add together two 6 digit numbers and it will get it right.
More impressively, if you generate a random small neural network and then put some samples in the prompt it can do a surprisingly good job of predicting the result of putting other inputs into the network. Presumably doing something that looks a bit like gradient decent, see https://arxiv.org/pdf/2212.07677.pdf
Solely in order to predict the next token, LLMs are learning an incredibly sophisticed model of both computation and the outside world.
Isn't this more an indictment of current teaching and testing methods than a praise of LLMs?
Like the old "you can't use a calculator on a test because you won't always have a calculator".
Im old enough that my textbooks still had trig lookup tables when I was a kid. That seems silly now, but it was considered important at the time. I know slightly older people in publishing that used manual photosizers instead of just importing to photoshop.
I remember being told multiple times that VR was going to fix programming because you could just visually move conditional blocks around.
So we have smartphones and scratch and photoshop and whathaveyou, but...
We still need people to draw and write and photograph and sculpt.
we still need engineers to use the new models, and we still need mechanics to fix the computers and trains when they break.
If GPT-4 can pass a test, maybe it's a better reference point, or maybe a better study aid. Like taped language lessons vs reading pronunciation diagrams.
You'll never obviate people from the equation that is defined by people,regardless of that singularity that is just around the corner like cold fusion and self driving cars and the cure for the common cold.
It's impressive, but the vast majority of these questions have already been asked essentially verbatim online, and been answered.
I'm of the opinion that only tests that were designed for students that were allowed to access the internet should be used as a benchmark for LLMs, and this wouldn't be one.
I tried the first one on GPT-3.5-turbo and it got the answer wrong haha. Goes to show you how much better GPT-4 is...
GPT-3.5-TURBO:
---------------------------
True.
A pure state of n qubits is described by an n-dimensional complex unit vector, also known as a quantum state vector. In Dirac notation, a pure state of n qubits can be represented as:
|ψ⟩ = a_0|0⟩ + a_1|1⟩ + ... + a_(2^n-1)|2^n-1⟩
where a_i are complex numbers and |i⟩ represents the i-th computational basis state. The normalization condition |a_0|^2 + |a_1|^2 + ... + |a_(2^n-1)|^2 = 1 must also be satisfied.
(hallucination?)
---------------------------
GPT-4:
---------------------------
False. A pure state of n qubits is described by a 2^n-dimensional complex unit vector. In a quantum system, each qubit can be in a superposition of two basis states, so for n qubits, there are 2^n possible basis states in the overall Hilbert space.
---------------------------
Interesting to note that with GPT-3.5-turbo, a single followup chain-of-reasoning prompt ("Let's think step by step", i.e. no introduction of a priming prompt like "Are you sure?" or "Why are you wrong?") helps the model identify the correct answer:
GPT-3.5-TURBO:
---------------------------
Sure, let's break it down step by step:
Qubits: A qubit is the fundamental unit of quantum information. It is a quantum analogue to the classical bit. While a classical bit can take one of two values, 0 or 1, a qubit can be in a superposition of both 0 and 1 states simultaneously.
Basis states: For a single qubit, there are two basis states, often denoted as |0> and |1>. A qubit can be in any linear combination (superposition) of these basis states, represented as α|0> + β|1>, where α and β are complex numbers, and the sum of the magnitudes squared (|α|^2 + |β|^2) equals 1.
Multiple qubits: When you have a system of n qubits, each qubit can be in a superposition of its basis states. The overall state of the n-qubit system is described by the tensor product of the individual qubit states. The number of basis states in the combined system is equal to the product of the number of basis states of each individual qubit.
n-dimensional vector: For n classical bits, you would indeed need an n-dimensional vector to describe their state. However, with n qubits in a quantum system, the number of basis states grows exponentially, as opposed to linearly in the classical case.
2^n-dimensional complex unit vector: As we mentioned earlier, each qubit can be in a superposition of two basis states. Therefore, for n qubits, there are 2^n possible basis states. The overall state of an n-qubit system is represented by a 2^n-dimensional complex unit vector, with each component corresponding to the probability amplitude of one of the 2^n basis states.
In summary, a pure state of n qubits is described by a 2^n-dimensional complex unit vector, not an n-dimensional one.
An LLM being able to pass your graduate-level exam almost certainly means that your exam is bad.
Lots of lazy professors do things like this (true/false and multiple choice answers, very soft questions, etc.), and the presence of GPTs should help them understand that this is not sufficient for evaluating someone's knowledge of a highly technical topic.
I have highly technical graduate and undergraduate degrees (in CS-adjacent field) and a decade of experience doing software development part time, but no direct experience with quantum computing (or much beyond an undergraduate physics class and an intro CS class in terms of directly related study), and consider myself a better-than-average test-taker. I could get 1D, 1E, 1F and 1T by intuition, 1J by actual knowledge, 1P by tautology, and had roughly no shot at the open-ended questions.
That doesn't mean anything about the quality of his courses or exams. Lots of world-class professors phone it in on their teaching responsibilities because they don't care. I had several of them in school - usually more prestigious professors had worse courses.
I can guarantee you that ChatGPT cannot do anything that a quantum computing class prepares you to do, aside from passing this final. That makes it a bad test.
Here’s one passing a very difficult test. Amazing! Now, rely on it to build a nuclear doohickey for a power station or a multi-billion dollar device for CERN or anything really and, well, no.
So humans still have to check the output, and now we’re in that situation where humans driving a Tesla on autopilot who are supposed to be 100% aware of the road aren’t, because they get lazy and doze off, and now the car crashes and whoops.
No negativity towards AI here. It’s amazing and it’ll change the future. But we need to be careful on the way.