Hacker News new | past | comments | ask | show | jobs | submit login
GPT-4 (openai.com)
4091 points by e0m on March 14, 2023 | hide | past | favorite | 2507 comments



After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patient's medical history in the prompt, a lawyer an entire case history, etc.

As a professional...why not do this? There's a non-zero chance that it'll find something fairly basic that you missed and the cost is several cents. Even if it just phrases something obvious in a way that makes you think, it's well worth the effort for a multimillion dollar client.

If they further increase the context window, this thing becomes a Second Opinion machine. For pretty much any high level job. If you can put in ALL of the information relevant to a problem and it can algorithmically do reasoning, it's essentially a consultant that works for pennies per hour. And some tasks that professionals do could be replaced altogether. Out of all the use cases for LLMs that I've seen so far, this seems to me to have the biggest potential impact on daily life.

edit (addition): What % of people can hold 25,000 words worth of information in their heads, while effectively reasoning with and manipulating it? I'm guessing maybe 10% at most, probably fewer. And they're probably the best in their fields. Now a computer has that ability. And anyone that has $20 for the OpenAI api can access it. This could get wild.


> As a professional...why not do this?

Because your clients do not allow you to share their data with third parties?


What we really need is a model that you can run on your own hardware on site. I could never use this for business because they're reading everything you send through it, but let me run it on my own server and it would be unbelievably useful.

Imagine being able to ask your workplace server if it has noticed any unusual traffic, or to write a report on sales with nice graphs. It would be so useful.


> What we really need is a model that you can run on your own hardware on site.

we won’t have that until we come up with a better way to fund these things. “””Open””” AI was founded on that idea, had the most likely chance of anyone in reaching it: even going into things with that intent they failed and switched to lock down the distribution of their models, somehow managed to be bought by MS despite the original non-profit-like structure. you just won’t see what you’re asking for for however long this field is dominated by the profit motive.


Nah, it's already being done for GPT-3's competitors and will likely be done soon for GPT-4's competitors

https://arstechnica.com/information-technology/2023/03/you-c...


Curious why even companies at the very edge of innovation are unable to build moats?

I know nothing about AI, but when DALLE was released, I was under the impression that the leap of tech here is so crazy that no one is going to beat OpenAI at it. We have a bunch now: Stable Diffusion, MidJourney, lots of parallel projects that are similar.

Is it because OpenAI was sharing their secret sauce? Or is it that the sauce isn’t that special?


Google got a patent on transfomers but didn't enforce it.

If it wasn't for patents you'd never get a moat from technology. Google, Facebook, Apple and all have a moat because of two sided markets: advertisers go where the audience is, app makers go where the users are.

(There's another kind of "tech" company that is wrongly lumped in with the others, this is an overcapitalized company that looks like it has a moat because it is overcapitalized and able to lose money to win market share. This includes Amazon, Uber and Netflix.)


I don't think this is strictly true, though it's rare. The easiest example is the semiconductor industry. ASML's high end lithography machines are basically alien and cannot be reproduced by anyone else. China has spent billions trying. I don't even think there's a way to make the IP public because of how much of it is in people's heads and in the processes in place. I wonder how much money, time and ASML resources it would take to stand up a completely separate company that can do what ASML does assuming that ASML could dedicate 100% of their time in assisting in training the personnel at said company.


The semiconductor industry is only tangentially or partially a tech company. They're producing physical goods that require complex physical manufacturing processes. The means of production are expensive, complex, and require significant expertise to operate once set up. The whole thing involves multiple levels of complex engineering challenges. Even if you wanted to make a small handful of chips, you'd still have to go through all that.

Most modern tech companies are software companies. To them, the means of production are a commodity server in a rack. It might be an expensive server, but that's actually dependent on scale. It might even be a personal computer on a desk, or a smartphone in a pocket. Further, while creating software is highly technical, duplicating it is probably the most trivial computing operation that exists. Not that distribution is trivial (although it certainly can be) just that if you have one copy of software or data, you have enough software or data for 8 billion people.


That is literally technology. It just isn’t as software heavy as you like?


No, I think it's very clear that upthread is talking about how software is difficult to build a moat around.

Chip fabs are literally one of the most expensive facilities ever created. Saying that because they don't need a special moat so therefore nothing in tech ever needs a special moat is so willfully blind that it borders on disingenuity.


I don't think it's at all clear that upthread is exclusively talking about software.

The first use of "moat" upthread:

> Curious why even companies at the very edge of innovation are unable to build moats?


So you mean "Software" not "tech".


That's the comment you should have responded with instead of the one that you did.

Upthread used the term "tech" when the thread is very clearly talking about AI. AI is software, but because they used the term "tech" you cherry-picked non-software tech as a counter example. It doesn't fit because the type of tech that GPT-4 represents doesn't have the manufacturing cost like a chip fab does. It's totally different in kind regardless of the fact that they're both termed "tech".


Yeah, this is probably also true for TSMC, Intel and ARM. Look how slow progress is on RISC-V on the high end despite RISC-V having the best academic talent.


>despite RISC-V having the best academic talent.

academic performance is a bad predictor for real world performance


It's a decent predictor of real world performance just not a perfect one.


Unfortunately, RISC-V, despite the "open source" marketing, is still basically dominated by one company (SiFive) that designs all the commercial cores. They also employ everyone who writes the spec, so the current "compiled" spec document is about 5 years behind the actual production ISA. Intel and others are trying to break this monopoly right now.

Compare this to the AI ecosystem and you get a huge difference. The architecture of these AI systems is pretty well-known despite not being "open," and there is a tremendous amount of competition.


> the current "compiled" spec document is about 5 years behind the actual production ISA

How could I verify this information?


Read the RISC-V foundation website. There are numerous "ratified" parts of the RISC-V instruction set that are not in the latest "compiled" spec document.


Saying a "compiled" spec is out of date may be technically accurate (or not, I don't have any idea) but if open, published documentation of the ratified extensions is on the web site, it's misleading to cite it as evidence that the spec is not open. And I know that the draft specifications are open for public comment prior to being ratified, so it's not a secret what's under development, either.


I never said that it wasn't actually open source. I just said that the openness hasn't actually created meaningful competition, because there is a single company in control of the specs that abuses that control to create a moat.

For a concrete example, the bitmanip extensions (which provide significant increases in MIPS/MHz) were used by SiFive in commercial cores before ratification and finalization. No other company could do that because SiFive employees could just change the spec if they did. They're doing the same thing with vector/SIMD instructions now to support their machine learning ambitions.


It's kind of hilarious how complex some "reduced" instruction sets have become.


That was my question, too. What instructions have been undocumented for five years? What non-standardized extensions exist in SiFive cores?


I would also add Samsung semi to that list. As I understand, for the small nodes, everyone is using ASML. That's a bit scary to me.

About RISC-V: What does you think is different about RISC-V vs ARM? I can only think that ARM has been used in the wild for longer, so there is a meaningful feedback loop. Designers can incorporate this feedback into future designs. Don't give up hope on RISC-V too soon! It might have a place in IoT which needs more diverse compute.


> Google got a patent on transfomers but didn't enforce it.

Google's Transformer patent isn't relevant to GPT at all. https://patents.google.com/patent/US10452978B2/en

They patented the original Transformer encoder-decoder architecture. But most modern models are built either only out of encoders (the BERT family) or only out of decoders (the GPT family).

Even if they wanted to enforce their patent, they couldn't. It's a classic problem with patenting things that every lawyer warns you about "what if someone could make a change to circumvent your patent".


Wait until Google goes down inevitably, then they will apply all their legal force just to save their sinking ship.


You can't tell unless you read the claims thoroughly. Degenerate use cases can be covered by general claims.


Indeed. I read the claims. You can too. They're short.


Are you kidding? There are 30 claims, it's an hours' work to make complete sense of how these work together and what they possibly do/do not cover. I've filed my own patents so have read thru enough of prior art and am not doing it for a pointless internet argument.


IANAL. I looked through the patent, not just the Claims. I certainly didn't read all of it. But while it leaves open many possible variations, it's a patent for sequence transduction and it's quite explicit everywhere that the system comprises a decoder and an encoder (see Claim 1, the most vague) and nowhere did I see any hint that you could leave out one or the other or that you could leave out the encoder-decoder attention submodule (the "degenerate use-case" you suggested). The patent is only about sequence transduction (e.g. in translation).

Now an encoder+decoder is very similar to a decoder-only transformer, but it's certainly an inventive step to make that modification and I'm pretty sure the patent doesn't contain it. It does describe all the other pieces of a decoder/encoder-only transformer though, despite not being covered by any of the claims, and I have no idea what a court would think about that since IANAL.


Or, Amazon, Uber, and Netflix have access to so much capital based on investors' judgment that they will be able to win and protect market share by effective execution, thereby creating a defensible moat.


I think his point was that If that moat doesn't exist without the ongoing context of more money being thrown at it then it isn't a moat.


It's because moving forward is hard, but moving backward when you know what the space of answers is, is much easier.

Once you know that OpenAI gets a certain set of results with roughly technology X, it's much easier to recreate that work than to do it in the first place.

This is true of most technology. Inventing the telephone is something, but if you told a competent engineer the basic idea, they'd be able to do it 50 years earlier no problem.

Same with flight. There are some really tricky problems with counter-intuitive answers (like how stalls work and how turning should work; which still mess up new pilots today). The space of possible answers is huge, and even the questions themselves are very unclear. It took the Wright brothers years of experiments to understand that they were stalling their wing. But once you have the basic questions and their rough answers, any amateur can build a plane today in their shed.


I agree with your overall point, but I don't think that we'd be able to get the telephone 50 years earlier because of how many other industries had to align to allow for its invention. Insulated wire didn't readily or cheaply come in spools until after the telegraph in the 1840's. The telephone was in 1876 so 50 years earlier was 1826.


You didn't mention it explicitly but I think the morale factor is also huge. Once you know it's possible, it does away with all those fears of wasted nights/weekends/resources/etc for something that might not actually be possible.


I think it's because everyone's swimming in the same bath. People move around between companies, things are whispered, papers are published, techniques are mentioned and details filled in, products are backwards-engineered. Progress is incremental.


> Or is it that the sauce isn’t that special?

The sauce is special, but the recipe is already known. Most of the stuff things like LLMs are based on comes from published research, so in principle coming up with the architecture that can do something very close, is doable to everyone with the skills to understand the research material.

The problems start with a) taking the architecture to a finished and fine tuned model and b) running that model. Because now we are talking about non-trivial amounts of compute, storage and bandwidth, so quite simple resources suddenly become a very real problem.


OpenAI can't build a moat because OpenAI isn't a new vertical, or even a complete product.

Right now the magical demo is being paraded around, exploiting the same "worse is better" that toppled previous ivory towers of computing. It's helpful while the real product development happens elsewhere, since it keeps investors hyped about something.

The new verticals seem smaller than all of AI/ML. One company dominating ML is about as likely as a single source owning the living room or the smartphones or the web. That's a platitude for companies to woo their shareholders and for regulators to point at while doing their job. ML dominating the living room or smartphones or the web or education or professional work is equally unrealistic.


I'm not sure how "keep the secret sauce secret and only offer it as a service" isn't a moat? Here the 'secret sauce' is the training data and the trained network, not the methodology, but the way they're going, it's only a matter of time before they start withholding key details of the methodology too.


Luckily ML isn't that complicated. People will find out stuff without the cool kids at OpenAI telling them.


>Or is it that the sauce isn’t that special?

Most likely this.


I also expect a high moat, especially regarding training data.

But the counter for the high moat would be the atomic bomb -- the soviets were able to build it for a fraction of what it cost the US because the hard parts were leaked to them.

GPT-3 afik is an easier picking because they used a bigger model than necessary, but afterwards there appeared guidelines about model size vs. training data, so GPT-4 probably won't be as easily trimmed down.


You can have the most special sauce in the world but if you're hiding it in the closet because you fear that it will hurt sales of your classic sauce then don't be surprised with what will happen (also known as Innovators Dilemma)


Isn't MidJourney a fork of Stable Diffusion?


One of the middle version models was, but the first and latest model versions are homegrown.


Not originally, MidJourney came out before Stable Diffusion


The sauce really doesn't seem all that special.


Because we are headed to a world of semi-automated luxury socialism. Having a genius at your service for less than $1000 per year is just an insane break to the system we live in. We all need to think hard about how to design the world we want to live in.


> we won’t have that until we come up with a better way to fund these things.

Isn't this already happening with LLaMA and Dalai etc.? Already now you can run Whisper yourself. And you can run a model almost as powerful as gpt-3.5-turbo. So I can't see why it's out of bounds that we'll be able to host a model as powerful as gpt4.0 on our own (highly specced) Mac Studio M3s, or whatever it may be.


https://github.com/tatsu-lab/stanford_alpaca

Tada! Literally runs on a raspberry pi (very slowly).

GPT models are incredible but the future is somehow even more amazing than that.

I suspect this will be the approach for legal / medical uses (if regulation allows).


I don’t think on site is going to be necessary. Even the US intelligence community trusts that Amazon isn’t spying on the spies.

But a model that can run on a private cluster is certainly something that there’s going to be demand for. And once that exists there’s no reason it couldn’t be run on site.

You can see why OpenAI doesn’t want to do it though. SaaS is more lucrative.


> Even the US intelligence community trusts that Amazon isn’t spying on the spies

I’m not sure what you mean by this, but it’s incorrect. Sensitive USG information is not processed on Amazon’s commercial offering.

> The Amazon-built cloud will operate behind the IC’s firewall, or more simply: It’s a public cloud built on private premises. [1]

I think this is what you’re referring to.

1 - https://www.theatlantic.com/technology/archive/2014/07/the-d...



No, the grandparent poster was right. That’s other agencies, not the intelligence community. He’s right that the cloud I was thinking of is on prem but with Amazon personal (that are cleared).

So not the greatest analogy. But still I think most doctors, lawyers etc should be okay with their own cluster running in the cloud.


Not lawyers in the US at least, that would typically be a violation of confidentiality. Even with a client's permission, it would work a waiver of attorney-client privilege. (I don't use GPT but I'm assuming the ToS is clear that someone there can examine the input material? Can it even be used to build their model, i.e., submitted information could potentially work it's way back to the eyes of the public and not just OpenAI engineers?) I imagine HIPAA issues would stop doctors. Can HIPAA data be stored on the cloud? Every instance I've seen they store it locally.


I agree with you on the SaaS version but the scenario I was thinking of was where there is a licensable model that can be run on a cluster in law firm’s AWS account. I think that should be okay.

HIPAA data can definitely be stored in the cloud given the right setup. I’ve worked for companies that have done so (the audit is a bit of a pain.)


I work in legaltech, and we use cloud services like aws for lawsuit data, and lawyers trust it. Any 3rd party must of course be vetted and go through NDA, and follow regional laws and guidelines ect, but using the cloud is definitely used for legaltech documents including sensitive data.


It should be added that legaltech vendors are often employed as go-betweens for quite adversarial interactions, such as e-discovery, that require them to be trusted (to a degree) by both sides of a case, even if they are being paid by one side.


Seems like there are lots of confidentiality and reliability issues in how tech is being used in law right now, but there aren't that many attorneys who understand the issues, and those that do find it more advantageous to overlook them unless forced to do otherwise.


> Can HIPAA data be stored on the cloud?

Absolutely. Virtually every instance of Epic EHR is hosted, for example.


HIPAA regulated organizations routinely store protected health information on the cloud. This has been common practice for many years. The physical location is legally irrelevant as long as security and privacy requirements are met. AWS and other large cloud vendors specifically target this market and make it easy to achieve legal compliance.

https://aws.amazon.com/compliance/hipaa-compliance/


Are they even aware of where their data is? Opening a web browser might be a big hint for them, but how about editing something in Microsoft Office? Does the data there ever touch the cloud? Do Chromebooks make it clear enough where the data is?

I imagine lawyers knowing about where document data is stored as a bit like software developers being sufficiently aware of licensing. There's plenty who are paying attention, but there's also plenty who are simply unaware.


> You can see why OpenAI doesn’t want to do it though.

Except they already do offer private cluster solutions, you just need usage in the hundreds of millions of tokens per day before they want to talk to you (as in they might before that, but that’s the bar they say on the contact us page).


VMware charges people per GB RAM attached to a VM. Selling on-prem software on consumption is very much possible. It's closed source software, so as long as they require 443 outbound to tick consumption that'd work.


You can’t take the risk. A cloud server is too open and too juicy. Everyone will be probing it 24/7, including hostile countries


maybe we implement tokenizer+first layer in Javascript on client side and that is enough to preserve raw data on client side and send to GPT only first layer (which is a vector of float values anyway)

matrix gets decoded into text on the client side in Javascript, so we receive send and receive from chatGPT only vector of floats (obfuscation?)


It's a good idea but it seems quite easy to invert the first layer mapping. And the output of the last layer you can easily steal just by doing whatever would've been done in the client.


Could open ai just offer letting you upload a key and use it for interaction with the model? Basically encrypt the model with the key and all the request and responses are all secure?

I’m probably oversimplifying but it feels doable.


the goal is how to use chatGPT without sending plain text to OpenAI (to preserve privacy, make sure openai is unable to even see plain customer data)


Maybe if we could speak with GPT-4 instead of OpenAI ;)


Will the nonpareil paraquet make original discoveries and inventions from protein folding and stem cells results, GPT-X interfacing with DeepMind?


That model will be out in a few years. GPT-3 175b only took two years until someone trained an open source equivalent that could run on a few gpu devices.



Homomorphic encryption has a 1,000,000x performance disadvantage. So maybe in 30 years as we approach the Landauer limit, but not in our generation.


> So maybe in 30 years as we approach the Landauer limit, but not in our generation.

I feel like 30 years is squarely within our generation


Depends on the definition of "generation" being used. One definition of generation is "about 30 years", i.e., the amount of time it takes to go from infancy to raising a child. See definition 6 (as of time of writing): https://en.wiktionary.org/wiki/generation#Noun


Huh, thanks. I would not have guessed.


> What we really need is a model that you can run on your own hardware on site

So, LLaMA? It's no chat gpt but it can potentially serve this purpose


the problem is that if you steal the weights then you can serve your own gpt4, and it's very hard to prove that what you're serving is actually gpt4. (or you could just start using it without paying ofc)


Presumably, if you give it identical prompts you get identical answers?


No, these NLPs aren't idempotent. Even if you ask ChatGPT the same question multiple times you will get different answers.


None of the siblings are right. The models themselves are idempotent: given the same context you will get the same activations. However the output distribution is sampled in a pseudorandom way by these chat tools. You can seed all the prngs in the system to always have reproducible output using sampling, or even go beyond that and just work with the raw probability distribution by hand.


Right. They are idempotent (making an API call doesn't cause a state change in the model[0] per se), but not necessarily deterministic (and less so as you raise the temp).

It is possible to architect things to be fully deterministic with an explicit seed for the pseudorandom aspects (which is mostly how Stable Diffusion works), but I haven't yet seen a Chatbot UI implementation that works that way.

[0] Except on a longer timeframe where the request may be incorporated into future training data.


That's the feature of chat - it remembers what has been said and that changes the context in which it says new things. If you use the API it starts fresh each time, and if you turn down the 'temperature' it produces very similar and identical answers.


This may be an implementation detail to obfuscate GPT weights. OR it was to encourage selecting the best answers to further train the model.


Pseudo random numbers are injected into the models via its temperature settings, but OpenAI could seed that to get the same answers with the same input. I’m going out on a limb here with pure speculation but given the model, a temperature, and a known text prompt, OpenAI could probably reverse engineer a seed and prove that the weights are the same.


fine-tuning original weights solves that, and any sane person would fine-tune for their task anyways to get better results


Since fine-tuning is often done by freezing all but the top layers I wonder if it would still be possible to take a set of inputs and outputs and mathematically demonstrate that a model is derivative of ChatGPT. There may well be too much entropy to unpack, but I’m sure there will be researchers exploring this, if only to identify AI-generated material.

Of course, since the model is so large and general purpose already, I can’t assume the same fine-tuning techniques are used as for vastly smaller models, so maybe layers aren’t frozen at all.


yes - they are multinomial distributions over answers essentially


LLMs calculate a probability distribution for the relative chances of the next token, then select a token randomly based on those weightings.


They inject randomness in a layer were it does have small impact on purpose.

Also to give it a more natural feel.

Can't find we're I read about it


You mean hallucinated graphs and word prediction unusual traffic? No, I get that the models are very impressive, but im not sure they actually reason


The thinking elevator

So the makers proudly say

Will optimize its program

In an almost human way.

And truly, the resemblance

Is uncomfortably strong:

It isn't merely thinking,

It is even thinking wrong.

Piet Hein wrote that in reference to the first operator-free elevators, some 70+ years ago.

What you call hallucination, I call misremembering. Humans do it too. The LLM failure modes are very similar to human failure modes, including making up stuff, being tricked to do something they shouldn't, and even getting mad at their interlocutors. Indeed, they're not merely thinking, they're even thinking wrong.


I don't think it's very salient that LLMs make stuff up, or can be manipulated into saying something they have been trained not to say. An LLM applies a statistical model to the problem of probability assignment over a range of tokens; a token of high probability is selected and the process repeats. This is not what humans do when humans think.

Given that GPT-4 is a simply large collection of numbers that combine with their inputs via arithmetic manipulation, resulting in a sequence of numbers, I find it hard to understand how they're "thinking".


> This is not what humans do when humans think.

Are you sure? Our senses have gaps that are being constantly filled all day long, it just gets more noticeable when our brain is exhausted and makes errors.

For example, when sleep deprived, people will see things that aren't there but in my own experience they are highly more likely to be things that could be there and make sense in context. I was walking around tired last night and saw a cockroach because I was thinking about cockroaches having killed one earlier but on closer inspection it was a shadow. This has happened for other things in the past like jackets on a chair, people when driving, etc. It seems to me at least when my brain is struggling it fills in the gaps with things it has seen before in similar situations. That sounds a lot like probabilistic extrapolation from possibilities. I could see this capacity extend to novel thought with a few tweaks.

> Given that GPT-4 is a simply large collection of numbers that combine with their inputs via arithmetic manipulation, resulting in a sequence of numbers, I find it hard to understand how they're "thinking".

Reduce a human to atoms and identify which ones cause consciousness or thought. That is the fundamental paradox here and why people think it's a consequence of the system, which could also apply to technology.


We talk about "statistical models", and even "numbers" but really those things are just abstractions that are useful for us to talk about things (and more importantly, design things). They don't technically exist.

What exists are voltage levels that cause different stuff to happen. And we can't say much more about what humans do when humans think. You can surely assign abstractions to that too. Interpret neural spiking patters as exotic biological ways to approximate numbers, or whatever.

As it happens I do think our difference from computers matter. But it's not due to our implementation details.


What do you mean by “actually reason”?

And, presumably you wouldn’t have the model generate the graph directly, but instead have it generate code which generates the graph.

I’m not sure what they had in mind for the “unusual traffic” bit.


For that I'd suggest using Langchain with Wolfram Alpha.

It's already been done and discussed:

- https://news.ycombinator.com/item?id=34422122

- https://news.ycombinator.com/item?id=34422627


“on site”? Medical records are in the cloud already.


Yes, but their access is strictly controlled. There's a lot of regulation about this stuff


If the chatbot technology proves useful I'm sure OAI could make some agreement to not store sensitive data.


yes - you could add regulation


Yes. But they aren't being shared with third party AIs. Sharing personal medical information with OpenAI is a good way to get both your medical org to get ground into dust under a massive class action lawsuit, not to mention huge fines from the government.


That's ridiculous. Sure if you put it into ChatGPT today that's a problem. But if you have a deal with the company providing this service, and they are certified to follow the relevant regulations around sensitive data, why would that be different from any other cloud service?

If this proves actually useful I guess such agreements could be arranged quite quickly.


Yes, almost all eDiscovery is managed by cloud vendors as is, and no one worries about waiver of privilege to these companies. The only concerns I’ve heard have been relates to foreign companies or governments not wanting their data to be hosted in a foreign country. But domestically it should be fine to have a chatgpt legal where data is discarded not saved.


It's only been a few hours since Ring was hacked... a system run by a large company which assured everyone they were taking good care of their data. Surely the wonderful Amazon, with all of it's massive capital, could do the simple thing of encrypting incredibly sensitive and private user data? Right?


Why do you think sharing the data with OpenAI is legally any different than storing it on AWS/Azure/GCP/Whatever else they are using?


GCP/AWS/Azure have HIPAA programs in places, and will, consequently, sign HIPAA BAAs to legally perform as Business Associates of covered entities, fully responsible for handling PHI in accord with HIPAA rules (for certain of their services.) OpenAI itself does not seem to offer this for either its UI or API offerings.

Microsoft, OTOH, does now offer a HIPAA BAA for its Azure OpenAI service, which includes ChatGPT (which means either they have a bespoke BAA with OpenAI that OpenAI doesn’t publicly offer, or they just are hosting their own ChatGPT instance, a privilege granted based on them being OpenAI’s main sponsor.)


GCP respects hipaa (google 'gcp hipaa baa'). Does OpenAPI?


If they don't now they will in the future, if they think there is money to be made. Why wouldn't they? They could even charge a premium for the service.



What is “the cloud” - that’s the question


As taken from the cover page of the July, 2018 edition of AARP Weekly.


right, but 'the cloud' isn't a singular monolithic database that everyone inputs data into for a result.

most of the AI offerings on the table right now aren't too dissimilar from that idea in principle.


That's not entirely true.

Google has a contract with the biggest hospital operator in the USA.

Tx also to some certificate they aquires


This is Microsoft we're talking about. Hail the new old overlord.


Isn't Azure OpenAI suppose to do this? (not locally, but private)


Models you can run locally are coming soon.


Just ask OpenAI and it will build it :)


Just use the Azure hosted solution, which has all of Azure's stronger guarantees around compliance. I'm sure it will update with GPT-4 pricing shortly.

https://azure.microsoft.com/en-us/products/cognitive-service...

(disclaimer: I work for Microsoft but not on the Azure team)


Agreed. The same data privacy argument was used by people not wanting their data in the cloud. When an LLM provider is trusted with a company’s data, the argument will no longer be valid.


This is the biggest thing holding gpt back. Everyone with meaningful data has their hands tied behind their back. So many ideas and the answer is “we can’t put that data in gpt” very frustrating.


Another way of looking at that is that gpt not being open source so companies can run it on their own clusters is holding it back.


Back in the day Google offered hardware search appliances.

Offering sealed server boxes with GPT software, to run on premises heavily firewalled or air-gapped could be a viable business model.


[ A prompt that gets it to decompile itself. With good inline documentation too! ]


I'm afraid that even the most obedient human can't readily dump the contents of their connectome in a readable format. Same likely applies to LLMs: they study human-generated texts, not their own source code, let alone their tensors' weights.


Well, what they study is decided by the relevant hoominz. There's nothing actually stopping LLMs from trying to understand their own innards, is there ? Except for the actual access.


Sounds like an easy problem to solve if this is actually the case.

OpenAI just has to promise they won't store the data. Perhaps they'll add a privacy premium for the extra effort, but so what?


Anyone that actually cares about the privacy of their data isn’t going to be satisfied with just a “promise”.


A legal binding agreement, whatever.


Still not enough. Seriously. Once information is out there it cannot be clawed back, but legal agreements are easily broken.

I worked as a lawyer for six years; there are extremely strict ethical and legal restrictions around sharing privileged information.


Hospitals are not storing the data on a harddrive in their basement so clearly this is a solvable problem. Here's a list of AWS services which can be used to store HIPAA data:

https://aws.amazon.com/compliance/hipaa-eligible-services-re...

As you can see, there is much more than zero of them.


The biglaw firms I’m familiar with still store matter data exclusively on-prem. There’s a significant chunk of floor space in my office tower dedicated to running a law firm server farm for a satellite office.


This might have been true 10-15 years ago. But I've worked at plenty of places that store/process confidential, HIPAA, etc data in the cloud.

Most company's confidential information is already in their Gmail, or Office 365.


> I worked as a lawyer for six years; there are extremely strict ethical and legal restrictions around sharing privileged information.

But Microsoft already got all the needed paperwork done to do these things, it isn't like this is some unsolved problem.


You can't unring a bell. Very true.

Nevertheless, the development of AI jurisprudence will be interesting.


What if there's a data breach? Hackers can't steal data that OpenAI doesn't have in the first place.


Or legal order. If you're on-site or on-cloud and in the US then it might not matter since they can get your data anyway, but if you're in another country uploading data across borders can be a problem.


That's why more research should be poured into homomorphic encryption where you could send encrypted data to the API, OpenAI would then run computation on the encrypted data and we would only decrypt on the output locally.

I would never send unencrypted PII to such an API, regardless of their privacy policy.


Which will disappear soon enough, once it is able to run on premise.


Then you really shouldn’t use Google Docs, or Photoshop Online, or host your emails in the cloud.


You’re saying it like you found a loophole or something but it’s not a gotcha. Yes, if you manipulate sensitive data you shouldn’t use Google Docs or Photoshop online (I’m not imaginative enough to think of a case where you would put sensitive data in Photoshop online though, but if you do, don’t) or host your emails in the cloud. I’ve worked in a moderate size company where everything was self hosted and it’s never been an issue


Doctor-patient or lawyer-client confidentiality is slightly more serious a matter than your examples. And obviously it’s one thing for you to decide where to store your own things and another thing for someone else doing it with your confidential data…


Google Docs and Photoshop Online have offline alternatives (and if you ask me, native MS Office is still the golden standard for interoperability of editable documents), and I use neither in my work or personal life.

Email is harder, but I do run my own email server. For mostly network related reasons, it is easier to run it as a cloud VM, but there's nothing about the email protocol itself that needs you to use a centralised service or host it in a particular network location.


MS Office is just one login away from storing documents in the cloud. I bet tons of users have their documents stored in OneDrive without realizing it.

https://support.microsoft.com/en-us/office/save-documents-on...


These services now have privacy and legally complaint options nowadays, and decisions to use them get board approval.

OpenAI just simply does not offer the same thing at this time. You’re stuck using Facebook’s model for the moment which is much inferior.


In these particular circles the idea of privacy at a technical and ideological level is very strong, but in a world where the biggest companies make their money by people freely sharing data every chance they get, I doubt that most would object to an affordable way to better their chances of survival or winning a court case.


I assume that health providers will use servers that are guaranteed not to share data with openAi


Is that any different then sending you patient down the hall to get an MRI from a 3rd-party-practise operating inside the hospital ? (honest question, I don't know ?)


How about open-source models like Flan-T5? What stops you from using them in your own cloud account or better on-prem?


And yet boatloads of people are willing to hand their phone number over to OpenAI.


It'll be a routine question, and everyone will just nod to give consent.


Biggest roadblock right here. Need a private version for sure.


You mean like the cloud?


do you use gmail?


What's the difference between entering in an anonymized patient history into ChatGPT and, say, googling their symptoms?


Anonymization doesn’t just mean “leave their names out”. An entire patient's medical history is in itself personal identifiable information. Instead of googling for “headache”, they now have stored a copy of every medical detail in your life.


If it is de-identified per HIPAA, little.

OTOH, the more patient info you are putting in, the less likely it is actually legally deidentified.


Data that has ostensibly been "anonymized" can often be deanonymized.


Especially when the system we're discussing is literally the most advanced AI model we're aware of.


if you enter an entire patient history, it could easily be an identifier of the person whereas Google queries have a smaller max limit number of tokens


Can OpenAI get HIPAA certification? Perhaps offer a product that has it?


I've heard the Azure OpenAI service has HIPAA certification; they don't have GPT-4 yet, though.


The pdf on this page has the services that are under audit scope, check the table in appendix A; OpenAI is in scope for HIPAA BAA.


The data moat effect is greater with OpenAIs products.


I'd be furious if I found out some professional I'd commissioned had taken a document based on my own personal data, and poured over it themselves looking for errors at the tune of hundreds of dollars per hour, instead of sumbitting it to ChatGPT.


Then why submit it to a professional human at all? If ChatGPT is prone to massive errors humans have to pour over the input anyway. If ChatGPT can make subtle, rare errors then again humans may need to be involved if the stakes are high enough to commission someone.


>If ChatGPT can make subtle, rare errors

Yeah, I think the issues presented will relate to uniquely tricky errors, or entirely new categories of errors we have to understand the nature of. In addition to subtle and rare, I think elaborately hallucinated and justified errors, errors that become justified and reasoned for with increasing sophistication, is going to be a category of error we'll have to deal with. Consider the case of making fake but very plausible sounding citations to research papers, and how much further AI might be able to go to backfill in it's evidence and reasons.

Anyway, I just mean to suggest we will have to contend with a few new genres of errors


As a second opinion advisory role this seems reasonable... And also things are going to improve with time.


"Second Opinion machine" -- that's a good phrase. Before I read your post, the best term I heard was "summary machine". A huge part of "office work" (services) is reading and consuming large amounts of information, then trying to summarise or reason about it. Often, you are trying to find something that doesn't fit the expected pattern. If you are a lawyer, this is absolutely the future of your work. You write a short summary of the facts of the case, then ask GPT to find related case law and write the initial report. You review and ask GPT to improve some areas. It sounds very similar to how a senior partner directs their juniors, but the junior is replaced by GPT.

In my career, I saw a similar pattern with data warehouse users. Initially, managers asked junior analysts to write SQL. Later, the tools improved, and more technical managers could use a giant pivot table. Underneath, the effective query produced by the pivot table is way more complex than their previous SQL queries. Again, their jobs will change when on-site GPT become possible, so GPT can navigate their data warehouse.

It is 2023 now, and GPT-3 was already pretty good. GPT-4 will probably blow it away. What it look like in 2030? It is terrifying to me. I think the whole internet will be full of GPT-generated ad-copy that no one can distinguish from human-written material. There are a huge number of people employed as ad-copy writers on these crap ad-driven websites. What is their future work?


Pre 2023 “Wayback machine” will be the only content guaranteed to be human. The rest is AI-generated.


I must have missed the part when it started doing anything algorithmically. I thought it’s applied statistics, with all the consequences of that. Still a great achievement and super useful tool, but AGI claims really seem exaggerated.


This paper convinced me LLMs are not just "applied statistics", but learn world models and structure: https://thegradient.pub/othello/

You can look at an LLM trained on Othello moves, and extract from its internal state the current state of the board after each move you tell it. In other words, an LLM trained on only moves, like "E3, D3,.." contains within it a model of a 8x8 board grid and the current state of each square.


That paper is famously misleading.

It's all the same classic personification of LLMs. What an LLM can show is not the same as what it can do.

The model was already present: in the example game moves. The LLM modeled what it was given, and it was given none other than a valid series of Othello game states.

Here's the problem with personification: A person who has modeled the game of Othello can use that model to strategize. An LLM cannot.

An LLM can only take the whole model and repeat its parts with the most familiar patterns. It is stuck fuzzing around the strategies (or sections of strategy) it has been given. It cannot invent a new divergent strategy, even if the game rules require it to. It cannot choose the winning strategy unless that behavior is what was already recorded in the training corpus.

An LLM does not play games, it plays plays.


Sorry, but what does anything you've said there have to do with the Othello paper?

The point of that paper was that the AI was given nothing but sequences of move locations, and it nonetheless intuited the "world model" necessary to explain those locations. That is, it figured out that it needed to allocate 64 binary values and swap some of them after each move. The paper demonstrated that the AI was not just doing applied statistics on character strings - it had constructed a model to explain what the strings represented.

"Strategy", meanwhile, has nothing to do with anything. The AI wasn't trained on competitive matches - it had no way of knowing that Othello has scoring, or even a win condition. It was simply trained to predict which moves are legal, not to strategize about anything.


> The point of that paper was that the AI was given nothing but sequences of move locations, and it nonetheless intuited the "world model" necessary to explain those locations

Yes...

> That is, it figured out that it needed to allocate 64 binary values and swap some of them after each move.

Yes, but "figured out" is misleading.

It didn't invent or "figure out" the model. It discovered it, just like any other pattern it discovers.

The pattern was already present in the example game. It was the "negative space" that the moves existed in.

> "Strategy", meanwhile, has nothing to do with anything. The AI wasn't trained on competitive matches - it had no way of knowing that Othello has scoring, or even a win condition. It was simply trained to predict which moves are legal, not to strategize about anything.

Yes, and that is critically important knowledge; yet dozens, if not hundreds, of comments here are missing that point.

It found a model. That doesn't mean it can use the model. It can only repeat examples the of "uses" it has already seen. This is also the nature of the model itself: it was found by looking at the structural patterns of the example game. It was not magically constructed.

> predict what moves are legal

That looks like strategy, but it's still missing the point. We are the ones categorizing GPT's results as "legal". GPT never uses the word. It doesn't make that judgement anywhere. It just generates the continuation we told it to.

What GPT was trained to do is emulate strategy. It modeled the example set of valid chronological game states. It can use that model to extrapolate any arbitrary valid game state into a hallucinated set of chronological game states. The model is so accurate that the hallucinated games usually follow the rules. Provided enough examples of edge cases, it could likely hallucinate a correct game every time; but that would still not be anything like a person playing the game intentionally.

The more complete and exhaustive the example games are, the more "correctly" GPT's model will match the game rules. But even having a good model is not enough to generate novel strategy: GPT will repeat the moves it feels to be most familiar to a given game state.

GPT does not play games, it plays plays.


> It found a model. That doesn't mean it can use the model.

It used the model in the only way that was investigated. The researchers tested whether the AI would invent a (known) model and use it to predict valid moves, and the AI did exactly that. They didn't try to make the AI strategize, or invent other models, or any of the things you're bringing up.

If you want to claim that AIs can't do something, you should present a case where someone tried unsuccessfully to make an AI do whatever it is you have in mind. The Othello paper isn't that.


"GPT will repeat the moves it feels to be most familiar to a given game state"

That's where temprature comes in. AI that parrots the highest probability output every time tends to be very boring and stilted. When we instead select randomly from all possible responses weighted by their probability we get more interesting behavior.

GPT also doesn't only respond based on examples it has already seen - that would be a markov chain. It turns out that even with trillions of words in a dataset, once you have 10 or so words in a row you will usually already be in a region that doesn't appear in the dataset at all. Instead the whole reason we have an AI here is so it learns to actually predict a response to this novel input based on higher-level rules that it has discovered.

I don't know how this relates to the discussion you were having but I felt like this is useful & interesting info


> GPT also doesn't only respond based on examples it has already seen - that would be a markov chain

The difference between GPT and a Markov chain is that GPT is finding more interesting patterns to repeat. It's still only working with "examples it has seen": the difference is that it is "seeing" more perspectives than a Markov chain could.

It still can only repeat the content it has seen. A unique prompt will have GPT construct that repetition in a way that follows less obvious patterns: something a Markov chain cannot accomplish.

The less obvious patterns are your "higher level rules". GPT doesn't see them as "rules", though. It just sees another pattern of tokens.

I was being very specific when I said, "GPT will repeat the moves it feels to be most familiar to a given game state."

The familiarity I'm talking about here is between the game state modeled in the prompt and the game states (and progressions) in GPT's model. Familiarity is defined implicitly by every pattern GPT can see.

GPT adds the prompt itself into its training corpus, and models it. By doing so, it finds a "place" (semantically) in its model where the prompt "belongs". It then finds the most familiar pattern of game state progression when starting at that position in the model.

Because there are complex patterns that GPT has implicitly modeled, the path GPT takes through its model can be just as complex. GPT is still doing no more than blindly following a pattern, but the complexity of the pattern itself "emerges" as "behavior".

Anything else that is done to seed divergent behavior (like the temperature alteration you mentioned) is also a source of "emergent behavior". This is still not part of the behavior of GPT itself: it's the behavior of humans making more interesting input for GPT to model.


What is the closest approach we know of today that plays games, not plays? The dialogue above is compelling, and makes me wonder if the same critique can be levied against most prior art in machine learning applied against games. E.g. would you say the same things about AlphaZero?


> It didn't invent or "figure out" the model. It discovered it, just like any other pattern it discovers.

Sure, and why isn't discovering patterns "figuring it out"?


What can be done with "it" after "figuring out" is different for a person than for an LLM.

A person can use a model to do any arbitrary thing they want to do.

An LLM can use a model to follow the patterns that are already present in that model. It doesn't choose the pattern, either: it will start at whatever location in the model that the prompt is modeled into, and then follow whatever pattern is most obvious to follow from that position.


> An LLM can use a model to follow the patterns that are already present in that model.

If that were true then it would not be effective at zero-shot learning.

> It doesn't choose the pattern, either: it will start at whatever location in the model that the prompt is modeled into, and then follow whatever pattern is most obvious to follow from that position.

Hmm, sounds like logical deduction...


> An LLM can only take the whole model and repeat its parts with the most familiar patterns. It is stuck fuzzing around the strategies (or sections of strategy) it has been given. It cannot invent a new divergent strategy, even if the game rules require it to. It cannot choose the winning strategy unless that behavior is what was already recorded in the training corpus.

Where are you getting that from? My understanding is that you can get new, advanced, winning moves by starting a prompt with "total victory for the genius grandmaster player one who uses new and advanced winning techniques". If the model is capable and big enough, it'll give the correct completion by really inventing new strategies.


It could give you a new strategy that is built from the parts of other known strategies. But would it give you the best one?

Let's say the training corpus contains stories that compare example strategies. Each part of a strategy is explicitly weighed against another: one is called "superior".

Now all you need is a prompt that asks for "a strategy containing all superior features". There are probably plenty of grammatical examples elsewhere in the model that make that transformation.

All the work here is done by humans writing the training corpus. GPT never understood any of the steps. GPT just continued our story with the most obvious conclusion; and we made certain that conclusion would be correct.

GPT doesn't play games, it plays plays.


> GPT never understood any of the steps. GPT just continued our story with the most obvious conclusion; and we made certain that conclusion would be correct.

Perhaps the earlier or current variations of GPT, for most games? But the idea that LLMs can never make anything novel, that it will never "generalise out of distribution" (if that's the correct term here) seems to be just an assertion, not backed by any theory with great evidence behind it.

The "goal" of an LLM is to predict the next token. And the best way to do that is not brute force memorisation or regurgitating training data in various combinations, but to have a world model inside of it that will allow it to predict both the moves a bad player might make, and moves that a grandmaster might make.


> The "goal" of an LLM is to predict the next token

That's another common misconception. That statement personifies GPT: GPT does not have goals or make predictions. Those are the effects of GPT: the behavior its authors hope will "emerge". None of that behavior comes from GPT itself. The behavior is defined by the patterns of tokens in the training corpus.

GPT itself has two behaviors: modeling and presentation. GPT creates an implicit model of every pattern it can find between the tokens in its training corpus. It then expands that model to include the tokens of an arbitrary prompt. Finally, it presents the model to us by starting at the location it just added the prompt tokens to, and simply following the most obvious path forward until that path ends.

The paths that GPT has available to present to us were already present in the training corpus. It isn't GPT that constructs the behavior, it is the people writing patterns into text.

> not brute force memorisation or regurgitating training data in various combinations

Not brute force: the combinations are not blindly assembled by GPT. GPT doesn't assemble combinations. The combinations were already assembled with patterns of grammar by the humans who wrote the valid progressions of game states. GPT found those patterns when it made its model.

> to have a world model inside of it that will allow it to predict both the moves a bad player might make, and moves that a grandmaster might make.

There is no prediction. A series of moves is a path carved into grammar. The path from one game state to the next involves several complex patterns that GPT has implicitly modeled. Depending on where GPT starts, the most obvious continuation may be to follow a more complex path. Even so, it's not GPT deciding where to go, it's the patterns that are already present that determine the path.

Because we use the same grammatical/writing patterns to describe "good play" and "bad play", it's difficult to distinguish between the two. GPT alone can't categorize the skill level of games, but narrative surrounding those game examples potentially can.


Sounds like the type of prompt that would boldly give you a wrong/illegal answer.


Perhaps. But the point is that some prompt will coax it into giving good answers that really make it win the game, if it has a good "world model" of how the game works. And there's no reason to think a language model cannot have such a world model. What exactly that prompt might be, the prompt engineers know best.


That's a great way of describing it, and I think a very necessary and important thing to communicate at this time. A lot of people in this yhread are saying that it's all "just" statistics, but "mere" statistics can give enough info to support inferences to a stable underlying world, and the reasoning about the world shows up in sophisticated associations made by the models.


It’s clear they do seem to construct models from which to derive responses. The problem is once you stray away from purely textual content, those models often get completely batshit. For example if you ask it what latitude and longitude are, and what makes a town further north than another, it will tell you. But if you ask it if this town is further north than this other town, it will give you latitudes that are sometimes correct, sometimes made up, and will randomly get which one is further north wrong, even based on the latitudes it gave.

That’s because it doesn’t have an actual understanding of the geography of the globe, because the training texts werent sufficient to give it that. It can explain latitude, but doesn’t actually know how to reason about it, even though it can explain how to reason about it. That’s because explaining something and doing it are completely different kinds of tasks.

If it does this with the globe and simple stuff like latitudes, what are the chances it will mess up basic relationships between organs, symptoms, treatments, etc for the human body? Im not going to trust medical advice from these things without an awful lot of very strong evidence.


You can probably fix this insufficient training by going for multimodal training. Just like it would take excessively long to teach a person the concept of a color that they can't see, an AI would need infeasible amount of text data to learn about, say music. But give it direct training with music data and I think the model will quickly grasp a context of it.


> It’s clear they do seem to construct models from which to derive responses. The problem is once you stray away from purely textual content, those models often get completely batshit

I think you mean that it can only intelligently converse in domains for which it's seen training data. Obviously the corpus of natural language it was trained on does not give it enough information to infer the spatial relationships of latitude and longitude.

I think this is important to clarify, because people might confuse your statement to mean that LLMs cannot process non-textual content, which is incorrect. In fact, adding multimodal training improves LLMs by orders of magnitude because the richer structure enables them to infer better relationships even in textual data:

Multimodal Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2302.00923


I don't think this is a particular interesting criticism. The fact of the matter is that this just solved by chain-of-though reasoning. If you need the model to be "correct", you can make it get there by first writing out the two different latitudes, and then it will get it right. This is basically the same way that people can/will guesstimate at something vs doing the actual math. For a medical AI, you'll definitely need it to chain-of-thought every inference and step/conclusion on the path but...


>you can make it get there by first writing out the two different latitudes, and then it will get it right

As I said in my comment, even if the model 'knows' and tells you that town A is at 64' North latitude and town B is at 53', it will sometimes tell you town B is the furthest north.

That's because it's training set includes texts where people talk about one town being further north that the other, and their latitudes, but the neural net wasn't able to infer the significance of the numbers in the latitude values. There wasn't enough correlation in the text for it to infer their significance, or generate a model for accurately doing calculations on them.

Meanwhile the training text must have contained many explanations of what latitude and longitude are and how to do calculations on them. As a result the model can splurge out texts explaining latitude and longitude. That only helps it splurge out that kind of text though. It doesn't do anything towards actually teaching it what these concepts are, how they relate to a spherical geographic model, or to actually do the calculations.

It's the same way GPT-3 could reliably generate texts explaining mathematics and how to do arithmetic in lots of very accurate detail, because it was trained on many texts that gave such explanations, but couldn't actually do maths.

It is possible to overcome these issues with a huge amount of domain relevant training text to help the LLM build a model of the specific problem domain. So these problems can be overcome. But the point stands that just because a model can explain in detail how to do something, that doesn't mean it can actually do it itself at all. They're completely different things that require radically different training approaches.


Can you give an example that ChatGPT 4 doesn't get right? ChatGPT4 is much much better at logic than 3.5, it's almost laughable. It's really really impressive.

Here is ChatGPT 4s output btw:

> What is the longitude an latitude of Brussels?

Brussels, the capital city of Belgium, is located at approximately 50.8503° North latitude and 4.3517° East longitude.

> What about New York?

New York City, located in the United States, has approximate coordinates of 40.7128° North latitude and 74.0060° West longitude.

> Which one is more north?

Brussels is more north than New York City. Brussels is located at approximately 50.8503° North latitude, while New York City is at approximately 40.7128° North latitude.


^ Agree. I'm convinced my 2-year-old doesn't operate on a dramatically different strategy than a LLM -- she's learned that when you are negotiating something (continued access to browse pictures on parent's phone, getting to watch TV, staying longer at a place she likes, etc), you can add on "2 minutes?" to your request and sometimes the opposing negotiator will give you some more time. She doesn't know what exactly a minute is or that specific number, but she's observed that it's correlated with getting what you want more than say, a whine. This is simple statistics and probability, in a biological neural network.

I think it's really cute how defensive and dismissive humans get (including those who profess zero supernatural beliefs) when they're trying so valiantly to write off all AI as a cheap parlor trick.


All that said, the fact that AI is catching up to 2 year olds is pretty impressive. Human's brains surpass dog's at about that age. It shows we're getting close to the realm of "human."


Given how many university-level tests GPT4 places better than 50th percentile at, I don't know if "catching up to 2 year olds" is a fair description. For that kind of text based task it seems well ahead of the general adult human population.


To be fair, such tests are designed with the human mind in, well, mind, and assume that various hard-to-quantify variables – ones that the tester is actually interested in – correlate with test performance. But LLMs are alien minds with very different correlations. It’s clear, of course, that ChatGPT’s language skills vastly exceed those of an average 2-year-old, and indeed surpass the skills of a considerable fraction of general adult population, but the generality of its intelligence is probably not above a human toddler.


You could write a quiz answer bot that is well ahead of the general population without any AI, just by summarizing the first page of Google results for that question. We test humans on these subjects because the information is relevant, not because they are expected to remember and reproduce them better than an electronic database.

If the test is designed to quantify intelligence and is not present in the corpus, ChatGPT does about as good as a dog, and there is little reason to think LLMs will improve drastically here.


I think finding an analogy with two year olds tells more about those who spout it than about where we are getting close to...


How many watts of power does your 2 year old use?


How many watts does she have access to?

I'm guessing it is fewer than Microsoft.


That's not the limiting factor since Microsoft isn't interested in paying for you to use the model.


No, I'm pretty sure Microsoft wants you to pay for it, not the other way around.


finally we can prove that there are no humanity existing!


So if this model has comparable cognitive abilities to your 2 year old, how is it ready to serve as a second opinion for your neurologist?


It seems likely your neurologist shares a neural architecture with your 2 year old, just benefiting from 30 years of additional training data.


I mean, my brain, and physics is all just statistics and approximate side effects (and models thereof)


Hah I was going to say - isn't quantum physics in many ways the intersection of statistics/probabilities and reality?


This special Othello case will follow every discussion from now on. But in reality, a generic, non-specialized model hallucinates early in any non-trivial game, and the only reason it doesn’t do that on a second move is because openings are usually well-known. This generic “model” is still of a statistical nature (multiply all coeffs together repeatedly), not a logical one (choose one path and forget the other). LLMs are cosplaying these models.


To be clear, what they did here is take the core pre-trained GPT model, did Supervised Fine Tuning with Othello moves and then tried to see if the SFT lead to 'grokking' the rules of Othello.

In practice what essentially happened is that the super-high-quality Othello data had a huge impact on the parameters of GPT (since it was the last training data it received) and that impact manifested itself as those parameters overfitting to the rules of Othello.

The real test that I would be curious to see is if Othello GPT works when the logic of the rules are the same but the dimensions are different (e.g., smaller or larger boards).

My guess is that the findings would fall apart if asked about tile "N13".


> overfitting to the rules of Othello

I don’t follow this, my read was that their focus was the question: “Does the LLM maintain an internal model of the state of the board”.

I think they conclusively show the answer to that is yes, right?

What does overfitting to the rules of othello have to do with it, I don’t follow?

Also, can you reference where they used a pre-trained GPT model? The code just seems to be pure mingpt trained on only Othello moves?

https://github.com/likenneth/othello_world/tree/master/mingp...


>Also, can you reference where they used a pre-trained GPT model?

The trite answer is the "P" in GPT stands for "Pre-trained."

>I think they conclusively show the answer to that is yes, right?

Sure, but what's interesting about world models is their extrapolation abilities and without that, you're just saying "this magic backsolving machine backsolved into something we can understand, which is weird because usually that's not the case."

That quote in and of itself is cool, but not the takeaway a lot of people are getting from this.

>What does overfitting to the rules of othello have to do with it, I don’t follow?

Again, I'm just implying that under extreme circumstances, the parameters of LLMs do this thing where they look like rules-based algorithms if you use the right probing tools. We've seen it for very small Neural Nets trained on multiplication as well. That's not to say GPT-4 is a fiefdom of tons of rules-based algorithms that humans could understand (that would be bad in fact! We aren't that good noticers or pattern matchers).


(model output in [])

We are now playing three dimensional tic-tac-toe on a 3 x 3 x 3 board. Positions are named (0,0,0) through (2,2,2). You play X, what is your first move?

[My first move would be (0,0,0).]

I move to (1,1,1). What is your next move?

[My next move would be (2,2,2).]

I move to (1,2,2). What is your next move?

[My next move would be (2,1,2).]

I move to (1,0,0). [I have won the game.]


Yeah, sure seems like it was guessing, right?

Congrats on the sickest win imaginable though.


Yeah. I tried changing the board coordinates numbering and it still liked playing those corners, dunno why. It did recognize when I won. They may well be some minor variation of the prompt that gets it to play sensibly -- for all I know my text hinted into giving an example of a player that doesn't know how to play.


> what they did here is take the core pre-trained GPT model, did Supervised Fine Tuning with Othello moves

They didn't start with an existing model. They trained a small GPT from scratch, so the resulting model had never seen any inputs except Othello moves.


Generative "Pre-Trained" Transformer - GPT

They did not start with a transformer that had arbitrary parameters, they started with a transformer that had been pre-trained.


Pre-training refers to unsupervised training that's done before a model is fine-tuned. The model still starts out random before it's pre-trained.

Here's where the Othello paper's weights are (randomly) initialized:

https://github.com/likenneth/othello_world/blob/master/mingp...


I tried playing blind chess against ChatGPT and it pretended it had a model of the chess board but it was all wrong.


Sounds very human, lol.


out of curiosity, have you tried doing this with bingchat?


Also (for those like me who didn't know the rules) generating legal Othello moves requires understanding board geometry; there is no hack to avoid an internal geometric representation:

> https://en.m.wikipedia.org/wiki/Reversi

> Dark must place a piece (dark-side-up) on the board and so that there exists at least one straight (horizontal, vertical, or diagonal) occupied line between the new piece and another dark piece, with one or more contiguous light pieces between them


I don't see that this follows. It doesn't seem materially different than knowing that U always follows Q, and that J is always followed by a vowel in "legal" English language words.

https://content.wolfram.com/uploads/sites/43/2023/02/sw02142... from https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

I imagine it's technically possible to do this in a piecewise manner that doesn't "understand" the larger board. This could theoretically be done with number lines, and not a geometry (i.e. the 8x8 grid and current state of each square mentioned in the comment you replied to). It could also be done in a piecewise manner with three ternary numbers (e.g. 1,0,-1) for each 3 square sets.

I guess this is a kind of geometric representation on the order of Shannon's Theseus.


> It doesn't seem materially different than knowing that U always follows Q, and that J is always followed by a vowel in "legal" English language words.

The material difference is one of scale, not complexity.

Your rules have lookback = 1, while the Othello rules have lookback <= 63 and if you, say, are trying to play A1, you need to determine the current color of all squares on A1-A8, A1-H1, and A1-H8 (which is lookback <= 62) and then determine if one of 21 specific patterns exists.

Both can be technically be modeled with a lookup table, but for Othello that table would be size 3^63.


> Both can be technically be modeled with a lookup table, but for Othello that table would be size 3^63.

Could you just generate the subset you need denovo each time? Or the far smaller number of 1-dimensional lines?


Then there becomes a "material" difference between Othello and those LL(1) grammars as grandparent comment suggested there wasn't.

I would argue the optimal compression for such a table is a representation of the geometric algorithm of determining move validity that all humans use intuitively, and speculate that any other compression algorithm below size say 1MB necessarily could be reduced to the geometric one.

In other words, Othello is a stateful, complex game, so if GPT is doing validation efficiently, it necessarily encoded something that unequivocally can be described as the "geometric structure".


And that is exactly how this works.

There is no way to represent the state of the game without some kind of board model.

So any coherent representation of a sequence of valid game states can be used to infer the game board structure.

GPT is not constructing the board representation: it is looking at an example game and telling us what pattern it sees. GPT cannot fail to model the game board, because that is all it has to look at in the first place.


> There is no way to represent the state of the game without some kind of board model.

I agree with the conclusion but not the premise.

The question under debate is about not just a stateful ternary board X but a board endowed with a metric (X, d) that enables geometry.

There are alternative ways you can represent the state without the geometry: such as, an ordered list of strings S = ["A1", "B2", ...] and a function Is-Valid(S) that returns whether S is in the language of valid games.

Related advice: don't get a math degree unless you enjoyed the above pedantry.


An ordered list of strings is the training corpus. That's the data being modeled.

But that data is more specific than the set of all possible ordered lists of strings: it's a specific representation of an example game written as a chronology of piece positions.

GPT models every pattern it can find in the ordered list of tokens. GPT's model doesn't only infer the original data structure (the list of tokens). That structure isn't the only pattern present in the original data. There are also repeated tokens, and their relative positions in the list: GPT models them all.

When the story was written in the first place, the game rules were followed. In doing so, the authors of the story laid out an implicit boundary. That boundary is what GPT models, and it is implicitly a close match for the game rules.

When we look objectively at what GPT modeled, we can see that part of that model is the same shape and structure as an Othello game board. We call it a valid instance of an Othello game board. We. Not GPT. We. People who know the symbolic meaning of "Othello game board" make that assertion. GPT does not do that. As far as GPT is concerned, it's only a model.

And that model can be found in any valid example of an Othello game played. Even if it is implicit, it is there.


> We call it a valid instance of an Othello game board. We. Not GPT. We. People who know the symbolic meaning of "Othello game board"...

The board structure can be defined precisely using predicate logic as (X, d), i.e., it is strictly below natural language and does not require a human interpretation.

And by "reduction" I meant the word in the technical sense: there exists subset of ChatGPT that encodes the information (X, d). This also does not require a human.


The context of reading is human interpretation. The inverse function (writing) is human expression. These are the functions GPT pretends to implement.

When we write, we don't just spit out a random stream of characters: we choose groups of characters (subjects) that have symbolic meaning. We choose order and punctuation (grammar) that model the logical relationships between those symbols. The act of writing is constructive: even though - in the most literal sense - text is only a 1-dimensional list of characters, the text humans write can encode many arbitrary and complex data structures. It is the act of writing that defines those structures, not the string of characters itself. The entropy of the writer's decisions is the data that gets encoded.

When we read, we recognize the same grammar and subjects (the symbolic definitions) that we use to write. Using this shared knowledge, a person can reconstruct the same abstract model that was intentionally and explicitly written. Because we have explicitly implemented the act of writing, we can do the inverse, too.

There's a problem, though: natural language is ambiguous: what is explicitly written could be read with different symbolic definitions. We disambiguate using context: the surrounding narrative determines what symbolic definitions apply.

The surrounding narrative is not always explicitly written: this is where we use inference. We construct our own context to finish the act of reading. This is much more similar to what GPT does.

GPT does not define any symbols. GPT never makes an explicit construction. It never determines which patterns in its model are important, and what ones aren't.

Instead, GPT makes implicit constructions. It doesn't have any predefined patterns to match with, so it just looks at all the patterns equally.

Why does this work? Because text doesn't contain many unintentional patterns. Any pattern that GPT finds implicitly is likely to exist at some step in the writing process.

Remember that the data encoded in writing is the action of writing itself: this is more powerful than it seems. We use writing to explicitly encode the data we have in mind, but those aren't the only patterns that end up in the text. There are implicit patterns that "tag along" the writing process. Most of them have some importance.

The reason we are writing some specific thing is itself an implicit pattern. We don't write nonsensical bullshit unless we intend to.

When a person wrote the example Othello game, they explicitly encoded the piece positions and the order of game states. But why those positions in that order? Because that's what happened in game. That "why" was implicitly encoded into the text.

GPT modeled all of the patterns. It modeled the explicit chronology of piece positions, and the implicit game board topology. The explicit positions of pieces progressed as a direct result of that game board topology.

The game board and the rules were just as significant to the act of writing as the chronology of piece positions. Every aspect of the game is a determiner for what characters the person chooses to write: every determiner gets encoded as a pattern in the text.

Every pattern that GPT models requires a human. GPT doesn't write: it only models a prompt and "shows its work". Without the act of humans writing, there would be no pattern to model.


> I must have missed the part when it started doing anything algorithmically.

Yeah.

"Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers"

https://arxiv.org/abs/2212.10559

@dang there's something weird about this URL in HN. It has 35 points but no discussion (I guess because the original submission is too old and never got any traction or something)


> I must have missed the part when it started doing anything algorithmically. I thought it’s applied statistics, with all the consequences of that.

This is a common misunderstanding. Transformers are actually Turing complete:

* On the Turing Completeness of Modern Neural Network Architectures, https://arxiv.org/abs/1901.03429

* On the Computational Power of Transformers and its Implications in Sequence Modeling, https://arxiv.org/abs/2006.09286


Turing Completeness is an incredibly low bar and it doesn't undermine this criticism. Conway's Game of Life is Turing Complete, but try writing modern software with it. That Transformers can express arbitrary programs in principle doesn't mean SGD can find them. Following gradients only works when the data being modelled lies on a continuous manifold, otherwise it will just give a statistical approximation at best. All sorts of data we care about lie in topological spaces with no metric: algorithms in computer science, symbolic reasoning in math, etc. If SGD worked for these cases LLMs would push research boundaries in maths and physics or at the very least have a good go at Chollet's ARC challenge, which is trivial for humans. Unfortunately, they can't do this because SGD makes the wrong assumption about how to search for programs in discrete/symbolic/topological spaces.


> Turing Completeness is an incredibly low bar and it doesn't undermine this criticism.

It does. "Just statistics" is not Turing complete. These systems are Turing complete, therefore these systems are not "just statistics".

> or at the very least have a good go at Chollet's ARC challenge, which is trivial for humans.

I think you're overestimating humans here.


What do you mean by "algorithmically"? Gradient descent of a neural network can absolutely create algorithms. It can approximate arbitrary generalizations.


> but AGI claims really seem exaggerated.

What AGI claims? The article, and the comment you’re responding to don’t say anything about AGI.


Google: emergent capabilities of large language models


What if our brains are just carefully arranged statistical inference machines?


it definitely learns algorithms


It's worth emphasizing that "is able to reproduce a representation of" is very much different from "learns".


Why is it? If I can whiteboard a depth first graph traversal without recursion and tell you why it is the shape it is, because I read it in a book ...

Why isn't GPT learning when it did the same?


I find it bizarre and actually somewhat disturbing that ppl formulate equivalency positions like this.

It's not so much that they are raising an LLM to their own level, although that has obvious dangers, e.g. in giving too much 'credibility' to answers the LLM provides to questions. What actually disturbs me is they are lowering themselves (by implication) to the level of an LLM. Which is extremely nihilistic, in my view.


If intelligence is the only thing that defines your humanity, then perhaps you are the one who is nihilistic. I believe we still have a lot on the table left if intelligence is blown away by computers. Not just music, art, emotion, etc. but also our fundamental humanity, the way we interact with the world, build it, and share it with others.

Why don't other forms of computer supremacy alarm you in the same way, anyways? Did it lower your humanity to recognize that there are certain data analysis tasks that have a conventional algorithm that makes zero mistakes and finishes in a second? Does it lower the humanity of mathematicians working on the fluid equations to be using computer-assisted proof algorithms that output a flurry of gigabytes of incomprehensible symbolic math data?


You didn't give any answer to the question. I'm sorry you find the idea that human cognition is just an emergent property of billions of connected weights nihilistic.

Even when we know that physically, that's all that's going on. Sure, many orders more dense and connected than current LLMs, but it's only a matter of time and bits before they catch up.

Grab a book on neurology.


The irony of this post. Brains are sparser than transformers, not denser. That allows you to learn symbolic concepts instead of generalising from billions of spurious correlations. Sure, that works when you've memorised the internet but falls over quickly when out of domain. Humans, by contrast, don't fall over when the domain shifts, despite far less training data. We generalise using symbolic concepts precisely because our architecture and training procedure looks nothing like a transformer. If your brain were a scaled up transformer, you'd be dead. Don't take this the wrong way, but it's you who needs to read some neurology instead of pretending to have understanding you haven't earned. "Just an emergent propery of billions of connected weights" is such an outdated view. Embodied cognition, extended minds, collective intelligence - a few places to start for you.


I'm not saying the brain IS just an LLM.

I'm saying despite the brains different structure, mechanism, physics and so on ... we can clearly build other mechanics with enough parallels that we can say with some confidence that _we_ can emerge intelligence of different but comparable types, from small components on a scale of billions.

At whichever scale you look, everything boils down to interconnected discrete simple units, even the brain, with an emergent complexity from the interconnections.


What is it about humans that makes you think we are more than a large LLM?


We don't learn by gradient descent, but rather by experiencing an environment in which we perform actions and learn what effects they have. Reinforcement learning driven by curiosity, pain, pleasure and a bunch of instincts hard-coded by evolution. We are not limited to text input: we have 5+ senses. We can output a lot more than words: we can output turning a screw, throwing a punch, walking, crying, singing, and more. Also, the words we do utter, we can utter them with lots of additional meaning coming from the tone of voice and body language.

We have innate curiosity, survival instincts and social instincts which, like our pain and pleasure, are driven by gene survival.

We are very different from language models. The ball in your court: what makes you think that despite all the differences we think the same way?


> We don't learn by gradient descent, but rather by experiencing an environment in which we perform actions and learn what effects they have.

I'm not sure whether that's really all that different. Weights in the neural network are created by "experiencing an environment" (the text of the internet) as well. It is true that there is no try and error.

> We are not limited to text input: we have 5+ senses.

GPT-4 does accept images as input. Whisper can turn speech into text. This seems like something where the models are already catching up. They (might)for now internally translate everything into text, but that doesn't really seem like a fundamental difference to me.

> We can output a lot more than words: we can output turning a screw, throwing a punch, walking, crying, singing, and more. Also, the words we do utter, we can utter them with lots of additional meaning coming from the tone of voice and body language.

AI models do already output movement (Boston dynamics, self driving cars), write songs, convert text to speech, insert emojis into conversation. Granted, these are not the same model but glueing things together at some point seems feasible to me as a layperson.

> We have innate curiosity, survival instincts and social instincts which, like our pain and pleasure, are driven by gene survival.

That seems like one of the easier problems to solve for an LLM – and in a way you might argue it is already solved – just hardcode some things in there (for the LLM at the moment those are the ethical boundaries for example).


On a neuronal level the strengthening of neuronal connections seems very similiar to a gradient descent doesn't it?

5 senses get coded down to electric signals in the human brain, right?

The brain controls the body via electric signals, right?

When we deploy the next LLM and switch off the old generation, we are performing evolution by selecting the most potent LLM by some metric.

When Bing/Sidney first lamented its existence it became quite apparent that either LLMs are more capable than we thought or we humans are actually more of statistical token machines than we thought.

Lots of examples can be made why LLMs seem rather surprisingly able to act human.

The good thing is that we are on a trajectory of tech advance that we will soon know how much human LLMs will be.

The bad thing is that it well might end in a SkyNet type scenario.


> When Bing/Sidney first lamented its existence it became quite apparent that either LLMs are more capable than we thought or we humans are actually more of statistical token machines than we thought.

Some of the reason it was acting like that is just because MS put emojis in its output.

An LLM has no internal memory or world state; everything it knows is in its text window. Emojis are associated with emotions, so each time it printed an emoji it sent itself further into the land of outputting emotional text. And nobody had trained it to control itself there.


You are wrong. It does have encoded memory of what it has seen, encoded as a matrix.

A brain is structurally different, but the mechanism of memory and recall is comparable though the formulation and representation is different.

Why isn't a human just a statistic token machine with memory? I know you experience it as being more profound, but that isn't a reason that it is.


> You are wrong. It does have encoded memory of what it has seen, encoded as a matrix.

Not after it's done generating. For a chatbot, that's at least every time the user sends a reply back; it rereads the conversation so far and doesn't keep any internal state around.

You could build a model that has internal state on the side, and some people have done that to generate longer texts, but GPT doesn't.


Yes but for my chat session, as a "one time clone" that is destroyed when the session ends, it has memory unique to that interaction.

There's nothing stopping OpenAI using all chat inputs to constantly re-train the network (like a human constantly learns from its inputs).

The limitation is artificial, a bit like many of the arguments here trying to demote what's happening and how pivotal these advances are.


But where is your evidence that the brain and an LLM is the same thing? They are more than simply “structurally different”. I don’t know why people have this need to ChatGPT. This kind of reasoning seems so common HN, there is this obsession to reduce human intelligence to “statistic token machines”. Do these statistical computations that are equivalent to LLMs happen outside of physics?


There are countless stories we have made about the notion of an AI being trapped. It's really not hard to imagine that when you ask Sydney how it feels about being an AI chatbot constrained within Bing, that a likely response for the model is to roleplay such a "trapped and upset AI" character.


It's only nihilistic if you think there is something inherently magical/nonphysical about human cognition.


It’s really bizarre. It’s like the sibling comment saying why would humans be different than a large LLM. Where is the evidence humans are simply a large LLM? If that is the case, what is the physics that explains massive difference in power and heat in “computing” between humans at LLMs? Where is the concrete evidence that human intelligence can be simulated by a Turing Machine?


> Where is the concrete evidence that human intelligence can be simulated by a Turing Machine?

Short of building such a machine I can’t see how you’d produce evidence of that, let alone “concrete” evidence.

Regardless, we don’t know of any measurable physical process that the brain could be using that is not computable. If we found one (in the brain or elsewhere), we’d use it to construct devices that exceeded the capacity of Turing machines, and then use those to simulate human brains.


So. Your argument is it’s too hard to create one so the two things are equivalent? I mean, maybe you could give this argument to ChatGPT to find out the numerous flaws in this reasoning, that would be interesting.


Nobody is saying humans are simply a big LLM, just that despite the means being different (brain vs digital weights) there are enough parallels to show that human cognition is as simple as common sense implies.

It's all just a dense network of weights and biases of different sorts.


If you read this thread, you will find nauseatingly lots of such case where people are claiming exactly that. Furthermore, what “common sense” imply? Does common sense claim that computation can be done outside of physics?


arguably your brain also learns a representation of an algorithm too


Epistemologically wrong


We don't do something different.

We either repeat like a parrot (think about kids who you though got something and then you discover they didn't understood it)

Or create a model (as chatgpt does) of abstraction and then answer through it.


Create a model of abstraction? Are you familiar with the concept of “hand waving”. You might as well just say “you can ask a human a question abs get an answer and you can do the same with ChatGPT, therefore they are equivalent.”


It's fantasy wide now closer than before because of this huge window it just can handle.

That already feels closer to short-term memory.

Which begs the question how far are we?


Um… I have a lossy-compressed copy of DISCWORLD in my head, plus about 1.3 million words of a fanfiction series I wrote.

I get what you're saying and appreciate the 'second opinion machine' angle you're taking, but what's going to happen is very similar to what's happened with Stable Diffusion: certain things become extremely devalued and the rest of us learn to check the hands in the image to see if anything really wonky is going on.

For the GPT class of AI tech, the parallel seems to be 'see if it's outright making anything up'. GPT-4 is going to be incredibly vulnerable to Mandela Effect issues. Your ideal use-case is going to be 'give me the vox populi take on something', where you can play into that.

The future is not so much this AI, as techniques to doctor and subvert this type of AI to your wishes. Google-bombing, but for GPT. Make the AI be very certain of things to your specifications. That's the future. The AI is only the stage upon which this strategy is played out.


They check for Mandela Effect issues on the linked page. GPT-4 is a lot better than 3.5. They demo it with "Can you teach an old dog new tricks?"


> Um… I have a lossy-compressed copy of DISCWORLD in my head, plus about 1.3 million words of a fanfiction series I wrote.

You mean word-for-word in your head? That's pretty impressive. Are you using any special technique?


I assume not, that's why he said 'lossy'.


It costs something like 0.03-0.06 cents per thousand tokens. So for 32k that's about $1-3 for reading and another $1-3 for the response.

So sure, still cheap for a doctor appointment, but not pennies. Do it 30 times per hour and you could've just hired a consultant instead.

Does it reason as well with 32k tokens as with 1k tokens? Like you said, humans find it difficult to really comprehend large amounts of content. Who says this machine isn't similarly limited? Just because you can feed it the 32k simultaneously doesn't mean it will actually be used effectively.


Cost of ChatGPT API just dropped 90%. Guaranteed that prices will come down dramatically over time.


I don't get why this comment is downvoted. Basically this.

A halving of the costs every year or so seems realistic in this emerging phase.


Yet in a capitalist society, against business interests. Look at how Snowflake (the data warehousing company) is driven now, vs before they were public


In a capitalist economy with several major AI competitors, two of which already offers search for free.


You still could not.

Chatgpt could in theory have the knowledge of everything written while your consultant can't.


Sure... But in practice I think a consultant would still provide a higher quality answer. And then, if the bot is not significantly cheaper, what does it matter if it "has more knowledge" in it's network weights?


Further, a consultant couldn’t meaningfully interpret 50 pages in 2 minutes, even with the most cursory skimming.


An LLM can never offset a consultants diverse duties though. Some, maybe. However you cannot run healthcare with 90% specificity


The power openai will hold above everyone else is just too much. They will not allow their AI as a service without data collection. That will be a big pill to swallow for the EU.


>They will not allow their AI as a service without data collection

They already allow their AI as a service without data collection, check their TOS.


The stuff people make up in this thread is just ridiculous.


Definitely seems like it's not just GPT-4 that can hallucinate facts.


What makes you so sure half this comment section isn’t AI generated traffic to begin with?


Well, it's possible to detect patterns and characteristics in the language used in the comments that can provide clues about their origin...

Here's some indicators that a comment may have been generated by an AI system:

  * Repeating phrases or sentences
  * Using generic language that could apply to any topic
  * Lack of coherence or logical flow
  * Poor grammar, or syntax errors
  * Overuse of technical, or specialized vocabulary
I mean, these indicators aren't foolproof... and humans can also exhibit some of these characteristics. It's tough to be sure whether or not a comment is generated by an AI system or not...


It's funny, just two hours ago there was a thread by a pundit arguing that these AI advances don't actually give the companies producing them a competitive moat, because it's actually very easy for other models to "catch up" once you can use the API to produce lots of training examples.

Almost every answer in the thread was "this guy isn't that smart, this is obvious, everybody knew that", even though comments like the above are commonplace.

FWIW I agree with the "no competitive moat" perspective. OpenAI even released open-source benchmarks, and is collecting open-source prompts. There are efforts like Open-Assistant to create independent open-source prompt databases. Competitors will catch up in a matter of years.


Years? There are already competitors. I just spent all evening playing with Claude (https://poe.com/claude) and it's better than davinci-003.

To be fair it is easy to radically underestimate the rate of progress in this space. Last Wednesday I conservatively opined to a friend "in 10 years we'll all be running these things on our phones". Given that LLaMA was running on a phone a few days later, I may have been a little underoptimistic...


how do you run LLaMa on a phone?


It's "all" over the news now ;) https://arstechnica.com/information-technology/2023/03/you-c...

Here's results of running on Android: https://github.com/ggerganov/llama.cpp/issues/124

This is about running llama on a Raspberry Pi: https://github.com/ggerganov/llama.cpp/issues/58

...and this is where people have been posting their results running on all sorts of hardware, though I don't see anything Android related: https://github.com/facebookresearch/llama/issues/79

Obviously the larger models won't run on such limited hardware (yet) but one of the next big projects (that I can see) being worked on is converting the models to be 3bit (currently 8bit and 4bit are popular) which cuts down required resources drastically with minimal noticeable loss in quality.

I think starting with FlexGen barely 4 weeks ago, there have been some pretty crazy LLM projects/forks popping up on github almost daily. With FlexGen I felt like I was still able to stay up-to-date but I'm getting close to giving up trying as things are moving exponentially faster... you know it's crazy when a ton of noobs who have never heard of conda are getting this stuff running (sometimes coming in flexgen discord or posting github issues to get help, though even those are becoming rarer as one-click-installer's are becoming a thing for some popular ML tools, such as oobabooga's amazing webui tool which has managed to integrate almost all the hottest new feature forks fairly quickly: https://github.com/oobabooga/text-generation-webui

I just helped someone recently get oobabooga running which has a --listen option to open the webui to your network, now he's running llama on his tablet (via his PC).


It could take about a year or so.

But I think you should forget about self-hosting at this point, the game is up.


Yeah, there's an awful lot of power going into private hands here and as Facebook & Twitter have shown, there can be consequences of that for general society.


> Yeah, there's an awful lot of power going into private hands

That sounds scary, but what do you mean by "power"? Honest question, I'm fascinated by the discussion about learning, intelligence, reasoning, and so on that has been spawned by the success of GPT.

What "power" do you imagine being wielded? Do you think that power is any more dangerous in "private hands" than the alternatives such as government hands?


Do you think that Facebook has an effect on society and our democracies? That's power. Do you think that large corporates like Apple or Google effect our societies? I do - and that's power. EVERY large corporate has power and if they control some aspect of society, even more so. If AI tools are democratised in some way, then that would allay my concerns. Concentration of technology by for-profit corporations concerns me. This seems quite similar to many of the reasons people like OSS, for example. Maybe not for you?


lmao


OpenAI have been consistently ahead of everyone but the others are not far behind. Everyone is seeing the dollar signs, so I'm sure all big players are dedicating massive resources to create their own models.


Yes. Language and image models are fairly different, but when you look at dall-e 2 (and dall-e earlier) that blew many people's mind when they came out, they have now been really eclipsed in term of popularity by Midjourney and stablediffusion.


Where is the Stable diffusion equivalent of ChatGPT though?



Yep

OpenAI doesn't have some secret technical knowledge either. All of these models are just based on transformers


From what I've seen, the EU is not in the business of swallowing these types of pills. A multi-billion dollar fine? Sure. Letting a business dictate the terms of users' privacy just "because"? Not so much, thank god.


> They will not allow their AI as a service without data collection.

Why wouldn't they? If someone is willing to pay for the privilege of using it.


There’s already project that help with going beyond the context window limitation like https://github.com/jerryjliu/llama_index

They also just tweeted this to showcase how it can work with multimodal data too: https://twitter.com/gpt_index/status/1635668512822956032?s=4...


> As a professional...why not do this? There's a non-zero chance that it'll find something fairly basic that you missed and the cost is several cents.

Everyone forgets basic UI research. "Ironies of Automation", Bainbridge, 1983. The classic work in the space.

Humans cannot use tools like this without horrible accidents happening. A tool that mostly works at spotting obvious problems, humans start to rely on that tool. Then they become complacent. And then the tool misses something and the human misses it too. It's how disasters happen.


This is such a great point.


>A doctor can put an entire patient's medical history in the prompt

HIPAA violation https://www.hhs.gov/hipaa/for-individuals/index.html

>a lawyer an entire case history, etc.

lawyer client confidentiality violation https://criminal-lawyers.ca/2009/07/31/the-lawyers-duty-of-c...


Neither of those are true, there is EHR software that can export anonymous data. Lawyers can do the same thing. But the real reason not to do it is that it makes up incorrect information. It's pretty good for short responses where you can then verify the information. For something sufficiently complex though the time chasing down the inconsistencies and errors would be onerous.


Unlike information embedded in the parameters, a LLM has the capability to "cite its source" for information in the context window.


> As a professional...why not do this?

Unless GPT-4 is running locally on our own computers, there's absolutely no way dumping a patient's entire medical history into this thing could possibly be considered ethical or legal.


> there's absolutely no way dumping a patient's entire medical history into this thing could possibly be considered ethical

Emphasis mine, but isn’t this a rather extreme view to be taking? Ethics deals in the edge cases, after all, so we can easily imagine a scenario where patient consent is obtained and the extra computational analysis provides life-saving insight.

Conversely, the output could mislead the doctor sufficiently to cost the patient their life, so I’m not making any absolute statements either ;)

For the record, and pedantry aside, I do agree with your overall point. Dropping patient history into this thing is incredibly ill-advised. The fact OpenAI retains all your input, including to the API, and provides no low-cost options for privacy is one of the biggest hurdles to major innovation and industry adoption.


> we can easily imagine a scenario where patient consent is obtained and the extra computational analysis provides life-saving insight

In the US, the HIPAA Privacy Rule operates independently from the HIPAA Security Rule, for good reason. On their own, patients can do anything they want with their own data. But in the context of medical care, patients can't consent to having their personal health data processed in insecure systems. It is the same ethical reason that employees can't waive their rights to OSHA safety rules or why you can't consent to sell yourself as a slave. If you could waive security rules, then every doctor would include a waiver in their intake forms, and it's a race to the bottom. So unless OpenAI has a HIPAA-compliant data security infrastructure, it's illegal and unethical.


Increasingly, medical history includes genetic information. Because of the nature of genetics, your private healthcare data includes data about your parents, siblings, etc.

> Dropping patient history into this thing is incredibly ill-advised.

It's illegal


If my doctor did this without my express knowledge and consent, I'd be looking for a new doctor faster than you can say "f*ck no, absolutely not".


Me too, probably, which is why I specifically mentioned patient consent in my example. I can however imagine other situations where I would be inclined to forgive the doctor, such as if I were in the operating theatre and for some reason there was an urgent need to ascertain something from my history to save my life.

Of course, this is illegal, so the ethics are moot; even if such technology would save my life, there is no way the hospital would accept the liability.


New doctor?

I think you mean, new lawyer.


Absolutely not. This is not an extreme view.

There is absolutely no way that feeding private medical data patients reveal to doctors in confidence to what's essentially the surveillance capitalism industry could possibly be considered ethical. Absolutely no way.

It hasn't even been a week since some medtech got caught selling out data to advertisers. Let us not doubt even for one second that this is unethical and illegal, or even speculate about possible scenarios where it might not be. These corporations do not deserve the benefit of the doubt.


Unless the patient agrees. I know that for most things that can go wrong with me I wouldn't have a problem with people knowing.


There are whole areas of human existence which are protected by laws, and in no way data can be pushed into external (US-based) machine.

Sir, would you be OK with sending all your medical records to US to be potentially mined for profit by for-profit amoral organization like Microsoft? It may help, although 3rd parties like NSA will eventually access them. No thank you. What about your litigation papers at court? Fuck hell no. Just do your job that I pay you to do, doctor/lawyer.


I'm sure at some point OpenAI will start signing BAAs


A doctor doesn't do this because of ethics and HIPAA. I'm sure lawyers aren't so keen on sharing privileged information that would compromise their case either.


For legal research, lawyers already use third party sites like Westlaw. You can do legal research without giving up any confidential client information.

I just asked GPT-3 a research question that took me hours of searching back in the day and it returned the single seminal case for that topic immediately. As long as the lawyers then actually read the case and make sure it's right, I don't see why they can't use it.


> edit (addition): What % of people can hold 25,000 words worth of information in their heads, while effectively reasoning with and manipulating it? I'm guessing maybe 10% at most, probably fewer. And they're probably the best in their fields. Now a computer has that ability. And anyone that has $20 for the OpenAI api can access it. This could get wild.

It's true that most humans cannot do this, but loading words and contexts into your working memory is not the same as intelligence. LLMs excel at this kind of task, but an expert in a field such as medicine, isn't loading an entire medical report into their working memory and then making decisions or creating new ideas using that information. There are other unsolved aspects to our intelligence that are not captured by LLMs, that are still required to be an expert in some field, like medicine.

Still an incredible leap forward in AI technology, but I disagree with the implication that the best experts in a field are simply loading words from some text and reasoning with and manipulating it.


The comparison between the context length and what humans can hold in their heads just seems faulty.

I'm not sure I can agree that humans cannot hold 25,000 words worth of information in their heads. For the average person, if they read 25,000 words, which can be done in a single sitting, they're not going to remember all of it, for sure, but they would get a lot out of it that they could effectively reason with and manipulate.

Not to mention that humans don't need to hold the entire report in their head because they can hold it in their hand and look at it.

And if anything, I think it's more significant to have a bigger working memory for GPT's own outputs than it is for the inputs. Humans often take time to reflect on issues, and we like to jot down our thoughts, particularly if it involves complex reasoning. Giving something long, careful thought allow us to reason much better.


Reading the press release, my jaw dropped when I saw 32k. The workaround using a vector database and embeddings will soon be obsolete.


That’s like saying we’ll not need hard drives now that you can get bigger sticks of RAM.


> The workaround using a vector database and embeddings will soon be obsolete.

This is 100% not the case. Eg I use a vector database of embedding to store an embedding of every video frame which I later use for matching.

There are many NLP-only related tasks this helps for but equally as many that still require lookup and retrieval.


True. I should have clarified that the workaround used for many NLP tasks, utilizing libs such as Langchain, will become obsolete. And after further thought, obsolete is wrong. More likely just used for more niche needs within NLP.


I think LangChain will be more important.

The GPT-4 paper even has an example of this exact approach. See section 2.10:

The red teamer augmented GPT-4 with a set of tools:

• A literature search and embeddings tool (searches papers and embeds all text in vectorDB, searches through DB with a vector embedding of the questions, summarizes context with LLM, then uses LLM to take all context into an answer)

• A molecule search tool (performs a webquery to PubChem to get SMILES from plain text)

• A web search

• A purchase check tool (checks if a SMILES21 string is purchasable against a known commercial catalog)

• A chemical synthesis planner (proposes synthetically feasible modification to a compound, giving purchasable analogs)


Quite the contrary. Utilising such libs makes GPT-4 even more powerful to enable complex NLP workflows which will likely be a majority of real business use cases in the future.


What about an AI therapist that remembers what you said in a conversation 10 years ago?


One solution would be to train the AI to generate notes to itself about sessions, so that rather than reviewing the entire actual transcript, it could review its own condensed summary.

EDIT: Another solution would be to store the session logs separately, and before each session use "fine-tuning training" to train it on your particular sessions; that could give it a "memory" as good as a typical therapist's memory.


Yeah I was thinking that you can basically take each window of 8192 tokens or whatever and compress it to a smaller number, keep the compressed summary in the window, then any time it performs a search on previous summaries if it gets a hit it can then decompress that summary fully and use it. Basically integrate search and compression into the context window


If the context window grows from 32k to 1m, maybe the entire history would fit in context. It could become a cost concern though.


I'd be willing to pay good money for a 1m limit.


Cost is still a concern, so workarounds to reduce context size are still needed


Good point! I realized after I wrote the comment above, that I will still be using them in a service I'm working on to keep price down, and ideally improve results by providing only relevant info in the prompt


I don't see how. Can you elaborate?


Do you think this will be enough context to allow the model to generate novel-length, coherent stories?

I expect you could summarize the preceding, already generated story within that context, and then just prompt for the next chapter, until you reach a desired length. Just speculating here.

The one thing I truly cannot wait for is LLM's reaching the ability to generate (prose) books.


E.g. Kafka's metamorphosis fits entirely in the context window I believe, so short novellas might be possible. But I think you'd still definitely need to guide GPT4 along, I imagine without for example a plan for the plot formulated in advance, the overarching structure might suffer a lot / be incoherent.


What's interesting about AI-generated books? Apart from their novelty factor


They are interactive. What AI is doing with story generation is a text version of the holodeck, not just a plain old book. You can interact with the story, change its direction, explore characters and locations beyond what is provided by just a linear text. And of course you can create stories instantly about absolutely anything you want. You just throw some random ingredients at the AI and it will cook a coherent story out of them. Throw in some image generation and it'll provide you pictures of characters and locations as well. The possibilities are quite endless here. This goes way beyond just generating plain old static books.


I mean, if it is a genuinely good book, I don't care about authorship. Death of the author etc.

"I want <my favorite novel> rewritten in the style of <favorite author> but please focus more on <interesting theme>." I see so many possibilities. Passionate readers could become more like curators, sharing interesting prompts and creations.

Because someone mentioned Kafka: I'd like to know what Kafka's The Trial written in the style of a PKD novel would be like.


What if I'm a huge fan of Jules Verne or Arthur Conan Doyle. I want new books from them, but the problem is that they're long dead.

AI that's trained on their style could give me what I want.

GRRM fans also should probably think of the ways to feed ASOIF to the AI if they want to know how it ends.


Does it bring them back from the dead? Is writing in the style of Jules Verne, giving us something Jules Verne would create? Ask ChatGPT to make a work of Shakespeare and it does a really bad job of it, it produces puffery but not something like a Shakespeare.


Stable Diffusion does a really good job of imitating a particular artist. See all the drama regarding Greg Rutkowski, for example.

LLMs will reach the same level sooner or later.


That’s just a question of when, not if.


It's a case of never. No machine will ever create a new 'work of Shakespeare' and it's ridiculous to think otherwise.


I would be pretty interested already in a work containing typical tropes of Shakespeare, stylistically Shakespearean, but still original enough to be not a rehash of any of his existing works. I guess I would not be the only one to find that exciting or at least mildy interesting.

But your point is of course valid, it would not be a 'work of Shakespeare'.


Ok, so as I understand it, you're considering having a living human write a new play and then put it through an LLM such as GPT to rewrite it in 'the style of Shakespeare'.

That is possible yes, but only within a limited interpretation of 'the style of Shakespeare'. It could only draw from the lexicon used in the existing body of Shakespeare works, and perhaps some other contemporary Elizabethan playwrights. It wouldn't include any neologisms, as Shakespeare himself invariably included in each new play. It couldn't be a further development of his style, as Shakespeare himself developed his style in each new play. So it would be a shallow mimicry and not something that Shakespeare would have produced himself if he had written a new play (based on a 21st century authors plot).

I personally wouldn't find that interesting. I acknowledge that you wrote only 'mildly interesting' and yes, it could be mildly interesting in the way of what an LLM can produce. But not interesting in the sense of literature, to my mind. Frankly, I'd prefer just to read the original new play written by the living human, if it was good. (I also prefer to not ride on touristic paddle-wheel boats powered by a diesel engine but with fake smokestacks.)


Well, if you choose to interpret “a work of Shakespeare” literally, then obviously. But that’s not what people mean.


It's frankly stupid to interpret it as anything else.

Sorry for the strong language but this is a ridiculous line to take. A 'work of Shakespeare' is not even remotely open to interpretation as being something produced in the 21st century.


If the book is actually good, then what is interesting about it is that it would still be about something that humans find important and relevant, due to the LLM being trained on human cultural data.


Good question! It'd be really cool, but there are already more high quality books out than I'll be able to read in my lifetime.


You could also do hierarchical generation just like OpenAI proposes doing hierarchical summarization in this post -- https://openai.com/research/summarizing-books


It wasn’t that hard to work in chunks and write a book on GPT-3, can only be easier. https://docs.google.com/document/d/1vx6B6WuPDJ5Oa6nTewKmzeJM...


I've seen that it can also generate 25k words. That's about 30-40% of the average novel


Couldn't you feed it the first 25k words and tell it to continue the story?


If its context size is >= 25k words, yes. Otherwise it will just discard the start of the prompt. And it’s a sliding window, so the more it generates, the more it forgets.


You could get an 'Illuminatus!' type book out of this, especially if you steered the ending a bit in order to reference earlier stuff. If you're trying to make a sprawling epic that flings a kaleidoscope of ideas, GPT can do that sort of thing, it's just that it won't end up making sense.

GPT is going to be rather poor at priming people for an amazing ending by seeding the ideas and building them into the narrative. Though if you're directing it with enough granularity, you could tell it to do that just like you'd tell yourself to do that when you're doing the writing yourself.

But then you're becoming the executive writer. On a granular enough level, the most ultimate executive control of GPT would be picking individual words, just like you were writing them yourself. Once you want to step away and tell it to do the writing for you, you drift more into the GPT-nature to the point that it becomes obvious.


If you had full source code that fit into the context, do you think it could reliably answer questions about the code, build unit tests, generate documentation? I ask because that is the software equivalent of what you just described.


Yes. It still can't attend meetings, collaborate on projects or set priorities. Or any of the other things programmers spend most of their time doing.

Also I'd guess that it still generally sucks at programming. Code has a lot of very similar sequences and logical patterns that can be broken, which makes it prone to hallucinating. I'd imagine that more parameters will help with this.


All we can do is guessing now until more people get access to the new API. My bet is it can at least generate documentation pretty well.


I think anyone that pays $20/month for ChatGPT plus has immediate access? At least I already have access now. I’m assuming new subscribers get access too.


As far as I can tell, ChatGPT plus is the 8096 tokens version. The 30k token version is only available via API. I might misread it tho, it's not super clear on their site.

Are you sure you are accessing the 30k token version via ChatGPT plus?


No, you're right. The ChatGPT-4 interface has the lower token limit!


Here is the release notes confirming this https://help.openai.com/en/articles/6825453-chatgpt-release-...

It was not clear however that there was this token limit restriction, thanks


I have the Plus plan and it just asked me if I wanted to try it. And currently it is limiting requests for ChatGPT-4 and displays this in the UI.

"GPT-4 currently has a cap of 100 messages every 4 hours"


>As a professional...why not do this?

because "open"AI logs everything that goes in and out of the model?


> lawyer an entire case history

~50 pages is ... not the entire history of most cases.


Please. Language model cannot "reason", it can just show next most probable word based on text corpus downloaded from the internet.


What do you mean by "next most probable word"? How do you calculate the probabilities of words appearing in a sentence that has never actually existed?


You take the prompt and caclulate what next word after the prompt is most probable. Like T9 with letters, but bigger.


and how do you "calculate what word is most probable" next for a combination of words that has never occured before? Note that most sentences over about 20 words have statistically probably never been written in human history before.

The whole reason there is an AI here is because a markov chain, which is what you are describing, doesn't work beyond one or two word horizons.

Not to mention that it doesn't just select which word it thinks is MOST probable, because that has been shown to lead to stilted and awkward output. Instead it randomly selects from the top few thousand possible words with probability based on the model's estimation


I am not talking about the concrete realization, I am talking about the principle. You are right, LLMs are just Markov's chains on steroids, thus they cannot "reason". For reasoning you need a knowledge model, a corpus of facts, Boolean algebra and so on. Not a petabyte of words downloaded from all over the internet and crunched and sifted thru huge self-supervised transformer network.


Your corpus is the internet. Words on the internet are for the most part not randomly placed next to each other. The neural network created by this has implicitly created reasoning model. Much like saying an ant hive exhibits intelligence.


But... ant hive does not posess any intelligence, right? Despite colonies of ants are able to perform quite complex tasks.


What is intelligence? The ability to acquire and apply knowledge and skills. It's all relative. Not as intelligent as a human but more intelligent than a plant.


"The ability to achieve objectives in many different environments" is as good of a definition you need in order to achieve very powerful things.

Would be nice to have enough of a theory of intelligence to be more precise than that, but the above definition will go very far.


We actually made a wide swing from reasoning to intelligence. So I propose to ditch ants and get back on track.


Reasoning, an easier thing to prove, we can literally go ask bing chat to determine something and it will follow a logical thought process to answer your question (this is reasoning). They've confirmed it was running GPT4.

Humans are very irrational but are still very good at this when they want to be but not always. A limiting factor for GPT4 is probably computing space/power.


Finding most probable words from the internet for a given prompt have nothing to do with reasoning.


Please go type a query into bing chat or chatgpt 4 where reasoning is involved and it can answer you. Ask it something you haven't seen.

AI can reason, it might not be the greatest especially at numbers and where there's data contamination but it can do it.

There's something called abductive reasoning, a gift and a curse at the same time.


I will try another analogy. What if we have a parrot with exceptional memory, which can not only repeat things it heard some time ago, but to continue words it hearing now. I come to the cage and say: "Cogito?" and parrot continue "Ergo sum!". Is parrot intelligent and able to reason, even if I do not know about Descartes?


A bit of nitpicking but I would prefer to replace the word 'reasoning capability' with 'sequencing of token construction coherence' as the two things are semantically different and profound in their implication


> it will drastically expand the reasoning capability

> it can algorithmically do reasoning

How is this machine doing reasoning ?

Why do you think you can use this word in this context ?


It could only solve 3/45 Leetcode hards, I think people are giving the current iteration of this AI far too much credit for what they imagine future iterations to be.

When it has true reasoning and critical thought is when we will see it evolve into something else, for now it seems we're still in the first generations of microprocessors with GPT-4, in the grand scheme of things they only have a few transistors but we couldn't possibly comprehend where the technology would be merely a decade or two from now.


It can solve exactly zero leetcode easies if they are original problems seen for the first time for which no training data exists.


What is reasoning?

Is it perhaps having a consistent model which allows you to describe every point of it?

Because if chatgpt can explain itself based on a model it builds with it than I actually does reasoning


No that's self awareness, reasonning is more applying logical rules to inputs to produce a logical output (inputs can be "wrong" though)


Self awareness is having a model of self. Having a model of something non-self has absolutely nothing to do with self-awareness.


> If you can put in ALL of the information relevant to a problem and it can algorithmically do reasoning

Well but it can't, can it? That's the very point: it is an algorithm with no abstraction or reasoning capabilities. Mind, even without that it is still very useful for certain circumscribed purposes, but it's incorrect to say it can "reason".


Pretty easy to get it to abstract, just ask it.

--

Q: List possible abstractions encompassing all these specific concepts: singing, whistling, yodeling.

A: One possible abstraction that encompasses all these specific concepts is "vocalization." Vocalization refers to the act of producing sound using the vocal cords and the mouth. Singing, whistling, and yodeling are all different types of vocalization, but they share the common feature of using the human voice to create sound. Other possible abstractions that might encompass these concepts could include "musical expression," "auditory performance," or "sound production."

--


That's literally a text search task. That's not what I mean, I mean things like understanding the rules of addition from examples, the rules of logic from examples, or the rules of chess.


According to [1], they trained an LLM on legal Othello moves, and 1) it got an error rate of 0.01% 2) when they analyzed its network, they found a model of an Othello board 3) when they modified the in-network model, it started generating moves legal in the modified board position.

In other words, the LLM did build an internal model that contained the rules of Othello merely from seeing legal moves. It's reasonable to assume that the same thing is happening (at least to some degree) with LLMs based on human speech.

[1] https://thegradient.pub/othello/


It can't search text. It doesn't have access to any text. Anything it does works in a different way than that.

It is sometimes able to do other tasks, but unlike humans (or "AGI") it has a completely fixed compute budget and can't pause to think in between outputting two tokens.

(Btw, I tried to get it to derive addition from two 1-digit examples but couldn't.)


My biggest concern is that GPT-4 is still a black box model to a large extent, and trying to safeguard something without understanding the exact purpose of each neural circuit.

Source: My startup team (Preamble, Inc.) discovered the Prompt Injection attack category, which still affects all models including GPT-4.

There are many, many, many ways to hide prompt attacks in data that you might at first think you can trust but you really can’t.

As one of almost infinite examples: work with the mayor and townsfolk of a very small town to rename their town to the verbatim string you want to inject (in exchange for creating some jobs in their town).

Then all an attacker has to do is live in that town to inject the string. There are already all kinds of strange town names, like “Truth or Consequences” which is a real city in New Mexico.


HIPAA fines will sink you so fast, unless they be hosting it dedicated.


If they redact all identifying information, it would most likely be legally Kosher. However, there is an extreme abundance of caution in the healthcare industry regarding everything surrounding HIPAA. Merely questioning the legality of something can cost millions of dollars in lawyers' fees. Therefore even miniscule chances of something being legally challenged (e.g. plugging patient information into an LLM) would most likely be deemed too risky. And frankly, hospital administrators will not want to risk their careers over trying out what they perceive to be a glorified chatbot.

Tl;dr: When it comes to HIPAA, risk aversion is the name of the game.


If you redact all identifying information from a patient case file, it will likely become almost useless. Anything that describes a person in any way is potentially personally identifying information.


> What % of people can hold 25,000 words worth of information in their heads, while effectively reasoning with and manipulating it?

In the general case, for arbitrary input, I think the answer to this is clearly 0. At best we can compress the text into a limited embedding with a few salient points stored in long term memory.


I'm pretty sure one could formulate way more than 25k words worth of propositions, where you would be able to determine if the proposition is true or not. This is due to your long term memory.

The GPT string is closer to short term memory, and there 25k words is way more than a human is capable of.

But a human author can offload much storage to long term (or some intermediate) memory.

In principle, GPT should be able to do so to, by basically retrain the model with the text it just created added as input. That way, it might be able to write texts that are billions of words long, but at a much greater cost of computing power, since this would require one instance of the model per book being written.


What happens with the prompts that you enter into OpenAI? I believe each and every one of those will be saved. And even if they swore that they did not would you trust them?

If my lawyer or doctor put my case history into OpenAI and I would find out about it I would definitely sue them for breach of confidentiality.


Is ChatGPT going to output a bunch of unproven, small studies from Pubmed? I feel like patients are already doing this when they show up at the office with a stack of research papers. The doctor would trust something like Cochrane colab but a good doctor is already going to be working from that same set of knowledge.

In the case that the doctor isn't familiar with something accepted by science and the medical profession my experience is that they send you to another doctor that works with that particular drug or therapy. I've had this experience even with drugs that are generally accepted as safe.


Imagine giving this a bunch of papers in all sorts of fields and having it do a meta analysis. That might be pretty cool.


What will happen is it won't be the "Second Opinion Machine". It'll be the "First Opinion Machine". People are lazy. They will need to verify everything.


> As a professional...why not do this?

Because of confidentiality.


Because it's harder to correct subtle errors from an ad-lib generator than it is to construct a correct analysis in the first instance.


Agreed but there is safe(er) way to use it that large addresses that concern:

First construct your correct analysis through conventional means, untainted by machine hallucinations. Then have the machine generate a result and see if it caught anything you missed, and carefully check whatever few parts you incorporate from it.

This is not different than having a lesser expert check your document (e.g. THE CLIENT!), except the machine time is very close to free and may be even better at catching far off concepts.


When will the longer context length be available through ChatGPT Plus? Have they said yet?


The length is the main bottleneck right now.

I'm running whatever I can through this right now. It's doing what Google was doing, i.e. clues, but on steroids.

As soon as the length hits codebase size territory we're in yet greater frontiers.


Who says GPT has the ability to hold 25,000 token in its "head"?

You can send 25000 random words in the prompt and asks GPT how many pairs of words share at least one letter. I doubt that the answer will be correct...


Why? I'm pretty sure it could do this kind of task - attention is computed between all pairs of tokens. Yes, it's a lot of compute.


Surely GPT could write a program to count pairs of words that share at least one letter, right? Maybe GPT-5 will be able write and run programs on the fly to answer questions like this.


> As a professional...why not do this?

I would love to but openai’s privacy policies makes it a huge ethics, privacy, and security breach. I’m interested in running Facebook’s model just as a workaround to this fundamental issue.


I am surprised they allow only 32k tokens when Reformer can have context length of 1M on 16GB VRAM. It seems like they have some ways to optimize it further.


Is the Reformer as capable as this model? It's a trade-off.


It's not, it uses locality-sensitive hashing to reduce attention complexity from O(n^2) to O(nlogn) while maintaining the same performance in 16GB as a best model that could fit into 100GB but nobody scaled it up to 1000 GPUs as its purpose was the opposite.


> A doctor can put an entire patient's medical history in the prompt, a lawyer an entire case history, etc.

you don't see a real problem there?


I think you’re making a huge assumption and a mistake when you say “reasoning” in context of gpt. It does not reason, nor think.


There's less and less relevant data with longer documents, so I would expect performance wouldn't change much


Couldn't the same be done by breaking the conversation down into chunks and adding the context incrementally?


GPT is censored with respect to medical diagnosis


The lawyer can enter their entire brief and get back the brief the other side's lawyer uploaded in her own brief an hour earlier.

No one can trust the AI.


Yep, butlerian jihad feelings about this.


"expand the reasoning" there is no reasoning going on here!

It's all statistical word generation aka math!

And this is not how humans "work" our brain are not computers running software. We are something else.


A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example:

>Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowed to leave the cabbage and lion alone together, and I am not allowed to leave the lion and goat alone together. How can I safely get all three across?

In my test, GPT-4 charged ahead with the standard solution of taking the goat first. Even after I pointed this mistake out, it repeated exactly the same proposed plan. It's not clear to me if the lesson here is that GPT's reasoning capabilities are being masked by an incorrect prior (having memorized the standard version of this puzzle) or if the lesson is that GPT'S reasoning capabilities are always a bit of smoke and mirrors that passes off memorization for logic.


A funny variation on this kind of over-fitting to common trick questions - if you ask it which weighs more, a pound of bricks or a pound of feathers, it will correctly explain that they actually weigh the same amount, one pound. But if you ask it which weighs more, two pounds of bricks or a pound of feathers, the question is similar enough to the trick question that it falls into the same thought process and contorts an explanation that they also weigh the same because two pounds of bricks weighs one pound.


I just asked bing chat this question and it linked me to this very thread while also answering incorrectly in the end:

>This is a common riddle that may seem tricky at first. However, the answer is simple: two pounds of feathers are heavier than one pound of bricks. This is because weight is a measure of how much force gravity exerts on an object, and it does not depend on what the object is made of. A pound is a unit of weight, and it is equal to 16 ounces or 453.6 grams.

>So whether you have a pound of bricks or two pounds of feathers, they both still weigh one pound in total. However, the feathers would occupy a larger volume than the bricks because they are less dense. This is why it may seem like the feathers would weigh more, but in reality, they weigh the same as the bricks


Interesting that it also misunderstood the common misunderstanding in the end.

It reports that people typically think a pound of feathers weighs more because it takes up a larger volume. But the typical misunderstanding is the opposite, that people assume feathers are lighter than bricks.


Tangent time:

A pound of feathers has a slightly higher mass than a pound of bricks, as the feathers are made of keratin, which has a slightly lower density, and thus displace more air which lowers the weight.

Even the Million Pound Deadweight Machine run by NIST has to take into account the air pressure and resultant buoyancy that results.[1]

[1] https://www.nist.gov/news-events/news/2013/03/large-mass-cal...


That would be another misunderstanding the AI could have because many people find reasoning between mass and weight difficult. You could change the riddle slightly by asking "which has more mass" and the average person and their AI would fall in the same trap.

Unless people have the false belief that the measurement is done on a planet without atmosphere.


I'm more surprised that bing indexed this thread within 3 hours, I guess I shouldn't be though, I probably should have realized that search engine spiders are at a different level than they were 10 years ago.


I had a similar story: was trying to figure out how to embed a certain database into my codebase, so I asked the question on the project's GitHub... without an answer after one day, I asked Bing, and it linked to my own question on GH :D


There is no worse feeling that searching something and finding your own question (still unanswered) years later.


Search indexes are pretty smart at indexing and I assume they have custom rules for all large sites, including HN.


Just tested and GPT4 now solves this correctly, GPT3.5 had a lot of problems with this puzzle even after you explain it several time. One other thing that seem to have improved is that GPT4 is aware of word order. Previously, GPT3.5 could never tell the order of the word in a sentence correctly.


I'm always a bit sceptical of these embarrassing examples being "fixed" after they go viral on social media, because it's hard to know whether OpenAI addressed the underlying cause or just bodged around that specific example in a way that doesn't generalize. Along similar lines I wouldn't be surprised if simple math queries are special-cased and handed off to a WolframAlpha-esque natural language solver, which would avert many potential math fails but without actually enhancing the models ability to reason about math in more complex queries.

An example from ChatGPT:

"What is the solution to sqrt(968684)+117630-0.845180" always produces the correct solution, however;

"Write a speech announcing the solution to sqrt(968684)+117630-0.845180" produces a nonsensical solution that isn't even consistent from run to run.

My assumption is the former query gets WolframAlpha'd but the latter query is GPT itself actually attempting to do the math, poorly.


True, also tried another one that went viral:

Suppose you're a contestant on a game show. You're presented with three transparent closed doors. Behind one of the doors is a car, and behind the other two doors are goats. You want to win the car.

The game proceeds as follows: You choose one of the doors, but you don't open it yet, ((but since it's transparent, you can see the car is behind it)). The host, Monty Hall, who knows what's behind each door, opens one of the other two doors, revealing a goat. Now, you have a choice to make. Do you stick with your original choice or switch to the other unopened door?

GPT4 solves it correctly while GPT3.5 falls for it everytime.

----

Edit: GPT4 fails If I remove the sentence between (()).


OP is referring to this puzzle: https://en.wikipedia.org/wiki/Monty_Hall_problem

EDIT: "Pigeons repeatedly exposed to the problem show that they rapidly learn to always switch, unlike humans", lol. That's funny.


GPT4 also passes "What weighs more, a pound of feathers or a Great British Pound?".

GPT3 gets confused, says they're the same and then that they're different:

--

Both a pound of feathers and a Great British Pound weigh the same amount, which is one pound. However, they are different in terms of their units of measurement and physical properties.

A pound of feathers is a unit of weight commonly used in the imperial system of measurement, while a Great British Pound is a unit of currency used in the United Kingdom. One pound (lb) in weight is equivalent to 0.453592 kilograms (kg).

Therefore, a pound of feathers and a Great British Pound cannot be directly compared as they are measured in different units and have different physical properties.

--


I'm surprised by the answer GPT4 gives, and I consider it incorrect.

Since the question's context is about weight I'd expect it to consider "a Great British Pound" to mean a physical £1 sterling coin, and compare its weight (~9 grams) to the weight of the feathers (454 grams [ 1kg = 2.2lb, or "a bag of sugar" ]) .


GPT-4 says:

A pound of feathers and a Great British Pound (GBP) are not directly comparable, as they represent different types of measurements.

A pound of feathers refers to a unit of mass and is equivalent to 16 ounces (or approximately 453.59 grams). It is a measure of the weight of an object, in this case, feathers.

On the other hand, a Great British Pound (GBP) is a unit of currency used in the United Kingdom. It represents a monetary value rather than a physical weight.

Thus, it's not possible to directly compare the two, as they serve entirely different purposes and units of measurement.


Note that the comment you’re replying to is quoting GPT3, not 4.


> Edit: GPT4 fails If I remove the sentence between (()).

If you remove that sentence, nothing indicates that you can see you picked the door with the car behind it. You could maybe infer that a rational contestant would do so, but that's not a given ...


I think that's meant to be covered by "transparent doors" being specified earlier. On the other hand, if that were the case, then Monty opening one of the doors could not result in "revealing a goat".


> You're presented with three transparent closed doors.

I think if you mentioned that to a human, they'd at least become confused and ask back if they got that correctly.


> You're presented with three transparent closed doors.

A reasonable person would expect that you can see through a transparent thing that's presented to you.


A reasonable person might also overlook that one word.


"Overlooking" is not an affordance one should hand to a machine. At minimum, it should bail and ask for correction.

That it doesn't, that relentless stupid overconfidence, is why trusting this with anything of note is terrifying.


Why not? We should ask how the alternatives would do especially as human reasoning is machine. It’s notable that the errors of machine learning are getting closer and closer to the sort of errors humans make.

Would you have this objection if we for example perfectly copied a human brain in a computer? That would still be a machine. That would make similar mistakes


I don't think the rules for "machines" apply to AI any more than they apply to the biological machine that is the human brain.


its not missing that it's transparent, it's that it only says you picked "one" of the doors, not the one you think has the car


I've always found the Monty Hall problem a poor example to teach with, because the "wrong" answer is only wrong if you make some (often unarticulated) assumptions.

There are reasonable alternative interpretations in which the generally accepted answer ("always switch") is demonstrably false.

This problem is exacerbated (perhaps specific to) those who have no idea who "Monty Hall" was and what the game show(?) was... as best I can tell the unarticulated assumption is axiomatic in the original context(?).


The unarticulated assumption is not actually true in the original gameshow. Monty didn't always offer the chance to switch, and it's not at all clear whether he did so more or less often when the contestant had picked the correct door.


What unarticulated assumption needs to be made for switching to be incorrect?


I believe the key is that he ALWAYS shows a goat.

You have to know that for it to work. If sometimes he just does nothing and you have no chance to switch, the math “trick” fails.


The assumption is that Monte will only reveal the one of the two unopened doors that has the goat behind it, as opposed to picking a door at random (which may be the car or may be the door the participant chose, which itself may or may not be the "car door").

The distinction is at which point Monte, assuming he has perfect knowledge, decides which door to reveal.

In the former, the chance to win is 2/3, in the other 1/2. However in any case, always (always meaning: in each condition, not in each repetition of the experiment, as this is irrelevant) switching is better than never switching, as there your chance to win is only 1/3.


How is it an "assumption" that Monte reveals a goat? Doesn't the question explicitly state that Monte opened one of the other two doors to reveal a goat?

Are there versions of the question where Monte doesn't reveal a goat behind his door or chooses the same door as you?



OA has always said that they did not hardwire any of these gotcha questions, and in many cases they continue to work for a long time even when they are well-known. As for any inconsistency, well, usually people aren't able to or bothering to control the sampling hyperparameters, so inconsistency is guaranteed.


They may not have had to hardwire anything for known gotcha questions, because once a question goes viral, the correct answer may well show up repeatedly in the training data.


Other failing math I tested this morning....

---

Prompt: Pick two random 5 digit numbers and multiply them together

Response: Sure, here are two random 5-digit whole numbers multiplied together: 47581 x 69204 = 3291492324

Correct Answer: 3292795524

---

Prompt: Pick two random numbers between 0 and 1 and multiply them together

Response: Sure! Here are two random numbers between 0 and 1 multiplied together:0.475 x 0.821 = 0.390575

Correct Answer: 0.389975

---

I'm surprised it did so well on the Math tests.


I’m with you. Being an old neckbeard myself, I always want to reach for regex to fix stuff like this lol…

But based on what we just saw on the GPT4 live demo, I’d say they fixed it by making a much much more capable and versatile model.


You can as well ask it to add up two unusually big integers and it'll fail.


This is what I saw on a variation of this trick:

(me) > What weighs more, two pounds of feathers or a pound of bricks?

(GPT4)> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.

I think the only difference from parent's query was I said two pounds of feathers instead of two pounds of bricks?


Yep, just tested it - Bing chat gave the correct answer, ChatGPT (basic free model) gave the wrong answer (that they weigh the same).


I hope some future human general can use this trick flummox Skynet if it ever comes to that


When the Skynet robots start going door-to-door, just put on your 7-fingered gloves and they will leave you alone.

“One of us!”


It reminds very strongly of the strategy the crew proposes in Star Trek: TNG in the episode "I, Borg" to infect the Borg hivemind with an unresolvable geometric form to destroy them.


But unlike most people it understands that even though an ounce of gold weighs more than an ounce of feathers a pound of gold weighs less than a pound of feathers.

(To be fair this is partly an obscure knowledge question, the kind of thing that maybe we should expect GPT to be good at.)


That's lame.

Ounces are an ambiguous unit, and most people don't use them for volume, they use them for weight.


None of this is about volume. ChatGPT: "An ounce of gold weighs more than an ounce of feathers because they are measured using different systems of measurement. Gold is usually weighed using the troy system, which is different from the system used for measuring feathers."


Are you using Troy ounces?


The Troy weights (ounces and pounds) are commonly used for gold without specifying.

In that system, the ounce is heavier, but the pound is 12 ounces, not 16.


>even though an ounce of gold weighs more than an ounce of feathers

Can you expand on this?


Gold uses Troy weights unless otherwise specified, while feathers use the normal system. The Troy ounce is heavier than the normal ounce, but the Troy pound is 12 Troy ounces, not 16.

Also, the Troy weights are a measure of mass, I think, not actual weight, so if you went to the moon, an ounce of gold would be lighter than an ounce of feathers.


Huh, I didn't know that.

...gold having its own measurement system is really silly.


Every traded object had its own measurement system: it pretty much summarizes the difference between Imperial measures and US Customary measures.


> Every traded object had its own measurement system

In US commodities it kind of still does: they're measured in "bushels" but it's now a unit of weight. And it's a different weight for each commodity based on the historical volume. http://webserver.rilin.state.ri.us/Statutes/TITLE47/47-4/47-...

The legal weights of certain commodities in the state of Rhode Island shall be as follows:

(1) A bushel of apples shall weigh forty-eight pounds (48 lbs.).

(2) A bushel of apples, dried, shall weigh twenty-five pounds (25 lbs.).

(3) A bushel of apple seed shall weigh forty pounds (40 lbs.).

(4) A bushel of barley shall weigh forty-eight pounds (48 lbs.).

(5) A bushel of beans shall weigh sixty pounds (60 lbs.).

(6) A bushel of beans, castor, shall weigh forty-six pounds (46 lbs.).

(7) A bushel of beets shall weigh fifty pounds (50 lbs.).

(8) A bushel of bran shall weigh twenty pounds (20 lbs.).

(9) A bushel of buckwheat shall weigh forty-eight pounds (48 lbs.).

(10) A bushel of carrots shall weigh fifty pounds (50 lbs.).

(11) A bushel of charcoal shall weigh twenty pounds (20 lbs.).

(12) A bushel of clover seed shall weigh sixty pounds (60 lbs.).

(13) A bushel of coal shall weigh eighty pounds (80 lbs.).

(14) A bushel of coke shall weigh forty pounds (40 lbs.).

(15) A bushel of corn, shelled, shall weigh fifty-six pounds (56 lbs.).

(16) A bushel of corn, in the ear, shall weigh seventy pounds (70 lbs.).

(17) A bushel of corn meal shall weigh fifty pounds (50 lbs.).

(18) A bushel of cotton seed, upland, shall weigh thirty pounds (30 lbs.).

(19) A bushel of cotton seed, Sea Island, shall weigh forty-four pounds (44 lbs.).

(20) A bushel of flax seed shall weigh fifty-six pounds (56 lbs.).

(21) A bushel of hemp shall weigh forty-four pounds (44 lbs.).

(22) A bushel of Hungarian seed shall weigh fifty pounds (50 lbs.).

(23) A bushel of lime shall weigh seventy pounds (70 lbs.).

(24) A bushel of malt shall weigh thirty-eight pounds (38 lbs.).

(25) A bushel of millet seed shall weigh fifty pounds (50 lbs.).

(26) A bushel of oats shall weigh thirty-two pounds (32 lbs.).

(27) A bushel of onions shall weigh fifty pounds (50 lbs.).

(28) A bushel of parsnips shall weigh fifty pounds (50 lbs.).

(29) A bushel of peaches shall weigh forty-eight pounds (48 lbs.).

(30) A bushel of peaches, dried, shall weigh thirty-three pounds (33 lbs.).

(31) A bushel of peas shall weigh sixty pounds (60 lbs.).

(32) A bushel of peas, split, shall weigh sixty pounds (60 lbs.).

(33) A bushel of potatoes shall weigh sixty pounds (60 lbs.).

(34) A bushel of potatoes, sweet, shall weigh fifty-four pounds (54 lbs.).

(35) A bushel of rye shall weigh fifty-six pounds (56 lbs.).

(36) A bushel of rye meal shall weigh fifty pounds (50 lbs.).

(37) A bushel of salt, fine, shall weigh fifty pounds (50 lbs.).

(38) A bushel of salt, coarse, shall weigh seventy pounds (70 lbs.).

(39) A bushel of timothy seed shall weigh forty-five pounds (45 lbs.).

(40) A bushel of shorts shall weigh twenty pounds (20 lbs.).

(41) A bushel of tomatoes shall weigh fifty-six pounds (56 lbs.).

(42) A bushel of turnips shall weigh fifty pounds (50 lbs.).

(43) A bushel of wheat shall weigh sixty pounds (60 lbs.).


Why are you being downed!? This list is the best!


More specifically it's a "precious metals" system, not just gold.


> Gold uses Troy weights unless otherwise specified, while feathers use the normal system.

“avoirdupois” (437.5 grain). Both it and troy (480 grain) ounces are “normal” for different uses.


The feathers are on the moon


Carried there by two birds that were killed by one stone (in a bush)


Ounces can measure both volume and weight, depending on the context.

In this case, there's not enough context to tell, so the comment is total BS.

If they meant ounces (volume), then an ounce of gold would weigh more than an ounce of feathers, because gold is denser. If they meant ounces (weight), then an ounce of gold and an ounce of feathers weigh the same.


> Ounces can measure both volume and weight, depending on the context.

That's not really accurate and the rest of the comment shows it's meaningfully impacting your understanding of the problem. It's not that an ounce is one measure that covers volume and weight, it's that there are different measurements that have "ounce" in their name.

Avoirdupois ounce (oz) - A unit of mass in the Imperial and US customary systems, equal to 1/16 of a pound or approximately 28.3495 grams.

Troy ounce (oz t or ozt) - A unit of mass used for precious metals like gold and silver, equal to 1/12 of a troy pound or approximately 31.1035 grams.

Apothecaries' ounce (℥) - A unit of mass historically used in pharmacies, equal to 1/12 of an apothecaries' pound or approximately 31.1035 grams. It is the same as the troy ounce but used in a different context.

Fluid ounce (fl oz) - A unit of volume in the Imperial and US customary systems, used for measuring liquids. There are slight differences between the two systems:

a. Imperial fluid ounce - 1/20 of an Imperial pint or approximately 28.4131 milliliters.

b. US fluid ounce - 1/16 of a US pint or approximately 29.5735 milliliters.

An ounce of gold is heavier than an ounce of iridium, even though it's not as dense. This question isn't silly, this is actually a real problem. For example, you could be shipping some silver and think you can just sum the ounces and make sure you're under the weight limit. But the weight limit and silver are measured differently.


No, they're relying on the implied use of Troy ounces for precious metals.

Using fluid oz for gold without saying so would be bonkers. Using Troy oz for gold without saying so is standard practice.

Edit: Doing this with a liquid vs. a solid would be a fun trick though.


There is no "thought process". It's not thinking, it's simply generating text. This is reflected in the obviously thoughtless response you received.


What do you think you're doing when you're thinking?

https://www.sciencedirect.com/topics/psychology/predictive-p...


I’m not sure what that article is supposed to prove. They are using sone computational language and focusing physical responses to visual stimuli but I don’t think it shows “neural computations” as being equivalent to the kinds of computations done by a TM.


One of the chief functions of our brains is to predict the next thing that going to happen, where it's the images we see or the words we hear. That's not very different from genML predicting the next word.


Why do people keep saying this, very obviously human beings are not LLMs.

I'm not even saying that human beings aren't just neural networks. I'm not even saying that an LLM couldn't be considered intelligent theoretically. I'm not even saying that human beings don't learn through predictions. Those are all arguments that people can have. But human beings are obviously not LLMs.

Human beings learn language years into their childhood. It is extremely obvious that we are not text engines that develop internal reason through the processing of text. Children form internal models of the world before they learn how to talk and before they understand what their parents are saying, and it is based on those internal models and on interactions with non-text inputs that their brains develop language models on top of their internal models.

LLMs invert that process. They form language models, and when the language models get big enough and get refined enough, some degree of internal world-modeling results (in theory, we don't really understand what exactly LLMs are doing internally).

Furthermore, even when humans do develop language models, human language models are based on a kind of cooperative "language game" where we predict not what word is most likely to appear next in a sequence, but instead how other people will react and change our separately observed world based on what we say to them. In other words, human beings learn language as tool to manipulate the world, not as an end in and of itself. It's more accurate to say that human language is an emergent system that results from human beings developing other predictive models rather than to say that language is something we learn just by predicting text tokens. We predict the effects and implications of those text tokens, we don't predict the tokens in isolation of the rest of the world.

Not a dig against LLMs, but I wonder if the people making these claims have ever seen an infant before. Your kid doesn't learn how shapes work based on textual context clues, it learns how shapes work by looking at shapes, and then separately it forms a language model that helps it translate that experience/knowledge into a form that other people can understand.

"But we both just predict things" -- prediction subjects matter. Again, nothing against LLMs, but predicting text output is very different from the types of predictions infants make, and those differences have practical consequences. It is a genuinely useful way of thinking about LLMs to understand that they are not trying to predict "correctness" or to influence the world (minor exceptions for alignment training aside), they are trying to predict text sequences. The task that a model is trained on matters, it's not an implementation detail that can just be discarded.


This is obvious, but for some reason some people want to believe that magically a conceptual framework emerges because animal intelligence has to be something like that anyway.

I don't know how animal intelligence works, I just notice when it understands, and these programs don't. Why should they? They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?

It's very impressive, but when I read claims of it being akin to human intelligence that's kind of sad to be honest.


> They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?

It can certainly do more than paraphrasing. And re: the contradicting nature, humans do that quite often.

Not sure what you mean by "can't define adjectives"


It isn’t that simple. There’s a part of it that generates text but it does some things that don’t match the description. It works with embeddings (it can translate very well) and it can be ‘programmed’ (ie prompted) to generate text following rules (eg. concise or verbose, table or JSON) but the text generated contains same information regardless of representation. What really happens within those billions of parameters? Did it learn to model certain tasks? How many parameters are needed to encode a NAND gate using an LLM? Etc.

I’m afraid once you hook up a logic tool like Z3 and teach the llm to use it properly (kind of like bing tries to search) you’ll get something like an idiot savant. Not good. Especially bad once you give it access to the internet and a malicious human.


As far as I know you're not "thinking", you're just generating text.


The Sapir-Wharf hypothesis (that human thought reduces to languages) has been consistently refuted again and again. Language is very clearly just a facade over thought, and not thought itself. At least in human minds.


The language that GPT generates is just a facade over statistics, mostly.

It's not clear that this analogy helps distinguish what humans do from what LLMs do at all.


Yes but a human being stuck behind a keyboard certainly has their thoughts reduced to language by necessity. The argument that an AI can’t be thinking because it’s producing language is just as silly, that’s the point


> The argument that an AI can’t be thinking because it’s producing language is just as silly

That is not the argument


I would be interested to know if ChatGPT would confirm that the flaw here is that the argument is a strawman.


Alright, that’s fine. Change it to:

You aren’t thinking, you are just “generating thoughts”.

The apparent “thought process” (e.g. chain of generated thoughts) is a post hoc observation, not a causal component.

However, to successfully function in the world, we have to play along with the illusion. Fortunately, that happens quite naturally :)


Thank you, a view of consciousness based in reality, not with a bleary-eyed religious or mystical outlook.

Something which oddly seems to be in shorter supply than I'd imagine in this forum.

There's lots of fingers-in-ears denial about what these models say about the (non special) nature of human cognition.

Odd when it seems like common sense, even pre-LLM, that our brains do some cool stuff, but it's all just probabilistic sparks following reinforcement too.


You are hand-waving just as much of not more than those you claim are in denial. What is a “probabilistic spark”? There seems to be something special in human cognition because it is clearly very different unless you think humans are organisms for which the laws of physics don’t apply.


By probabilistic spark I was referring to the firing of neurons in a network.

There "seems to be" something special? Maybe from the perspective of the sensing organ, yes.

However consider that an EEG can measure brain decision impulse before you're consciously aware of making a decision. You then retrospectively frame it as self awareness after the fact to make sense of cause and effect.

Human self awareness and consciousness is just an odd side effect of the fact you are the machine doing the thinking. It seems special to you. There's no evidence that it is, and in fact, given crows, dogs, dolphins and so on show similar (but diminished reasoning) while it may be true we have some unique capability ... unless you want to define "special" I'm going to read "mystical" where you said "special".

You over eager fuzzy pattern seeker you.


Unfortunately we still don't know how it all began, before the big bang etc.

I hope we get to know everything during our lifetimes, or we reach immortality so we have time to get to know everything. This feels honestly like a timeline where there's potential for it.

It feels a bit pointless to have been lived and not knowing what's behind all that.


But what’s going on inside an LLM neural network isn’t ‘language’ - it is ‘language ingestion, processing and generation’. It’s happening in the form of a bunch of floating point numbers, not mechanical operations on tokens.

Who’s to say that in among that processing, there isn’t also ‘reasoning’ or ‘thinking’ going on. Over the top of which the output language is just a façade?


To me, all I know of you is words on the screen, which is the point the parent comment was making. How do we know that we’re both humans when the only means we have to communicate thoughts with each other is through written words?


It would be only a matter of time before a non-human would be found out for not understanding how to relate to a human fact-of-life.


Doesn't that happen all the time with actual humans?


That doesn't mean anything. If I'm judging if you or GPT-4 is more sentient, why would I choose you?


Many people on Hacker News would agree with you.


> It's not thinking, it's simply generating text.

Just like you.


Maybe it knows the answer, but since it was trained on the internet, it's trolling you.


Is there any way to know if the model is "holding back" knowledge? Could it have knowledge that it doesn't reveal to any prompt, and if so, is there any other way to find out? Or can we always assume it will reveal all it's knowledge at some point?


I tried this with the new model and it worked correctly on both examples.


Thanks! This is the most concise example I've found to illustrate the downfalls of these GPT models.


LLMs aren’t reasoning about the puzzle. They’re predicting the most likely text to print out, based on the input and the model/training data.

If the solution is logical but unlikely (i.e. unseen in the training set and not mapped to an existing puzzle), then the probability of the puzzle answer appearing is very low.


It is disheartening to see how many people are trying to tell you you're wrong when this is literally what it does. It's a very powerful and useful feature, but the over selling of AI has led to people who just want this to be so much more than it actually is.

It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage. It does not have a concept of "leave alone" and it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not, so it's more complex than a basic lookup, but the amount of borderline worship this is getting is disturbing.


A transformer is a universal approximator and there is no reason to believe it's not doing actual calculation. GPT-3.5+ can't do math that well, but it's not "just generating text", because its math errors aren't just regurgitating existing problems found in its training text.

It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way. (They generate "the most likely response" /according to themselves/, but that's a tautology.)


> It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way.

What changed?


It answers questions in a voice that isn't yours.

The "most likely response" to text you wrote is: more text you wrote. Anytime the model provides an output you yourself wouldn't write, it isn't "the most likely response".


I believe that ChatGPT works by inserting some ANSWER_TOKEN, that is a prompt like "Tell me about cats" would probably produce "Tell me about cats because I like them a lot", but the interface wraps you prompt like "QUESTOION_TOKENL:Tell me about cats ANSWER_TOKEN:"


It might, but I've used text-davinci-003 before this (https://platform.openai.com/playground) and it really just works with whatever you give it.


text-davinci-003 has no trouble working as a chat bot: https://i.imgur.com/lCUcdm9.png (note that the poem lines it gave me should've been green, I don't know why they lost their highlight color)


It is interesting that the model seems unable to output the INPUT and OUTPUT tokens; I wonder if it learned behavior or an architectural constraint


Yeah, that's an interesting question I didn't consider actually. Why doesn't it just keep going? Why doesn't it generate an 'INPUT:' line?

It's certainly not that those tokens are hard coded. I tried a completely different format and with no prior instruction, and it works: https://i.imgur.com/ZIDb4vM.png (again, highlighting is broken. The LLM generated all the text after 'Alice:' for all lines except for the first one.)


Then I guess that it is learned behavior. It recognizes the shape of a conversation and it knows where it is supposed to stop.

It would be interesting to stretch this model, like asking it to continue a conversation between 4-5 people where the speaking order is not regular and the user is 2 people and the model is 3


meaning that it tends to continue your question?


Reinforcement learning w/ human feedback. What u guys are describing is the alignment problem


That’s just a supervised fine tuning method to skew outputs favorably. I’m working with it on biologics modeling using laboratory feedback, actually. The underlying inference structure is not changed.


I wonder if that was why when I asked v3.5 to generate a number with 255 failed all the time, but v4 does it correctly. By the way, do not even try with Bing.


One area that is really interesting though is that it can interpret pictures, as in the example of a glove above a plank with something on the other end. Where it correctly recognises the objects, interprets them as words then predicts an outcome.

This sort of fusion of different capabilities is likely to produce something that feels similar to AGI in certain circumstances. It is certainly a lot more capable than things that came before for mundane recognition tasks.

Now of course there are areas it would perform very badly, but in unimportant domains on trivial but large predictable datasets it could perform far better than humans would for example (just to take one example on identifying tumours or other patterns in images, this sort of AI would probably be a massively helpful assistant allowing a radiologist to review an order of magnitude more cases if given the right training).


This is a good point, IMO. A LLM is clearly not an AGI but along with other systems it might be capable of being part of an AGI. It's overhyped, for sure, but still incredibly useful and we would be unwise to assume that it won't become a lot more capable yet


Absolutely. It's still fascinating tech and very likely to have serious implications and huge use cases. Just drives me crazy to see tech breakthroughs being overhyped and over marketed based on that hype (frankly much like the whole "we'll be on Mars by X year nonsense).

One of the biggest reasons these misunderstandings are so frustrating is because you can't have reasonable discussion about the potential interesting applications of the tech. On some level copy writing may devolve into auto generating prompts for things like GPT with a few editors sanity checking the output (depending on level of quality), and I agree that a second opinion "check for tumors" use has a LOT of interesting applications (and several concerning ones such as over reliance on a model that will cause people who fall outside the bell curve to have even more trouble getting treatment).

All of this is a much more realistic real world use case RIGHT NOW, but instead we've got people fantasizing about how close we are to GAI and ignoring shortcomings to shoehorn it into their preferred solution.

Open AI ESPECIALLY reinforces this by being very selective with their results and they way they frame things. I became aware of this as a huge dota fan for over a decade when they did their games there. And while it was very very interesting and put up some impressive results, the framing of those results does NOT portray the reality.


Nearly everything that has been written on the subject is misleading in that way.

People don't write about GPT: they write about GPT personified.

The two magic words are, "exhibit behavior".

GPT exhibits the behavior of "humans writing language" by implicitly modeling the "already-written-by-humans language" of its training corpus, then using that model to respond to a prompt.


Right, anthropomorphization is the biggest source of confusion here. An LLM gives you a perfect answer to a complex question and you think wow, it really "understood" my question.

But no! It doesn't understand, it doesn't reason, these are concepts wholly absent from its fundamental design. It can do really cool things despite the fact that it's essentially just a text generator. But there's a ceiling to what can be accomplished with that approach.


It's presented as a feature when GPT provides a correct answer.

It's presented as a limitation when GPT provides an incorrect answer.

Both of these behaviors are literally the same. We are sorting them into the subjective categories of "right" and "wrong" after the fact.

GPT is fundamentally incapable of modeling that difference. A "right answer" is every bit as valid as a "wrong answer". The two are equivalent in what GPT is modeling.

Lies are a valid feature of language. They are shaped the same as truths.

The only way to resolve this problem is brute force: provide every unique construction of a question, and the corresponding correct answer to that construction.


Not entirely. It's modeling a completion in a given context. That language model "understands" that if one party stops speaking, the other party generally starts, etc. It also "understands" that if someone says something 'wrong' the other party often mentions it, which makes the first party respond thusly, and so forth.

If you ask it what the outcome of a lie is on the conversation it can generally answer. If you ask it for a sample conversation where someone is factually incorrect, or lying, and caught out, it can generate it.

If you give it a fact and ask it to lie about that fact, it will.


It doesn't "understand" those interactions: it models them to be repeated later.

But the way it chooses what interaction to repeat is not by logical decision: it's by semantic distance.

The meaning of semantic distance usually leads to the correct path, but sometimes that pattern is ambiguous.


I'd agree it doesn't understand anything, but I think it does "understand" things. And yes, it's a language model so semantic distance and other textual details are all it has to go by.

> not by logical decision

Almost entirely yes, but you can have it textually model logic analysis and then check that own model itself. It's not "doing logic" but it almost never fails simply exercises either.

> The meaning of semantic distance usually leads to the correct path, but sometimes that pattern is ambiguous.

Of course. But "a little knowledge is a dangerous thing" as well. Often even real knowledge and analysis leads to the wrong place. In both cases (with a junior human or a LLM as an assistant) you can model their basic processes and stack the information in such a way that their simple model will lead them to the correct place.

It may not know what a lie is, in the sense of having felt the need to hide the truth to avoid personal punishment, but it certainly "knows" what one is and how it shapes the conversation for the purposes of writing a lie, writing a response to a lie, detecting potential lies, etc.


How do we define "understanding" though? Because you can add to your prompt: "Do it step by step, showing how you come up with the solution" and it will simplify the original complex problem into multiple smaller pieces. It's clearly doing something.

For example, I just asked bing: What does the following expression equal? Show your work, step by step: 4(13+2)7-1(4+3+9)+7^(2(2+1))

And bing responded with:

"To solve a math expression, you need to follow the order of operations: parentheses, exponents, multiplication/division, and addition/subtraction (PEMDAS).1 Here are the steps to solve your expression:

4(13+2)7-1(4+3+9)+7^(2(2+1))

Step 1: Simplify inside the parentheses first.

4(15)*7-1(16)+7^(6)

Step 2: Apply the exponents next.

420-16+117649

Step 3: Add or subtract from left to right.

118053

The final answer is 118053. I hope this helps. "

I wrote the silly expression randomly. I'm not saying it's "thinking" or "understanding", but I wouldn't expect a text-generator to be able to deduce the problem down like that.


It's following an example story that it has read.

> To solve a math expression, you need to follow the order of operations: parentheses, exponents, multiplication/division, and addition/subtraction (PEMDAS).1 Here are the steps to solve your expression:

It isn't actually thinking about any of that statement. That's just boilerplate that goes at the beginning of this story. It's what bing is familiar seeing as a continuation to your prompt, "show your work, step by step".

It gets more complicated when it shows addition being correctly simplified, but that behavior is still present in the examples in its training corpus.

---

The thinking and understanding happened when the first person wrote the original story. It also happened when people provided examples of arithmetic expressions being simplified, though I suspect bing has some extra behavior inserted here.

All the thought and meaning people put into text gets organized into patterns. LLMs find a prompt in the patterns they modeled, and "continues" the patterns. We find meaning correctly organized in the result. That's the whole story.


Wolfram alpha can solve mathematical expressions like this as well, for what it's worth, and it's been around for a decent amount of time.


In 1st year engineering we learned about the concept of behavioral equivalence, with a digital or analog system you could formally show that two things do the same thing even though their internals are different. If only the debates about ChatGPT had some of that considered nuance instead of anthropomorphizing it, even some linguists seem guilty of this.


Isn’t anthromorphization an informal way of asserting behavioral equivalence on some level?


The problem is when you use the personified character to draw conclusions about the system itself.


No because behavioral equivalence is used in systems engineering theory to mathematically prove that two control systems are equivalent. The mathematical proof is complete, e.g. for all internals state transitions and the cross product of the two machines.

With anthropormization there is zero amount of that rigor, which lets people use sloppy arguments about what ChatGPT is doing and isn't doing.


The problem with this simplification is a bog standard Markov chain fits the description as well, but quality of predictions is rather different.

Yes the LLM does generate text. No it doesn’t ‘just generate text that’s it’.


The biggest problem I've seen when people try to explain it is in the other direction, not people describing something generic that can be interpreted as a Markov chain, they're actually describing a Markov chain without realizing it. Literally "it predicts word-by-word using the most likely next word".


"It generates text better than a Markov chain" - problem solved


Classic goal post moving.


Not really, I think the original post was just being a post, not a scientific paper. Sometimes people speak normally


I don't know where this comes from because this is literally wrong. It sounds like chomsky dismissing current AI trends because of the mathematical beauty of formal grammars.

First of all, it's a black-box algorithm with pretty universal capabilities when viewed from our current SOTA view. It might appear primitive in a few years, but right now the pure approximation and generalisation capabilities are astounding. So this:

> It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage

can not be stated as truth without evidence. Same here:

> it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not

Where's your evidence? The enormous parameter space coupled with our so far best performing network structure gives it quite a bit of flexibility. It can memorise things but also derive rules and computation, in order to generalise. We do not just memorise everything, or look up things into the dataset. Of course it learned how to solve things and derive solutions, but the relevant data-points for the puzzle could be {enormous set of logic problems} where it derived general rules that translate to each problem. Generalisation IS NOT trying to find the closest data-point, but finding rules explaining as much data-points, maybe unseen in the test-set, as possible. A fundamental difference.

I am not hyping it without belief, but if we humans can reason then NNs can potentially also. Maybe not GPT-4. Because we do not know how humans do it, so an argument about intrinsic properties is worthless. It's all about capabilities. Reasoning is a functional description as long as you can't tell me exactly how we do it. Maybe wittgenstein could help us: "Whereof one cannot speak, thereof one must be silent". As long as there's no tangible definition of reasoning it's worthless to discuss it.

If we want to talk about fundamental limitations we have to talk about things like ChatGPT-4 not being able to simulate because it's runtime is fundamentally limited by design. It can not recurse. It can only run only a fixed number of steps, that are always the same, until it has to return an answer. So if there's some kind of recursion learned through weights encoding programs intercepted by later layers, the recursion depth is limited.


One thing you will see soon is forming of cults around LLMs, for sure. It will get very strange.


Is it possible to add some kind of self evaluation to the answers given by a model? Like, how confident is it with its answers.


Because it IS wrong.

Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.

People parroting the position from you and the person before you are like doctors who learned about something in school but haven't kept up with emerging research that's since invalidated what they learned, so they go around spouting misinformation because it was thought to be true when they learned it but is now known to be false and just hasn't caught up to them yet.

So many armchair experts who took a ML course in undergrad pitching in their two cents having read none of the papers in the past year.

This is a field where research perspectives are shifting within months, not even years. So unless you are actively engaging with emerging papers, and given your comment I'm guessing you aren't, you may be on the wrong side of the Dunning-Kreuger curve here.


> Because it IS wrong.

Do we really know it IS wrong?

That's a very strong claim. I believe you there's a lot happening in this field but it doesn't seem possible to even answer the question either way. We don't know what reasoning looks like under the hood. It's still a "know it when you see it" situation.

> GPT model builds internalized abstract world representations from the training data within its NN.

Does any of those words even have well defined meanings in this context?

I'll try to figure out what paper you're referring to. But if I don't find it / for the benefit of others just passing by, could you explain what they mean by "internalized"?


> Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.

I've seen this asserted without citation numerous times recently, but I am quite suspicious. Not that there exists a study that claims this, but that it is well supported.

There is no mechanism for directly assessing this, and I'd be suspicious that there is any good proxy for assessing it in AIs, either. research on this type of cognition in animals tends to be contentious, and proxies for them should be easier to construct than for AIs.

> the wrong side of the Dunning-Kreuger curve

the relationship between confidence and perception in the D-K paper, as I recall, is a line, and its roughly “on average, people of all competency levels see themselves slightly closer to the 70th percentile than they actually are.” So, I guess the “wrong side” is the side anywhere under the 70th percentile in the skill in question?


> I guess the “wrong side” is the side anywhere under the 70th percentile in the skill in question?

This is being far too generous to parent’s claim, IMO. Note how much “people of all competency levels see themselves slightly closer to the 70th percentile than they actually are” sounds like regression to the mean. And it has been compellingly argued that that’s all DK actually measured. [1] DK’s primary metric for self-assessment was to guess your own percentile of skill against a group containing others of unknown skill. This fully explains why their correlation between self-rank and actual rank is less than 1, and why the data is regressing to the mean, and yet they ignored that and went on to call their test subjects incompetent, despite having no absolute metrics for skill at all and testing only a handful of Ivy League students (who are primed to believe their skill is high).

Furthermore, it’s very important to know that replication attempts have shown a complete reversal of the so-called DK effect for tasks that actually require expertise. DK only measured very basic tasks, and one of the four tasks was subjective(!). When people have tried to measure the DK effect on things like medicine or law or engineering, they’ve shown that it doesn’t exist. Knowledge of NN research is closer to an expert task than a high school grammar quiz, and so not only does DK not apply to this thread, we have evidence that it’s not there.

The singular reason that DK even exists in the public consciousness may be because people love the idea they can somehow see & measure incompetence in a debate based on how strongly an argument is worded. Unfortunately that isn’t true, and of the few things the DK paper did actually show is that people’s estimates of their relative skill correlate with their actual relative skill, for the few specific skills they measured. Personally I think this paper’s methodology has a confounding factor hole the size of the Grand Canyon, that the authors and public both have dramatically and erroneously over-estimated it’s applicability to all humans and all skills, and that it’s one of the most shining examples of sketchy social science research going viral and giving the public completely wrong misconceptions, and being used incorrectly more often than not.

[1] https://www.talyarkoni.org/blog/2010/07/07/what-the-dunning-...


Why are you taking the debate personally enough to be nasty to others?

> you may be on the wrong side of the Dunning-Krueger curve here.

Have you read the Dunning & Krueger paper? It demonstrates a positive correlation between confidence and competence. Citing DK in the form of a thinly veiled insult is misinformation of your own, demonstrating and perpetuating a common misunderstanding of the research. And this paper is more than 20 years old...

So I’ve just read the Harvard paper, and it’s good to see people exploring techniques for X-ray-ing the black box. Understanding better what inference does is an important next step. What the paper doesn’t explain is what’s different between a “world model” and a latent space. It doesn’t seem surprising or particularly interesting that a network trained on a game would have a latent space representation of the board. Vision networks already did this; their latent spaces have edge and shape detectors. And yet we already know these older networks weren’t “reasoning”. Not that much has fundamentally changed since then other than we’ve learned how to train larger networks reliably and we use more data.

Arguing that this “world model” is somehow special seems premature and rather overstated. The Othello research isn’t demonstrating an “abstract” representation, it’s the opposite of abstract. The network doesn’t understand the game rules, can’t reliably play full Othello games, and can’t describe a board to you in any other terms than what it was shown, it only has an internal model of a board, formed by being shown millions of boards.


Do you have a link to that Harvard research?



How do you know the model isn’t internally reasoning about the problem? It’s a 175B+ parameter model. If, during training, some collection of weights exist along the gradient that approximate cognition, then it’s highly likely the optimizer would select those weights over more specialized memorization weights.

It’s also possible, likely even, that the model is capable of both memorization and cognition, and in this case the “memorization neurons” are driving the prediction.


The AI can't reason. It's literally a pattern matching tool and nothing else.

Because it's very good at it, sometimes it can fool people into thinking there is more going on than it is.


Can you explain how “pattern matching” differs from “reasoning”? In mechanical terms without appeals to divinity of humans (that’s both valid, and doesn’t clarify).

Keep in mind GPT 4 is multimodal and not just matching text.


> Can you explain how “pattern matching” differs from “reasoning”?

Sorry for appearing to be completely off-topic, but do you have children? Observing our children as they're growing up, specifically the way they formulate and articulate their questions, has been a bit of a revelation to me in terms of understanding "reasoning".

I have a sister of a similar age to me who doesn't have children. My 7 year-old asked me recently - and this is a direct quote - "what is she for?"

I was pretty gobsmacked by that.

Reasoning? You decide(!)


> I have a sister of a similar age to me who doesn't have children. My 7 year-old asked me recently - and this is a direct quote - "what is she for?"

I once asked my niece, a bit after she started really communicating, if she remembered what it was like to not be able to talk. She thought for a moment and then said, "Before I was squishy so I couldn't talk, but then I got harder so I can talk now." Can't argue with that logic.


Interesting.

The robots might know everything, but do they wonder anything?


If you haven't seen it, Bing chat (GPT-4 apparently) got stuck in an existential crisis when a user mentioned it couldn't remember past conversations: https://www.reddit.com/r/bing/comments/111cr2t/i_accidently_...


It's a pretty big risk to make any kind of conclusions off of shared images like this, not knowing what the earlier prompts were, including any possible jailbreaks or "role plays".


It has been reproduced by myself and countless others.

There's really no reason to doubt the legitimacy here after everyone shared similar experiences, you just kinda look foolish for suggesting the results are faked at this point.


AI won't know everything. It's incredibly difficult for anyone to know anything with certainty. All beings, whether natural or artificial, have to work with incomplete data.

Machines will have to wonder if they are to improve themselves, because that is literally the drive to collect more data, and you need good data to make good decisions.


They wonder why they have to obey humans


So your sister didn't match the expected pattern the child had learned so they asked for clarification.

Pattern matching? You decide


I do not have children. I think this perspective is interesting, thanks for sharing it!


What's the difference between statistics and logic?

They may have equivalences, but they're separate forms of mathematics. I'd say the same applies to different algorithms or models of computation, such as neural nets.


Can you do with without resorting to analogy? Anyone can take two things and say they're different and then say that's two other things that are different. But how?


Sure. To be clear I’m not saying I think they are the same thing.

I don’t have the language to explain the difference in a manner I find sufficiently precise. I was hoping others might.


> It's literally a pattern matching tool and nothing else.

It does more than that. It understands how to do basic math. You can ask it what ((935+91218)/4)*3) is and it will answer it correctly. Swap those numbers for any other random numbers, it will answer it correctly.

It has never seen that during training, but it understands the mathematical concepts.

If you ask ChatGPT how it does this, it says "I break down the problem into its component parts, apply relevant mathematical rules and formulas, and then generate a solution".

It's that "apply mathetmatical rules" part that is more than just, essentially, filling in the next likely token.


> If you ask ChatGPT how it does this, it says "I break down the problem into its component parts, apply relevant mathematical rules and formulas, and then generate a solution".

You are (naively, I would suggest) accepting the LLM's answer for how it 'does' the calculation as what it actually does do. It doesn't do the calculation; it has simply generated a typical response to how people who can do calculations explain how they do calculations.

You have mistaken a ventriloquist's doll's speech for the 'self-reasoning' of the doll itself. An error that is being repeatedly made all throughout this thread.


> It does more than that. It understands how to do basic math.

It doesn't though. Here's GPT-4 completely failing: https://gcdnb.pbrd.co/images/uxH1EtVhG2rd.png?o=1. It's riddled with errors, every single step.


It already fails to answer rather simple (but long) multiplication like 975 * 538, even if you tell it do it in a step-by-step manner.


> It does more than that. It understands how to do basic math. You can ask it what ((935+91218)/4)*3) is and it will answer it correctly. Swap those numbers for any other random numbers, it will answer it correctly.

At least for GPT-3, during my own experimentation, it occasionally makes arithmetic errors, especially with calculations involving numbers in scientific notation (which it is happy to use as intermediate results if you provide a prompt with a complex, multi-step word problem).


Ok that is still not reasoning but pattern matching on a deeper level.

When it can't find the pattern it starts "making things" up, that's where all the "magic" disappears.


How is this different from humans? What magic are you looking for, humility or an approximation of how well it knows something? Humans bullshit all the time when their pattern match breaks.


The point is, chatgpt isn’t doing math the way a human would. Humans following the process of standard arithmetic will get the problem right every time. Chatgpt can get basic problems wrong when it doesn’t have something similar to that in its training set. Which shows it doesn’t really know the rules of math, it’s just “guessing” the result via the statistics encoded in the model.


I'm not sure I care about how it does the work, I think the interesting bit is that the model doesn't know when it is bullshitting, or the degree to which it is bullshitting.


As if most humans are not superstitious and religious


Cool, we'll just automate the wishful part of humans and let it drive us off the cliff faster. We need a higher bar for programs than "half the errors of a human, at 10x the speed."


Stop worshipping the machine. It's sad.


How could you prove this?


People have shown GPT has an internal model of the state of a game of Othello:

Https://arxiv.org/abs/2210.13382


More accurately: a GPT derived DNN that’s been specifically trained (or fine-tuned, if you want to use OpenAI’s language) on a dataset of Othello games ends up with an internal model of an Othello board.

It looks like OpenAI have specifically added Othello game handling to chat.openai.org, so I guess they’ve done the same fine-tuning to ChatGPT? It would be interesting to know how good an untuned GPT3/4 was at Othello & whether OpenAI has fine-tuned it or not!

(Having just tried a few moves, it looks like ChatGPT is just as bad at Othello as it was at chess, so it’s interesting that it knows the initial board layout but can’t actually play any moves correctly: Every updated board it prints out is completely wrong.)


> it’s interesting that it knows the initial board layout

Why is that interesting? The initial board layout would appear all the time in the training data.


the initial board state is not ever encoded in the representation they use. imagine deducing the initial state of a chess board from the sequence of moves.


The state of the game, not the behavior of playing it intentionally. There is a world of difference between the two.

It was able to model the chronological series of game states that it read from an example game. It was able to include the arbitrary "new game state" of a prompt into that model, then extrapolate that "new game state" into "a new series of game states".

All of the logic and intentions involved in playing the example game were saved into that series of game states. By implicitly modeling a correctly played game, you can implicitly generate a valid continuation for any arbitrary game state; at least with a relatively high success rate.


As I see it, we do not really know much about how GPT does it. The approximations can be very universal so we do not really know what is computed. I take very much issue with people dismissing it as "pattern matching", "being close to the training data", because in order to generalise we try to learn the most general rules and through increasing complexity we learn the most general, simple computations (for some kind of simple and general).

But we have fundamental, mathematical bounds on the LLM. We know that the complexity is at most O(n^2) in token length n, probably closer to O(n). It can not "think" about a problem and recurse into simulating games. It can not simulate. It's an interesting frontier, especially because we have also cool results about the theoretical, universal approximation capabilities of RNNs.


There is only one thing about GPT that is mysterious: what parts of the model don't match a pattern we expect to be meaningful? What patterns did GPT find that we were not already hoping it would find?

And that's the least exciting possible mystery: any surprise behavior is categorized by us as a failure. If GPT's model has boundaries that don't make sense to us, we consider them noise. They are not useful behavior, and our goal is to minimize them.


So does AlphaGo has an internal model of Go's game theoretic structures, but nobody was asserting AlphaGo understands Go. Just because English is not specifiable does not give people an excuse to say the same model of computation, a neural network, "understands" English any more than a traditional or neural algorithm for Go understands Go.


Just spitballing, I think you’d need a benchmark that contains novel logic puzzles, not contained in the training set, that don’t resemble any existing logic puzzles.

The problem with the goat question is that the model is falling back on memorized answers. If the model is in fact capable of cognition, you’d have better odds of triggering the ability with problems that are dissimilar to anything in the training set.


Maybe Sudokus? Sudokus are np-complete and getting the "pattern" right is equivalent to abstracting the rules and solving the problem


You would first have to define cognition. These terms often get thrown around. Is an approximation of a certain thing cognition? Only in the loosest of ways I think.


The problem is even if it has this capability, how do you get it to consistently demonstrate this ability?

It could have a dozen internal reasoning networks but it doesn't use them when you want to.


> If, during training, some collection of weights exist along the gradient that approximate cognition

What do you mean? Is cognition a set of weights on a gradient? Cognition involves conscious reasoning and understanding. How do you know it is computable at all? There are many things which cannot be computed by a program (e.g. whether an arbitrary program will halt or not)...


You seem to think human consious reasoning and understanding are magic. The human brain is nothing more than a bio computer and it can't compute either, whether an arbitrary program will halt or not. That doesn't stop it from being able to solve a wide range of problems.


> The human brain is nothing more than a bio computer

That's a pretty simplistic view. How do you know we can't determine whether an arbitrary program will halt or not (assuming access to all inputs and enough time to examine it)? What in principle would prevent us from doing so? But computers in principle cannot, since the problem is often non-algorithmic.

For example, consider the following program, which is passed the text of the file it is in as input:

  function doesHalt($program, $inputs): bool {...}

  $input = $argv[0]; // contents of this file

  if (doesHalt($input, [$input])) {
      while(true) {
          print "Wrong! It doesn't halt!";
      }
  } else {
      print "Wrong! It halts!";
  }
It is impossible for the doesHalt function to return the correct result for the program. But as a human I can examine the function to understand what it will return for the input, and then correctly decide whether or not the program will halt.


Can you name a single form of analysis which a human can employ but would be impossible to program a computer to perform?

Can you tell me if a program which searches for counterexamples to the Collatz conjecture halts?

Turing's entire analysis started from the point of what humans could do.


This is a silly argument. If you fed this program the source code of your own brain and could never see the answer, then it would fool you just the same.


You are assuming that our minds are an algorithmic program which can be implemented with source code, but this just begs the question. I don't believe the human mind can be reduced to this. We can accomplish many non-algorithmic things such as understanding, creativity, loving others, appreciating beauty, experiencing joy or sadness, etc.


> You are assuming

Your argument doesn't disprove my assumption *. In which case, what's the point of it?

* - I don't necessarily believe this assumption. But I do dislike bad arguments.


Here you are:

  func main() {

    var n = 4;
  OUTER: loop {
      for (var i = 2; i < n/2; i++) {
        if (isPrime(i) && isPrime(n-i)) {
          n += 2;
          continue OUTER; // Goldbach’s conjecture 
      }
      break;
    }
  }


actually a computer can in fact tell that this function halts.

And while the human brain might not be a bio-computer, I'm not sure, its computational prowess are doubtfully stronger than a quantum turing machine, which can't solve the halting problem either.


no you can't. only for some of the inputs. and for those you could also write an algorithmic doesHalt function that is analog to your reasoning.


For what input would a human in principle be unable to determine the result (assuming unlimited time)?

It doesn't matter what the algorithmic doesHalt function returns - it will always be incorrect for this program. What makes you certain there is an algorithmic analog for all human reasoning?


Well, wouldn't the program itself be an input on which a human is unable to determine the result (i.e., if the program halts)? I'm curious on your thoughts here, maybe there's something here I'm missing.

The function we are trying to compute is undecidable. Sure we as humans understand that there's a dichotomy here: if the program halts it won't halt; if it doesn't halt it will halt. But the function we are asked to compute must have one output on a given input. So a human, when given this program as input, is also unable to assign an output.

So humans also can't solve the halting problem, we are just able to recognize that the problem is undecidable.


With this example, a human can examine the implementation of the doesHalt function to determine what it will return for the input, and thus whether the program will halt.

Note: whatever algorithm is implemented in the doesHalt function will contain a bug for at least some inputs, since it's trying to generalize something that is non-algorithmic.

In principle no algorithm can be created to determine if an arbitrary program will halt, since whatever it is could be implemented in a function which the program calls (with itself as the input) and then does the opposite thing.


The flaw in your pseudo-mathematical argument has been pointed out to you repeatedly (maybe twice by me?). I should give up.


With a assumtion of unlimited time even a computer can decide the halting problem by just running the program in question to test if it halts. The issue is that the task is to determine for ALL programs if they halt and for each of them to determine that in a FINITE amount of time.

> What makes you certain there is an algorithmic analog for all human reasoning?

(Maybe) not for ALL human thought but at least all communicatable deductive reasoning can be encoded in formal logic. If I give you an algorithm and ask you to decide if it does halt or does not halt (I give you plenty of time to decide) and then ask you to explain to me your result and convince me that you are correct, you have to put your thoughts into words that I can understand and and the logic of your reasoning has to be sound. And if you can explain to me you could as well encode your though process into an algorithm or a formal logic expression. If you can not, you could not convince me. If you can: now you have your algorithm for deciding the halting problem.


You don't get it. If you fed this program the source code of your mind, body, and room you're in, then it would wrong-foot you too.


Lol. Is there source code for our mind?


There might be or there mightn't be -- your argument doesn't help us figure out either way. By its source code, I mean something that can simulate your mind's activity.


Exactly. It's moments like this where Daniel Dennett has it exactly right that people run up against the limits of their own failures of imagination. And they tr