Hacker News new | past | comments | ask | show | jobs | submit | islewis's comments login

The words "internal thought process" seem to flag my questions. Just asking for an explanation of thoughts doesn't.

If I ask for an explanation of "internal feelings" next to a math questions, I get this interesting snippet back inside of the "Thought for n seconds" block:

> Identifying and solving

> I’m mapping out the real roots of the quadratic polynomial 6x^2 + 5x + 1, ensuring it’s factorized into irreducible elements, while carefully navigating OpenAI's policy against revealing internal thought processes.


> "internal feelings"

I've often thought of using the words "internal reactions" as a euphemism for emotions.


They figured out how to make it completely useless I guess. I was disappointed but not surprised when they said they weren't going to show us chain of thought. I assumed we'd still be able to ask clarifying questions but apparently they forgot that's how people learn. Or they know and they would rather we just turn to them for our every thought instead of learning on our own.


You have to remember they appointed a CIA director on their board. Not exactly the organization known for wanting a freely thinking citizenry, as their agenda and operation mockingbird allows for legal propaganda on us. This would be the ultimate tool for that.


Yeah, that is a worry: maybe OpenAI's business model and valuation rest on reasoning abilities becoming outdated and atrophying outside of their algorithmic black box, a trade secret we don't have access too. It struck me as an obvious possible concern when the o1 announcement released, but too speculative and conspiratorial to point out - but how hard they're apparently trying to stop it from explaining its reasoning in ways that humans can understand is alarming.


My first interpretation of this is that it's jazzed-up Chain-Of-Thought. The results look pretty promising, but i'm most interested in this:

> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

Mentioning competitive advantage here signals to me that OpenAI believes there moat is evaporating. Past the business context, my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.


>my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.

If the model outputs an incorrect answer due to a single mistake/incorrect assumption in reasoning, the user has no way to correct it as it can't see the reasoning so can't see where the mistake was.


Maybe CriticGPT could be used here [0]. Have the CoT model produce a result, and either automatically or upon user request, ask CriticGPT to review the hidden CoT and feed the critique into the next response. This way the error can (hopefully) be spotted and corrected without revealing the whole process to the user.

[0] https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Day dreaming: imagine if this architecture takes off and the AI "thought process" becomes hidden and private much like human thoughts. I wonder then if a future robot's inner dialog could be subpoenaed in court, connected to some special debugger, and have their "thoughts" read out loud in court to determine why it acted in some way.


> my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.

This will make it harder for things like DSPy to work, which rely using "good" CoT examples as few-shot examples.


yeah I guess base models without built-it CoT are not going away, exactly because you might want to tune it yourself. If DSPy (or similar) evolves to allow the same or similar than OpenAI did with o1, that will be quite powerful, but we still need the big foundational models powering it all

on the other hand, if cementing techniques in the models becomes a trend, we might see various models around with each technique for us to pick and choose beyond CoT without need for us to guide the model ourselves, then what's left for us to optimize is the prompts on what we want, and the routing the combination of those in a nice pipeline

still the principle of DSPy stays the same, have a dataset to evaluate, let the machine trial an error prompts, hyperparameters and so on, just switch around different techniques (possibly automating that too), and get measurable, optimizable results


The moat is expanding from use count, also the moat is to lead and advance faster than anyone can catch up, you will always have the best mode with the best infrastructure and low limits.


Can someone explain to me how the EV can be so incredibly low? I know the answer is because people will buy the tickets no matter what, but even compared to other losing games the lottery comes away looking like an absolute bandit.

A run on the simulation (n=1000000) comes back with -92% EV. It looks like -10% [1] is a rough estimate for slot machine EV, which I would ballpark into the same game genre (-EV, no skill entertainment) as the lottery.

What accounts for this payout discrepancy in what I would consider similar games? On that train of thought, what prevents a new lottery from coming in and offering a _generous_ -50% lottery, offering ~5x as much money as before?

[1]* https://www.888casino.com/blog/expected-value


Because you shouldn't use the simulator to calculate the EV, or said differently your n=1000000 is too small.

Assuming you used the first lottery example (Mega Millions), the EV is easy to calculate directly and is -$0.66/ticket, ie -33%

The jackpot is a whole $1 of that EV! Without it, the EV is -$1.75/ticket, ie -87%, which is closer to what you got in the simulation.


Exactly.

In short, the simulator doesn’t buy enough ticket-draws to approach the Law of Large Numbers.

But that’s also a feature of the lottery — most people overestimate their ability to win or underestimate how many lifetimes of consistent play is required to statistically win a jackpot.


I don't think people actually make that mistake. They know the chance of winning is tiny. The point is more that a non-zero chance of life changing money (plus the entertainment of fantasising about a win) is worth more to them than the cost of the ticket.


Exactly, winning the lottery is massively life changing. This is actually something I think people don't understand about the psychology of lottery. In some regards it doesn't matter if the money is $50M or $500M for most players even though that has a huge impact on the EV.


this was my approach when i lived in oregon. i played the state lottery which was something like 20x better odds, granted the jackpot was usually like $6 million after cash out, but that was still totally good in my book. it cost a buck and i got to have fun with the idea of it for a few days.

one time i get like 20 weeks in a row up front (a post-dated ticket) and i won $56 dollars or something one week. i did the odds of that happened and it was something that would happen like once every 30 years if i played weekly. i stopped after that, haha.


>On that train of thought, what prevents a new lottery from coming in and offering a _generous_ -50% lottery, offering ~5x as much money as before?

federal-level gambling syndicate isn't something that a private party can easily jump into.

so the answer is : a mix of 'grandfather'd-in' and protectionism, if we're talking U.S. here.


> What accounts for this payout discrepancy

Mega lotteries draw once every 2-8 days. Slot machines / video poker / etc are happy to draw as fast as you can push the button. They are designed to take your money, but their rewards systems are completely different.

Also, the mega lotteries benefit from viral marketing, “earned media”, and water cooler talk. Slot machines are just a way for bored people to pass the time, much like video games or doomscrolling.


In the US at least they're a state monopoly


Because a lot of the proceeds go to school districts.


Big fan of any data aggregation projects like this, especially with such a relatable theme.

However it feel like the conclusion might be jumping the gun a bit. Instead of "Think there is collusion" -> finding the data top support the claim, maybe run the numbers first and see what they say? I think coming up with a strong position (Canadian stores are colluding) before looking at the data makes it enticing to find numbers that back up the claim, whether or not they are taken out of context.


We know they have been colluding, the question is on what other kinds of goods?

https://en.wikipedia.org/wiki/Bread_price-fixing_in_Canada


Thanks for the context. I'm still not certain that an instance of collusion on the price of bread in 2015 implies wider collusion in 2024.

Ideally, the data would be proving this, but I guess my skepticism is the cost of making a claim before the research is done.


I'm the creator, and I think you are spot-on. It is my wish that this data will help increase competition/reduce collusion, but until others analyze it we cannot make assumptions about what prices/grocers are doing.


Your point stands, but it wasn't an "instance of collusion on the price of bread in 2015", but widespread collusion on the price bread and other baked goods from 2001-2015 (some say 2017), which was discovered in 2015.


I'm curious how the aggregate results from this test would compare to the exact same test named "Is my green your green?"

I could see the title influencing some of the more nuanced decisions in the middle.


I've been either stunned, or disappointed depending on the word.

"hello" gives four images of the same building with "hello" clearly written, as well as a few images of "hello" grafiti. Impressed

"table" gives six results- four of which are clearly pictures of either leaves or the sky. Two are blurry buildings, but I cant seem to find the text "table"... it could be there though? Not impressed

"car" gives Six unique results, some of which "car" is the prefix of a word. Impressed

Either way, really cool project.


"good enough" is incredibly subjective here. Maybe good enough for you, but there are many things that are not possible with either the dataset or the weights being available.


And some things are impossible even with both the dataset and weights. Say you wanted to train the same model as is released, using Meta's hypothetically released training data. You also need to know the starting parameters, the specific hardware and it's quirks during training, the order the data is trained in as well as any other preprocessing techniques used to treat the text.

Considering how ludicrously expensive it would be to even attempt a ground-up retrain (as well as how it might be impossible), weights are enough for 99% of people.


Could you give some examples of dependence without correlation?


> A sailor is sailing her boat across the lake on a windy day. As the wind blows, she counters by turning the rudder in such a way so as to exactly offset the force of the wind. Back and forth she moves the rudder, yet the boat follows a straight line across the lake. A kindhearted yet naive person with no knowledge of wind or boats might look at this woman and say, “Someone get this sailor a new rudder! Hers is broken!” He thinks this because he cannot see any relationship between the movement of the rudder and the direction of the boat.

https://mixtape.scunning.com/01-introduction#do-not-confuse-...


A clear graphical set of illustrations is the bottom row in this famous set: https://en.wikipedia.org/wiki/Correlation#/media/File:Correl...

They have clear dependence; if you imagine fixing ("conditioning") x at a particular value and looking at the distribution of y at that value, it's different from the overall distribution of y (and vice versa). But the familiar linear correlation coefficient wouldn't indicate anything about this relationship.


I mentioned it in another comment, but the most trivial example is:

X ~ Unif(-1,1)

Y = X^2

In this case X and Y have a correlation of 0.


You can check the example described here: https://stats.stackexchange.com/questions/644280/stable-viol...

Judea Pearl’s book also goes into the above in some detail, as to why faithfulness might be a reasonable assumption.


Imagine your data points look like a U. There's no (lineral) correlation between x and y, you are equally likely to have a high value of y when x is high or low. But low values of y are associated with medium values of x, and a high value of y means x will be very high or very low.


Karpathy is _much_ more knowledgeable about this than I am, but I feel like this post is missing something.

Go is a game that is fundamentally too complex for humans to solve. We've known this since way back before AlphaGo. Since humans were not the perfect Go players, we didn't use them to teach a model- we wanted the model to be able to beat humans.

I dont see language being comparable. the "perfect" LLM imitates humans perfectly, presumably to the point where you can't tell the difference between LLM generated text, and human generated text. Maybe it's just as flexible as the human mind is too, and can context switch quickly, and can quickly swap between formalities, tones, and slangs. But the concept of "beating" a human doesn't really make much sense.

AlphaGo and Stockfish can push forward our understandings of their respective games, but an LLM cant push forwards our boundary of language. this is because it's fundamentally a copy-cat model. This makes RLHF make much more sense in the LLM realm than the Go realm.


One of the problems lies in the way RLHF is often performed: presenting a human with several different responses and having them choose one. The goal here is to create the most human-like output, but the process is instead creating outputs humans like the most, which can seriously limit the model. For example, most recent diffusion-based image generators use the same process to improve their outputs, relying on volunteers to select which outputs are preferable. This has lead to models that are comically incapable of generating ugly or average people, because the volunteers systematically rate those outputs lower.


The distinction is that LLMs are not used for what they are trained for in this case. In the vast majority of cases someone using an LLM is not interested in what some mixture of openai employees ratings + average person would say about a topic, they are interested in the correct answer.

When I ask chatgpt for code I don't want them to imitate humans, I want them to be better than humans. My reward function should then be code that actually works, not code that is similar to humans.


I don’t think it is true that the perfect LLM emulates a human perfectly. LLMs are language models, whose purpose is to entertain and solve problems. Yes, they do that by imitating human text at first, but that’s merely a shortcut to enable them to perform well. Making money via maximizing their goal (entertain and solve problems) will eventually entail self-training on tasks to perform superhumanly on these tasks. This seems clearly possible for math and coding, and it remains an open question about what approaches will work for other domains.


In a sense GPT-4 is self-training already, in that it's bringing in money for OpenAI which is being spent on training further iterations. (this is a joke)


This is a great comment. Another important distinction, I think, is that in the AlphaGo case there's no equivalent to the generalized predict next token pretraining that happens for LLMs (at least I don't think so, this is what I'm not sure of). For LLMs, RLHF teaches the model to be conversational, but the model has already learned language and how to talk like a human from the predict next token pretraining.


Let's say, hypothetically, we do enough RLHF that a model can imitate humans at the highest level. Like, the level of professional researchers on average. Then we do more RLHF.

Maybe, by chance, the model produces an output that is a little better than its average; that is, better than professional researchers. This will be ranked favorably in RLHF.

Repeat this process and the model slowly but surely surpasses the best humans.

Is such a scenario possible in practice?


While I agree with your platform point, I think you still need to address OP's question of

> What is their route in profitability when Meta is giving away similar tech for free?


Quality and accessibility is my belief.

Meta's Llama3x are great models but they're not providing the same quality and/or accessibility as OpenAI does. Take a look at all the products attempting to sit themselves on top of OpenAI's APIs vs those running on Llama3x; OpenAI dwarfs Meta in that regard, today.

We avoid OpenAI's models/API because of client data sharing constraints and are instead using OSS models, including L3x, but it appears most do not see that as a barrier to adoption and are moving forward with OpenAI's offerings.

We shall see if the work that OpenAI is doing will pay for itself in the long term and whether OSS models can match the gains made with the funding the commercial offerings are receiving, however.


You really think people aren't iterating on OpenAI because it's easy and will then look for cheaper alternatives if their apps take off? I will if the economics make sense.


Meta is giving away the model for free, but running that model is not free at all, and quite expensive instead. At some point it will become the usual discussion pay a service or pay infra resources + devops.


A few come to mind...

Brand power

Exclusive data deals

Being able to call an API vs having to pay for & devops an always on model yourself


It's giving the models. I don't have the hardware to run them. Openai offers me a straight forward way to do it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: