HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
>Last semester, professor Pamela Newton, who also teaches the course, allowed students to bring readings either on tablets or in printed form. While laptops felt like a “wall” in class, Newton said, students could use iPads to annotate readings and lie them flat on the table during discussions. However, Newton said she felt “paranoid” that students could be texting during class.
>This semester, Newton has removed the option to bring iPads to class, except for accessibility needs, as a part of the general movement in the “Reading and Writing the Modern Essay” seminars to “swim against the tide of AI use,” reduce “the infiltration of tech,” and “go back to pen and paper,” she said.
Is this about teaching efficiency or managing the teacher's feelings? If "the infiltration of tech" allowed for better learning, would this teacher even be open to it?
>It's pretty insidious to think that these AI labs want you become so dependent on them so that once the VC-gravy-train stops they can hike the token price 10x and you'll still pay because you have no other choice.
I don't think that's true? From what I understand most labs are making money from subscription users (maybe not if you include training costs, but still, they're not selling at a loss).
>(thankfully market dynamics and OSS alternatives will probably stop this but it's not a guarantee, you need like at least six viable firms before you usually see competitive behavior)
OpenAI is very aggressive with the volume of usage you can get from Codex, Google/DeepMind with Gemini. Anthropic reduced the token price with the latest Opus release (4.5).
>To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run.
I get the sentiment, but then, do you count all the other experiments that were done by that company before specifically trying to train this model? All the experiments done by people in that company at other companies? Since they rely on that experience to train models.
You could say "count everything that has been done since the last model release", but then for the same amount of effort/GPU, if you release 3 models does that divide each model cost by 3?
Genuinely curious in how you think about this, I think saying "the model cost is the final training run" is fine as it seems standard ever since DeepSeek V3, but I'd be interested if you have alternatives. Possibly "actually don't even talk about model cost as it will always be misleading and you can never really spend the same amount of money to get the same model"?
>E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.
I think in that specific case that says more about LMArena than about the newer models. Remember that GPT 4o was so specifically loved by people that when GPT 5 replaced there was lots of backlash against OpenAI.
One of the popular benchmarks right now is METR which shows some real improvement with newer models, like Opus 4.5. Another way of getting data is anecdotes, lots of people are really impressed with Opus 4.5 and Codex 5.2 (but they're hard distangle from people getting better with those tools, the scaffolding (Claude code, Codex) getting better, and lots of other stuff). SWEBench is still not saturated (less than 75% I think).
Why would you leave the question of whether it's true or not aside? If it's false, isn't it a good thing that not many people are ready to admit something false?
Based on what metric you declare my statement as false?
For example for Algeria:
"available resources dropping from 1500 \(m^{3}\)/capita/year in 1962 to 500 \(m^{3}\)/capita/year by 2016, far below the 1000 \(m^{3}\) threshold set by the World Bank"
Can you give examples of how that "LLM's do not think, understand, reason, reflect, comprehend and they never shall" or that "completely mechanical process" helps you understand better when LLM works and when they don't?
Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well. The "they don't reason" people tend to, in my opinion/experience, underestimate them by a lot, often claiming that they will never be able to do <thing that LLMs have been able to do for a year>.
To be fair, the "they reason/are conscious" people tend to, in my opinion/experience, overestimate how much a LLM being able to "act" a certain way in a certain situation says about the LLM/LLMs as a whole ("act" is not a perfect word here, another way of looking at it is that they visit only the coast of a country and conclude that the whole country must be sailors and have a sailing culture).
It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?
> Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well.
My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.
A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.
What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition. Some things are so hard to define (and people have tried for centuries) e.g. what is consciousness? That they are a problem set within themselves please see Hard problem of consciousness.
>My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.
What kind of absurd conclusions? And what kind of non absurd conclusions can you make when you follow your let's call it "mechanistic" view?
>It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?
I don't, just like I don't think a human or animal brain has any magical power that imbues it with "intelligence" and "reasoning".
>A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.
I'm not saying they do or they don't, I'm saying that from what I've seen having a strong opinion about whether they think or they don't seem to lead people to weird places.
>What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition.
You see pretty certain that whatever those three things are a LLM isn't doing it, a paper and pencil aren't doing it even when manipulated by a human, the system of a human manipulating a paper and pencil isn't doing it.
> Mistakes made by chatbots will be considered more important than honest human mistakes, resulting in the loss of more points.
>I thought this was fair. You can use chatbots, but you will be held accountable for it.
So you're held more accountable for the output actually? I'd be interested in how many students would choose to use LLMs if faults weren't penalized more.
I thought this part especially was quite ingenious.
If you have this great resource available to you (an LLM) you better show that you read and checked its output. If there's something in the LLM output you do not understand or check to be true, you better remove it.
If you do not use LLMs and just misunderstood something, you will have an (flawed) justification for why you wrote this. If there's something flawed in an LLM, the likelihood that you do not have any justification except for "the LLM said so" is quite high and should thus be penalized higher.
One shows a misunderstanding, the other doesn't necessarily show any understanding at all.
>If you have this great resource available to you (an LLM) you better show that you read and checked its output. If there's something in the LLM output you do not understand or check to be true, you better remove it.
You could say the same about what people find on the web, yet LLMs are penalized more than web search.
>If you do not use LLMs and just misunderstood something, you will have an (flawed) justification for why you wrote this. If there's something flawed in an LLM, the likelihood that you do not have any justification except for "the LLM said so" is quite high and should thus be penalized higher.
Swap "LLMs" for "websites" and you could say the exact same thing.
The author has this in their conclusions:
>One clear conclusion is that the vast majority of students do not trust chatbots. If they are explicitly made accountable for what a chatbot says, they immediately choose not to use it at all.
This is not true. What is true is that if the students are more accountable for their use of LLMs than their use of websites, they prefer using websites. What is "more" here? We have no idea, the author didn't say so. It could be that an error from a website or your own mind is -1 point and from a LLM is -2, so LLMs have to make two times less mistakes than websites and your mind. It could be -1 and -1.25. It could be -1 and -10.
The author even says themselves:
>In retrospect, my instructions were probably too harsh and discouraged some students from using chatbots.
But they don't note the bias they introduced against LLMs with their notation.
reply