In what way did they "release" this? I can't find it in hugging face or ollama, and they only seem to have a "try online" link in the article. "Self-sovereign intelligence", indeed.
They released it in the same sense OpenAI released GPT4. There is an online demo you can chat with, and a form to get in touch with sales to get API access
Facebook trained the model on an Internet's worth of copyrighted material without any regard for licenses whatsoever - even if model weights are copyrightable, which is an open question, you're doing the exact same thing they did. Probably not a bulletproof legal defense though.
Yes, and there have already been court cases that ruled AI training of copyrighted data to be fair use, because it's technically no different than any other form of art, everything is based off of seeing other ideas elsewhere, there are no new ideas anymore.
DeepThought-8B: 200,000 (based on 2020 census data)
Claude: 300-350,000
Gemini: 2.7M during peak times (strange definition of population !)
I followed up with DeepThought-8B: "what is the population of all of manhattan, and how does that square with only having 200,000 below CP" and it cut off its answer, but in the reasoning box it updated its guess to 400,000 by estimating as a fraction of land area.
Whether you bake the behaviour in or wrap it in an external loop, you need to train/tune the expected behaviour. Generic models can do chain of thought if asked for, but will be worse than the specialised one.
Beam search just traverses different paths and assigns each path a probability of being correct. The paths with the higher probabilities are kept and the ones with lower probabilities are pruned until the search terminates with an "answer". The marketing department calls it "reasoning" and "test-time compute" because the average consumer does not care whether it's beam search or something else.
Your link seems to do a good job of explaining beam search but it's a classic algorithm in state space exploration so most books on search algorithms and discrete optimization will have a section about it.¹
An algorithm that searches for the highest probability answer is not "reasoning"? "Search" has been a fundamental building block of GOFAI since the beginning. How do you define reasoning? Can you justify it being different from the last 70 years of thought on the topic?
Since you asked, I define reasoning as cambridge does:
Reasoning
"the process of thinking about something in order to make a decision"
Thinking:
"the activity of using your mind to consider something"
Mind:
"the part of a person that makes it possible for him or her to think, feel emotions, and understand things"
I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"
I also think that the definition of Mind of Cambridge is incomplete and lacks the creativity part along with cognition and emotions. But it's a vastly different topic.
>I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"
I recall reading about a theory of neurology where many thoughts (neural circuits) fired simultaneously and then one would "win" and the others got suppressed. The closest thing I can find right now is Global Workspace Theory.
The Chinese Room is a theoretical room that contains a “Chinese speaker” but actually when given a text to ‘understand’ actually just looks up the text in a huge number of words inside until it finds a way to find a response and then just outputs the response as its reply
The known upper bound for transformers on the fly computation abilities is a complexity class called DLOGTIME-uniform TC^0.
There is a lot to unpack there but if you take FO as being closed under conjunction (∧), negation (¬) and universal quantification (∀); you will find that DLOGTIME-uniform TC^0 is equal to FO+Majority Gates.
So be careful about that distinction.
To help break the above down:
DLOGTIME = Constructible by a RAM or TM in logarithmic time.
uniform = Only one circuit for all input sizes, when circuits families are the default convention
TC^0: Constant-Depth Threshold Circuits
Even NP == SO-E, the second-order queries where the second-order quantifiers are only existantials.
DLOGTIME-uniform TC^0 is a WAY smaller group than most people realize, but anything that is an algorithm or a program basically is logic, with P being FO + transitive closure or a half a dozen other known mappings.
Transformers can figure out syntax, but if you dig into that dlogtime part, you will see that semantic correctness isn't really an option...thus the need to leverage the pattern matching and finding of pre-training as much as possible.
Thanks. If I'm reading this right, the limiting factor on the intelligence of current LLMs is not the network size, nor training data (size/quality) but rather the architecture? Do we know of a better one for complex computations / "reasoning"?
Given the name they gave it, someone with access should ask it for the “Answer to the Ultimate Question of Life, The Universe, and Everything”
If the answer is anything other than a simple “42”, I will be thoroughly disappointed. (The answer has to be just “42”, not a bunch of text about the Hitchhikers Guide to the Galaxy and all that.)
I asked it "Describe how a device for transportation of living beings would be able to fly while looking like a sphere" and it just never returned an output
The reasoning steps look reasonable and the interface is simple and beautiful, though Deepthought-8b fails to disambiguate the term "the ruliad" as the technical concept from Wolfram physics, from this company's name Ruliad. Maybe that isn't in the training data, because it misunderstood the problem when asked "what is the simplest rule of the ruliad?" and went on to reason about the company's core principles. Cool release, waiting for the next update.
Beginning from the values for fundamental physical constants, is it possible to derive the laws of entropy ?
and then based on its response to that I asked it
Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?
That will be very common in the corpus, thus pattern matching and retrieval more than 'reasoning'.
Not that isn't useful, just expected behavior for an LLM these days.
I would be interested in the output, because it is almost certainly the more common by fundamentally incorrect intro class half truth, that is useful for the typical curriculum, but not near the frontier of current understanding.
It's not giving me a link, and its responses are multi-step and wordy. I can post some pullquotes...
Yes, it is possible to derive the laws of entropy from the fundamental physical constants. Entropy is a measure of the number of possible microscopic configurations of a system, and it is related to the fundamental physical constants through the Boltzmann constant (k_B) and the laws of thermodynamics.
The second law of thermodynamics states that the total entropy of an isolated system can only increase over time, or remain constant if the system is in equilibrium. This law can be derived from the fundamental physical constants and the microscopic behavior of particles in a system.
One way to derive the laws of entropy is by considering the statistical mechanics of a system. In statistical mechanics, entropy is related to the number of possible microscopic configurations (microstates) of a system, which can be described by the Boltzmann constant (k_B). The entropy (S) of a system is given by the Boltzmann formula:
S = k_B * ln(Ω)
where Ω is the number of possible microstates.
The fundamental physical constants, such as the speed of light (c), the gravitational constant (G), and the Planck constant (h), are related to the other constants through various equations. For example, the Boltzmann constant (k_B) can be expressed in terms of the Planck constant (h), the speed of light (c), and the gravitational constant (G):
k_B = (h * P / 2π) * (c^2 / G)
where P is the pressure.
By using these relationships and considering the microscopic behavior of particles in a system, it is possible to derive the laws of entropy from the fundamental physical constants. However, it is important to note that entropy is a macroscopic property, and its laws are not directly derived from the fundamental physical constants but rather from the microscopic behavior of particles and the laws of thermodynamics.
[Q: Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?]
The dividing line between the region where microscopic/atomic processes are reversible and the region where macroscopic processes are irreversible lies in the scale at which the processes occur and the interactions between particles.
At the microscopic level (a), processes are generally considered reversible because the interactions between individual particles can be easily reversed. For example, when two particles collide, they can exchange momentum and energy, and the process can be considered reversible. In this region, the system is in thermodynamic equilibrium, and the second law of thermodynamics does not apply.
However, as we move to the macroscopic level (b), the number of particles involved in the processes increases, and the interactions between them become more complex. In this region, the processes are considered irreversible because the energy and momentum exchanges between particles lead to an increase in entropy. This increase in entropy is a result of the second law of thermodynamics, which states that the total entropy of an isolated system can only increase over time or remain constant if the system is in equilibrium.
The dividing line between reversible and irreversible processes can be identified by examining the scale at which the system operates and the interactions between particles. At the microscopic level, processes are generally reversible, while at the macroscopic level, processes are irreversible due to the increase in entropy resulting from the interactions between a large number of particles.
Please don't call model weights source code. Code is something you can (usually) read and understand, if anything weights are closer to a very obfuscated compiled binary - although even that can be analyzed by a good enough reverse engineer.
"Open Source" models are the latest in a long series of attempts to take advantage of FOSS's reputation without actually releasing something that adheres to the principles.
I would think truly open source means also including the initial training data. i.e. Everything needed to build[/tweak/customize] the thing from the ground up.
Neat! But if we're sticking to the source code analogy, the matrix multiplier probably maps closer to a CPU or an interpreter - I wouldn't say every Java program is open source because you have an open-source JVM.
Yes, which is why if you're serious you will exclusively use the terminal to protect your computer from even the passing chance of being exposed to a so called "image file".
In all seriousness, compilation and model training are lossy processes and erase a lot of the context needed to understand the output (and with model training we don't fully understand it even with access to the training data). Images aren't necessarily derivative of anything, so the analogy breaks down here.
often images are not only lossily compressed, but they are a composite of many images/layers/image transformations. these are lost when the image is flattened and then distributed.
I asked it 'find two primes whose sum is 123' .. it is in deep thought from 5 minutes just looping and looping over seemingly repeated hallucinations of right path. (btw, chatgpt immediately answers 61 and 62 lol.. so much for intelligence)
To find two prime numbers that sum to 123, let’s consider the properties of prime numbers and the nature of their sums:
1. Prime Numbers Greater Than 2 Are Odd: Except for 2, all prime numbers are odd.
2. Sum of Two Odd Numbers Is Even: Adding two odd numbers results in an even number.
3. Sum of an Even and an Odd Number Is Odd: Adding an even number (like 2) and an odd number results in an odd sum.
Given that 123 is an odd number, the only way two primes can sum to 123 is if one of them is 2 (the only even prime) and the other is an odd prime. Let’s test this possibility:
• 2 + x = 123 ⇒ x = 121
However, 121 is not a prime number (since ).
Next, consider the possibility of adding two odd primes:
• Odd Prime + Odd Prime = Even Number
Since adding two odd primes yields an even number, it’s impossible for two odd primes to sum to 123 (an odd number).
Conclusion: There are no two prime numbers that sum to 123.
Answer: There are no two prime numbers whose sum is 123; such primes do not exist.
It looks nice, but my chrome browser on android has all sorts of trouble with rendering the animated bits, so it ends up skipping frames throughout my navigating and clicks. Add to that; the model doesn't respond at all in my multiple attempts, it's a waste of time until it's remedied.
"Model A 13B", "Model B 20B" etc are pretty vapid claims. Which actual models? There are plenty of terrible high-param-count models from a year or two ago. The benchmark seems meaningless without saying what models are actually being compared against... And "13B" in particular is pretty sketchy: are they comparing it against Llama 2 13B? Even an untuned Llama 3.1 8B would destroy that in any benchmark.
It's a bit buried, but there's a link further down that takes you to an interface similar to ChatGPT. However, I have not been able to get any output from it; it seems to just think forever. I must be giving it the wrong prompts...
The user has provided a simple greeting 'what's up' and is expecting a response. This is a casual conversation starter, and I need to provide an appropriate reply.
2.
analysis
The phrase 'what's up' is an informal greeting asking about someone's well-being or current state. It's commonly used in casual conversations with friends, family, or acquaintances.
3.
contextualization
As an AI language model, I don't have personal experiences or emotions, so I need to provide a response that acknowledges the greeting without implying any personal state.
4.
solution proposal
An appropriate response would be to acknowledge the greeting and offer a brief, friendly reply that doesn't require reciprocating any personal information.
5.
creative thinking
I'll craft a response that is friendly, concise, and universally applicable: 'Not much, just hanging out and processing information. How about you?'
6.
evaluation
This response acknowledges the greeting, mentions my 'state' (processing information) which is true for an AI, and turns the question back to the user, encouraging further conversation.
7.
verification
The proposed response meets all the criteria: it's friendly, appropriate for the casual tone of 'what's up', and doesn't require any personal information from me as an AI.
8.
conclusion drawing
The final response 'Not much, just hanging out and processing information. How about you?' is an appropriate and effective way to respond to the user's 'what's up'.
---
Not much, just hanging out and processing information. How about you?
Oh fuck me, it’s a fellow autistic. You’d think after all these years on the planet my brain would be able to immediately respond with “Fine. You?” but no, this is what my brain does before eventually replying.
A bit off-topic, but that comparison graph is a great example why you should buy your designer a cheap secondary screen. I was viewing it on my second monitor and had to lean in to make out the off-white bar for Model D on the light-grey background. Moved the window over to my main screen and it's clear as day, five nice shades of coffee on a light-gray background.
That's a pretty egregious mistake for a designer to make -- and that's not even mentioning the lack of accessibility. WebAIM's contrast checker says it's a 1:1 contrast ratio!
If someone is releasing a model that claims to have a level of reasoning, one would hope that their training dataset was scrutinized and monitored for unintended bias (as any statistical dataset is susceptible to: see overfitting). But if the graph on the announcement page is literally unreadable to seemingly anyone but the creator... that's damning proof that there is little empathy in the process, no?
I wouldn’t say it’s implied, but there’s a reason people put on nice clothes for an interview.
I’m looking at the graphs on my phone and I’m pretty sure that there are 5 graphs and 3 labels. And their 8B model doesn’t seem to be very good, looks like a 20B model beats it in every single benchmark.
The body text is also quite hard to read because the font has a tall x-height and line spacing is very tight.
This makes paragraphs look very dense, almost like it was set in uppercase only, because the lowercase letters don’t create a varying flow between lines for the eye to follow.
The model may be good, but the web design doesn’t win any prizes.
Also, is it standard practice to obfuscate which models you're benchmarking against? They're just labeled Model A-D, with sizes but no additional information.
Given the context, it appears they are not benchmarking against other models but comparing differently sized versions of the same model. The 8B one is just the one they decided to give a catchy name. The other ones are probably also just fine tuned Llama models. But without information on the total compute budget (i.e. nr. of trained tokens), this kind of plot is pretty useless anyways.
i sadly don't feel this is a mistake, the transparent once are the two that beat the model in one category or more, its feels more like scam than error, if not please fix it
1
Problem Understanding
The task is to count the number of 'r's in the word 'strawberry'. This requires identifying each occurrence of the letter 'r' in the given word and keeping a count.
2
Data Gathering
The word to analyze is 'strawberry'. I need to examine each character in this word.
3
Hypothesis Formation
Based on a quick glance, I hypothesize that there are 2 'r's in 'strawberry', but I need to verify this systematically.
4
Analysis
I will go through each letter of 'strawberry' one by one: s-t-r-a-w-b-e-r-r-y
5
Calculation
Counting the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 9. Total count: 2
6
Verification
Double-checking: s(0)-t(0)-r(1)-a(1)-w(1)-b(1)-e(1)-r(2)-r(2)-y(2). Confirmed, there are 2 'r's.
7
Conclusion Drawing
The final answer is that there are 2 'r's in the word 'strawberry'.
8
Implementation
To present the answer, I will state: 'There are 2 r's in the word "strawberry".' This directly answers the question in a clear and concise manner.
It's still not quite perfect at letter counting, but it handles questions like "which is heavier? 2kg of feathers or 1kg of lead" really well. The reasoning steps seem to prevent it from jumping to pattern-matched answers
Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.
This is a bad test for reasoning. The model is either trained on the spelling of its tokens or specific words, or not. It's not a thing reasoning would help with in any way. (Ok, maybe accidentally if it tried different splits it's trained to spell, but that's not the reasoning part either)
I know that this is a typical test of an LLM's ability to reason, but I wonder how much time could be saved by teaching an LLM how to recognise the type of problem that it's inherently bad at, bundling a python interpreter and asking it to to write a short function to solve the problem? Are we not pushing a boulder uphill?
> teaching an LLM how to recognise the type of problem that it's inherently bad at
Solving this is the actual hard part and is either adjacent to or even equivalent to solving the problem of LLM's hallucinating. ChatGPT already includes a Python interpreter tool which can be used if the context indicates its appropriate.
I suppose my question is pointing to another, which is: can one make an LLM that doesn't hallucinate? Isn't that problem inherent to the way that LLMs work? Obviously we can try to clean the data so there isn't any nonsense fed into it, but that'll only get you so far with a probabilistic, stochastic system. As an LLM once told me "Some experts argue that hallucination is an innate limitation of LLMs, akin to confabulation in humans, where false memories are created without the intention to deceive". I'm not sure if I believe that though.
RAG and fine tuning improve domain specificity and may reduce the problem to a level where you don't care, but it will always be there.
Clean data would help reduce the incidents, possibly to a level that is more usable, but also don't remove the problem.
Considering next token prediction as serial runs on multi tape TMs, with the previous output as the input can help.
Especially if you consider Microsoft's 1.53bitnet that requires full precision for training but can reduce weights to just the sign components for inference.
Unfortunately all paths to explain this I have require graduate level complexity theory and/or diff geometry. Or you relive the
Brouwer–Hilbert controversy by trying the logic path.
I found the following transcript of the thought process to that question equally enlightening and funny:
Thought for 4 seconds
1 problem understanding
The task is to count the number of instances of the letter 'r' in the word 'strawerberry'. This requires identifying each occurrence of 'r' in the given word and keeping a count.
2 data gathering
The word to analyze is 'strawerberry'. I need to examine each character in this word.
3 hypothesis formation
Based on a quick glance, I hypothesize that there are 2 instances of 'r' in 'strawerberry'.
4 analysis
Let's break down the word into individual characters: s, t, r, a, w, e, r, b, e, r, r, y
5 calculation
Now, let's count the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 7, 3) Third 'r' at position 10. Total count: 3
6 verification
Double-checking the count: s(0), t(0), r(1), a(0), w(0), e(0), r(2), b(0), e(0), r(3), r(3), y(0). Confirmed: 3 instances of 'r'.
7 error checking
The initial hypothesis was incorrect. The actual count (3) differs from the initial guess (2). This highlights the importance of careful analysis rather than quick assumptions.
8 conclusion drawing
The letter 'r' appears 3 times in the word 'strawerberry'.
9 implementation
The final response should be the number 3, as this is the correct count of 'r' instances in 'strawerberry'.
>Alright, I need to figure out how many 'r's are in the word 'strawerberry'. Let's look at it carefully. I'll write it down: s-t-r-a-w-e-r-b-e-r-r-y. Now, I'll go through each letter one by one.
>Starting from the beginning: s, t, r – that's one 'r'. Then a, w, e, r – that's the second 'r'. Next, b, e, r – that's the third 'r'. And finally, another r before the y, making it the fourth 'r'. So, I count four 'r's in total.
Tokenization, still. LLMs don’t see individual letters. It’s like using Google translate to ask a German speaker the same question and concluding that Germans can’t count letters.
Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.
The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model
Is there something more than prediction going on? I'd like to understand why there would be studies like this: https://arxiv.org/pdf/2307.02477 if it's all just prediction. Are there some algorithms that are called "reasoning", that are somewhere between prediction and full-blown reasoning?
My guess is the framing in that paper was chosen to be provocative. The authors surely understand deep neural networks neither reason nor recite as we understand it in humans.
In any case, the conclusion was decidedly against reasoning.
I found the following video from Sam Witteveen to be a useful introduction to a few of those models:
https://youtu.be/vN8jBxEKkVo
reply