today is the day i stopped reading opinion pieces from the technologyreview. not for presenting an opinion i don’t agree with, but for mistaking word soup for an argument.
I'm trying to distill the essence of their approach, which imho is concealed behind inessential and particular details such as the choice of this or that compression scheme or prior distributions.
It seems like the central innovation is the construction of a "model" which can be optimized with gradient descent, and whose optimum is the "simplest" model that memorizes the input-output relationships. In their setup, "simplest" has the concrete meaning of "which can be efficiently compressed" but more generally it probably means something like "whose model complexity is lowest possible".
This is in stark contrast to what happens in standard ML: typically, we start by prescribing a complexity budget (e.g. by choosing the model architecture and all complexity parameters), and only then train on data to find a good solution that memorizes input-output relationship.
The new method is ML on its head: we optimize the model so that we reduce its complexity as much as possible while still memorizing the input-output pairs. That this is able to generalize from 2 training examples is truly remarkable and imho hints that this is absolutely the right way of "going about" generalization.
Information theory happened to be the angle from which the authors arrived at this construction, but I'm not sure that is the essential bit. Rather, the essential bit seems to be the realization that rather than finding the best model for a fixed pre-determined complexity budget, we can find models with minimal possible complexity.
The idea of minimizing complexity is less novel than it may seem. Regularization terms are commonly added to loss objectives in optimization, and these regularizers can often be interpreted as penalizing complexity. Duality allows us to interpret these objectives in multiple ways:
1. Minimize a weighted sum of data error and complexity.
2. Minimize the complexity, so long as the data error is kept below a limit.
3. Minimize the error on the data, so long as the complexity is kept below a limit.
It does seem like classical regularization of this kind has been out of fashion lately. I don't think it plays much of a role in most Transformer architectures. It would be interesting if it makes some sort of comeback.
Other than that, I think there are so many novel elements in this approach that it is hard to tell what is doing the work. Their neural architecture, for example, seems carefully hacked to maximize performance on ARC-AGI type tasks. It's hard to see how it generalizes beyond.
Right, but see how the complexity budget is prescribed ahead of time: we first set the regularization strength or whatever, and then optimize the model. The result is the best model with complexity no greater than the budget. In this standard approach, we're not minimizing complexity, we're constraining it.
Again, because of duality, these are not really different things.
To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.
ah I see what you mean, sorry it took me a while. Yes, you're right, the two are dual: "fix complexity budget then optimize data error", and "fix data error budget then optimize cx".
I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.
I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.
I think you're right about the essential ingredient in this finding, but I feel like this is a pretty ARC-AGI specific result.
Each puzzle is kind of a similar format, and the data that changes in the puzzle is almost precisely that needed to deduce the rule. By reducing the amount of information needed to describe the rule, you almost have to reduce your codec to what the rule itself is doing - to minimise the information loss.
I feel like if there was more noise or arbitrary data in each puzzle, this technique would not work. Clearly there's a point at which that gets difficult - the puzzle should not be "working out where the puzzle is" - but this only works because each example is just pure information with respect to the puzzle itself.
I agree with your observation about the exact noise-free nature of the problem. It allows them to formulate the problem as "minimize complexity such that you memorize the X-y relationship exactly". This would need to be generalized to the noisy case: instead of demanding exact memorization, you'd need to prescribe an error budget. But then this error budget seems like an effective complexity metaparameter, doesn't it, and we're back to square zero of cross-validation.
If we think of the 'budget' as being similar to a bandwidth limit on video playback, there's a kind of line below which the picture starts being pretty unintelligible, but for the most part that's a slider: the less the budget, the slightly less accurate playback you get.
But because this is clean data, I wonder if there's basically a big gap here: the codec that encodes the "correct rule" can achieve a step-change lower bandwidth requirement than similar-looking solutions. The most elegant ruleset - at least in this set of puzzles - always compresses markedly better. And so you can kind of brute-force the correct rule by trying lots of encoding strategies, and just identify which one gets you that step-change compression benefit.
1. The futile attempt by Microsoft to equip their search engine with artificial intelligence. See also: Bing (Microsoft search engine), Try (v.).
2. An attitude of confidence and contempt characteristic of large language models assumed particularly when expressing false opinions or facts. See also: Bigotry (n.).
I would welcome anything that would cut off the flow of ad money into the pockets of the awful algorithmic SEO scum, who have been poisoning search results with shallow and meaningless articles for the past few years.
This attempt is better viewed as a replacement for the infobox, which does iterative queries for you, instead of you having to hunt down the appropriate jargon yourself and do the successive searches. Don't use it standalone. It gives you citations, since it can actually browse the internet.
Second point is kind of harsh on the model behavior since that’s a product of the data, the training, and the user.
It is possible to treat it nicely and have it respond in kind. Most users just don’t consider that to be a worthwhile expense of their mental capacity.
Microsoft also forced Google’s hand. Would Google have ever wanted to augment search on their own? Sounds like a massive risk to Google ad revenue…
> Most users just don’t consider that to be a worthwhile expense of their mental capacity.
Beyond that, I fundamentally don’t think people should be trained to be “nice” to technology. I don’t have to politely ask a hammer to pound in a nail and - the fact we’re talking about NLP notwithstanding — I shouldn’t have to politely ask Bing to provide me the results I’m looking for.
And the NLP point matters quite a bit. ChatGPT can analyze the sentiment, and even offer adjustments.
This is less about being nice to technology and more about being aware of the impact of the self on the rest of the world. Technology just highlights the gap.
People should be taught to be nice to others, of course. The point is that LLMs are not “others”, they are inanimate tools. If I called a cashier a worthless piece of shit, that would be incredibly rude. If I said the same thing to Siri, it wouldn’t be, because it is not possible to be rude to software.
If my child were to say that to Siri though, I’d be concerned because, as you said, it could be highlighting something about the way they interact with the world. But I would still want it to respond to the command and leave the problem of my child’s bad manners to me. Unless there’s a major shift in our understanding of sentience, I consider teaching the delineation between humans, who are never unfeeling tools, and technology, which is always an unfeeling tool, equally as important as teaching mindfulness of one’s impact in the world. In fact, I don’t think you can actually understand the latter without understanding the former.
You don’t have to be malicious to be rude or inconsiderate. Few are approaching this as worth talking to like another person. It is a servant, by description, by design, so most treat it as such.
There is a bias to the interaction that most will never consider.
If you talk with a human, and the human thinks you are incorrect, and you insist, and neither of you attempts to smooth the conversation, plenty of humans also begin to get aggressive. Or at least irritable. (Which can escalate.)
the difference with a human is, they'd concede they made a mistake. The user did as Bing asked, reported the date on their phone. Bing doubled down on incorrectness.
also, LLMs are not human, so no expectation to treat it as such required.
Presumably something that got exponentially worse with proximity would have badness e^(-kr) where r is distance and k is some constant. So depending on k, it’s worse by some constant amount when you bring d down from 1 meter to zero. This is, notably, finitely worse.
But k/r^2 (different k) is a whole different beast. It’s infinitely worse at zero distance!
Of course, the radioactive source itself is not a point, and the human body isn’t a point either, so it’s not infinitely bad at zero range. Closer than a meter or so (very roughly the size of a body), it will merely concentrate the exposure over a smaller portion of the target and deliver a larger fraction of its total output to the target. The latter effect is a constant factor not vastly greater than 1 when comparing 1 meter to zero meters.
Here's GPT's own explanation what the purpose of that while loop is:
---
This code uses JavaScript's `eval` function to obfuscate the code by looping over an array of strings and passing them as arguments to `eval` to create a variable. It also uses an anonymous function to obfuscate the code. The code is deobfuscated by replacing the `eval` function and the anonymous function with their respective strings.
That explanation is also not correct! The while loop is an obfuscation gadget (of sorts), but it doesn't use eval, it uses push and shift to rotate the array. The only use of eval is 'eval("find")', which is not top grade obfuscation.
This reminds me of when I heard the phrase "lenticular cloud". I looked it up, hoping to learn something about how these clouds are formed, etc. It simply means "shaped like a lentil".
Perhaps this is autocorrect gone awry, but it has nothing to do with lentils.
Not a meteorologist, but from all I understand it means lens-shaped, and to me that immediately gives a strong clue as to how they form since lenses have focal points and can, to some degree, have shapes expressed as formulae about some origin. Which to me is suggestive of the relation to the mountains around which they form.
I wonder how a modern language model would fare here -- use something like GPT-3 to evaluate the log likelihood gain of stitching together each of all N^2 possible pairs, then merge greedily best matches until none are left. Totally within reach, I bet it could get at least _some_ of the order right.
if you can explain this, go ahead:
"The concrete hypothesis is that the network of subjective measurements of distances we experience on DMT (coming from the relationships between the phenomenal objects one experiences in that state) has an overall geometry that can accurately be described as hyperbolic (or hyperbolic-like). In other words, our inner 3D1T world grows larger than is possible to fit in an experiential field with 3D Euclidean phenomenal space (i.e. an experience of dimension R2.5 representing an R3 scene). This results in phenomenal spaces, surfaces, and objects acquiring a mean negative curvature. Of note is that even though DMT produces this effect in the most consistent and intense way, the effect is also present in states of consciousness induced by tryptamines and to a lesser extent in those induced by all other psychedelics."
Yeah this is pretty much standard ‘psychonaut’ theorizing about ‘what it all means, man’. We need less of this and more actual research. What dmt and lsd and similar drugs do to the mind is near miraculous, and if we want to understand consciousness, understanding how these drugs act to warp it so thoroughly seems like an important step.
So who's going to do that real research? I appreciate this psychonaut because he's at least trying, and seems fairly serious about his interest.
If you'd turn the perspective around, and instead of letting "actual researchers" perform the research, how'd you approach training interested psychonauts to perform useful research? Even if it's in a more informal manner?
The problem with research into this sort of thing is that "increases in overal entropy of neural firing patterns and correlation between visuospatial and prefrontal firing." reduces the experience to the point of banality.
What we really need is more philosophers who are well versed in science to have these experiences, and to try and reinterpret the world from this alternate perspective.
I'm aware of whatever gets popular press or a write up in nautilus/aeon, and I browse /r/science daily. Please provide links to these interesting, under-shared studies.
DMT is a psychedelic drug. The essay doesn't explain DMT, perhaps because it's an illegal drug and they didn't want to be accused of promoting it, who knows. It's funny because they try to explain the psychedelic experience with mathematical constructs.
I still don't understand why this ME feature has been created to begin with. Assuming that breaking it is a matter of time (someone clever enough thinking about it for long enough), it seems like a serious security vulnerability, worse still because an attack is undetectable.
Why create it in the first place? Are the enterprise uses the article mentions worth the risk?
The board needs vPro/AMT for things like remote access. If the board doesn't implement those things (and you'd usually know, because you pay more for them), the ME ends up doing...well, I'm not sure what. I think mostly things like enforcing DRM? Still, the machine needs special support on the motherboard and ethernet controller to enable the features that people are complaining the most about.