I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".
However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)
The information on the creative step which you provided to o1, was also the key step and contained almost all the difficulty. The hope is that 2025 models could eventually come up with solutions like this given enough time, but this is also a toy problem. The question is how much clever answers will cost for real world complex problems. At present it looks like, very much.
All I need is the proportion of the qualifying numbers to the input array to run the algorithm and the number of samples. Then we can sample min, max index of the qualifying array and return their difference without having to sample many times if we can derive the joined min max distribution conditional on the Bernoulli.
In other words the procedure can take any input array and qualifying criteria.
The joint distribution is relatively simple to derive. (This is related to the fact that min, max of continuous uniform on 0, 1 are Beta distributions.)
Sampling doesn't give you the actual answer for an actual array. If the program uses the array for multiple things, such as organizing the numbers after allocating the correct number of buckets, your method will cause logic errors and crashes.
The O(1) method based on statistics only works when the function making this calculation can hide the array (or lack of array) behind a curtain the entire time. If it has to take an array as input, or share its array as output, the facade crumbles.
The prompt is not "generate this many random numbers and then say max qualifying minus min qualifying". If it was, your method would give valid solutions. But the prompt starts with "Given a list".
In the article, we let ChatGPT generate the random numbers as a matter of convenience. But the timing results are only valid as long as it keeps that part intact and isolated. We have to be able to swap it out for any other source of random numbers. If it invents a method that can't do that, it has failed.
It depends on how you read the problem still. In a lot of the llms solutions the array is not provided in the solving functions but rather constructed inside (as instead of defining the function with an input and then creating a main function that would be called with no argument, construct an array and call the solving function with that as argument, as typical in python), so I assume the llm did not read it like this or also failed this aspect of the code (which was never really mentioned). It is not clear if we are given a specific array of integers or one input is an array of random variables that we need to instantiate ourselves.
This gets to the old saw, "knowing what question to ask is the most important thing". To the extent that LLMs can answer questions better than formulate which ones to ask, they may be inherently limited. We will see.
But it does seem they are good (to the extent that they are good at anything) at identifying the questions first if you ask them. It does mean you need an ok enough meta-question to start the chain of the reasoning, but that is the key insight of the recent wave of "reasoning models." First ask the LLM to reformulate the problem and structure an approach, or multiple approaches on how to address it, then have a second pass do just that.
Google search with less steps? Still a huge advancement, of course.
Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?
Annual ridership is about 7m passengers (https://www.caltrain.com/media/34265) or 9m if we annualize the numbers in this article. At the same time, the Caltrain is losing more than $33m/year (https://www.caltrain.com/blog/2023/04/financial-future-caltr...) even though it recieved a huge $410m subsidy from the state government for the capital investment required to complete the electrification. So it's losing more than $5 per ride even ignoring capital costs. In fact, the farebox recovery ratio is only 25%, which is truly horrible even by public transport standards: https://www.caltrain.com/media/33996/download
The curmudgeon in my questions where it really make sense for the Caltrain to exist at all, given that it:
1. It is extremely far from covering its own costs
2. The subsidy provided by the state is regressive because it principally benefits the wealthy commuters along the line
3. It imposes substantial negative externialities on the South Bay both in terms of noise & traffic delays at grade crossings
4. The line and stations occupy very valuable real estate that could be sold and put to better uses
If you are only looking at how profitable infrastructure is on the short term and only on 1st order effects you'll quickly realise you should never build it. That goes for power plants, roads, trains, subways, electrical conduits, and any other major project requiring a lot of initial investment.
1. Before the pandemic, farebox recovery was 73%, one of the highest in the country. Electrified, it probably would have gone even higher since at least some people avoided the train because it was too crowded at rush hour. Unfortunately it could take over a decade to return to that point if ever, but it shows that Caltrain can be fairly efficient given enough ridership.
2. There have been ongoing efforts to improve equity, including Go passes for students and low-income workers. Clipper START offers 50% off for low-income riders, youths can ride for $2/day now. Side note: subsidies and programs like this decrease farebox recovery ratios, which is why transit covering its own costs is not necessarily the end goal. In fact, the decline of commuting for tech/knowledge workers (many who now work hybrid or fully remote) has meant that less-privileged workers have started making up a larger share of Caltrain ridership.
3. The new electric trains are a lot quieter. There's ongoing, long-term projects to grade separate the remaining at-grade crossings (of the 113 crossings, 41 still need to be separated).
4. Arguably, the stations were directly responsible for the land being so valuable. Many peninsula downtowns and sometimes entire cities were built around the train stations, 100-150 years ago. Pre-pandemic, some of the most expensive apartments were the ones near the train station. Sure most of them also had the benefit of being close to downtown restaurants, but there were also big developments around Lawrence station which has basically zero local services aside from the Costco and the train station.
Of course this was mostly in the past, but downtown areas continue to be congested and the only way we can keep building high-density housing is to have good public transit. In the long run, the train provide immense benefits to the cities that host them and removing it would be shortsighted.
Should roads exist at all, given that the gas tax doesn't cover the costs of building and maintaining roads?
And furthermore since Bay Area roads are not toll roads, the equivalent of farebox recovery ratio is zero, since road users don't need to pay to drive on the road.
Have you considered that the real estate is valuable because there's a commuter train across the street?
iirc there are Japanese railroads that recoup their costs by owning real estate that appreciates in value once a train exists. In Los Angeles the great cable car conspiracy was kind of the opposite, real estate developers built a train to make the suburbs appealing, and then once all the property was sold there was no reason to maintain the trains so they were scrapped, bait and switch really
Supercompilation doesn't depend on having a typed language - you can more or less derive a supercompiler mechanically from any operational semantics, typed or untyped.
My PhD was on supercompilation for Haskell (thanks for the cite in the repo :-))
Cool to see that people are still working on it, but I think that the main barrier to practical use of these techniques still remains unsolved. The problem is that it's just so easy for the supercomplier to go into some crazy exponential blowup of function unfolding, making the compilation step take impractically long.
Even if you avoid a literal exponential blowup you can easily end up generating tons of specializations that bloat your code cache but don't reveal any useful optimization opportunities/are infrequently used in practice. Similar performance problems also plague related techniques like trace-based JITs, even though the trace JIT happens at runtime and thus has access to strictly more information about the frequency with which a trace might be used.
You can try to use annotations like the @extract proposed in the article to control these problems, but it can be hard to predict in advance when this is going to occur.
One interesting research direction might be to use deep reinforcement learning to try to guide the generalization/termination decisions, where the reward is based on A) whether the unfolding leads to a tie-back later on, and B) to what extent the unfolding allowed deforestation/beta reduction to take place.
Thanks for commenting! Yes, automatically predicting whether it's beneficial to specialize a function is not implemented. It's under the section "Further research" to find suitable heuristics to automatically emit annotations like those `@extract` that we already have.
This topic was brought up many times in the past. Most of the time it was suggested to use speculative approaches to detect blowups, but I haven't actually seen any attempts to predict them before supercompilation. For example, while I was writing Mazeppa, I've seen many times that blowups occur because some function call gives insufficient information that is subsequently pattern-matched by another function, and since it cannot be properly pattern-matched at compile-time, a lot of code duplication takes place.
I'm leaning towards a kind of "abstract interpretation pass" before supercompilation to approximate "good" and "bad" interactions between functions. After all, given that offline analyses exist even for harder problems such as detecting non-termination, why cannot we understand how supercompilation gives us desirable results?
Maybe this is naive but couldn't you use the same approach with profiles collected at runtime as PGO like LLVM/GCC? In my experience it can help quite a lot with guiding inlining both by hand and via a compiler.
I don't think that's naive - it could definitely help ameliorate the issues. To my knowledge noone has tried this approach.
The other idea I had when I was working on this stuff was to do JIT supercompliation. Clasically, supercompilers expand the whole graph at compile time, but it would be straightforward to delay the supercompilation process until the first time that the code is actually executed, which helps tame the problem of generating lots of specializations that may never actually be executed in practice.
(Choosing a generalization heuristic becomes more important in the JIT setting because you always have access to the full info about all parameters to the function you are specializing, and naively you would end up e.g. specializing sigmoid(x)=1/(1+exp(x)) for each value of x observed at runtime.)
Do current JIT implementation already have something to control specialization? I have heard many times for example that JavaScript engines will specialize hot paths.
To the best of my knowledge (which isn’t great) .NET has multiple compilation levels for functions, using more time after a method has been called a few times. But starts with a pretty fast but naive compilation. It also, I believe, specialises everything all the time, but (probably because of the multi-level compilation) that doesn’t seem to be a practical problem.
I just skimmed over LoRA+ and DoRA and I see no reason why these improvements could not go hand in hand. Actually, LoRA+ seems to be about efficient training while DoRA seems about improving the ability to actually learn, making it significantly more robust. Although I still have my questions on how the improvements of LoRA+ would be applied to the magnitude vector.
The two methods seem to be independent, wonder if you can combine them for even better performance.
Interestingly both seem to indirectly modify the optimisation process, in my opinion effectively trying to fix a bad optimiser. Seems like we still have a long way to go after Adam...
> Seems like we still have a long way to go after Adam...
A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].
Hee, cute, but predictable once the ... showed up. Honestly I think the ending was unnecessary; it destroys all subtlety.
Of course, this is not actually how latent space works. It's the AI's understanding of concepts, not the inherent nature of concepts; that's why every model has its own version of "latent space". Though the understanding of latent space in the story is internally consistent; given a superintelligent image generator, you could do prompt engineering like this.
That is a super interesting article, thank you for posting it!
But, I guess my question is more about why the abstraction of "protein chunks" doesn't fall apart when there are relatively significant "diffs" in the RNA sequence.
The most significant diffs between both vaccines occur in the untranslated regions located around the protein coding sequence and will never be present in the actual spike protein.
Regarding the protein coding region, because of the degeneracy/redundancy of the genetic code, all changes within it are synonymous and code for identical amino acids.
That is a fascinating read (and the perfect level of depth in this field for me). How did you happen across it? Always looking to add a good source to my RSS feed list
Not just a hunch, actual US policy: "Operation Choke Point" was a 2013 inititative by the DoJ to hurt legal but unpopular businesses (e.g. payday lenders, tobacco, firearm dealers, and - yes - pornographers) by making it harder for banks to provide services to them.
I am curious, if such an operation exists, then does it stand to reason that perhaps they have agent provocateurs that post illicit content, then report it to the banks to disrupt the service?
> "Operation Choke Point" was a 2013 inititative by the DoJ to hurt legal but unpopular businesses (...) by making it harder for banks to provide services to them.
Honest question: can the US goverent stop anyone from launching their banking/payment services?
Probably not. I think what would stop you would be the mountain of regulations required to be a payment provider and tie the networks required for ACH and other types of transfers. And then of course, you need the relationships with all the other banks and so you are really back at square one. Now you are in the same position Mastercard is in.
However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)