The leading example in the article is Carmel, Indiana. I am not sure what has been done there should be described as a model of competent governance. This is a small city (100k people/50k workers/37k households [1]) where the city has taken on major amounts of debt (1.4$billion/14k per person/28k per worker/38k per household) to create public amenities.
The financial logic of what Carmel is doing centers around zero-sum competition with neighboring communities. Carmel is betting that by being the nicest place with the fanciest amenities, they will attract the richest families and be able to support that heavy debt burden in the coming decades.
This is explicitly not a model that every community could follow! It relies on spending more than would be prudent for the current tax base, in the hopes of standing out in comparison to other communities. And for sure its a bet that could fail badly, depending on general economic conditions. Its too soon to tell if the gamble will work out successfully, and very hard to tell even in hindsight how risky it was, succeed or fail.
If you look at the average household income in Carmel, IN. Then compare it to, say, Darien, CT.
Then consider, Darien's per capita debt load works out to about 2% of the average annual per capita income. Carmel's per capita debt load works out to roughly 25-30% of its average annual per capita income. (And that's being kind and using the average per capita income. It gets even uglier for Carmel if we do these comparisons at the median.)
Given all that, it's not hard at all to bet on failure.
You can play at being Darien if you have Darien money. If you don't, you're just being fiscally irresponsible. There's nothing about the Carmel story that says 'competence'.
You have to incorporate county and state debt per taxpayer if you are comparing a locale in different governments.
Connecticut has a very high (multiple standard deviation above the norm) retiree benefit debt, due to decades of underfunding. But these numbers are not easily comparable because there are no strict rules around how taxpayer funded DB and retiree healthcare debt is calculated.
It just shows up as a bigger and bigger proportion of government expenses.
Comparing Detroit to Singapore is pretty silly in my eyes. The second derivative for Detroit has been good for about twenty years, but the city has been decaying around economic events and poor business diversification.
Connect the Q-Line out towards Oakland County and to the People Mover. Get that train going to the airport. There are real infrastructure needs here and it doesn't take a lot to imagine them.
Organizations like Exxon-Mobil promote these fairy-tale schemes as a talking point to counter legislation that would restrict their unlimited production of fossil-fuel based plastics. Its the same scheme with carbon capture for coal-burning power plants. The technology doesn't and will never work as described, but talking up its potential gives an excuse not to regulate the polluting industry now.
The 6502 is somewhat famously a hard target for the C language, and KickC does quite well at producing good results in spite of this. The C language is heavily based around pointers and stack usage. The 6502 has a minimal hardware stack: 256 bytes with no stack-relative addressing. So a "stack frame" is an alien concept that requires slow workarounds to emulate. And the 6502 only has pointers in the form of words stored in the first 256 bytes of RAM ("zero page") and also requires the use of one of the three registers to dereference.
The PDP-11 postincrement thing is very often repeated, and I'm sure there are good reasons to suspect it, but here is one bit of evidence to the contrary I find convincing
The 6502's stack is intended to be mostly a call stack, with maybe a temporary or two stored while you juggle the registers, not a place for your data frames. From what I vaguely recollect, the cc65 uses Y as the stack pointer, but only when it actually needs stack-like behaviour; at other times it uses static allocation a la Fortran, or so I think.
This paper is from a small group at an academic institution. They are trying to innovate in the idea space and are probably quite compute constrained. But for proving ideas smaller problems can make easier analysis even leaving aside compute resources. Not all research can jump straight to SOTA applications. It looks quite interesting, and I wouldn't be surprised to see it applied soon to larger problems.
Baseline time to grok something looks to be around 1000x normal training time so make that $20k per attempt. Probably takes a while too. Their headline number (50x faster than baseline, $400) looks pretty doable if you can make grokking happen reliably at that speed.
I’ve been in a small group at an academic institution. With our meager resources we trained larger models than this on many different vision problems. I personally train LLMs on OpenWebText than this using a few 4090s (not work related). Is that too much for a small group?
MNIST is solvable using two pixels. It shouldn’t be one of two benchmarks in a paper, again just in my opinion. It’s useful for debugging only.
I thought so at first, but the repo's[0] owner and the first name listed in the article has Seoul National University on their Github profile.
Far away from a small academic institution.
Oh hm, so they are. I thought they were binary because they used a digital pen to create them, IIRC, and logistic regression is always the baseline; but checking, they technically are grayscale and people don't always binarize them. So I guess information-theoretically, if they are 0-255 valued, then 2 pixels could potentially let you classify pretty well if sufficiently pathological.
This is one of those unknowns that climate scientists couldn't predict: melting permafrost is releasing minerals that have been locked in place for thousands of years.
Back in 1999-2000 there was an "International RoShamBo Programming Competition" [1] where computer bots competed in the game of rock-paper-scissors. The baseline bot participant just selected its play randomly, which is a theoretically unbeatable strategy. One joke entry to the competition was carefully designed to beat the random baseline ... by reversing the state of the random number generator and then predicting with 100% accuracy what the random player would play.
Edit: the random-reversing bot was "Nostradamus" by Tim Dierks, which was declared the winner of the "supermodified" class of programs in the First International RoShamBo Programming Competition. [2]
That's me! Thanks for pulling up the quote from long ago:
> "With his obvious technical skill, and his "cheat early and often"
attitude, Tim could have a promising career as an AI programmer
in the computer games industry. :)"
Instead took a path of security, authoring the TLS RFC and principal engineer in Google security. Thanks for the flashback.
Serato is digital pretend scratching. Might as well use a DJ controller. Real scratching requires a proper tt and skill like Mix Master Mike. Hell, I have 2 Pioneer DL-5 and a Pioneer DJM-600, but these tts aren't good for scratching because of their straight arms, they're good for gapless playback. https://youtu.be/58Y--XTIRZ8
The whole commentary about the "supermodified" class of competition entrants is making my laugh:
> Nostradamus was written by Tim Dierks, a VP of Engineering at Certicom,
who has a lot of expertise in cryptography. The program defeats the
optimal player by reverse-engineering the internal state of the random()
generator, which he states "was both easier and harder than I thought it
would be". To be sporting, it then plays optimally against all other
opponents.
> Fork Bot was based on an idea that Dan Egnor came up with a few minutes
after hearing about the contest. Since "library routines are allowed",
his elegant solution was to spawn three processes with fork(), have each
one make a different move, and then kill off the two that did not win.
This was implemented by Andreas Junghanns in about 10 lines of code.
Unfortunately, since all three moves lost to the Psychic Friends Network
after the first turn, the program exited and the remainder of that match
was declared forfeited.
> The Psychic Friends Network is a truly hilarious piece of obfuscated C,
written by Michael Schatz and company at RST Corporation. Among other
things, it uses an auxiliary function to find good karma, consults
horoscopes, cooks spaghetti and (mystic) pizza to go with various kinds
of fruit, #defines democrats as communists, and undefines god. We're
still trying to figure out exactly what it is doing with the stack
frame, but we do know that it never scores less than +998 in a match,
unless it is playing against a meta-meta-cheater.
> The Matrix was written by Darse Billings, who holds the prestigious
title of "Student for Life", and recently started the PhD programme at
the University of Alberta. The RoShamBo program defeated every opponent
with a perfect score, based on the simple principle "There is no spoon".
> Since The Matrix is also the tournament program, it has complete access
to all other algorithms, data structures, and output routines, and is
therefore unlikely to ever be overtaken. As a result, this category is
hereby declared to be solved, and thus retired from future competitions.
I believe it works as follows:
- It plays randomly for the first 998 turns (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...): this line is "if (*turn < trials - 2) return libra ? callback() : random() % 3;", and "libra" is initalized to (int) NULL, i.e. zero, on every invocation.
- In the last 2 turns, it uses `find_goodkarma` to comb through the stack to find where the variables that match its history and the opponents' history are stored. These the stack arrays p1hist and p2hist (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...)
They're easy to find because they contain 998 known values each in a ~random sequence of (0, 1, 2), and they're just upwards of the stack from the current invocation of the Psychic Friends Network.
`find_goodkarma` simply increments a pointer until the whole sequence of 998 values matches the known history.
Then, these lines repeat the same trick for the number of wins and losses. It checks whether it's p1 or p2 by comparing the addresses of the win/loss arrays, and then overwrites the wins/losses appropriately using `pizza` https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...
in the end it returns an arbitrary value (the address of `good_hand` mod 3).
It was fun to follow but the result is kind of boring :)
The whole is better than the sum of the parts. You have nostradamus which predicts the future, fork bot which plays all playable presents, and the psychic friends network which rewrites the past. All under the watchful eye of the matrix.
There's something beautiful here and you honestly couldn't make it up.
In general, you can pick a random seed at the start of the day, commit to it somewhere (eg publish a hash of it on the bitcoin blockchain, or just on your website), then use that seed in a cryptographically secure PRNG on your website all day, and at the end of the day you publish the seed you already committed to.
This way people can check that you didn't cheat, but can't guess their opponents cards either.
Wouldn't the most obvious method of cheating be the site owner peeking at other player's hands and/or the deck? Which this doesn't (and cannot) prevent?
I guess I'm not sure what publicizing their PRNG is meant to prove. It shows they didn't cheat via a very specific type of cheating but there are several other potential cheating vectors.
I've always wondered: why aren't ADCs (e.g. mic input) and temperature sensors considered a good source of entropy, particularly if the lower bits are taken?
If you just want any old entropy, and don't care about proving to someone else that your entropy is good, these are acceptable. But it's honestly really easy to get entropy on that grate, and thanks to modern cryptographically secure PRNG, you don't need a lot of 'real' entropy. You can stretch it out forever.
If you want/need to be able to argue to a third party that your entropy was good, you can spend a bit more.
How do you convince anyone that your mic input was actual real mic input, and not just a sequence of bits you made up to be convenient for you?
Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.
This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.
Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.
So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.
Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.
I agree, but my question was not whether distance between data points tends to decrease as dataset size grows, but whether that is the reason why the number of training tokens required per parameter declines. It could be, but proving it would require a better understanding of how and why these giant AI models work.
Wasn't your question about how *independent* the data is?
We could talk about this in different ways, like variance. But I'm failing to see how I didn't answer your question. Did I misscommunicate? Did I misunderstand?
The model is learning off of statistics so most of your information gain would be through more independent data. Think of this space we are talking about as "knowledge." And our "intelligence" as how easy it is to get to any point in this space. The vector view above might help with understanding this one, because you can step in the direction of any vectors you have and then how you combine them to get to your final point. The question is how many you have to use (how many "steps" away)? And of course, how close you can get to your final destination. As you can imagine, from my previous comment, that doing this for any given point you'll reduce "steps" to your final destination if you have more vectors, but you can also understand that the utility of each vector decreases as you add more. (Generally. Of course if you have a gap in knowledge you can get a big help from a single vector that goes into that area but let's leave that aside).
Does this help clarify? If not I might need you to clarify your question a bit more. (I am a ML researcher fwiw)
> Wasn't your question about how independent the data is?
No. My original (top) comment was about how the number of training tokens required per parameter slowly declines as models become larger. dzdt suggested it could be because the independence of training points declines as the dataset size grows. I said it could be, but I'm not sure how one would go about proving it, given how little we know about the inner working of giant models. Makes sense?
Oh I see. It's because yes, we expect this to happen once we get to sufficient coverage. As we linearly increase the number of parameters the number of configurations increases super linearly. In other words, the information we can compress.
There's a lot we didn't know, but it isn't nothing. There's a big push for ML not needing math. It's true you can do a lot without, especially if you have compute. But the math helps you understand what's going on and what are your limits. We can't explain everything yet, but it's not nothing.
I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.
I understand the conext to be that for a smart TV with Roku built in as the operating system, the idea is that if the TV is set to take input from a HDMI device which is idle the Roku OS might take over and display its own content. Like previews for other shows as a "screensaver" kind of thing.
The financial logic of what Carmel is doing centers around zero-sum competition with neighboring communities. Carmel is betting that by being the nicest place with the fanciest amenities, they will attract the richest families and be able to support that heavy debt burden in the coming decades.
This is explicitly not a model that every community could follow! It relies on spending more than would be prudent for the current tax base, in the hopes of standing out in comparison to other communities. And for sure its a bet that could fail badly, depending on general economic conditions. Its too soon to tell if the gamble will work out successfully, and very hard to tell even in hindsight how risky it was, succeed or fail.
[1] https://www.point2homes.com/US/Neighborhood/IN/Carmel-Demogr...