Hacker News new | past | comments | ask | show | jobs | submit | more dzdt's comments login

The leading example in the article is Carmel, Indiana. I am not sure what has been done there should be described as a model of competent governance. This is a small city (100k people/50k workers/37k households [1]) where the city has taken on major amounts of debt (1.4$billion/14k per person/28k per worker/38k per household) to create public amenities.

The financial logic of what Carmel is doing centers around zero-sum competition with neighboring communities. Carmel is betting that by being the nicest place with the fanciest amenities, they will attract the richest families and be able to support that heavy debt burden in the coming decades.

This is explicitly not a model that every community could follow! It relies on spending more than would be prudent for the current tax base, in the hopes of standing out in comparison to other communities. And for sure its a bet that could fail badly, depending on general economic conditions. Its too soon to tell if the gamble will work out successfully, and very hard to tell even in hindsight how risky it was, succeed or fail.

[1] https://www.point2homes.com/US/Neighborhood/IN/Carmel-Demogr...


If you look at the average household income in Carmel, IN. Then compare it to, say, Darien, CT.

Then consider, Darien's per capita debt load works out to about 2% of the average annual per capita income. Carmel's per capita debt load works out to roughly 25-30% of its average annual per capita income. (And that's being kind and using the average per capita income. It gets even uglier for Carmel if we do these comparisons at the median.)

Given all that, it's not hard at all to bet on failure.

You can play at being Darien if you have Darien money. If you don't, you're just being fiscally irresponsible. There's nothing about the Carmel story that says 'competence'.


You have to incorporate county and state debt per taxpayer if you are comparing a locale in different governments.

Connecticut has a very high (multiple standard deviation above the norm) retiree benefit debt, due to decades of underfunding. But these numbers are not easily comparable because there are no strict rules around how taxpayer funded DB and retiree healthcare debt is calculated.

It just shows up as a bigger and bigger proportion of government expenses.


Didn't the guy also refers to Mike Dugger's work in Detroit


Comparing Detroit to Singapore is pretty silly in my eyes. The second derivative for Detroit has been good for about twenty years, but the city has been decaying around economic events and poor business diversification.

Connect the Q-Line out towards Oakland County and to the People Mover. Get that train going to the airport. There are real infrastructure needs here and it doesn't take a lot to imagine them.


Organizations like Exxon-Mobil promote these fairy-tale schemes as a talking point to counter legislation that would restrict their unlimited production of fossil-fuel based plastics. Its the same scheme with carbon capture for coal-burning power plants. The technology doesn't and will never work as described, but talking up its potential gives an excuse not to regulate the polluting industry now.


cc65 aims for full C language support, which kickC does not have, but cc65 code will never look like hand assembly which kickC often does.


The 6502 is somewhat famously a hard target for the C language, and KickC does quite well at producing good results in spite of this. The C language is heavily based around pointers and stack usage. The 6502 has a minimal hardware stack: 256 bytes with no stack-relative addressing. So a "stack frame" is an alien concept that requires slow workarounds to emulate. And the 6502 only has pointers in the form of words stored in the first 256 bytes of RAM ("zero page") and also requires the use of one of the three registers to dereference.


The way to do it is to make a C-like language that has types and operators that map easily onto the 6502 architecture.

After all, C was designed to map onto the PDP-11 architecture - things like postincrement.


The PDP-11 postincrement thing is very often repeated, and I'm sure there are good reasons to suspect it, but here is one bit of evidence to the contrary I find convincing

https://yarchive.net/comp/c.html



There's also millfork which is a "mid-level" language specifically for 8-bit CPUs:

https://github.com/KarolS/millfork


The 6502's stack is intended to be mostly a call stack, with maybe a temporary or two stored while you juggle the registers, not a place for your data frames. From what I vaguely recollect, the cc65 uses Y as the stack pointer, but only when it actually needs stack-like behaviour; at other times it uses static allocation a la Fortran, or so I think.


This paper is from a small group at an academic institution. They are trying to innovate in the idea space and are probably quite compute constrained. But for proving ideas smaller problems can make easier analysis even leaving aside compute resources. Not all research can jump straight to SOTA applications. It looks quite interesting, and I wouldn't be surprised to see it applied soon to larger problems.


> They are trying to innovate in the idea space and are probably quite compute constrained.

Training a GPT-2 sized model costs ~$20 nowadays in respect to compute: https://github.com/karpathy/llm.c/discussions/481


Baseline time to grok something looks to be around 1000x normal training time so make that $20k per attempt. Probably takes a while too. Their headline number (50x faster than baseline, $400) looks pretty doable if you can make grokking happen reliably at that speed.


$20 per attempt. A paper typically comes after trying hundreds of things. That said, the final version of your idea could certainly try it.


I’ve been in a small group at an academic institution. With our meager resources we trained larger models than this on many different vision problems. I personally train LLMs on OpenWebText than this using a few 4090s (not work related). Is that too much for a small group?

MNIST is solvable using two pixels. It shouldn’t be one of two benchmarks in a paper, again just in my opinion. It’s useful for debugging only.


Again, a small academic institution may not have the experience or know-how to know these things.


I thought so at first, but the repo's[0] owner and the first name listed in the article has Seoul National University on their Github profile. Far away from a small academic institution.

[0]: https://github.com/ironjr/grokfast


It's a free world. Nothing stops you from applying their findings to bigger datasets. It would be a valuable contribution.


How can MNIST be solved using just two binary pixels when there's 10 classes, 0-9?


i'm also curious but my understanding was MNIST pixels are not binary due to some postprocessing artifacts


Oh hm, so they are. I thought they were binary because they used a digital pen to create them, IIRC, and logistic regression is always the baseline; but checking, they technically are grayscale and people don't always binarize them. So I guess information-theoretically, if they are 0-255 valued, then 2 pixels could potentially let you classify pretty well if sufficiently pathological.


> MNIST is solvable using two pixels.

really? do you have any details?

agree it has no business being in a modern paper


This is one of those unknowns that climate scientists couldn't predict: melting permafrost is releasing minerals that have been locked in place for thousands of years.


Back in 1999-2000 there was an "International RoShamBo Programming Competition" [1] where computer bots competed in the game of rock-paper-scissors. The baseline bot participant just selected its play randomly, which is a theoretically unbeatable strategy. One joke entry to the competition was carefully designed to beat the random baseline ... by reversing the state of the random number generator and then predicting with 100% accuracy what the random player would play.

Edit: the random-reversing bot was "Nostradamus" by Tim Dierks, which was declared the winner of the "supermodified" class of programs in the First International RoShamBo Programming Competition. [2]

[1] https://web.archive.org/web/20180719050311/http://webdocs.cs...

[2] https://groups.google.com/g/comp.ai.games/c/qvJqOLOg-oc


That's me! Thanks for pulling up the quote from long ago:

> "With his obvious technical skill, and his "cheat early and often" attitude, Tim could have a promising career as an AI programmer in the computer games industry. :)"

Instead took a path of security, authoring the TLS RFC and principal engineer in Google security. Thanks for the flashback.


You pulled a Kobayashi Maru when you got the chance. I bow to thee.


This makes me happy to see.


Can you share the source to Nostradamus?


Or a write up on the algorithm used? How did knowledge of the prngs of the impact the impl?


I had to checkout your Git after this awesome reply. Gotta love Hacker News.

I had a cool vision for “tag play” … I visualize mini RFID records on a turn table that tell Roku what to play.


DJs have timecoded vinyl records that do something like this, even allowing the DJ to scratch the mp3 that is being played.


Serato is digital pretend scratching. Might as well use a DJ controller. Real scratching requires a proper tt and skill like Mix Master Mike. Hell, I have 2 Pioneer DL-5 and a Pioneer DJM-600, but these tts aren't good for scratching because of their straight arms, they're good for gapless playback. https://youtu.be/58Y--XTIRZ8


Only on Hacker News!


you're a wizard bro. hell yea


So what you’re saying is, if you can’t beat em, join em? /s

I’m actually a bit relieved they have you on the team. Considering what they (Google) know about us all.


I submitted the optimally bad entrant the first year, cheesebot.

https://web.archive.org/web/20180719050236/http://webdocs.cs...


I can't believe they didn't say anything about your solution! How did it work?


Nice! How does Cheesebot work and why did it lose?


The whole commentary about the "supermodified" class of competition entrants is making my laugh:

> Nostradamus was written by Tim Dierks, a VP of Engineering at Certicom, who has a lot of expertise in cryptography. The program defeats the optimal player by reverse-engineering the internal state of the random() generator, which he states "was both easier and harder than I thought it would be". To be sporting, it then plays optimally against all other opponents.

> Fork Bot was based on an idea that Dan Egnor came up with a few minutes after hearing about the contest. Since "library routines are allowed", his elegant solution was to spawn three processes with fork(), have each one make a different move, and then kill off the two that did not win. This was implemented by Andreas Junghanns in about 10 lines of code. Unfortunately, since all three moves lost to the Psychic Friends Network after the first turn, the program exited and the remainder of that match was declared forfeited.

> The Psychic Friends Network is a truly hilarious piece of obfuscated C, written by Michael Schatz and company at RST Corporation. Among other things, it uses an auxiliary function to find good karma, consults horoscopes, cooks spaghetti and (mystic) pizza to go with various kinds of fruit, #defines democrats as communists, and undefines god. We're still trying to figure out exactly what it is doing with the stack frame, but we do know that it never scores less than +998 in a match, unless it is playing against a meta-meta-cheater.

> The Matrix was written by Darse Billings, who holds the prestigious title of "Student for Life", and recently started the PhD programme at the University of Alberta. The RoShamBo program defeated every opponent with a perfect score, based on the simple principle "There is no spoon".

> Since The Matrix is also the tournament program, it has complete access to all other algorithms, data structures, and output routines, and is therefore unlikely to ever be overtaken. As a result, this category is hereby declared to be solved, and thus retired from future competitions.


I was very curious about the Psychic Friends Network. One can find the code here (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...), and it's easy to deobfuscate substantially by running it through the C preprocessor.

I believe it works as follows: - It plays randomly for the first 998 turns (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...): this line is "if (*turn < trials - 2) return libra ? callback() : random() % 3;", and "libra" is initalized to (int) NULL, i.e. zero, on every invocation.

- In the last 2 turns, it uses `find_goodkarma` to comb through the stack to find where the variables that match its history and the opponents' history are stored. These the stack arrays p1hist and p2hist (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...)

They're easy to find because they contain 998 known values each in a ~random sequence of (0, 1, 2), and they're just upwards of the stack from the current invocation of the Psychic Friends Network.

`find_goodkarma` simply increments a pointer until the whole sequence of 998 values matches the known history.

- Then, it rewrites the history to make itself win. These lines (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...) never get executed, then these lines (https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...) tally up draws so far (libra), wins (cancer) and losses (scorpio).

This line makes sure its move is the opponents' move +1 mod 3, which is the winning move: https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...

Then, these lines repeat the same trick for the number of wins and losses. It checks whether it's p1 or p2 by comparing the addresses of the win/loss arrays, and then overwrites the wins/losses appropriately using `pizza` https://github.com/MrValdez/Roshambo/blob/master/rsb-iocaine...

in the end it returns an arbitrary value (the address of `good_hand` mod 3).

It was fun to follow but the result is kind of boring :)


so parallel universes lost to rewriting history

amazing


Thanks for the deep dive. Yes fun to follow.


The whole is better than the sum of the parts. You have nostradamus which predicts the future, fork bot which plays all playable presents, and the psychic friends network which rewrites the past. All under the watchful eye of the matrix.

There's something beautiful here and you honestly couldn't make it up.


I remember someone doing the same to an online poker site that had helpfully documented its PRNG in a laudable attempt at transparency.

(And the transparency got them an improvement in their security in the end.)


I'm surprised they don't use some form of hardware based RNG. I assume there's many good reasons

https://en.wikipedia.org/wiki/Hardware_random_number_generat...


They wanted to show that they didn't cheat.

In general, you can pick a random seed at the start of the day, commit to it somewhere (eg publish a hash of it on the bitcoin blockchain, or just on your website), then use that seed in a cryptographically secure PRNG on your website all day, and at the end of the day you publish the seed you already committed to.

This way people can check that you didn't cheat, but can't guess their opponents cards either.


Wouldn't the most obvious method of cheating be the site owner peeking at other player's hands and/or the deck? Which this doesn't (and cannot) prevent?

I guess I'm not sure what publicizing their PRNG is meant to prove. It shows they didn't cheat via a very specific type of cheating but there are several other potential cheating vectors.


> It shows they didn't cheat via a very specific type of cheating but there are several other potential cheating vectors.

Yes, and this is not the only anti-cheat method they had.


(The above was simplified. In a game of poker, you also want to make sure that when hands get folded, that no one learns what the cards were.)


> don't use some form of hardware based RNG

I've always wondered: why aren't ADCs (e.g. mic input) and temperature sensors considered a good source of entropy, particularly if the lower bits are taken?


If you just want any old entropy, and don't care about proving to someone else that your entropy is good, these are acceptable. But it's honestly really easy to get entropy on that grate, and thanks to modern cryptographically secure PRNG, you don't need a lot of 'real' entropy. You can stretch it out forever.

If you want/need to be able to argue to a third party that your entropy was good, you can spend a bit more.

How do you convince anyone that your mic input was actual real mic input, and not just a sequence of bits you made up to be convenient for you?



Or, if you don't want to trust the source, https://drand.love/


So, it was using a pseudorandom generator?


Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.


Yes, could be. Not sure how or even if anyone could prove it, though.


This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.

Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.

So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.

Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.


I agree, but my question was not whether distance between data points tends to decrease as dataset size grows, but whether that is the reason why the number of training tokens required per parameter declines. It could be, but proving it would require a better understanding of how and why these giant AI models work.


Wasn't your question about how *independent* the data is?

We could talk about this in different ways, like variance. But I'm failing to see how I didn't answer your question. Did I misscommunicate? Did I misunderstand?

The model is learning off of statistics so most of your information gain would be through more independent data. Think of this space we are talking about as "knowledge." And our "intelligence" as how easy it is to get to any point in this space. The vector view above might help with understanding this one, because you can step in the direction of any vectors you have and then how you combine them to get to your final point. The question is how many you have to use (how many "steps" away)? And of course, how close you can get to your final destination. As you can imagine, from my previous comment, that doing this for any given point you'll reduce "steps" to your final destination if you have more vectors, but you can also understand that the utility of each vector decreases as you add more. (Generally. Of course if you have a gap in knowledge you can get a big help from a single vector that goes into that area but let's leave that aside).

Does this help clarify? If not I might need you to clarify your question a bit more. (I am a ML researcher fwiw)


> Wasn't your question about how independent the data is?

No. My original (top) comment was about how the number of training tokens required per parameter slowly declines as models become larger. dzdt suggested it could be because the independence of training points declines as the dataset size grows. I said it could be, but I'm not sure how one would go about proving it, given how little we know about the inner working of giant models. Makes sense?

Otherwise, I agree with everything you wrote!


Oh I see. It's because yes, we expect this to happen once we get to sufficient coverage. As we linearly increase the number of parameters the number of configurations increases super linearly. In other words, the information we can compress.

There's a lot we didn't know, but it isn't nothing. There's a big push for ML not needing math. It's true you can do a lot without, especially if you have compute. But the math helps you understand what's going on and what are your limits. We can't explain everything yet, but it's not nothing.


I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.


I understand the conext to be that for a smart TV with Roku built in as the operating system, the idea is that if the TV is set to take input from a HDMI device which is idle the Roku OS might take over and display its own content. Like previews for other shows as a "screensaver" kind of thing.


Phone alert came here 1hr40min after the quake. [shrug] yes pretty useless


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: