Great work! Hacker News still seems to have a deeply skeptical culture with rega...

sentdex · on June 22, 2021

Heh, yeah, tough crowd I guess. The full code, models, and videos are all released and people are still skeptical.

I feel like 95%+ of papers don't do anything besides tell you what happened and you're just supposed to believe them. Drives me nuts. Not sure why all the hate when you could just see for yourself. I'd welcome someone who can actually prove the model just "memorized" every combo possible and didn't do any generalization. I imagine the original GameGAN researchers from NVIDIA would be interested too.

Interesting @ guided diffusion, not aware of its existence til now. We've had our heads down for a while. Will look into it, thanks!

godelski · on June 22, 2021

> I feel like 95%+ of papers don't do anything besides tell you what happened and you're just supposed to believe them.

Honestly I think there's a big problem with page limits. My team recently had a pre-print that was well over 10 pages and we still didn't get everything and then when we submitted to NeurlIPS we had to reduce it to 9! This seems to be a common problem and why you should often check different versions on ArXiv. And we had more experiments and data we needed to convey since the pre-print. This problem is growing as we have to compare more things and tables can easily take up a single page. I think this causes an exaggeration of the problem that always exists of not explaining things in detail and expecting readers to be experts. Luckily most people share source code which helps show all the tricks authors used and blogging is becoming more common which further helps.

> I'd welcome someone who can actually prove the model just "memorized" every combo possible

Honestly this would be impressive in of itself.

danuker · on June 22, 2021

There's the Hutter Prize [1] - memorizing is useful (and arguably intelligent) if it's compressed.

http://prize.hutter1.net/

justinjlynn · on June 22, 2021

Indeed. Novel, efficient program synthesis is still novel, efficient program synthesis even if it's a novel, efficient data compression codec you're synthesising.

YeGoblynQueenne · on June 22, 2021

>> The full code, models, and videos are all released and people are still skeptical.

If you're uncomfortable with criticism of your work you should definitely try publishing it, e.g. at a conference or journal. It will help you get comfortable with being criticised very quickly.

alimov · on June 22, 2021

I think he’s pointing out that the “criticism” here is similar to that of a person criticizing a book they’ve never read or even flipped through.

YeGoblynQueenne · on June 23, 2021

Perhaps, but that criticism should be the easiest to ignore. The OP expresses frustration to lay criticism and I expect that even brief contact with academic criticism will make the frustration felt by the OP to lay criticism fade into irrelevance.

ShamelessC · on July 2, 2021

I've been learning about this stuff for about a year now. Your earlier experiments with learning to drive in GTA V were an inspiration for me - because they hit that perfect intersection of machine learning, accessibility in education, and just plain cool.

The pandemic hit and Open AI had released DALL-E and CLIP. I was unemployed and bored with my Python skills and decided to just dive in. I found a nice gentleman named Phil Wang on github had been replicating the DALL-E effort and decided to start contributing!

You can find that work here

https://github.com/lucidrains/DALLE-pytorch

and you'll find me here:

https://github.com/afiaka87

We have a few checkpoints available with colab notebooks ready and there is also a research team with access to some more compute who will eventually be able to perform a full replication study and match a similar scale to Open AI and then some because we are also working with another brilliant German team https://github.com/CompVis/ who has provided us with what they are calling a "VQGAN" (if you're not familiar) - which is a variational autoencoder for vision tokens with the neat trick from GAN-land of using a discriminator in order to produce fine details.

https://github.com/CompVis/taming-transformers

We use their pretrained VQGAN to convert an image into digits. We use another pretrained text tokenizer to convert words to digits. The digits both go into a Transformer architecture and a mask is applied to the image tokens in the transformer so that the text tokens can't see the image tokens. The digits come out and we encode them back into text and image respectively. Then, a perceptual loss is computed. Rinse, wash, repeat. Slowly but surely, text predicts image without ever having been able to actually _see_ the image. Insanity.

Anyway, taking a caption and making a neural network output an image from it has again hit that "perfect intersection of machine learning, accessibility in education, and just plain cool". I don't know if you could fit it into the format of your YouTube channel but perhaps it would be a good match?

codetrotter · on June 22, 2021

FWIW I saw your video a couple of days ago via Reddit and I loved it a lot. Even sent a link to the video to a friend of mine because I think it was a very inspiring and interesting video.

I hope you don't let naysayers get to you :)

fossuser · on June 22, 2021

This is wild - thanks for putting the video together, it’s very cool.

rasz · on June 22, 2021

One of the main problems with ML/NN is it often works like magic, aka the trick works as long as audience doesnt know the secret behind it. Its fascinating to gullible audience, mundane bordering on boring to practitioners.

My Tiger repelling rock^^^^^^leopard detection model works great on all animal pictures ... until you feed it a sofa https://web.archive.org/web/20150703094328/http://rocknrolln...

>able to generalize various game logic like collision/friction with vehicles and also learns aspects of rendering such as a proper reflection of the sun on the back of the car

id did none of that, what this model did is learn all the frames of video and their chronological order according to the input.

> impossible task of "splitting a car in two" to try and solve a head-on collision.

it played back both learned versions at once, like reporting confidence of round thing being 50% ball and 50% orange.

sentdex · on June 22, 2021

In the end, everything is boiling down to matrix math, so you can always make the argument that no neural network is impressive if you want.

The model's size is ~173MB, depending on settings. That's not much space to have memorized every single possible combination of events, nor was our data enough to cover that either.

fartcannon · on June 22, 2021

Your original self driving GTA5 videos are what helped me come to understand machine learning in the first place (along with some of Seth Bling's MarI/O, and a bit of Tom7's learn/play-fun magic). I used your tech to make an AI that played Donkey Kong Country in LSNES emulator shortly before Gym-Retro was released.

So, thanks a bunch, Sentdex. You are rad.

sentdex · on June 22, 2021

Hah, awesome! Any plans to apply GAN Theft Auto to something else? :o

fartcannon · on June 22, 2021

Not offhand, but you've probably inspired a lot of creativity with this across the internet... and a lot of copy cats. I'm looking forward to seeing what gets made.

YeGoblynQueenne · on June 22, 2021

>> The model's size is ~173MB, depending on settings. That's not much space to have memorized every single possible combination of events, nor was our data enough to cover that either.

The resolution of the images output by the model is very low (what is it exactly, btw?). It's not impossible that your model has memorised at least a large part of its data.

In fact the simplest explanation of your model's output (as of much of deep neural networks for machine vision) is that it's a combination of memorisation and interpolation. There was a recent ish paper by Pedro Domingos that proposed an explation of deep learning as memorisation of exemplars similar to support vectors (if I understood it correctly - only gave it a high-level read).

It's also difficult to see from your demonstration exactly what the relation between the output and the input images are. You're showing some very simple situations in the video (go left, go right) but is that all that was in the input?

For example, I'd like to see what happens when you try to drive the car over the barrier. Was that situation in the input? And if so, how is it modelled in the output?

Finally, how do you see this having real-world applications? I don't mean necessarily right now, but let's say in 30 years time. So far, you need a fully working game engine to model a tiny part of an entire game in very low resolution and very poor detail. Do you see this as somehow being extended to creating a whole novel game from scratch? If so, how?

Edit: on memorisation, it's not necessary to memorise events, only the differences between sets of pixels in different frames. For instance, most of the background and the road stays the same during most of the "game". Again, the resolution is so low that it's not unfathomable that the model has memorised the background and the small changes to it necessary to model the input. So, it interpolates, but can it extrapolate to unseen situations that are nevertheless predicted by the physics you suggest it has learned, like driving over the barrier?

haecceity · on June 22, 2021

Video frame resolution is pretty small...

teruakohatu · on June 22, 2021

> The model's size is ~173MB

That is impressive! Less than twice the size of ResNet-50 weights. Surely that is within an order of magnitude of an equivalent Unity or GoDot game+models.

godelski · on June 22, 2021

> My Tiger repelling rock^^^^^^leopard detection model works great on all animal pictures ... until you feed it a sofa

I'm sorry, how is this different than normal software engineering? There's dozens of unit/integration testing memes poking fun at specifically this (which is a mostly solvable problem in ML btw, when you use out of distribution data. Give your model a 3rd end state that represents "neither").

> id did none of that, what this model did is learn all the frames of video and their chronological order according to the input.

A better explanation is that the network knows what frame to generate given the current frame (and n previous frames) and the current user input. If it was memorizing then it'd have to generate an extremely large number of scenarios (it would exponentially grow as any given frame has k possible actions from your current frame to the next frame). If Sendex can run the game for arbitrary length and take arbitrary actions then it is a far more reasonable explanation that the model is generating the frames rather than memorizing. Apply Occam's Razor.

Edit: Sentdex said the model was ~173MB, so that is not large enough to memorize the gameplay.

motohagiography · on June 22, 2021

Maybe I'm misinterpreting, but if you've ever seen a cat freak out about a cucumber (an entire video genre, apparently), ostensibly real intelligences make similar errors.

Beyond rote memorization, it looks like it could be explained by saying the model appears to have a found a concept of consonance and dissonance that is bounded within the field of its inputs, and a networked grammar for interacting with the up/down/left/right inputs. Some people might find that technically trivial, but as a layman I am impressed.

The "magic" part is that the response of the network appears to be so complex relative to its inputs, but given the input is so limited from a controller, it's easy to attribute more meaning to it when it is working with a finitely bounded simulated model.

Generally I'd wonder, if the behaviour appears more complex than the stimuli, do we tend to attribute intent to it?

andrepd · on June 22, 2021

> Hacker News still seems to have a deeply skeptical culture with regard to machine learning

Is... that a bad thing? Skepticism is good. When it's about something as hyped as "deep learning", even more so.

dkarras · on June 22, 2021

>Is... that a bad thing?

Yes, when it is there for no valid reason, or ridiculous reasons. Skepticism is not a default position you can take like a toddler refusing to eat their vegetables. You need some informed (and non-fallacious) intelligent reasoning behind that. "I'm skeptic about this thing using X because X is so hyped these days" is not such reasoning.

andrepd · on June 22, 2021

Well, it kind of is. Blockchain has been hyped by charlatans as the cure to all world's ills. That means when you read something about blockchain you should be especially suspicious.

Similarly, I've read too many people hyping up glorified chatbots as one step below AGI (see the :o reactions to GPT3), so I'm now extra skeptical about claims about machine learning.

akiselev · on June 22, 2021

"I'm skeptical about this thing using X to do Y because the burden of proof is on people claiming X does Y and historically they have failed to meet that burden"

I don't know what skepticism has to do with ridiculous toddlers - they are almost universally incapable of grasping the nuances of epistemology.

godelski · on June 22, 2021

> Skepticism is good

There's skepticism and then there's being a non-expert in a field and talking with high confidence. How do you differentiate these? Conspiracy theorists use the same logic. You're right that skepticism is good, but it is easy to go overboard.

ekianjo · on June 22, 2021

And then there are so called experts who are charlatans as well. Dont ever forget that possibility.

godelski · on June 22, 2021

Sure, but skepticism should decrease if there are a community of experts are saying the same thing. As an example, anti-vaxxers often claim skepticism and that they have done their own research. The reason we don't trust them is because we think doctors have a greater expertise in the subject than them (it is, either way, trusting someone). Unless you're a virologist you probably don't actually have the expertise to actually verify vaccine claims.

So sure, you are right, but in the context of this discussion you're implying that the vast majority of ML researchers (myself included) are charlatans. I'm not sure what the meaningful difference here is. We're publishing results, people are actively reproducing them, and then some person on the internet that doesn't understand the subject comes along and says "you're full of shit." We can even disprove the claims being made (e.g. I've explained why the network can't be memorizing the game in another comment). That is literally happening in this thread (GAN Theft Auto is in fact a replication/extension effort). Is that meaningfully different from the anti-vaxxers?

roystonvassey · on June 22, 2021

I think it’s a problem when it turns to - being skeptical for the sake of it.

Not been too long on HN but the top comments on most threads are a contrarian one (and one which I truly appreciate because it provides a different POv) but sadly because it is encouraged through the high upvotes, the crowd tendency is to regress towards this approach, even if sometimes the rigour of the critique is lacking

jcims · on June 22, 2021

>Skepticism is good.

It can be, but its certainly not an unmitigated good. Especially when it leads to aspersions of fraud and conspiratorial thinking (e.g. rasz's comment thread below).

bastawhiz · on June 22, 2021

Skepticism is good when it targets bold claims with vague proof. This is not a bold claim (it's a video demo showing the process) and its proof is not vague (you can inspect the source). Skepticism over something like GPT-2 without more than sample output is good. Skepticism over GPT-2 with a workable demo and source is unhelpful.

andrepd · on June 22, 2021

Funny you mention GPT-2/3, which is by all accounts a glorified chatbot, but which has nevertheless been hyped as one step below AGI by many people.

bastawhiz · on June 22, 2021

Has anyone at OpenAI made that claim?