Score matching with Langevin Sampling: a new contender to GANs

Fede_V · on Sept 3, 2020

Alexia is one of the most original researchers working on GANs: she invented the relativistic discriminator (a brilliant and obvious idea in hindsight: https://arxiv.org/abs/1807.00734) which is one of the easiest tweaks you can make to instantly boost your GAN results.

hyperbovine · on Sept 3, 2020

The better the idea, the more obvious it was in hindsight.

derefr · on Sept 3, 2020

There's gotta be a better word for this than "obvious."

The key thing about ideas that are "brilliant and obvious in hindsight" is that the world was already ready for them, and so nothing needed to change for them to happen; i.e., they don't have any prerequisites that aren't already in place. They "just" needed someone to actually notice that there were some pieces that could be fit together in a novel way.

There's no word I know of that captures this idea of "the world being ready for" the idea, though. Is the idea "incremental" in hindsight? "Elegant" in hindsight? "Free" in hindsight?

nmstoker · on Sept 4, 2020

Apposite?

jimmydorry · on Sept 4, 2020

Apposite: apt in the circumstances or in relation to something.

Neat, I learned a new word.

girvo · on Sept 4, 2020

What a brilliant word.

franklampard · on Sept 3, 2020

Intuitive?

derefr · on Sept 3, 2020

That's very close! But something that's intuitive in hindsight might still have required some hard work in development, rather than "just" a brilliant, novel idea.

Photolithography, for example, is intuitive, and far simpler as a technique for constructing circuit boards than what came before it; but etching circuits using light projected onto light-activated chemicals isn't one of these "the world was ready for it, someone just needed to do it" ideas. Someone thought of it, then needed to do a whole lot of work to get it to happen, finding the right chemicals, experimenting with projection technologies, etc. After the fact, the idea of photolithography is extremely intuitive; but it wasn't a better-term-for-"obvious in retrospect" idea.

fxtentacle · on Sept 3, 2020

In short, this is a super cool approach to replace the discriminator in GANs with something that doesn't need to be trained and provably converges to the correct result.

"256×256 images cannot be done reliably without 8 V100 GPUs or more!"

That's quite sad because that means this approach is far out of reach for any hobby researcher and for most universities.

alpineidyll3 · on Sept 3, 2020

I think there's a pretty high chance that tricks could be brought to bear to greatly reduce the costs of what's being done here (by OOMs). There's a lot of handles for tweaking.

In particular the basic crux of this approach is a Monte-Carlo annealing of the score function. In every field of science, Monte-Carlo sampling entails a huge pre-factor cost, and in every case it only reaches it's greatest potential with importance sampling, which hasn't been applied here.

This is basically the 'brute-force' version of some future approach which would replace the diffusion kernel with another process that allows one to avoid sampling huge volumes of function space which are irrelevant for your desired P(X). This would introduce dependence of the training process on the sampling pre-conditioner, but ultimately be required for highest performance.

I suspect that these authors are already thinking of how to use invertible flows to this effect.

salty_biscuits · on Sept 4, 2020

I've always had this idea that you could do a dumb version of annealing by just progressively increasing the size of samples used in the minibatches. SGD gives you noise on the gradient which should get smaller as you increase the batch size. If you just adjusted the batch size with some sort of MH accept/reject logic you should get a better optimzer for mode seeking than just vanilla SGD. Like all my ideas though it has probably been thought before or is just wrong :) too damn lazy to try it as well.

fxtentacle · on Sept 3, 2020

The way I read it, they actually need the randomness introduced from full Monte-Carlo sampling to make sure that they explore the data distribution sufficiently far in every direction. So it might be that for this approach, importance sampling would distort the results and re-introduce problems like the chance for divergence that GANs are fighting with.

alpineidyll3 · on Sept 4, 2020

The miracle of importance sampling is that it doesn't formally ever prevent you from visiting the rare regions, it just might take longer. What can occur is that you will get what appears to be a well-converged estimate of an answer but really you weren't fully ergodic. Such is the curse of dimensionality. No free lunch theorem, yadda yadda.

marmaduke · on Sept 3, 2020

I just checked (my university account with Dell) and a dual Quadro RTX8000 (so 96 GB GPU memory) is around 10k. A V100 is 32 GB so so a 6xRTX8000 could probably cut it at price of 30k. It’s not cheap but the price of a six month of a post doc in my country.

deeeeeplearning · on Sept 3, 2020

AWS has V100's available so most Universities with a decent budget should be able to swing this.

notsuoh · on Sept 3, 2020

Not even that, spot pricing on an 8 GPU instance (the 16xl I believe, the larger one has the same number of GPUs but more memory per GPU) is something like $6/hr. I use this for personal projects sometimes, get all data in S3, a good launch template, and then spin up a spot instance and be super efficient about training quickly. I've even run evals on a separate, cheaper, machine so the 16xl can spend all its time training. It's still not "cheap", but $50 for 8 hours of training on a machine like that with $64k of GPUs on board is really not bad.

bravura · on Sept 4, 2020

If you wrote a blog post about how to streamline this, I would read it and upvote it.

belval · on Sept 3, 2020

24$/hour for 8 V100 to be exact.

sudosysgen · on Sept 3, 2020

It's interesting how renting 8 GPUs costing 7000$ each costs about the same as renting the services of the average US worker.

smabie · on Sept 4, 2020

24/hour comes to a salary of $49,920, working 8 hours a day, 5 days a week, no vacation, no holidays.

Meanwhile the GPUs cost $56,000 total. So the numbers aren't really very far off.

sudosysgen · on Sept 4, 2020

They kind of are, these GPUs last multiple years. Anyways, it's just an interesting observation.

smabie · on Sept 4, 2020

I mean you also have to take into account electricity, which I assume is pretty expensive.

sudosysgen · on Sept 4, 2020

It's only about 2.5kW, so around 1-2$ per hour for the full system

fxtentacle · on Sept 3, 2020

Except that cloud-ified V100s are significantly less powerful than if you have direct access to the hardware. Last time I checked, in AWS they're actually external devices mapped in over GBit ethernet, which is significantly slower than the 8GB/s that PCIe x4 has.

nl · on Sept 4, 2020

This isn't true. AWS V100s are standard V100s.

I think you are confusing this with AWS Elastic Inference.

If you use AWS Elastic Inference, then you get networked attached devices. But these are Amazon's own (non-NVidia) devices and only used for inference, so it's not really comparable.

https://aws.amazon.com/machine-learning/elastic-inference/fa...

p1esk · on Sept 4, 2020

I routinely switch between AWS 8x V100 instances and on-premise 8x V100 servers and I observe no difference in speed (time per epoch).

Reelin · on Sept 4, 2020

Presumably that depends on maximum PCIe bandwidth consumption before your workload bottlenecks elsewhere? A 2018 benchmark (https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-with-4-...) seems to indicate that x8 isn't generally a bottleneck for common (at the time) workloads. x8 is a far cry from the claimed gigabit ethernet though!

p1esk · on Sept 4, 2020

AWS is tricky in terms of how storage is provisioned - I don't remember details, but it's easy to put your datasets on storage that is connected to your GPU servers over 1Gb link. That could easily become a bottleneck. Datasets should live on Elastic Block Storage or something like that, over high speed links. Again, it's been a while since I looked into that, so I don't remember the details.

Reelin · on Sept 4, 2020

The earlier comment claimed that the GPUs (!!!) were located elsewhere on the network; I suspect that the scenario you describe is what they intended to refer to.

(IIRC AWS offers compute optimized instances with a volume that's guaranteed to be backed by blocks on a local NVMe drive.)

nl · on Sept 4, 2020

I think they are confused with AWS Elastic Inference. That is a different thing which does have network attached accelerators:

Amazon Elastic inference accelerators are GPU-powered hardware devices that are designed to work with any EC2 instance, Sagemaker instance, or ECS task to accelerate deep learning inference workloads at a low cost. When you launch an EC2 instance or an ECS task with Amazon Elastic Inference, an accelerator is provisioned and attached to the instance over the network.

https://aws.amazon.com/machine-learning/elastic-inference/fa...

marcinzm · on Sept 4, 2020

Citation? AWS GPU instances are cpu+gpu+memory in fixed ratio packages so I don't see why it'd be over ethernet.

twanvl · on Sept 4, 2020

This is not just replacing the discriminator. In this DSM-ALS method you learn a score function instead of a generator, and to generate a sample you need to evaluate the score function multiple times (it essentially gives you a direction to move in).

scoopertrooper · on Sept 4, 2020

Oracle clouds are offering 8 A100s for $3.05 an hour! I can only assume it's some horrible trap, but it sounds like crazy good value!

https://www.oracle.com/corporate/pressrelease/oracle-cloud-i...

rajnathani · on Sept 6, 2020

No, they've listed "$3.05" per GPU. For 8 A100s that would be 8x$3.05 = $24.4/hour.

scoopertrooper · on Sept 7, 2020

Ah missed that.

rjeli · on Sept 3, 2020

Reminds me of JEMs from Grathwohl et al last year [0]. They train a generative model by treating it an an Energy Based Model using stochastic langevin gradients. I’m curious as to how it relates to the models in this post.

[0] https://arxiv.org/abs/1912.03263

GistNoesis · on Sept 4, 2020

From the Pascal Vincent link : "Beware that this usage differs slightly from traditional statistics terminology where score usually refers to the derivative of the log likelihood with respect to parameters,whereas here we are talking about a score with respect to the data."

toxik · on Sept 3, 2020

I wish this essay wasn’t littered with emojis :(

aperrien · on Sept 3, 2020

I learned something new and valuable today. The emoji make no difference at all to me, as they do not affect the quality of what I have learned by any discernible measure.

notsuoh · on Sept 3, 2020

If Geoff Hinton or Francois Chollet add emojis to their writing, it is visually less pleasing in my opinion but I agree with you overall. When it's someone I don't know though, it makes me trust the writing less because I can't strictly verify the contents of what is being discussed easily for work like this, so it does make a difference because I guess it's less "professional" in a way.

sheikheddy · on Sept 4, 2020

I've noticed it more and more in the ML research community, I think it's mostly the influence of twitter and medium articles. For a blog it really is fine though and I'm comfortable with language evolving in an imprecise fashion as long as the emojis don't try to do much more than add flavor.

grp000 · on Sept 4, 2020

Most math heavy content is written very drily. Seeing emojis makes reading it feel more comfy and human. If someone maps the greek alphabet to emojis, you could enjoy the integration on a deeper, more intimate level. The math of emotion!

nl · on Sept 4, 2020

> I wish this essay wasn’t littered with emojis :(

I'm really hoping the inclusion of :( was meant ironically.

(Personally, I'm a fan of personality. Check out Darknet/YOLO creator Joseph Redmon's resume one day.)

toxik · on Sept 5, 2020

I hated the YOLO paper too, very distracting way to write.

And yes the smiley face is meant as an ironic illustration of why exactly it’s distracting. Seems it worked since I got downvoted.

pixelpoet · on Sept 4, 2020

> I wish this essay wasn’t littered with emojis :(

I wish people still used subjunctive mood (in this case, "weren't" instead of "wasn't"); we don't always get our personal preferences, and that's okay.

scott31 · on Sept 3, 2020

natch · on Sept 4, 2020

Someday the field will figure out that not all images of interest are squares. That will be a great advance. I realize some people have hacked up personal branches of projects to support non-square rectangles but it really needs to become mainstream or else we’re going to stay stuck in this “AI square winter.”

p1esk · on Sept 4, 2020

I don't get it - is there anything preventing you using non-square images today?

natch · on Sept 4, 2020

Just the massive inconvenience of having to track down and implement the changes necessary to add non-square image support on most platforms. So if you think that is a minor thing that can be dismissed as nothing, then no. Or one can always crop, or resize, both of which throw away information, or pad, for which the underlying behavior is undocumented. I just don’t think any of these are great options. Do you?

If the answer is “you can always write your own” that is true, but it’s just underlining my point that the problem is not yet solved.

p1esk · on Sept 4, 2020

support on most platforms

What do you mean by "platforms"?

natch · on Sept 5, 2020

Deep learning frameworks: Keras, Pytorch, Tensorflow, Core ML, etc.

p1esk · on Sept 8, 2020

You might be confused. There's nothing in any of those platforms that dictate you use square images.

natch · on Sept 8, 2020

It's possible I am confused, yes. But I'm just going by what is in all the documentation and tutorials I have encountered.

It may have been solved in a lab somewhere, but the solution hasn't made it out into code usable by mere mortals, as far as I can tell. You may be applying a very special meaning of the words "nothing" and "dictate" it's just that it's very well hidden how to do this.

I'm not alone. Here are examples of other people struggling with non-square images and not succeeding:

https://stats.stackexchange.com/questions/240690/non-square-...

https://github.com/tanakataiki/ssd_kerasV2/issues/10

https://github.com/allanzelener/YAD2K/issues/51

https://github.com/eriklindernoren/PyTorch-YOLOv3/issues/277

https://stackoverflow.com/questions/49893741/tensorflow-cnn-...

https://github.com/ml-hongkong/keras-transfer-learning-for-o...

p1esk · on Sept 8, 2020

You are confusing platforms with individual models. A platform, such as Pytorch or Tensorflow, does not care what shape is your input. You design and train a model for the whatever inputs you want. On the other hand, if someone trained a specific model architecture (e.g. YOLOv3 or Resnet-50) on some dataset of square images, then yes, this particular pre-trained model will expect the same input size and shape as what it was trained on. Does it make sense? If you take a beginner level course on deep learning (e.g. Stanford CS231n or FastAI course) you will immediately realize there's nothing (in any sense of the word) that prevents you from using any input shape to train your model.

If you don't want to learn the tools you use, you will need to find someone who will train a model on your images. If you're willing to pay for it you will find plenty of help. However, if you want to neither learn nor pay, then what exactly are you complaining about?

janhenr · on Sept 4, 2020

No, in production rectangle images are reshaped and resized to square ones.

ebalit · on Sept 4, 2020

If you want to be able to run inference in batches, you have to normalize to a given input size. It often means a square input size if you want to minimally distort both vertical and horizontal images.

You can also use variable input shape in production if you need to. It will be harder to optimize, especially for throughput, but it's not impossible.