Hacker News new | past | comments | ask | show | jobs | submit login
Imitating Deep Learning Dynamics via Stochastic Differential Equations (arxiv.org)
122 points by jiayaozhang 44 days ago | hide | past | favorite | 21 comments

You probably want to compress your figures - I think your line plots are stored in some vector format. The paper is 30MB and rendering chokes on those ultra dense figures (and the data resolution is not buying you any information). If the figures are vector format you should convert to png/jpeg etc.

Thanks for the heads-up! Yes, we use vector graphics in pdf formats for all figures (outputted from matploblib). I overlooked the size of the arXiv version but my local complication is around 14.5MB (pdflatex under TexLive 2020). I will check what went wrong :)

You should keep the vectored graphics for research papers, it is really handy so people can zoom in. The issue here is just how much information is in the plots you have. E.g. Figure 3 each subplot contains a lot more information than is actually needed. You could use moving averages there to clearly state the same thing. Or if you wanted to, you could include the envelopes. That would create smaller sized images. The other option is to just make the plots more sparse. There's far more information than necessary there.

It's weird, but images often sell a paper, so it is worth time learning how to make good images (or at least one person in your group). But that said, research comes first and that is more important. Just don't underestimate the power of good images. I often find this video on colors helpful[0]

[0] https://www.youtube.com/watch?v=Qj1FK8n7WgY

Maybe pdf graphics would cost less? Keeping it scalable would be great

I'd say this is a problem with the PDF reader, not the paper.

Why is PDF reader having trouble displaying 30MB of data?

It's even better than a 30MB of images, because images have to be decoded but vector graphics is just bytes at some point and you decide how to render them.

It's probably some accidentally quadratic behavior that struggles with 10000 vector objects.

Vector graphics are much more complicated than raster images. They have to be drawn sequentially (to not mess up draw and blending order), mostly on CPU with the current software stack. And no, at 30MB you are looking at about 100000 paths, each of which consists of bezier curves which needs to be converted into polylines before drawing (there are algorithms that handles curves directly, but they are just as expensive). It's a vast amount of workload.

Images are relatively well handled with GPUs, and with a precomputed mipmap they can be rescaled very quickly, unlike vector graphics which needs to be re-rendered each time zoom level is changed.

Yeah, handling 100k paths on CPU means the problem is in the PDF reader, not the PDF. Although I must admit that calculating 100000 paths and getting polylines still shouldn't take very long.

Why wouldn't you be able to send an array of stuff defining the curves, and an array of stuff defining the draw order to the gpu and just render it in a simple?

Hello HackerNews! Author here :)

TL;DR: We devise a linear SDE/ODE model to imitate per-class feature (thinking logits) dynamics of neural nets training based on local elasticity (LE) [1]. We found the emergence of LE implies linear separation of features from different classes as training progresses.

The drift matrix of our model has a relatively simple structure; with that estimated, we can simulate the SDE using the forward Euler method, whose results align reasonably well with genuine dynamics.

Local elasticity models the phenomenon observed in DNN training: the effect due to training on a sample is greater for samples from the same class, and smaller for samples from different classes. For example, training an image of cats facilitates the model better learns images of other cats while not so for images of, say, dogs.

Any comments/thoughts/questions are most welcome!

[0] https://arxiv.org/abs/2110.05960 [1] https://arxiv.org/abs/1910.06943

The paper is a bit over my head. Are there any findings with respect to the phenomenon of Deep Double Descent [0], or the more recent grokking phenomenon [1]?

[0] https://arxiv.org/abs/1912.02292 [1] https://mathai-iclr.github.io/papers/papers/MATHAI_29_paper....

I thought the main selling point of deep learning that it finds non-linear connections in the data. Isn't it surprising that the implication is a linear separation of features?

With regard to the class conditioned regime they are experimenting with; this is merely attempting to explain more precisely _how_ deep nets are able to distinguish features between classes. We already know that they do; but we lack a detailed model of precisely why they do and my understanding is that many base assumptions made by e.g. statistics will not help you at all with neural networks (for instance, overparameterization leading to better performance on out-of-corpus rather than overfitting).

> many base assumptions made by e.g. statistics will not help you at all with neural networks

There is lately a lot of hate against classic statistics on HN. I don't know why. Does it help to understand why and how NNs work? Not yet. But saying that it is utterly useless and won't provide any useful insights in the future sounds to me like telling the young Steve Jobs that dropping out of college and taking calligraphy classes instead of is utterly useless. And still, I am writing this on an Apple product, which set the standards for digital typography...

I have actually seen a fair amount of respect for classical stats on HN; perhaps it is due to my attention filter.

I'm thinking of some wonderful posts describing where/when/why linear regression can offer performance which is very close to the best from a NN-- except that regression models train much faster on much less data AND are interpretable.

Some discussion about how a NN works well for data where there is a lot of (statistical) structure to the data-- two close pixels in an image are likely to have very similar color/luminosity (and if not, the difference is important to the model, i.e. an 'edge'). But that NN don't do as well in a domain where the different features of the data don't have such relationships, say an econometric model or many biological models or ...

I'm not an expert, but statistics may be the branch of mathematics we wind up using to solve the unknowns of machine learning. I have no hate for classic statistics, sorry if I gave that impression.

> I have no hate for classic statistics, sorry if I gave that impression.

You didn't, but a lot of HNlers do. Maybe I should rant on them, that's true.

> I thought the main selling point of deep learning that it finds non-linear connections in the data.

Agreed. Yet there are unknown feature spaces in which the deep nets separate the data linearly; the mapping learnt by the deep nets from the data to this space is likely highly nonlinear.

For example, a binary logistic regression can be thought as mapping data to a 1D space ([0, 1]), and separates the two classes at say 1/2 linearly.

Ok thanks for the example, so what you're saying is that this research tries to explain how it separates the data linearly starting from this unknown feature space?

This reminded me of Invariant Risk Minimization (IRM) (https://arxiv.org/pdf/1907.02893.pdf) due to the linear bound being sufficient to control the features.

Do you have any comments/insight into how you’d say they’re similar/different? Thanks!

Thanks for sharing the interesting IRM paper!

Will read and be back for discussion hopefully soon.

does it have direct application for training nets faster? i.e. can the ode be integrated faster than backprop?

> does it have direct application for training nets faster?

That's a great question! Unfortunately not yet -- though we believe further studies may bring us there finally. We found (at least for simple classification tasks) the features seem to have a two-stage behavior: a de-randomization stage to identify the best direction in the feature space; and an amplification stage where features stretches along these directions.

We've been thinking to identify a bound on the exit time of the first stage, and examine how it depends on different hyper-parameters, dataset properties etc, so that one may pinpoint how to reduce the time spending in the first stage, effectively making training faster.

> i.e. can the ode be integrated faster than backprop?

Also a good question, at this stage we need to estimate model parameters (the drift matrix) from simulations on DNNs. As future works we hope to explore if we can pre-determine those parameters so a comparison between backprop might make more sense.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact