Hacker News new | comments | show | ask | jobs | submit login

Another post of the "If it makes you feel any better" type: I'm a relatively established researcher in NLP, having worked with a variety of methods from theoretical to empirical, publishing in the top venues with decent frequency, and still I'm having a really hard time to get into the deep learning (DL) stuff.

I'm training a sequence-to-sequence model and have been tuning hyperparameters for the last 2-3 months. I'm making progress, but painfully slowly due to the large time it takes to train and test models (I have a local Titan X and some Tesla K80's in a remote cluster, to which I can send models expecting a latency of 3-4 days of queue and a throughput of around 4 models running simultaneously on average - probably more than many people can get, but still feels slow for this purpose) and the fact that hyperparameter optimization seems to be little more that blind guessing with some very rough rules of thumb. The randomness also doesn't help, as running the same model with different random seeds I have noticed that there is huge variance in accuracy. So sometimes I tweak a parameter and get improvements, but who knows if they are significant or just luck with the initialization. I would have to run every experiment with a bunch of seeds to be sure, but that would mean waiting even more for results and my research would be old before I got to state of the art accuracy.

Maybe I'm just not good at it and I'm a bit bitter, but my feeling is that this DL revolution is turning research in my area from a battle of brain power and ingenuity to a battle of GPU power and economic means (in fact my brain doesn't work much in this research project, as it spends most of the time waiting for results for some GPU - fortunately I have a lot of other non-DL research to do in parallel so the brain doesn't get bored). In the same line, I can't help but notice that most of the top DL NLP papers come from a very select few institutions with huge resources (even though there are heroic exceptions). This doesn't happen as much with non-DL papers.

Good thing that there is still plenty of non-DL research to do, and if DL takes over the whole empirical arena, I'm not bad at theoretical research...

> The randomness also doesn't help, as running the same model with different random seeds I have noticed that there is huge variance in accuracy.

Set all your random seeds to something predefined, such as 42. Even though the exact randomness is OS-specific, this will at least rule out lucky runs from real hyperparameter improvements.

I alredy do that for reproducibility reasons, but I don't really think it takes luck out of the equation. 42 may be a great seed for a model with 400 cells per layer and a terrible seed for a model with 600 cells per layer, as the different layout will lead to a totally different distributions of the weights even if the seed remains the same.

Indeed, but if performance is affected so much by the initialisation, then I would avoid random initialisation in the first place. There are various publications exploring different initialisation methods for various problems.

I'm afraid that I cannot go into more specific details right now, but you can get more stable training and faster convergence with a better initialisation strategy.

Couln't you use an initialization pattern that includes all the weights of the smaller layer in the larger layer? This would keep the behavior of a subset of units exactly the same, at least at initialization time.

Maybe I'm missing something here but if the problem is that different starting positions return different answers when they should all converge to the same one -- doesn't that mean that there's a fundamental problem with the actions being taken?

Fixing the starting randomness seems to be like that old adage of a man with two clocks doesn't know what time it is so he throws away one of them.

There is a fundamental problem in general (not in my particular approach) which is that we don't know how to do non-convex optimization. There are many problems where it's just not possible with our current techniques to know if a minimum is local or global.

Of course, with better tuning one can obtain better optima in general (that's what the whole field is doing) and it's possible that I'm not applying the best techniques and I would get less randomness if I had a better model. But as far as I understand, even the best models can converge to different local optima depending on initialization.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact