Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

THe RWKV model seems really cool. If you could get transformer-like performance with an RNN, the “hard coded” context length problem might go away. (That said, RNNs famously have infinite context in theory and very short context in reality.)

Is there a primer for what RWKV does differently? According to the Github page it seems the key is multiple channels of state with different decaying rates, giving I assume, a combination of short and long term memory. But isn’t that what LSTMs were supposed to do too?



There's already research that tries to fix this problem with transformers in general, like Transformer-XL [1]. I'm a bit puzzled that I don't see much interest in getting a pre-trained model out that uses this architecture---it seems to give good results.

[1]: https://arxiv.org/abs/1901.02860


T5 uses relative positional encoding


My understanding is that RNNs aren't worse than Transformers per se, they are just slower to train, and use GPU much more efficiently, i.e. much more stuff could be run in parallel.


Also slower to perform inference on. RNNs have to be much more sequential.


We also don't have evidence that they scale the way transformers do



> RNNs famously have infinite context in theory and very short context in reality.

any sources to read more about this please? its the first ive heard of it


Read about "RNN Vanishing Gradients". LSTMs help here, but see eg https://medium.com/analytics-vidhya/why-are-lstms-struggling... for the problems there.


My understanding that LSTM is a kind of RNN.


Yes it is. They were developed to fix the vanishing gradient problem.

The 1997 paper where they were introduced puts it like this:

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

https://www.researchgate.net/publication/13853244_Long_Short...

Usually they aren't competitive with transformers on long-range understanding problems though.


I am not sure this article will answer your question, but Karpathy has an article about RNNs.

https://karpathy.github.io/2015/05/21/rnn-effectiveness


it doesnt touch on the "infinite context in theory and very short context in reality" piece which is what i was asking about


I can confirm it from what we’re seeing on a video prediction task. Future frames end up blurry. The first frame is sharp, but by frame 3 it’s only crisp when it’s very certain of its prediction. Any kind of rare movement, it goes “I kinda know what it roughly looks like” and smears fingerpaint all over the canvas.

The overall trajectory looks ok, so I’ll be more rigorously investigating whether it’s possible to squeeze more precise context out of it. For example, since the first frame is sharp, you could discard the other future frames and use that first frame as the last history entry (rolling completion window). If “the first frame is always sharp” is true, then it seems reasonable that you can generate N sharp frames with that technique, which might work better than predicting N all at once.


You might also mess with your loss function to force it to "make up its mind" as right now the blurry mess likely minimizes the error from the actual frame (which isn't really want you want).


Exactly! That was the exact thing I was trying to think of a way to do.

Got any ideas? There’s discriminators, but after reading over prior work, it seems like they help, but they weren’t really a groundbreaking / effective solution.

I had two harebrained ideas in mind. One is to add yolo style object detection. The difference between a blurry mess and a recognizable object is the fact that it’s a recognizable object, so minimizing the error wrt yolo might work. (“If there are more recognizable objects in the ground truth image than the generated image, penalize the network”)

The other was to try to make some kind of physics-based prediction of the world — if it knows roughly where a street is, or where a wall is relative to an object, then it’ll likely be less confused when generating objects. That idea is very nascent, but right now I’m attacking it by trying to get our RNN to predict an nbody simulation. (Two or three 2D circles that have a gravitational effect on each other, with bouncing when they collide.) The RNN is surprisingly okay at that, even though it’s only examining pixels, but it gets blurry. I was going to try to get it to spit out actual predictions of position, velocity, acceleration, radius in the hopes that it’ll be able to make a connection between “I know there’s a ball flying along this trajectory, so obviously it should still be there 3 frames from now.”

It seems like the more traditional solution is to add a loss term related to the optical flow of the image (displacement from the previous frame to current), or to do foreground/background segmentation masks and have it focus only on the foreground. Both of those feel like partial solutions though, and it feels like there should be some general way to “force it to make up its mind,” as you say. So if you have any oddball ideas (or professional solutions), I’d love to hear!


Have you checked RSSM approach in DreamerV1,V2,V3,PlaNet? It uses deterministic (GRU hidden state) and discrete stochastic latent states. The deterministic and stochastic (sampled) latent state are used to predict the next state. I think the stochastic state might help with your problem a bit.


Dear mystery HN’er, thank you so much. I hadn’t heard about RSSM, and your explanation was wonderfully helpful.

Much appreciated. Have a great weekend :)


Naive RNN have vanishing gradient, but LSTMs and GRUs are much better in this respect.


While this is true, and was a major advantage of LSTMs/GRUs, they still suffer from vanishing gradients.

w.r.t proteins, our sequences often surpass 1500 amino acids and that is really tough for an LSTM to stably train on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: