gnyeki's comments

gnyeki · 2024-07-01T16:21:37.000000Z

I'm glad you found it valuable! Both are good questions and I haven't gone far enough mapping the code to Elman's architecture to know the answer to the second.

For your first question, using three hidden layers makes it a little clearer what the network does. Each layer performs one step of the calculation. The first layer collects what is known from the current token and what we knew after the calculation for the previous token. The second layer decides whether the current token looks like program code, by checking if it satisfies the decision rule. The third layer compares the decision with what we decided for previous tokens.

I think that this could be compressed into a single hidden layer, too. A ReLU should be good enough at capturing non-linearities so this should work.

Fripplebubby · 2024-07-01T16:42:14.000000Z

Ah, that makes sense. So, we consider two hidden layers more as "memory" or "buffers", and actually the rule is implemented in just one layer, at least for a single token.

gnyeki · 2024-07-01T16:03:51.000000Z

This area is covered by non-parametric statistics more generally. There are many other methods to non-parametrically estimate functions (that satisfy some regularity conditions). Tree-based methods are one family of such methods, and the consensus still seems to be that they perform better than neural networks on tabular data. For example:

https://arxiv.org/abs/2106.03253