Depends on personality I guess. That would be sooo unsatisfying to me. E.g. not wanting to accept that languages have exceptions "just because" is what got me interested in historical linguistics as a young lad.
Yep, we can see a lot of people here have had little experience in raising children. Some will just seem to naturally say "I accept that", and another kid that will be like "f you, I don't do what you tell me" born a year apart and raised in the same household. Nurture can moderate these behaviors, but nature is strong.
I was more referring to the children that compare themselves to other children, or differences between what they have/are allowed to do and what others can.
It's why I prefaced it with "It can".
Every child is different, but a big impact is every parent who has or hasn't dealt with the normal childhood stuff every parent can have, plus the extra, or latent reactivity can be modelled and passed on.
At least for Nix itself, that's pretty much it except via Dutch.
> The name Nix is derived from the Dutch word niks, meaning nothing; build actions do not see anything that has not been explicitly declared as an input
Also, I think the founder's username in various places is nixnut. Which to an English-only speaker means someone crazy about Nix (Nix fan). However in Dutch 'niksnut' or 'nietsnut' loosely translates to 'bum'.
I have that forked, as well as a fork of funscii. Both have fixes in the main branch. I've added a fair amount of stuff beyond that in a branch of unscii.
It's the daydreaming/mind-wandering state that occurs when you're not focused on an external task. With all the stimuli of the modern world, I feel like we're being starved of crucial DMN time if we don't engineer conditions like the ones you describe.
It reminds me of how LLM hallucination is attributed to "I don't know" being underrepresented in training data, and it being a better strategy to guess on evaluations rather than admit not knowing.
Different reward function, but the same behaviour emerges.
We'll see that improve as people move onto synthetic training data-- something now possible that we have sufficiently smart LLMs to create enough of it.
The idea is that you generate fake llm transcripts using your classical training data. E.g. look at some training data, generate q/a transcripts. Generate radom questions, RAG against your whole dataset and look for relevant stuff, if there is nothing there, train a "I don't know." reply.
A moderately sized LLM operating some tools to access more information behind the scenes, perform tests and correct its own errors can write transcripts simulating a much larger and smarter llm.
No FFN is blowing my mind. This is pretty much "Attention Is ACTUALLY All You Need". Reminds me of BERT Q&A which would return indices into the input context, but even that had a FFN. Really exciting work.
I guess this had always been bugging me. I get while you need activation/non-linearities, but do you really need the FFN in Transformers? People say that without it you can't do "knowledge/fact" lookups, but you still have the Value part of the attention, and if your question is "what is the capital of france" the LLM could presumably extract out "paris" from the value vector during attention computation instead of needing the FFN for that. Deleting the FFN is probably way worse in terms of scaling laws or storing information, but is it an actual architectural dead-end (in the way that deleting activation layer clearly would be since it'd collapse everythig to a linear function).
> if your question is "what is the capital of france" the LLM could presumably extract out "paris" from the value vector during attention computation instead of needing the FFN for that.
But how do you get 'Paris' into the value vector in that case? The value vector is just the result of a matrix multiplication, and without a nonlinearity it can't perform a data-dependent transformation. Attention still acts as a nonlinear mixer of previous values, but your new output is still limited to the convex combination of previous values.
> But how do you get 'Paris' into the value vector in that case?
Ok wait I think I see what you mean. Although maybe it's not getting paris _into_ the value vector that's hard, but isolating the residual stream to _only_ that instead of things like other capitals.
So as a naive example maybe at the very first layer consuming your tokens: Q{France} would have high inner product with K{capital} and so our residual would now mostly contain V{capital}, which maybe contains embeddings of all the capitals of all countries. You need some way to filter out all the other stuff, but can't do that without a FFN + activation.
Just throwing in a relu by itself won't help since that would still work on all the elements uniformly, you need some way to put weight on "paris" while suppressing the others, i.e. mixing within the residual stream itself.
Although maybe if you really stretch it, somewhere in a deeper layer you could have 1-hot encoded values with a "gain" coefficient so that when you do the residual addition it's something like {<paris>, <tokyo>, <dc>} + 10000*{<1>, <0>, <0>} and then if you softmax that you get something with most of its mass on "Paris". But it seems like this would not be practical, or it's just shifting the issue to how that the right 1-hot vector is chosen
Makes sense that the agent can refine its search terms/strategy based on discovered context.
But it still has to enumerate synonyms to find things.
I would assume it's very domain dependent, like code or technical docs would have more precise terminology that is better for fixed string search. On the other hand, medical or legal text can have many many ways to say something
Yup, and good luck finding usable dictionaries. It's a lot of one-time handiwork to build it yourself, for which you need to find the right motivation (and time, and funding)
reply