Hacker Newsnew | past | comments | ask | show | jobs | submit | hackerchy's commentslogin

This is fascinating. The fact that only ~7 layer blocks work and not fewer/more really suggests there are emergent functional units in the transformer stack that we don't fully understand yet. Almost like "organs" in the network. Have you tried this on architectures other than Qwen, like Llama or Mistral? Curious if the magic block size is architecture-dependent or if 7 layers is some kind of universal constant.

I wouldn't be surprised if even in the same model, the organ block size varied wildly depending on what you're looking for (i.e. his probes).

But if there are sizes that are common, then that could also point to an architectural flaw, because whilst it could be universal constant-ness it could also be bounded by some inner working - and perhaps this is something that could be improved upon.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: