At this point, does anything actually preclude telling a model connected to a service running Python with all the requisite libraries (or access to PyPI / conda) to try "designing" another LLM, from architecture and choice of training data to writing actual training code?
Nothing, and I'm sure people are trying that, but it's probably beyond the abilities of current LLMs to make progress in this area, since it's kind of outside their training.
Is it, though? Codex was trained on Python code mostly, and given the cutoff date, it has all the relevant ML bits. And if you ask it about things like transformers, self-attention etc, it's clear that papers on that were a part of its training set, as well. The question is whether it can reason well enough about it to produce meaningful improvements.
FWIW if you ask it how transformers could be improved, it basically gives a recap of various relevant papers that it knows: Longformer, BigBird etc. Amazingly, all the papers cited actually exist and have the specified titles, names, and dates. But then if you ask it for some ideas of its own:
> Can you think of other novel improvement ideas that haven't been investigated yet, to the best of your knowledge? What makes the most sense based on your understanding of the subject matter.
As an AI language model, my knowledge is derived from the text data I have been trained on, which includes research papers and articles published up to September 2021. While it's difficult to guarantee the novelty of any idea, here are a few suggestions for potential improvements that may not have been thoroughly investigated yet:
Context-Adaptive Layers: Instead of using the same set of layers for all input sequences, consider using context-adaptive layers that are dynamically selected or generated based on the input's characteristics. This could allow the model to adapt its processing strategy to different input types, such as formal vs. informal text, or narrative vs. expository text, which might improve its performance on diverse tasks.
Multi-modal Fusion: Develop techniques to better fuse information from multiple modalities, such as text, images, and audio, within the Transformer architecture. Improved multi-modal fusion could lead to enhanced performance on tasks that require understanding and generating content across different modalities, such as visual question answering or image captioning.
Meta-Learning for Architecture Search: Employ meta-learning techniques to search for the best Transformer architecture for a specific task or domain. By learning to optimize the architecture itself, the model might be able to discover more effective configurations or components that have not been considered before.
Transformer Modules for Specific Tasks: Design specialized Transformer modules tailored for specific NLP tasks, such as syntactic parsing, named entity recognition, or sentiment analysis. These task-specific modules could be combined with the generic Transformer architecture to create a more powerful and adaptable model.
Hierarchical Transformers: Explore hierarchical architectures where multiple layers of Transformers work together to process information at different levels of abstraction. This could help the model capture both low-level patterns, such as local dependencies, and high-level patterns, such as long-term dependencies and semantic relationships, more effectively.
Enhanced Tokenization Strategies: Investigate novel tokenization strategies that better capture the structure and meaning of the input text. This could involve using character-level, subword-level, or even syntax-based tokenization to better represent the input for the Transformer model.
Regularization Techniques: Develop novel regularization techniques specifically tailored for the Transformer architecture. This could help improve generalization and prevent overfitting, especially when training large-scale models with billions of parameters.