For those interested in some of the recent MoE work going on, some groups have been doing their own MoE adaptations, like this one, Sparsetral - this is pretty exciting as it's basically an MoE LoRA implementation that runs a 16x7B at 9.4B total parameters (the original paper introduced a model, Camelidae-8x34B, that ran at 38B total parameters, 35B activated parameters). For those interested, best to start here for discussion and links: https://www.reddit.com/r/LocalLLaMA/comments/1ajwijf/model_r...
> Shortly thereafter I got in contact with AMD and in early 2022 I have left Intel and signed a ZLUDA development contract with AMD. Once again I was asked for a far-reaching discretion: not to advertise the fact that AMD is evaluating ZLUDA and definitely not to make any commits to the public ZLUDA repo. After two years of development and some deliberation, AMD decided that there is no business case for running CUDA applications on AMD GPUs.
> One of the terms of my contract with AMD was that if AMD did not find it fit for further development, I could release it. Which brings us to today.
It's worth noting that while ZLUDA is a very cool project, it's probably not so relevant for ML. Also from the README:
> PyTorch received very little testing. ZLUDA's coverage of cuDNN APIs is very minimal (just enough to run ResNet-50) and realistically you won't get much running.
> However if you are interested in trying it out you need to build it from sources with the settings below. Default PyTorch does not ship PTX and uses bundled NCCL which also builds without PTX:
PyTorch has OOTB ROCm support btw and while there are some CUDA-only libraries I'd like (FA2 for RDNA, bitsandbytes, ctranslate2, FlashInfer among others), I think sponsoring direct porting/upstreaming compatibility of the libraries probably makes more sense. Also from the ZLUDA README:
> ZLUDA offers limited support for performance libraries (cuDNN, cuBLAS, cuSPARSE, cuFFT, OptiX, NCCL).
And you don't need any additional equipment at all. When I say trivial, I really do mean it - you can go to https://www.together.ai/pricing and see for yourself - a 10M token 3 epoch fine tune on a 7B model will cost you about $10-15 right now. Upload your dataset, download your fine tune weights (or serve via their infrastructure). This is only going to get easier (compare how difficult it was to inference local models last year to what you can do with plug and play solutions like Ollama, LM Studio, or Jan today).
Note also that tuning is a one-time outlay, and merges are even less resource intensive/easier to do.
To put things in perspective, tell me how much cost and effort it would be to tune a model where you don't have the weights at all in comparison.
Fine-tuning - obtaining a dataset for your task (this in itself is not trivial), figuring out how the service you linked works (after figuring out that it exists at all), uploading the dataset, paying, downloading the weights - OK, now how do you load them into LM studio?
It's all subjective, of course, but for me there's a considerable difficulty jump there.
* Announcement: https://twitter.com/billyuchenlin/status/1749975138307825933
* Model Card: https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO
* Response Re-Ranker: https://huggingface.co/llm-blender/PairRM
"We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models, which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model. While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models."
The naming of these models is getting ridiculous...
I'm an occasional visitor to huggingface, so I'm actually superficially familiar with the taxonomy. I just felt like, even if I tried to satirize it, I wouldn't be able to come up with a crazier name. And that's not even the end of the Cambrian explosion of LLMs.
As these models can be quite large and memory intensive, if you want to just give it a quick spin, huggingface.co/chat, chat.nbox.ai, and labs.pplx.ai all have Mixtral hosted atm.
Alternatively, you could also use your own UI/API token (API calls aren't trained on). Chatbot UI just got a major update released and has nice things like folders, and chat search: https://github.com/mckaywrigley/chatbot-ui
Just like the EU knows better about what chargers people should use than customers and engineers? Such wise bureaucrats!
There are quite a few recent attention extension techniques recently published:
* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462
* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325
* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669
* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071
You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.
There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...
Large changes are best performed as a sequence of thoughtful bite sized steps, where you plan out the approach and overall design. Walk GPT through changes like you might with a junior dev. Ask for a refactor to prepare, then ask for the actual change. Spend the time to ask for code quality/structure improvements.