Hacker News new | past | comments | ask | show | jobs | submit | lhl's comments login

Just an FYI, Mixtral is a Sparse Mixture of Experts that has 47B parameters for memory costs (but 13B active parameters per token). For those interested in reading more about how it works: https://arxiv.org/pdf/2401.04088.pdf

For those interested in some of the recent MoE work going on, some groups have been doing their own MoE adaptations, like this one, Sparsetral - this is pretty exciting as it's basically an MoE LoRA implementation that runs a 16x7B at 9.4B total parameters (the original paper introduced a model, Camelidae-8x34B, that ran at 38B total parameters, 35B activated parameters). For those interested, best to start here for discussion and links: https://www.reddit.com/r/LocalLLaMA/comments/1ajwijf/model_r...

Well, you probably read a inaccurate headline about it. The project is called ZLUDA https://github.com/vosen/ZLUDA and it had a recent public update because of the opposite - AMD decide not to continue sponsoring work on it:

> Shortly thereafter I got in contact with AMD and in early 2022 I have left Intel and signed a ZLUDA development contract with AMD. Once again I was asked for a far-reaching discretion: not to advertise the fact that AMD is evaluating ZLUDA and definitely not to make any commits to the public ZLUDA repo. After two years of development and some deliberation, AMD decided that there is no business case for running CUDA applications on AMD GPUs. > > One of the terms of my contract with AMD was that if AMD did not find it fit for further development, I could release it. Which brings us to today.

It's worth noting that while ZLUDA is a very cool project, it's probably not so relevant for ML. Also from the README:

> PyTorch received very little testing. ZLUDA's coverage of cuDNN APIs is very minimal (just enough to run ResNet-50) and realistically you won't get much running. > However if you are interested in trying it out you need to build it from sources with the settings below. Default PyTorch does not ship PTX and uses bundled NCCL which also builds without PTX:

PyTorch has OOTB ROCm support btw and while there are some CUDA-only libraries I'd like (FA2 for RDNA, bitsandbytes, ctranslate2, FlashInfer among others), I think sponsoring direct porting/upstreaming compatibility of the libraries probably makes more sense. Also from the ZLUDA README:

> ZLUDA offers limited support for performance libraries (cuDNN, cuBLAS, cuSPARSE, cuFFT, OptiX, NCCL).

It is trivial to fine tune any model (whether a base model or an aligned model) to your preferred output preferences as long as you have access to the model weights.

Not trivial for the general public at all, and furthermore, you need much more memory for finetuning than for inference, often making it infeasible for many machine/model combinations.

If you are running a local LLM already (which no one in the "general public is") then the bar is really not that much higher for fine-tuning (either for an individual or community member to do).

And you don't need any additional equipment at all. When I say trivial, I really do mean it - you can go to https://www.together.ai/pricing and see for yourself - a 10M token 3 epoch fine tune on a 7B model will cost you about $10-15 right now. Upload your dataset, download your fine tune weights (or serve via their infrastructure). This is only going to get easier (compare how difficult it was to inference local models last year to what you can do with plug and play solutions like Ollama, LM Studio, or Jan today).

Note also that tuning is a one-time outlay, and merges are even less resource intensive/easier to do.

To put things in perspective, tell me how much cost and effort it would be to tune a model where you don't have the weights at all in comparison.

Running a local LLM - downloading LM studio, installing on Windows, using the search function to search for a popular LLM, click "download", click the button to load the model, chat.

Fine-tuning - obtaining a dataset for your task (this in itself is not trivial), figuring out how the service you linked works (after figuring out that it exists at all), uploading the dataset, paying, downloading the weights - OK, now how do you load them into LM studio?

It's all subjective, of course, but for me there's a considerable difficulty jump there.

A new 7B model, Snorkel-Mistral-PairRM-DPO, using a similar self-rewarding pipeline was just released:

* Announcement: https://twitter.com/billyuchenlin/status/1749975138307825933

* Model Card: https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO

* Response Re-Ranker: https://huggingface.co/llm-blender/PairRM

"We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models, which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model. While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models."


The naming of these models is getting ridiculous...

I kind of disagree. It's not "user friendly" but it is very descriptive. They are codenames afterall. Take "dolphin-2.6-mistral-7b-dpo-laser" for instance : with a little LLM background knowledge, just from the name you know it is a 7 billion parameters model based on Mistral, with a filtered dataset to remove alignment and bias (dolphin), version 2.6 and using the techniques described in the Direct Preference Optimization (https://arxiv.org/pdf/2305.18290.pdf) and Laser (https://arxiv.org/pdf/2312.13558.pdf) papers to improve its output.

And here I was thinking they were somehow using the first three words from my Bitcoin wallet.

Thank you for a great and informative explanation despite my somewhat ignorant take.

I'm an occasional visitor to huggingface, so I'm actually superficially familiar with the taxonomy. I just felt like, even if I tried to satirize it, I wouldn't be able to come up with a crazier name. And that's not even the end of the Cambrian explosion of LLMs.

A bit like the User-Agent string.

Thanks for this, and the link you provided below for GGUF files! I just cleared my schedule this afternoon to kick the tires.

I assume this doesn’t yet run on llama.cpp?

It is based on Mistral which llama.cpp supports, so I assume it does run (you might need to convert to GGUF format and quantize it).

For macOS and Linux, Ollama is probably the easiest way to try Mixtral (and a large number of models) locally. LM Studio is also nice and available for Mac, Windows, and Linux.

As these models can be quite large and memory intensive, if you want to just give it a quick spin, huggingface.co/chat, chat.nbox.ai, and labs.pplx.ai all have Mixtral hosted atm.

You can make a privacy request for OpenAI to not train on your data here: https://privacy.openai.com/

Alternatively, you could also use your own UI/API token (API calls aren't trained on). Chatbot UI just got a major update released and has nice things like folders, and chat search: https://github.com/mckaywrigley/chatbot-ui

It should be opt out by default: not opt in.

Until people decide to boycott services where privacy is opt-in, nothing will change.

Maybe the EU should fine them like a billion dollars a month until it becomes opt-in.

Because the EU knows better about this kinds of things, then the people who build and use these?

Just like the EU knows better about what chargers people should use than customers and engineers? Such wise bureaucrats!

That comment was immediately downvoted so there isn’t much hope.

ChatGPT's web search is interminably slow and I've added to my custom prompt to not do web searches unless explicitly asked. However, I'd give Perplexity.ai a try - I've found it to be incredibly fast and useful (funnily enough, they largely also use Bing for search retrieval results) and if you pay for Pro (which I do now), you can also use GPT-4 for generating responses.

My understanding is that Mistral uses a regular 4K RoPE that is "extends" the window size with SWA. This is based on looking at the results of Nous Research's Yarn-Mistral extension: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k and Self-Extend, both of which only apply to RoPE models.

There are quite a few recent attention extension techniques recently published:

* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462

* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325

* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669

* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071

You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.

There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...

Yes, I think "predecessor" is a more apt term than competitor. AGP was introduced in 1997 and was pretty ubiquitous for almost a decade. While the first PCIe graphics cards started coming out around 2004, AGP cards continued to be released for several years later until they were phased out completely.

Sorry for little off top - do you have tested by any chance 7940HS against stable diffusion? How does it perform?

I haven't but I know there are writeups I've seen online that should be easy to find of people using SD w/ APUs.

I've given Aider and Mentat a go multiple times and for existing projects I've found those tools to easily make a mess of my code base (especially larger projects). Checkpoints aren't so useful if you have to keep rolling back and re-prompting, especially once it starts making massive (slow token output) changes. I'm always using `gpt-4` so I feel like there will need to be an upgrade to the model capabilities before it can be reliably useful. I have tried Bloop, Copilot, Cody, and Cursor (w/ a preference towards the latter two), but inevitably, I end up with a chat window open a fair amount - while I know things will get better, I also find that LLM code generation for me is currently most useful on very specific bounded tasks, and that the pain of giving `gpt-4` free-reign on my codebase is in practice, worse atm.

There is a bit of learning curve to figuring out the most effective ways to collaboratively code with GPT, either through aider or other UXs. My best piece of advice is taken from aider's tips list and applies broadly to coding with LLMs or solo:

Large changes are best performed as a sequence of thoughtful bite sized steps, where you plan out the approach and overall design. Walk GPT through changes like you might with a junior dev. Ask for a refactor to prepare, then ask for the actual change. Spend the time to ask for code quality/structure improvements.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact