Brave Leo now uses Mixtral 8x7B as default

rhdunn · on Jan 27, 2024

If you want to run Mixtral 8x7B locally you can use llama.cpp (including with any of the supporting libraries/interfaces such as text-generation-webui) with https://huggingface.co/TheBloke/Nous-Hermes-2-Mixtral-8x7B-S....

The smallest quantized version (2bit) needs 20GB of RAM (which can be offloaded onto the VRAM of a decent 4090 GPU). The 4bit quantized versions are the largest models that can just about fit onto a 32GB system (29GB-31B). The 6bit (41GB) and 8bit (52GB) models need a 64GB system. You would need multiple GPUs with shared memory if you wanted to offload the higher precision models to VRAM.

I've experimented with the 7B and 13B models, but haven't experimented with these models yet, nor other larger models.

viraptor · on Jan 27, 2024

And if you want better performance when talking about code, you can try the dolphin-mixtral fine tuning https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGU...

tarruda · on Jan 27, 2024

> You would need multiple GPUs with shared memory if you wanted to offload the higher precision models to VRAM.

Or just a powerful apple silicon machine? I've tried dolphin mixtral 4bit on a 36gb ram MacBook m3, and inference is super fast.

mkesper · on Jan 27, 2024

Or a Linux machine with a Ryzen using the internal GPU and the unified RAM (scroll down at llama.cpp and look for ROCm).

jacooper · on Jan 27, 2024

Wait ROCm support Ryzen APUs and still doesn't support dedicatedly GPUs like the 6700XT?!

jacooper · on Jan 27, 2024

Supports* dedicated*

trissi1996 · on Jan 27, 2024

While not being officially supported, rocm runs just fine on my 6700XT, i just have to set an env var(export HSA_OVERRIDE_GFX_VERSION=10.3.0)

jacooper · on Jan 28, 2024

Really? Does everything run? Even AI stuff? Do you have any links where I can read more about that?

trissi1996 · on Jan 29, 2024

Everything I've tried to get running, worked quite smoothly. Although I only tried LLMs via llama-cpp and stable diffusion via ComfyUI. I don't see any reason why other AI stuff wouldn't work as long as it supports rocm.

Also I only tried it on linux, AFAIK windows is a lot more difficult to get running, if it works at all...

With llama-cpp, I successfully tried various LLMs(e.g. LLAMA 13B, Mixtral etc) with very solid performance. Even for models that don't fit in VRAM completely, performance can be surprisingly solid, as long as you compile with AVX extensions. (and your CPU supports those)

Stable Diffusion via ComfyUI also works very well. However, be aware of VRAM limitations with the larger SDXL variants, especially when running a heavy desktop environment.

Regarding setup guides/links, there isn't a good centralized resource sadly, so some tinkering is needed. Unlike some of those CUDA 1-click solutions, ROCm requires more manual setup, especially for the models only unofficially supported.

Here are a couple of links that might be helpful:

https://old.reddit.com/r/LocalLLaMA/comments/18ourt4/my_setu...

https://old.reddit.com/r/StableDiffusion/comments/ww436j/how...

https://rentry.org/eq3hg

In general the r/localllama & r/StableDiffusion subreddits are good places to search for info.

KeplerBoy · on Jan 27, 2024

Or a jetson orin agx (~2k$). Probably the cheapest way to get an Nvidia GPU with 64 GB of RAM.

fsiefken · on Jan 27, 2024

I wonder what would be the cheapest way to run an LLM, with the latest Ryzen integrated graphics and 64G Ram or the Jetson AGX Orin 64. https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...

KeplerBoy · on Jan 27, 2024

The Ryzen is a lot cheaper, but most likely also a fair bit slower. You'd be looking at a 200$ CPU, 200$ Motherboard + 200$ of ddr5 ram. Throw in a case, nvme drive and power supply and you're still below $1k and those numbers are quite generous estimates, you could do it a lot cheaper by going AM4 with DDR4 ram.

robterrell · on Jan 27, 2024

Have you tried this yourself? Curious to know how well this works for an LLM home lab.

kkielhofner · on Jan 27, 2024

I’ve worked with Jetson going back to the TK1 and I highly recommend you do not do this.

Nvidia has significant dominance in the AI space because of their work on software and the overall platform.

With the Jetson line being the sole exception. Use it for what it’s for - a targeted build for an embedded/specific application requiring small size and low power.

The software is a mess. Support for Jetson (generally) is a far afterthought or not considered at all around projects at Nvidia and the broader ecosystem. When it is supported at all it lags behind significantly, using ancient distros (Jetpack), etc. To make matters worse the user base is so (relatively) tiny there are bugs and strange behavior everywhere.

Just don’t do it.

dchichkov · on Jan 28, 2024

This is a bit surprising to hear. Current Jetpack 6 is Ubuntu 22.04 - this is the current Ubuntu LTS release. There's nothing ancient about it, no? I'm pretty sure, if I go and check versions of CUDA, PyTorch, Tensorflow - it'd be also relatively recent.

I'd suggest checking what examples are available, see what community is doing, see if what you need had already been tried - https://www.jetson-ai-lab.com

From what I've seen, mainstream LLM libraries like VLLM, llamacpp that use CUDA under the hood tend to work out-of-the-box. And there are tutorials available: https://www.jetson-ai-lab.com/tutorial_text-generation.html. I think that TensorFlow/Pytorch are also well maintained, although I've not checked recently.

kkielhofner · on Jan 28, 2024

I think this perspective comes from a lack of historical experience and hands-on experience overall.

Nvidia more broadly has very impressive support for their GPUs. If you look at the support lifecycles for their Jetson hardware over time it's significantly worse. I encourage you to look at what support lifecycles have looked like, with the most "egregious" example being dropping of support for the Jetson Nano in from what I recall was within a couple of years.

Another consideration - Jetson is optimized for power efficiency/form-factor and on a per $ basis CUDA performance is terrible. The power efficiency and form-factor come at significant cost. See this discussion from one of my projects[0]. I evaluated the use of WIS on an Orin Nano that I have and it was nearly 10x slower than a GTX 1070 which is seven years old and is still supported by the latest drivers and CUDA 12 on whatever OS you want.

Nvidia knows what they're doing in terms of productization and the Jetson line should not be seen as some kind of secret hack/unlock for getting CUDA performance with gobs of RAM. In the case of LLMs I wouldn't be surprised at all if CPU beats it and at that point pickup 256GB of RAM or whatever for equivalent cost.

In the end what do I care what people use, I'm offering the perspective and experience of someone who has actually used the Jetson line for many years and frequently struggled with all of these issues and more.

[0] - https://github.com/toverainc/willow-inference-server/discuss...

dchichkov · on Jan 28, 2024

I've used Jetson for a few projects as a hobby. Made an I2S Sodar array with a TX2. And some robotics projects with a Jetson AGX Xavier that I got to evaluate and then to work on. And a few both, professional and toy projects with versions of Jetson Nano kit and Xavier. But this was between 2017 and 2021 or so.

About a year back, I took that very early version of AGX Xavier, that got released years ago. It wasn't even the version that was officially released. Yet I was able to refresh it to newer Ubuntu without any issues.

Wheels are often not pre-built for aarch64, yes. If you want to compile directly on Nano, disk performance is very important. Sometimes you get I/O bound.

Orin Nano being that slow in [0], it looks like you've been trying it in Aug 2023. It maybe worth re-evaluating on the latest Jetpack, it had transitioned to CUDA 12.2, TensorRT 8.6, cuDNN 8.9. I would expect that recent popularity of ASR/TTS pipelines and LLMs was not completely missed by Jetpack maintainers (there are some tutorials here - https://www.jetson-ai-lab.com ). And recently released JetPack could be optimized a lot more for these workflows.

And your project is very cool! I'd suggest sharing it and your performance numbers (!) with the maintainers of: https://developer.nvidia.com/embedded/community/jetson-proje...

kkielhofner · on Jan 29, 2024

> I've used Jetson for a few projects as a hobby. Made an I2S Sodar array with a TX2. And some robotics projects with a Jetson AGX Xavier that I got to evaluate and then to work on. And a few both, professional and toy projects with versions of Jetson Nano kit and Xavier. But this was between 2017 and 2021 or so.

Nice! I'm sorry if I seemed dismissive or even disrespectful, in my experience Jetson certainly has it's place (why I've been using them for years) but compared to "bring your distro, apt-get/.run Nvidia driver" they can be a serious shock for casual users. Then they see the performance...

> Orin Nano being that slow in [0], it looks like you've been trying it in Aug 2023. It maybe worth re-evaluating on the latest Jetpack, it had transitioned to CUDA 12.2, TensorRT 8.6, cuDNN 8.9. I would expect that recent popularity of ASR/TTS pipelines and LLMs was not completely missed by Jetpack maintainers (there are some tutorials here - https://www.jetson-ai-lab.com ). And recently released JetPack could be optimized a lot more for these workflows.

Interestingly WIS was recently bumped to CUDA 12.2, etc and the performance improvements were very marginal. WIS uses Ctranslate2 under the hood (same as faster-whisper) which offers among the best Whisper performance overall but doesn't benefit much from changes in these underlying libraries. In the end even if it somehow magically doubled performance (it doesn't and won't) that still places the latest generation ~$600 Jetson board 5x slower than an ancient yet still fully officially supported ~$100 GPU. Power and form-factor is an issue but for the voice assistant use case a Jetson board barely doing realtime with Whisper medium is unacceptable to me and the vast majority of our users. Our goal is sub one second voice command sessions from end of speech, to command execution, to TTS response and Jetson just can't provide that at any cost.

I'm glad there are community resources for Jetson platforms (which I'm aware of) but their existence underscores my point - you'll notice when perusing through there are often various hoops to jump through whereas anything else is basically "install driver, container toolkit, docker run" and it just works and works performantly. Basically CUDA x86_64 and discrete GPUs is native/expected/developed for, Jetson is almost always a bit of an edge case with rough edges (relatively) all over the place.

> And your project is very cool!

Thanks! In terms of your suggestion I certainly might but in the meantime, overall (based on my Jetson experience) as I said in that discussion I'm very reluctant to officially support the Jetson line with WIS. I'm almost certain it will blow-back on the project and cause support headaches for us while all the while providing a sub-optimal user experience.

barkingcat · on Jan 28, 2024

I have a Jetson as well, and you are sorely mistaken. Just reading the doc pages everything seems nice and well, but Nvidia deprecates these little boards like no other. No support after you've bought the thing, and everything is kept frozen. (ie no new python, no new python dependencies, etc) What they aren't telling you is that specific sub-versions within each jetson/orin family board have differing support (ie not what they say on that website you are reading), and it's up to you to figure it out.

I've gotten my Jetson to work well using Yocto to build my own linux distros with correct updated dependencies, libraries and updated jetpack, but it's not for the faint of heart, and that's a whole other ball of yarn. It also takes a few hours to generate a new build every time I need to update some dependency that depends on other dependencies (Yocto maintenance is a full time job in many embedded development shops - you're basically authoring your own distribution).

Treat these devices as what they are: embedded target boards for fixed industrial development (for example, to go into a robot or a car - once that design is finished, Nvidia will expect you to NEVER update any part of the system with an embedded jetson or orin system for years, until you replace the whole thing with their newest model that you buy off the shelf again).

This is standard fare in embedded and robotics space. Do not use these boards for any kind of rapidly moving software development, because it's the wrong tool for the job.

kkielhofner · on Jan 29, 2024

+1

Software for Jetson boards should be viewed as firmware for these embedded/industrial devices. They get installed in a robot, MRI machine, etc with a specific bespoke application targeting what they came with and are never touched again -or- supported by some large commercial firm with the skillsets you describe.

I was as firm/absolute in my original reply as I was because anyone who thinks life with Jetson is similar to life with a discrete Nvidia GPU on x86_64 will be in for a huge shock and 95% of the time it will end up on their shelf in a year or two.

It's one thing when it's the latest random ARM SBC you bought for $50 with no vendor support, it's another thing entirely when you're spending > $600 (or $2000 as this started!!!!) on a Jetson.

KeplerBoy · on Jan 28, 2024

Yes, it's all rather recent in my experience. You get CUDA 12 and the newest Pytorch.

kkielhofner · on Jan 28, 2024

For now. Check back in a couple of years.

qrios · on Jan 27, 2024

According to this article [1] it looks like there is no complex preparation needs to run the inference on a Jetson system. Should work with Mixtral too.

[1] https://www.hackster.io/pjdecarlo/llama-2-llms-w-nvidia-jets...

KeplerBoy · on Jan 27, 2024

I haven't tried it for LLMs yet, i use it for real time RF processing, but I actually have one of them on my desk and they are fun little devices.

Maybe I will try to get a 32 GB+ LLM running one of those days.

stavros · on Jan 27, 2024

What? I can do this? Runs to the PC

EDIT: I cannot, I need to install ROCm to compile with it, and then install something called hipBLAS, and who knows what else.

mkesper · on Jan 28, 2024

Well, yes, you need to install ROCm and depdendencies. Have a look at https://rocm.docs.amd.com/projects/install-on-linux/en/lates... Debian trixie (not yet released) has most dependencies as packages. Or you can try a docker container https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

stavros · on Jan 28, 2024

I'll try that, thanks!

assbuttbuttass · on Jan 27, 2024

OpenCL should also work on AMD cards, and is way easier to install

brucethemoose2 · on Jan 27, 2024

It is dead slow on integrated graphics, unfortunately.

stavros · on Jan 27, 2024

Does that let me use unified memory on the GPU, though? Or is it just so I can use my CPU memory?

EDIT: Oh, no, I have an nVidia GPU, AMD CPU.

mkesper · on Jan 28, 2024

I bet your AMD CPU has an internal GPU, too. That's what you can use with the unified memory.

dimask · on Jan 27, 2024

How much RAM are you able to set aside for a ryzen igpu?

zaat · on Jan 28, 2024

I think my motherboard allow me to dedicate 12. I didn't see any improvement using CPU + ROCm compared to CPU alone. Using CPU alone I can get 4.2 - 5 Tokens/s, with ROCm I can get 4.5 - 5.2 T/s. With CPU + RTX 2070 8GB I get 6.2-7 T/s.

ode · on Jan 28, 2024

How fast is it with a setup like this?

pennaMan · on Jan 27, 2024

I can run 4bit on a beat up 1070 ti. GP talks about higher precision models

sp332 · on Jan 27, 2024

You wouldn’t be able to fit the whole model into 8GB VRAM. It’s faster than not using a GPU at all, but most of it would still be computed on the CPU.

baq · on Jan 27, 2024

IME ollama ran mixtral on a 1070 fast enough.

dimask · on Jan 27, 2024

Though it most probably does not run in on the 1070 but rather on the cpu. It cannot fit on a 1070, it is not about speed, a 1070 cannot run it period.

Dkuku · on Jan 28, 2024

In llama.cpp You can offload some of the layers to gpu with -ngl X. Where x is the number of layers

asimpleusecase · on Jan 27, 2024

Did you do anything special to make that work? Is it useful? Or just a toy?

windexh8er · on Jan 27, 2024

I have a 14" MBP with an M1 Max and 64GB. The M3 won't really make a difference, but the RAM, since unified, is huge. I can run most models on this machine with realtime performance compared to a Ryzen 7735HS and 64GB (DDR5). Now I'm not saying the Ryzen setup should be good, but the M1 architecture just makes it a much better option. I could add an eGPU to the Ryzen system and it could likely do better, but would also exceed the price point and portability.

paulmd · on Jan 28, 2024

it's not just that it's huge and unified - ryzen APUs obviously can have 2x32GB SODIMMs put in them and they support unified memory too.

the difference is the bandwidth and the computational power of the APU. M1 Max is roughly similar to a PS5 in terms of overall system design (shader configuration and bandwidth) plus has dedicated AI inference units already (which won't be added to consoles until PS5 Pro launches with RDNA 3.5). It is far more bandwidth than you can get out of a socketed-memory laptop system.

https://twitter.com/Locuza_/status/1450271726827413508

To support that level of performance in a socketed-memory system you will need an extra layer of caching added to the processor to supplement the bandwidth - and maybe still need to go to quad-channel. Those products are Strix and Strix Halo and should be hitting the market over the next year or two but the reality is that the M1 Max was an absurdly powerful laptop, far more potent than even the first-gen 5nm laptops for x86 let alone the other junk you could buy in 2020.

This is the problem with the discourse around apple silicon for the last few years: yeah, they're expensive, but even a loaded-out x86 laptop doesn't get you the same capabilities. Even if the x86 is competitive in some particular benchmark on iso-node you are probably spending more power to do it, and the x86 product comes years after the apple product, and still has a much weaker gpu and less bandwidth (which doesn't just matter for GPU, it matters for compiling and JIT too).

It is incredibly silly to look back on the discourse in 2020-2023 around apple silicon, a lot of reviewers made extremely silly claims about how "even 7nm x86 processors were already competitive with apple silicon" and as the ecosystems have matured it is obvious that even 5nm processors are not quite competitive yet. And they dumped on the SPEC tests and Geekbench that measured this properly, in favor of dumb things like cinebench R23 and so on (it's always cinebench used for this dumb shit tbh, CB R13/R15 were hugely misleading at the zen1 launch too). Let alone things like, you know, compiling or JVM/node workloads...)

(similarly, gotta love the vibe a few years ago of: "threadripper vs mac pro" - did you know that a 64C threadripper with 256GB RAM is actually cheaper than a mac pro loaded out with 2TB!? waow, who knew systems with an order of magnitude less capacity would be cheaper!? https://youtu.be/BH291DQRIOg )

brucethemoose2 · on Jan 27, 2024

I've had less luck with Mixtral, but I run Yi 34B finetunes for general personal use, including quick queries for work.

Its kinda like GPT 3.5, with no internet access and slightly less reliable responses, but unrestrained, much faster and with a huge (up to 75K on my Nvidia 3090) usable context.

Mixtral is extremely fast though, at least at a batch size of 1.

simonw · on Jan 27, 2024

Which Yi 34B finetunes are you using that have a 75,000 token length?

brucethemoose2 · on Jan 27, 2024

All of the Yi 200K finetunes should support it, but you have to be careful because some degrade the base model's quite excellent long context performance more than others. The very strong Bagel 34B DPO model, for instance, basically doesn't work at long context.

Nous Capybara is a popular one. I personally use my own merge of many models, and you can look through the constituent models to see if any interest you: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megame...

You can't really use llama.cpp for super long context btw, its just too slow and vram inefficient at the moment.

tarruda · on Jan 28, 2024

Nothing special other than llama.cpp, which is an inference engine optimized for apple silicon.

I heard you can simply install ollama app which uses llama.cpp under the hoods, but has a more user friendly experience.

EarthLaunch · on Jan 27, 2024

I've been using it for 'easy' queries like syntax/parameter questions, in place of ChatGPT 4. It's great for that. I am using a ~48GB version.

bearjaws · on Jan 27, 2024

2bit is pretty damn terrible, I don't recommend it for anything serious.

ignoramous · on Jan 27, 2024

At that level of quantization / distillation, smaller models like phi-2 (q&a) and wavecoder-6.7b (code-gen) might be preferable over QLoRAd ones: https://huggingface.co/microsoft/phi-2

> 2bit is pretty damn terrible

Wait till you go hybrid [0] or even 1bit [1]

[0] https://github.com/efeslab/Atom

[1] https://github.com/IST-DASLab/qmoe

jodleif · on Jan 27, 2024

I prefer koboldcpp over llama.cpp. It’s easy to spilt between gpu/cpu on models larger than VRAM

sp332 · on Jan 27, 2024

Llama.cpp has --n-gpu-layers that lets you set how much of the model to put on the GPU.

DrSiemer · on Jan 27, 2024

Runs in Oobabooga textUi as well, if you add the llama.cpp extension. Easier interface imo, plus fun stuff like coqui and whisper integration.

rhdunn · on Jan 27, 2024

That's interesting. It also looks like koboldcpp works better with long interactions, as it only processes changed tokens. I'm using llama.cpp with text-generation-webui and its OpenAI compatible API. I'll have to look to see if I can use koboldcpp with it.

sp332 · on Jan 27, 2024

Llama.cpp has an interactive mode, but I don't think text-generation-webui uses it. https://github.com/ggerganov/llama.cpp/blob/master/examples/...

jodleif · on Jan 27, 2024

Indeed. Koboldcpp works fine with other UIs than the bundled one.

magicalhippo · on Jan 27, 2024

I've got an aging 2080Ti and Ryzen 3800X with 96GB RAM, any point in trying to mess with the GPU or?

Haven't really been able to justify upgrading to a 4090 or similar given I play so few new games these days.

htsh · on Jan 27, 2024

Yes, offloading some layers to the GPU and VRAM should still help. And 11gb isn't bad.

If you're on linux or wsl2, I would run oobabooga with --verbose. Load a GGUF, start with a small number of GPU layers and creep up, keeping an eye on VRAM usage.

If you're on windows, you can try out LM Studio and fiddle with layers while you monitor VRAM usage, though windows may be doing some weird stuff sharing ram.

Would be curious to see the diffs. Specifically if there's a complexity tax in offloading that makes the CPU-alone faster but in my experience with a 3060 and a mobile 3080, offloading what I can makes a big diff.

macNchz · on Jan 27, 2024

> Specifically if there's a complexity tax in offloading that makes the CPU-alone faster

Anecdotal, but I played with a bunch of models recently on a machine with a 16GB AMD GPU and 64GB of system memory/12 core CPU. I found offloading to significantly speed things up when dealing with large models, but there was seemingly an inflection point as I tested models that approached the limits of the system, where offloading did seem to significantly slow things down vs just running on the CPU.

baq · on Jan 27, 2024

I had only cuda installed and it took 2 ollama shell commands in WSL2 from quite literally 0 local LLM experience to running mixtral fast enough on a 1070 and 12700k. Go for it.

attentive · on Jan 27, 2024

kobold bundles and runs llama.cpp. So it should be fairly the same with convenient defaults.

DreamGen · on Jan 27, 2024

When talking about memory requirements one also needs to mention the sequence length. In case of Mixtral, which supports 32000 tokens, this can be a significant chunk of the memory used.

chpatrick · on Jan 27, 2024

`ollama run mixtral:8x7b-instruct-v0.1-q3_K_L` works fast on my 3090 locally

beck5 · on Jan 27, 2024

Dumb question, but how can a 32 bit number be converted to 2 bits and still be useful? It seems like magic.

regularfry · on Jan 27, 2024

Mixtral and others are often distributed as 16-bit floats, so that chops the problem in half immediately, but then it turns out that LLMs only have about four bits per parameter of actual information stored. There's a lot of redundancy. The ideal quantisation scheme would only throw away useless data, but no quantisation scheme is perfect so they inevitably harm the model somehow.

You've then got to remember that one thing neural networks are very, very good at is being noise tolerant. In some senses that's all they are - noise correction systems. The inaccuracies introduced by quantisation are "just" a sort of noise, so it's not surprising that they aren't fatal. It just raises the noise floor and gives the model more ways to be wrong.

Finally the thing to know is that these quantisation schemes don't do a naive "chop each number down to two bits", not exactly. Simplifying a bit, for each parameter in this example they'd try to find a mapping from a two-bit index into a four element lookup table of higher-precision values such that the information destroyed by replacing the original parameter by the lookup value is minimised. That mapping is calculated across small blocks of parameters, rather than across the entire model, so it can preserve local detail. The lookup table gets stored per block, which throws the compression ratio off a little.

DougBTX · on Jan 27, 2024

Nice graphs here: https://github.com/ggerganov/llama.cpp/pull/1684

So for example, 2 bit version of the 30B is much worse than the original, but still better than the 13B model.

Also, there are lots of extra details, eg, not all of the weights are 2 bit, and even the 2 bit weights are higher than that overall as groups of quantised weights share scale factors stored elsewhere.

beefield · on Jan 27, 2024

I think of it with this kind of analogy: the original image is stored with 32 bit color scheme. You can reduce the color scheme to 16 bit accuracy and still figure out pretty well what the image is about. 2 bit is stretching this to a bit far, basically either pixel is white or it is black, but even if you lose lots of nuances in the image, in many images even that gives you some idea whats going on in the image.

DougBTX · on Jan 27, 2024

That’s an interesting question, I wonder if there is an analogy in quantisation to image dithering?

hnfong · on Jan 27, 2024

This blog post might shed some light on the matter. If I'm understanding it correctly, it claims there are emergent features on the LLM weights that make it easier to "compress" the floats into smaller bits without losing much precision.

https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

Note that 2 bit quantization is generally regarded as too aggressive. Generally 4bits+ achieves a good tradeoff, see eg. https://arxiv.org/abs/2212.09720

brucethemoose2 · on Jan 27, 2024

Its not really 2 bits.

Modern quantization schemes are almost like lossy compression algorithms, and llms in particular are very "sparse" and amenable to compression.

ttoinou · on Jan 27, 2024

All the 32 bits weren't necessarily used, and it's the whole network itself that has to be useful. It's a tradeoff. We started with very good precision to test the new method, now we can optimize some parts of it

Const-me · on Jan 27, 2024

Here’s an example of a custom 4 bits/weight codec for ML weights:

https://github.com/Const-me/Cgml/blob/master/Readme.md#bcml1...

llama.cpp does it slightly differently but still, AFAIK their quantized data formats are conceptually similar to my codec.

sp332 · on Jan 27, 2024

The extra precision is more useful for training. Once the network is optimized, it's a statistical model and only needs enough precision to make good guesses. In fact, one of the big papers on this also pointed out that you can drop about 40% of the weights completely. I think people generally skip that part because sparse matrix operations are slower, so it doesn’t help here.

viraptor · on Jan 27, 2024

For models with dropped weights, the keyword is "distilled". For example ssd-1b is a 50% size version of Stable Diffusion XL (https://huggingface.co/segmind/SSD-1B)

sp332 · on Jan 27, 2024

That’s crazy, I’ve never seen one that dropped whole layers from a pre-trained model. I guess that avoids the sparse matrix math.

EVa5I7bHFq9mnYK · on Jan 27, 2024

Faraday.dev has it in its selection of models now. Good for us clueless Windows folks. Runs decently fast with 16gb mobile 3080 gpu. Results seem better than any other free option.

thriw63748 · on Jan 27, 2024

Why not normal RAM? Ryzen 5600 with 128GB DDR4 is perfectly fine to run mixtral 8bit, and costs less than $1000.

GPUs are only needed if you can not wait 5 minutes for an answer, or for training.

snowfield · on Jan 27, 2024

Or if you want multiple sessions at the same time. Or if you want to do anything else with your machine while it's running.

But realistically, 5 minutes is too long. It should be conversational, and for that you need at least 5 tokens per second. Which your Ryzen just can't do.

MPSimmons · on Jan 27, 2024

>It should be conversational, and for that you need at least 5 tokens per second.

To be fair, a lot of people are using this for non-interactive work, like batching document analysis or offline processing of user generated content.

Diti · on Jan 27, 2024

This particular thread we are commenting on is about Dolphin Mixtral, which is mostly used for offline code completion (à là Microsoft GitHub Copilot). You don’t want to have to wait 5 minutes at every keystroke to get code suggestions.

Gracana · on Jan 27, 2024

In my experience, it takes some experimentation to figure out a good prompt. I don’t think I would have gotten very far off I had to wait that long for each result.

irusensei · on Jan 27, 2024

Why not both? Llama.cpp allows layering GGUF models between GPU and CPU memory.

dragonwriter · on Jan 27, 2024

> GPUs are only needed if you can not wait 5 minutes for an answer

Yeah, but that's generally true (or at least, “5 minutes for an answer is very suboptimal”, even if “can’t” isn’t quite true) for interactive use cases, which are... a lot of LLM use cases.

juliangoldsmith · on Jan 27, 2024

Not sure why you're getting downvoted. It performs decent enough on my Ryzen 3600X with 64GB of RAM. It definitely wouldn't be usable for production or fine-tuning, but it's fine for experimenting.

brucethemoose2 · on Jan 27, 2024

> perfectly fine

Only for very short context and responses.

Beyond that, the performance is painful.

rhdunn · on Jan 27, 2024

That was what I was referring to with the 32/64 GB systems.

SkyMarshal · on Jan 27, 2024

What's the bandwidth between the Ryzen and that DDR4?

MuffinFlavored · on Jan 27, 2024

What differences would I measurably notice running the 2-bit version vs the 4-bit version vs the 6-bit vs the 8-bit?

zaat · on Jan 27, 2024

Using TheBloke nous-hermes-2-mixtral-8x7b-dpo.Q5_K_M.gguf, llam.cpp with oobabooga, on Ryzen 9 7950x and RTX 2070 super, offloading 4 layers to the GPU.

27,631MiB system memory used.

7358/8192MiB usage on the GPU.

I'm getting 6.32 tokens/second.

You: You are Hunter S. Thompson, write a 450 words essay on the LHC project. The writing must be in Thompson's style

AI: Well, buckle up, kiddos, because we're about to dive into the twisted world of particle physics and the Large Hadron Collider, or as I like to call it, the Doomsday Machine. This beast of a thing is buried deep beneath the Swiss countryside, a 17-mile-long ring of death and destruction that makes the atom bomb look like a goddamn firecracker.

Now, let me tell you, when the Swiss government first proposed this monstrosity, I thought they'd gone completely batshit crazy. But then again, this is the land of Toblerone and Heidi, so maybe I shouldn't be too surprised. The LHC is essentially a giant, underground racetrack for subatomic particles, where they're accelerated to nearly the speed of light before being smashed into each other like a couple of drunken frat boys at a kegger.

The goal of this madness? To recreate the conditions that existed just after the Big Bang, when the universe was nothing more than a seething cauldron of energy and chaos. And if that doesn't sound like a recipe for disaster, I don't know what does.

But let's not forget the real reason we're all here: the Higgs boson, or as it's more commonly known, the God Particle. This little bugger is what gives other particles their mass, and without it, we'd all be floating around like a bunch of cosmic tumbleweeds. The LHC was designed to find this elusive particle, and in 2012, they finally did it. Sort of.

You see, the Higgs boson they found was a bit of a disappointment. It was too light, too wimpy, and didn't behave quite the way they expected it to. So now, the physicists are back at the drawing board, trying to figure out what the hell went wrong. And in the meantime, the LHC continues to churn out more and more data, like some kind of unholy particle-smashing machine.

But let's not forget the dark side of the LHC. There are those who believe that this thing could very well be the end of the world as we know it. They fear that the LHC could create a black hole, or even a miniature Big Bang, right here on Earth. And while the scientists assure us that these fears are unfounded, I can't help but wonder if they're just whistling past the graveyard.

So, what have we learned from all this? Well, for one thing, we've discovered that the universe is a much stranger place than we ever imagined. We've also learned that even the most brilliant minds can be blinded by their own ambition, and that sometimes, the pursuit of knowledge can lead us down some very dark paths.

In the end, the LHC is a testament to the insatiable curiosity of the human race, and our unquenchable thirst for knowledge. It's a monument to the power of science, and the potential consequences of playing God. And as we continue to push the boundaries of what's possible, we must always remember that with great power comes great responsibility. Or, as I like to say, "When the going gets weird, the weird turn pro."

tacoman · on Jan 27, 2024

I am using the exact same model. Ryzen 5600G w/32GB and an Nvidia P40 w/24GB VRAM

20/33 layers offloaded to GPU, 4K context. Uses 25GB system RAM and all 24GB VRAM. 5-7 tokens per second.

zaat · on Jan 27, 2024

Context is set to 32768, I didn't change it I guess that's the model's default.

Thanks for making me feel better about investing in tht motherboard + CPU + RAM upgrade and deferring the GPU upgrade.

attentive · on Jan 28, 2024

and Groq does 485.08 T/s on mixtral 8x7B-32k

I am not sure local models have any future other than POC/research. Depends on the cost of course.

tome · on Jan 31, 2024

(Groqster here) For anyone who wants to try it, you can go to https://chat.groq.com/ and choose Mixtral from the drop-down menu. Also, feel free to ask me any questions about Groq hardware or service.

wolverine876 · on Jan 27, 2024

Kudos to Brave (for this and other privacy features):

Unlinkable subscription: If you sign up for Leo Premium, you’re issued unlinkable tokens that validate your subscription when using Leo. This means that Brave can never connect your purchase details with your usage of the product, an extra step that ensures your activity is private to you and only you. The email you used to create your account is unlinkable to your day-to-day use of Leo, making this a uniquely private credentialing experience.

quinncom · on Jan 27, 2024

This is very cool, and something I’d like to integrate in my own apps. Does anybody know how this works exactly, not using foreign keys?

luke-stanley · on Jan 27, 2024

I could guess, an "anonymous payment credential service" could do something like this: 1. User completes payment for the paid for service, 2. To track the payment entitlement, a random, unique ID is generated by the service for the user, that is not related to any of their data. 3. This ID is saved in a database as a valid payment key. 4. The database records IDs in shuffled batches, or with semi-random fuzzy / low resolution timestamps to prevent correlation between payment time and ID generation. 5. Each ID has an entitlement limit or usage stopping point, ensuring it's only valid for the subscribed period.

Another way might be Zero-Knowledge Proofs (ZKPs), but that might be more complex. They might even use their BAT crypto stuff for this somehow, I suppose. Whatever solution, would need a fundamental solution for how to avoid correlation, I think.

syntaxing · on Jan 27, 2024

Interesting, I must have missed the first Leo announcement. I really like how privacy conscious it is. They don’t store any chat record which is what I want.

Dwedit · on Jan 27, 2024

There is no way to confirm that claim, just like there is no way to confirm that a VPN service is "no log".

lolinder · on Jan 27, 2024

Yes, at some point if you're going to interface with other humans you will eventually just have to trust their word.

For some people's threat models that isn't good enough, but for the vast majority of people—people who aren't being pursued by state intelligence agencies but who are squeamish about how much data a company like Google collects—a pinky promise from Brave or Mullvad is good enough.

wolverine876 · on Jan 27, 2024

> For some people's threat models that isn't good enough, but for the vast majority of people—people who aren't being pursued by state intelligence agencies but who are squeamish about how much data a company like Google collects—a pinky promise from Brave or Mullvad is good enough.

Who are you to say it's good enough (and ridicule people who disagree)? We don't have too much evidence of it, because they have very few options and of course most people are not informed and lack the expertise to understand the issues (a good situation for regulation). At one point lots of people used lead paint and were fine with it; they would have told us.

> Yes, at some point if you're going to interface with other humans you will eventually just have to trust their word.

There's technology, such as the authorization tokens used by Brave, that reduces that risk. Of course, no risk can be complete eliminated but that doesn't mean we shouldn't reduce it.

lolinder · on Jan 27, 2024

> say it's good enough (and ridicule people who disagree)?

I'm not ridiculing anyone, I explicitly say that for some people's threat models it isn't good enough.

wolverine876 · on Jan 27, 2024

You said that people who agreed with you are,

> people who aren't being pursued by state intelligence agencies

That implies that the only valid reason to disagree is if you are pursued by state intelligence agencies. Obviously it's ridiculous to think that you are.

salad-tycoon · on Jan 27, 2024

Well, these days a lot of people are actually pursued. You know like parents, social activists, various religious movements. It’s not just only the Dr. Evils of the world anymore.

bcye · on Jan 27, 2024

I would like to think GDPR ensures this pinky promise is good enough

Erratic6576 · on Jan 27, 2024

You gotta trust them by their word

Tanath · on Jan 27, 2024

It's foolish to trust Brave. https://www.spacebar.news/p/stop-using-brave-browser

firtoz · on Jan 27, 2024

What are good API providers that serve mixtral? I know only octo ai which seems decent but will be good to know alternatives too

jeanloolz · on Jan 27, 2024

The creators of the model actually have their own platform where you can access this model and others via API: https://console.mistral.ai/

attentive · on Jan 28, 2024

I just discovered Groq, which does 485.08 T/s on mixtral 8x7B-32k

No idea on pricing but supposedly one can email to api@groq.com

frozenport · on Jan 31, 2024

I think you can try it online at chat.groq.com

tome · on Jan 31, 2024

(Groqster here). Yes, you can select Mixtral from the dropdown menu. If anyone has any questions about Groq let me know and I'll do my best to answer!

Tiberium · on Jan 27, 2024

OpenRouter is generally a good option (already mentioned), the best part is that you have a unified API for all LLMs, and the pricing is the same as with the providers themselves. Although for OpenAI/Anthropic models they were forced (by the respective companies) to enable filtering for inputs/outputs.

petercooper · on Jan 27, 2024

Both already mentioned, but I am using Anyscale Endpoints with great success, very fast and will work on ten jobs at a go out of the box. Together.ai also seems to work fine in my initial tests, but haven't tried it at scale yet.

mark_l_watson · on Jan 27, 2024

I have used both Mistral’s commercial APIs and also AnyScale’s commercial APIs for mixtral-8-7b- both providers are easy to use.

I also run a 3 bit quantization of mixtral-8-7b on my M2 Pro 32G memory system and it is fairly quick.

It is great having multiple options.

htsh · on Jan 27, 2024

openrouter, fireworks, together.

we use openrouter but have had some inconsistency with speed. i hear fireworks is faster, swapping it out soon.

tome · on Jan 31, 2024

I work for Groq and we serve the fastest available version of Mixtral (by far) and we also have a web chat app. I'll refrain from linking it because it has already been linked and I don't want to spam, but I'm available to answer any questions people have about Groq's hardware and service.

bearjaws · on Jan 27, 2024

Together.ai seems to be the best, incredibly fast.

T_S_ · on Jan 31, 2024

Not so sure about that. Check out https://github.com/ray-project/llmperf-leaderboard

And try mixtral on chat.groq.com

RupertEisenhart · on Jan 27, 2024

These guys are much faster than openrouter, and their llama2 runs faster than 3.5-turbo. Amazing work.

firejake308 · on Jan 27, 2024

I personally like Anyscale Endpoints

Philpax · on Jan 27, 2024

I've had good experiences with Together, and they have very competitive pricing.

jug · on Jan 27, 2024

You also have Replicate: https://replicate.com/mistralai/mixtral-8x7b-instruct-v0.1

frozenport · on Jan 31, 2024

I've been running the version on poe and chat.groq.com for the last week.

Much better than llama 70b.

charcircuit · on Jan 27, 2024

It's interesting that they made it so you can ask LLM queries right from the omnibar. I wonder if they eventually will come up with some heuristic to determine if thr query should be sent directly to an LLM or if the query should use the default search provider.

m3kw9 · on Jan 27, 2024

If you have used gpt4 and then use mistral, it’s like looking at a Retina display and then have to go back to a low res screen. You are always thinking “but GPT4 could do this though”

mpalmer · on Jan 27, 2024

Have you used mixtral?

kristianpaul · on Jan 27, 2024

I run Mixtral locally using ollama

emmanueloga_ · on Jan 28, 2024

Does anyone know of a good chrome extension for AI page summarization? I tried a bunch of the top Google search hits, they work fine but are really bloated with superfluous features.

Terretta · on Jan 29, 2024

See Kagi's Universal Summarizer

https://kagi.com/summarizer/index.html

https://help.kagi.com/kagi/api/summarizer.html

"Alternatively use Kagi Search browser extension (Chrome/Firefox) and you can use the most advanced Muriel model right from the extension."

andai · on Jan 27, 2024

Asked Mistral 8x7B for an essay on ham. It started telling me about Hamlet.

Erratic6576 · on Jan 27, 2024

It must start from the beginning. Pig > piglet. Ham > Hamlet

andai · on Jan 30, 2024

Would make sense if it was the first token. But it's the last, presumably with a "end of user message" separator! (Or perhaps not? I don't know.)

fifteen1506 · on Jan 27, 2024

Just checking: PDF summarization is not yet implemented, right?

_aaed · on Jan 27, 2024

The Kagi browser extension can do that, if you're a subscriber

fifteen1506 · on Jan 28, 2024

Ask a PDF? I thought it was only the $25 a month plan.

_aaed · on Jan 28, 2024

No, it's just text, like so: https://i.imgur.com/3NMzyDf.png

finikytou · on Jan 27, 2024

quick question I have 24GB VRAM and I need to close everything to run MIXTRAL at 4 bit quant with bitsandbyte. there is no way to run it at 3,5 on windows?

davikr · on Jan 27, 2024

It's nice using Brave because you have Chromium's better performance, without having to worry about Manifest V2 dying and taking adblocking down with it. I have uBlock Origin enabled, but it has barely caught anything that slipped past the browser filters.

croes · on Jan 27, 2024

Brave"s support of Manifest V3 is totally dependent on Google and Chrome

>Brave will support uBO and uMatrix so long as Google doesn’t remove underlying V2 code paths (which seem to be needed for Chrome for enterprise support, so should stay in the Chromium open source)

https://twitter.com/BrendanEich/status/1534893414579249152

steve_rambo · on Jan 27, 2024

It doesn't really matter much in practice. The built-in adblocker (which does not rely on extension APIs) has excellent performance, about as effective as ublock origin, and supports the same filter lists.

croes · on Jan 27, 2024

It does matter if you use other extensions that require Manifest V2

guipsp · on Jan 27, 2024

The built-in adblocker still relies on a large amount of extension APIs.

davikr · on Jan 27, 2024

Yeah, but the Brave adblocker is built-in, it's not an extension.

croes · on Jan 27, 2024

There are more extensions than just ad blockers

Diti · on Jan 27, 2024

If by performance you mean browser performance, you have more performance with Firefox nowadays. https://news.ycombinator.com/item?id=36770883

soundnote · on Feb 7, 2024

One genuine performance edge Firefox has is that it can handle giant YouTube comments sections better than Chromium browsers in part because it accepts larger maximum page size.

FoodWThrow · on Jan 27, 2024

I use firefox because it has the most hassle-free hardware decoding in linux. However, everything basically feels better with Brave, even with the same amount of plug-ins.

kristianp · on Jan 27, 2024

I found the snap update notifications too annoying on Ubuntu, so I tried the ppa. But it the video plugin would crash. So back to Chrome for me.

1xdevnet · on Jan 28, 2024

Curious whether you've tried with the new (non-PPA) repo directly from Mozilla as of v122 [1]. I think the old PPA was also Mozilla, so I don't know what may have changed aside from being more publicly acknowledged. Might be worth a try?

I don't have an Ubuntu VM at-hand but on Debian bookworm it installed fine, and (after tweaking one line in profiles.ini to point to my old ESR profile) it loaded and played Widevine-protected videos without any issues.

[1]: https://support.mozilla.org/en-US/kb/install-firefox-linux#w...

kristianp · on Jan 29, 2024

Thanks, I'll give that a try.

jacooper · on Jan 27, 2024

While still not being as secure as chromium and still not supporting many advanced features like WebGpu.

danShumway · on Jan 28, 2024

This has come up a few times, but as far as I know Gorhill is planning to full-on drop support for Chromium browsers outside of uBO Lite. Does Brave keeping Manifest V2 matter if developers stop maintaining the Chromium version of those extensions?

I also still can't really find if Brave has an extension store or if it's connecting to Chrome's. If it's the latter, then it seems like V2 extensions are going away for Brave regardless of what API decisions it makes, because short of sideloading them the addons won't be on the Chrome web store anymore and won't be getting updates. I assume Brave supports sideloading extensions, and maybe developers would maintain extensions for Chromium that can't actually be used in Chrome? Although that seems a little optimistic. But not having a way to search for V2 extensions or get ratings/reviews seems like it will still be a problem for users.

Departed7405 · on Jan 28, 2024

Brave indeed use the Chrome Web Store.

srid · on Jan 27, 2024

> I have uBlock Origin enabled, but it has barely caught anything that slipped past the browser filters.

I have been completely satisfied with Brave's builtin ad-blocker. Does uBlock Origin catch anything in particular that the builtin one does not?

rozab · on Jan 27, 2024

I couldn't see Brave being able to counter, say, the recent YouTube anti-adblock push as quickly as the open source community did. I could see that kind of stunt becoming more common as Google tries to nail the coffin shut and deny their competitors this USP

SparkyMcUnicorn · on Jan 27, 2024

Brave's ad blocking is compatible (and uses) the uBlock Origin lists, so you shouldn't really notice a difference between having uBO enabled or disabled in brave.

charcircuit · on Jan 27, 2024

MV3 doesn't prevent adblockers from existing.

rpastuszak · on Jan 27, 2024

It makes them almost useless in practice.

HeatrayEnjoyer · on Jan 27, 2024

Because the filter list is capped, right? Is there a reason the Brave team cannot just remove or increase the cap?

gkbrk · on Jan 27, 2024

Not just because of the filter list cap. It also reduces ad blockers to static filter lists instead of powerful dynamic filters.

MV3 makes it impossible for ad-blockers to inspect requests with code and then allow/deny dynamically.

charcircuit · on Jan 27, 2024

>It also reduces ad blockers to static filter lists instead of powerful dynamic filters.

This is very outdated information and borderline misinformation by representing it as how it currently works. It allows for 30,000 dynamic rules and 5,000 session rules (session rules only persist until the browser is closed).

>MV3 makes it impossible for ad-blockers to inspect requests with code and then allow/deny dynamically.

Giving this ability to extensions can slow down the browser for the user. These ads can still be blocked through other means.

danShumway · on Jan 28, 2024

> It allows for 30,000 dynamic rules

That is not what we mean by dynamic filters. From https://developer.chrome.com/blog/improvements-to-content-fi...

> However, to support more frequent updates and user-defined rules, extensions can add rules dynamically too, without their developers having to upload a new version of the extension to the Chrome Web Store.

What Chrome is talking about is the ability to specify rules at runtime. What critics of Manifest V3 are talking about is not the ability to dynamically add rules (although that can be an issue), it is the ability to add dynamic rules -- ie rules that analyze and rewrite requests in the style of the blockingWebRequest permission.

It's a little deceptive to claim that the concerns here are outdated and to point to vague terminology that sounds like it's correcting the problem, but on actual inspection turns out to be entirely separate functionality from what the GP was talking about. It's almost like the Chrome team deliberately decided to call these "dynamic rules" so they could claim that Chrome supported them and muddy the issue, even though Chromes "dynamic rules" have nothing to do with support for a blockingWebRequest API. But I don't want to be conspiratorial.

> Giving this ability to extensions can slow down the browser for the user. These ads can still be blocked through other means.

This is the debate; most of the adblocking community disagrees with your assertion. Chrome has been saying this for ages, but saying it doesn't make it true.

uBO maintains a list of some common features that are already not possible to support in Chrome (https://github.com/gorhill/uBlock/wiki/uBlock-Origin-works-b...) and has written about features that are not able to be supported via Chrome's current V3 API (https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-as...). Of particular note are filtering for large media elements (I use this a lot on mobile Firefox, it's great for reducing page size), and top-level filtering of domains/fonts.

Chrome could of course add support for some of this -- Chrome could implement more filter controls for analyzing headers and request sizes, but as far as I know that support hasn't been added yet. And even if that support does eventually get added, this all glosses over the bigger issue that devs have been saying from the beginning, which is that it's ridiculous to make every innovation in adblocking dependent on Chrome explicitly adding new APIs to support each individual use-case. The model that Chrome is moving towards is one where every new kind of filtering metric that devs want to use requires asking Google for permission to use it and waiting for Chrome to implement it.

charcircuit · on Jan 28, 2024

>That is not what we mean by dynamic filters.

My point was that "reduces ad blockers to static filter lists" is not true as it is not a static list of rules.

>What critics of Manifest V3 are talking about is not the ability to dynamically add rules

Yes, they are. Due to early versions of declarativeNetRequest only having a static list of rules there are a large amount of people who believe this to still be true. On this site I have encountered many people who make a claim about it being impossible to update rules without an update to the extension through the store.

>It's a little deceptive to claim that the concerns here are outdated and to point to vague terminology that sounds like it's correcting the problem, but on actual inspection turns out to be entirely separate functionality from what the GP was talking about.

Dynamic rules do correct the problem of a static list. The second half of my comment points out that there was a trade off made between functionality and performance which is why "dynamic filters" are no longer possible.

>This is the debate; most of the adblocking community disagrees with this assertion.

Considering MV3 lets extensions turn every page into a blank document or inject random scripts I don't see how it can be possible for ads to somehow force themselves to be shown to the user.

>common features that are already not possible to support in Chrome

>CNAME-uncloaking

The relevant issue in the bug tracker is https://bugs.chromium.org/p/chromium/issues/detail?id=115104... and the engineer working on it does not have enough bandwidth.

>Browser launch

This is no longer a problem with declarativeNetRequest since the rules for blocking network requests is part of the browser itself.

>as written about features that are not able to be supported via Chrome's current V3 API

Some of these are self inflected by the author like only updating rules with store updates and, some of these are due to the lack of maturity of MV3, some of these are due to trade offs that are being made ecosystem wide.

danShumway · on Jan 28, 2024

> My point was that "reduces ad blockers to static filter lists" is not true as it is not a static list of rules.

That's not what we mean by static either. What we mean by static is that extensions lose the ability to dynamically analyze requests and block them using on-the-fly logic.

> Yes, they are. Due to early versions of declarativeNetRequest only having a static list of rules there are a large amount of people who believe this to still be true.

Very obviously that is not what gkbrk meant by "powerful dynamic filters", considering that gkbrk clarified that they were referring to the ability to dynamically modify requests literally one sentence later. Also very obviously that is not what gkbrk meant because gkbrk literally said in the first sentence that they weren't talking about filter list caps.

When you replied to gkbrk saying that dynamic filters were supported you were not correcting misinformation, you were saying that gkbrk was wrong -- and they're not, they're correct -- dynamic filtering of the kind they are talking about is not supported and the lack of that API meaningfully constrains adblockers (a point which, quite frankly, you don't even disagree with given that you are characterizing this as a "tradeoff").

Did you mean to reply to a different person who wasn't present?

> Considering MV3 lets extensions turn every page into a blank document or inject random scripts I don't see how it can be possible for ads to somehow force themselves to be shown to the user.

"Just inject scripts into the webpage" is a wild take from someone who is trying to claim that Manifest V3 improves privacy, security, or performance. Nobody wants addons to be putting all of their functionality into the page itself, there are myriads of problems with that approach. If you're defending Manifest V3 and you're suggesting that addons should just completely destroy the separation between extension code and page code, then... I mean, that's just not a serious suggestion.

It's good that extensions can insert code into pages, that is sometimes necessary and extensions should have that ability. I think uBO full-on uses that ability for some features. But running code that way is not the preferred way to handle functionality or request blocking, and its something that should be done sparingly and carefully.

If the upshot of Manifest V3 in Chrome is to encourage developers to start pushing tons of extension code into the page itself when they could have before kept that code separate -- then that's a failure of Manifest V3; injecting tons of extra scripts into pages that wouldn't otherwise need to be there will make performance, security, and privacy worse for end users.

> The relevant issue in the bug tracker is https://bugs.chromium.org/p/chromium/issues/detail?id=115104... and the engineer working on it does not have enough bandwidth.

So it's not supported. I'm sorry, you want me to pretend that developer concerns have been addressed because there's an open issue where the developer is overworked and does not have the bandwidth to address it? This issue has been open for 3 years! You could not show a better example of the intrinsic problems with Manifest V3 and the intrinsic problems of extension devs needing to ask Chrome devs for permission to innovate on adblockers if you tried.

This is exactly the problem of Manifest V3. There's need for adblockers to be able to innovate, and when that innovation is dependent on Chrome adding specific APIs, then the issues sit open for 3 years. Chrome is proving before V3 even becomes a requirement that they are not capable of keeping pace with adblocker innovation and that this whack-a-mole approach to adding individual capabilities into declarativeNetRequests is unworkable and bad for addons.

If Google can't keep pace with requests during a period when extensions are still able to use V2, then they're not going to magically get better at handling feature requests when every extension is using V3. If Google's approach to addon APIs is to overwork developers and spread them thin between features, it is reasonable to conclude that either Google does not see addon support as a priority worth investing sufficient developer resources into, or that Google's model for browser development is just inherently not capable of handling API requests at a reasonable speed.

---

And I think that's the broader point. This is not a healthy relationship for Chrome to have with extension developers, and Chrome is proving in real time that it doesn't have the resources or ability to support developer requests.

But to address the other specific APIs you bring up:

> This is no longer a problem with declarativeNetRequest since the rules for blocking network requests is part of the browser itself.

I'm sorry, the inability of an extension to load before the browser starts sending requests is a non-issue to you? Sure, you can pull the blocklist out so the static rules still take effect before the extension loads, but that is far from the only issue in having extensions load asynchronously from their pages.

To be fair, you are right that declarativeNetRequest makes this slightly better, but only because prior to declarativeNetRequest Chrome already exhibited this behavior so at least now there's one way to make sure that requests get filtered from browser launch, as limited as it may be. As opposed to Firefox, where this has never been a problem for either MV2 or MV3 extensions, because Firefox doesn't make the wild decision to start loading pages before the user's addons have initialized.

> Some of these are self inflected by the author like only updating rules with store updates

I actually don't think that's the main problem with most of these. Dynamic user rules, fonts, and noscript rules are blocked by rule scoping around top-level document URLs, not because of store updates. uBO Lite does avoid making network requests to update lists, but that is one paragraph out of an entire page describing limitations that make useful features difficult or impossible to build under MV3.

Additionally, if you dig into the actual details of what Gorhill is saying about updating extension lists, you'll find that this is not just about whether you can load list updates, it's also about when it happens and the fact that (as mentioned above) uBO Lite can't guarantee that it will be running when a request is made, and can't guarantee that those rule updates will take effect before the request is sent.

> and, some of these are due to the lack of maturity of MV3

Well, it's a good thing that Google is waiting for MV3 to become mature and it's a good thing that MV2 won't stop working in June, right? /s

> some of these are due to trade offs that are being made ecosystem wide.

Sure, other apps other than adblockers will also be affected. I maintain hobby Firefox extensions that are impossible to ship on Chrome for Manifest V3. Do you expect "we didn't just break adblockers, we broke lots of other addons too" to be an argument that makes anyone more sympathetic to Google?

charcircuit · on Jan 28, 2024

>That's not what we mean by static either.

What is a static filter list other than a list of filters that does not change?

>When you replied to gkbrk saying that dynamic filters were supported you were not correcting misinformation

I was saying that the filter list was dynamic and not static.

>"Just inject scripts into the webpage" is a wild take

I was pointing out the upper bounds of what was possible and not giving recommendation on how it should be done.

>So it's not supported.

That is right.

>you want me to pretend that developer concerns have been addressed

No, I want people interested in this feature to track the bug or contribute the fix themselves since it is an open source project.

>it is reasonable to conclude that either Google does not see addon support as a priority worth investing sufficient developer resources into

That is correct in my understanding. This means that the open source community needs to dedicate their own resources if they want to accelerate progress here.

>the inability of an extension to load before the browser starts sending requests is a non-issue to you

Assuming no events are dropped then I don't see it as an issue.

>doesn't make the wild decision to start loading pages before the user's addons have initialized.

It doesn't sound that wild if it speeds up launch time by not blocking on third party extensions having to load.

>it's also about when it happens and the fact that (as mentioned above) uBO Lite can't guarantee that it will be running when a request is made

This is not an issue as the extension does not need to be running for the browser to apply the rules. The extension is only needed to update the rules.

>Well, it's a good thing that Google is waiting for MV3 to become mature and it's a good thing that MV2 won't stop working in June, right? /s

Considering that the deadline keeps being pushed back, I think they recognize that it has low maturity.

gkbrk · on Jan 28, 2024

> What is a static filter list other than a list of filters that does not change?

A request is being made. At the time of the request, you go through a list of rules and decide if you're going to block or allow it.

Once the request reaches the filter, you are working with a static list because you cannot make decisions using the content or circumstances of that particular request. You cannot add or remove anything to your list after looking at the request.

A dynamic filter on the other hand could

- Resolve hostnames with DNS to do CNAME uncloaking

- Decode encoded requests so the website cannot send both real data and analytics/tracking data through a single URL. Being able to make dynamic decisions allows the ad blocker to inspect request headers and request bodies.

- Modify responses. The web page could be getting ads from a JSON endpoint with the rest of the page. You can use filterResponseData and change the response dynamically to strip the ads return an empty JSON list for example.

danShumway · on Jan 28, 2024

Exactly. Or in other words:

> It also reduces ad blockers to static filter lists instead of powerful dynamic filters. MV3 makes it impossible for ad-blockers to inspect requests with code and then allow/deny dynamically.

I'm assuming the best here, but it is extremely clear what you meant by a "static filter" in your comment. You meant a filter where it is "impossible for ad-blockers to inspect requests with code and then allow/deny dynamically."

It's a little frustrating to have to pretend that any of this is ambiguous.

danShumway · on Jan 28, 2024

> No, I want people interested in this feature to track the bug or contribute the fix themselves since it is an open source project.

> That is correct in my understanding. This means that the open source community needs to dedicate their own resources if they want to accelerate progress here.

OR... hear me out: we could recognize that Chrome, a browser owned by one of if not the most powerful tech companies in the entire world, does not need charity, and that it's not a community owned project (if it was, MV2 would not be getting deprecated), and that the community has no say in how the project gets updated or maintained and is fact very regularly treated with hostility by the project, and that instead of trying to fix an issue that Google probably isn't even interested in getting help with, we could instead go use and contribute to Open Source browsers that aren't transparently trying to kill adblockers.

This is wild. You went from "MV3 is fine, there's just a lot of misinformation" to "MV3 is trying, but the devs are overworked" to "nobody is allowed to complain about this unless they're giving hecking Google free labor."

Google made a decision to deprecate a working API, developers correctly pointed out that the API hobbles adblockers, Google has not fixed the API but is launching it anyway, and your answer is that the open source community needs to dedicate their own resources. This problem is not our fault, we didn't make it.

How about Google not completely rewrite an API out of the blue if they don't have the internal resources necessary to do so? Is there some hiring freeze on developers, has Google lost the ability to hire and pay people? Google creates a problem, and you're blaming the Open Source community for not fixing Google's problem for them, a problem that exists entirely of Google's own free will, because Google ignored the Open Source community that was desperately trying to stop Google from making the problem.

The Open Source community has a solution to this; don't deprecate blockingWebRequest. Like most community feedback, Google isn't interested in hearing that.

I would suggest that if Open Source developers need to contribute somewhere, they could try contributing to Firefox, a browser that has none of these problems because somehow a company that is so small its entire corporate profits would barely register in Google's spreadsheets is still better at building extension APIs than Google is.

----

> Assuming no events are dropped then I don't see it as an issue.

It's an issue because extensions sometimes like to respond to events when they happen. If the browser stopped allowing synchronous responses to click events and I said, "well as long as they eventually trigger there's no issue" I would be rightly laughed out of the room.

The issue is that a user can update filters and they have no guarantee that the filters will be updated before the requests that they mean to block are fired off. If an adblocking extension does decide to update filter rules on launch, they have no guarantee that the update will take effect before the page loads.

> It doesn't sound that wild if it speeds up launch time by not blocking on third party extensions having to load.

Not loading the extension at all would speed up launch time even more, Google should just get rid of extensions entirely /s

People use extensions as a way to increase security and to build reliable features into the browser. There are many use-cases from tab-control to feature toggles where having tabs and pages load before extensions breaks functionality. It is wild to suggest that breaking functionality in userspace is preferable to adding an extra second to a browser launch.

For many people, adblocking is a security issue. To suggest that the browser should occasionally just send requests anyway before an extension has time to update and apply rules is like suggesting that a desktop should start loading and displaying user settings before it checks the user's login password.

> This is not an issue as the extension does not need to be running for the browser to apply the rules.

See above.

----

> Considering that the deadline keeps being pushed back, I think they recognize that it has low maturity.

This is once again a pretty big shift from "MV3 doesn't stop adblockers" to "MV3 isn't ready yet and they should keep pushing it back." It's also optimistic given that Google has given no indication yet that they're going to push back the June deadline.

But most importantly, it also doesn't change anything about my point -- we are seeing that Google does not have the resources to keep pace with adblocking innovations. That's not going to change. This is not a question of MV3 becoming ready and then everything becoming fine: "ready" is a moving target, and Google is not moving fast enough to keep pace with it.

What these issues demonstrate is that the declarativeNetRequest API is poorly designed -- it is poorly designed because it greatly increases the support requirements from Google to a level where the company is unable to meet them. And when Google is unable to meet those support requirements, user security and privacy suffers as a result.

This is an intrinsic problem that is baked into the design of declarativeNetRequests, and it's not going away even if Google does manage to close a couple of bugs. Because more of these issues are coming every time that adblockers need to innovate. If an API increases support burdens to the point where the dominant browser owned by the most powerful tech company has to come crawling to the Open Source community for help triaging 3-year-old issues, then it's a bad API.

And underlying all of the criticism of MV3 is that idea -- that it is not tenable for extension developers to have to request support for every individual feature that adblockers need. A better API would be a more flexible API that allows extension developers to innovate without asking permission. blockingWebRequest is an imperfect API and it has issues, but it allows extension developers to innovate without filing support tickets for every new feature and then waiting 3 years for a response because the API is flexible enough that extension developers can use it in creative ways that the authors may not have fully anticipated or known about.

charcircuit · on Jan 27, 2024

>Because the filter list is capped, right?

The limits are 300,000 static rules [1] + 30,000 dynamic rules [2] + 5,000 session rules [3]. For reference easylist is about 35k rules. The Chrome team has been constantly tweaking these limits themselves and Brave could set their own limits if they wish. The API is designed such that extensions can query to see how many rules they can use.

[1] https://source.chromium.org/chromium/chromium/src/+/main:ext...

[2] https://source.chromium.org/chromium/chromium/src/+/main:out...

[3] https://source.chromium.org/chromium/chromium/src/+/main:out...

danShumway · on Jan 28, 2024

Note that "dynamic rules" as specified here are not the same as what GP was talking about. The API does not support dynamic rules in the way that Firefox/V2 users use that term.

charcircuit · on Jan 28, 2024

GP did not mention dynamic rules.

danShumway · on Jan 28, 2024

> It also reduces ad blockers to static filter lists instead of powerful dynamic filters.

Just to make sure, did you actually not know what comment I was referring to, or are you just upset that I incorrectly referred to it as a GP comment?

In either case, it remains true that Chrome's current plans for V3 do limit the ability of extensions like uBO to dynamically filter requests.

charcircuit · on Jan 27, 2024

That is a baseless statement. It doesn't make them useless as they can still block ads.

deely3 · on Jan 27, 2024

Is this submarine comment?

thebruce87m · on Jan 27, 2024

What is the definition of a submarine comment? Google fails and ChatGPT says:

> A "submarine comment" on social media refers to a comment that is made on an old post or thread, long after the conversation has died down. This term derives from the idea of a submarine which remains submerged and out of sight for long periods before suddenly surfacing. In the context of social media, it's when someone delves deep into someone else's posts or timeline, finds an old post, and leaves a comment, bringing the old post back to attention. This can sometimes surprise the original poster and other participants, as the conversation was thought to have been concluded.

Which doesn’t make sense in this context

mcbuilder · on Jan 27, 2024

I think GP is trying to coin a term for stealth marketing Hacker News comments, except the analogy doesn't really make sense.

deely3 · on Jan 27, 2024

Definition: https://paulgraham.com/submarine.html

freediver · on Jan 27, 2024

Seen on HN first!

deely3 · on Jan 27, 2024

Definition: https://paulgraham.com/submarine.html