I also wondered the same and check the model configs. they are using bigger vocab size and the intermediate size of fully connected layer seems to be bigger.
Or save a bookmark in your browser and edit its destination to be this Javascript bookmarklet to let you load the archive.is version of any URL you're currently on without even needing to remember the domain or type anything:
(The archive.is one takes you to it in the same tab, while the wayback machine one opens a new one - because personally I use the former when I can't load a page, so don't need that tab kept open, and use the W.M. for comparing current to old versions of the page. But it should be fairly self-explanatory how to swap one URL with the other if you prefer it differently.)
Or this more complicated version of the Wayback Machine one, which if you click while on an empty tab will instead give you an alert with a text field in which to type or paste whatever URL you want to look up:
I'm building upon insights from this paper (https://arxiv.org/pdf/2403.03950.pdf) and believe that classification can sometimes outperform regression, even when dealing with continuous output values. This is particularly true in scenarios where the output is noisy and may assume various values (multi modal). By treating the problem as classification over discrete bins, we can obtain an approximate distribution over these bins, rather than settling for a single, averaged value as regression would yield. This approach not only facilitates sampling but may also lead to more favorable loss landscapes. The linked paper in this comment provides more details of this idea.
Isn't it a given that classification would "outperform" regression, assuming n_classes < n_possible_continuous_labels?
Turning a regression problem into a classification problem bins the data, offers more examples per label, simplifying the problem, with a tradeoff in what granularity you can predict.
(It depends on what you mean by "outperform" since metrics for classification and regression aren't always comparable, but I think I'm following the meaning of your comment overall)
Not really, since they do a mathematical function over blocks and don't need a single if statement. They map learned data + input -> output as a pure function
the model says 8x7B model, so its a 56B model. what is the GPU memory requirements to run this model for a 512 context size? are there any feasible quantization models of this available? I want to know if my 16GB VRAM GPU can run this model?
Thanks
18.14GB in 2bit, which is still too high for your GPU, and most likely borders on unusable in terms of quality. You could probably split it between CPU and GPU, if you don't mind the slowdown.
reply