Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fork of Facebook’s LLaMa model to run on CPU (github.com/markasoftware)
246 points by __anon-2023__ on March 8, 2023 | hide | past | favorite | 170 comments



The thing I like the most about the current AI wave is the pressure is putting on computing hardware. Yes, mobile phones with long battery lives are cool and all of that, but most cool things I like are locked behind huge computational requirements.


Agree. I work in robotics and we never have enough compute. I want to see us get to the point where the most advanced robot ever has all the compute it needs onboard, and that means huge growth in compute density and efficiency are needed.


That's genuinely surprising.

What sort of on-board compute do you typically have today?


a common example from my robotics experience (mainly mobile robots) has been getting something powerful enough to run our image recognition/interpreting sensor data. We often have something like several microprocessors (think:arduino equivalent running c++ or c) which run all the motor control etc and a high level system (used to often be raspberry pi, now more often nvidia jetson nano) listening to all of those and using most of it's computing power on some kind of sensor data, usually image recognition or processing TOF camera/lidar/radar data etc. We often have to optimise hard to get a couple of cycles or "frames" per second with these, which really puts limitations on how robots respond (250ms delay is veeeery noticable, especially if it's in obstacle avoidance - relatively common)


Limiting ourselves to onboard compute available on mobile robots is one thing, but even for fixed installation robots, aka an arm in a factory where space and power aren't limited, we're very much still limited by compute capacity. Trying to use robots to do something as simple as folding clothes still cannot be done at a reasonable speed. Yeah, on a personal level, just buck up and spend the 20 minutes folding your clothes, or hire a maid to do it for you, but the complexity of automating the task of folding clothes by a robot is a stand in for other tasks in industry that we still can't automate because the complexity is still too high for our current computing power, and have to hire a human for.

Researchers at US Berkeley came out with the algorithm they named SpeedFolding in October of last year. Watch https://youtu.be/UTMT2WAUlRw?t=511 and then realize that linked excerpt is sped up 9x.

If we had 9x faster compute we could have laundry folding robots which is one thing, but that amount of compute would enable robots to do tons more tasks in industry.


Robotics is a double whammy, you have compute problems but you also have actuation.

Getting robots to move quickly is easy; getting them to move quickly to exactly where you want them, or with exactly as much force... that is much, much more difficult. Double for mobile robots where you don't have a good energy source. If cost is an issue that is another dimension -- powerful and accurate actuators are extremely expensive.


I don't work in the field but just to kind of put it into perspective, a 12v 100A LiFePO4 battery has 1200 Watts capacity and weighs 30 pounds. A typical gaming PC (which to be fair, is more willing to trade power for performance) consumes about 600 Watts per hour. Problem for a Tesla? Not so much. Problem for a lightweight drone? Definitely.


ahhh the units in this post are making my eye twitch.


I would be mad too, if my gaming PC was demanding 300 Watts of power but it took half an hour to ramp up ;)


I know Watts per hour is not the right way to phrase it but I feel it helps for those that don't know. Also, I just don't like saying Amp. Hours :)


Watts per hour implies watts/hour. Watt-hour implies a number of watts multiplied by a length of time. Also known as energy. Watts are power. Watt hours are energy. Two different things. Watts/hour is nothing.


Watt/hour is speed of power. It doesn't make any sense.


No, you misunderstand what it means in that case. Watt-hours are comparable to joules. 1 watt-hour = 3600 joules.


watt*hour=joule

watt-hour: you cannot subtract time from power, it doesn't make any sense.

watt/hour: delta of power per time, something very weird.


The NVIDIA Jetson boards are popular, but even with a full desktop processor and state of the art GPU, you can easily down them in data from a LIDAR sensor or a few cameras. Especially since robots may also need fast response times.

There is another reply to your comment that shares a lot of what I have experienced. You have so many pieces of code that need to run and a good handful of them are working on something like LIDAR point clouds with a million 3D points in them, plus some cameras running several different image recognition and segmentation algorithms, and you want to have fast cycle times, it just all adds up. Every serious robot I have ever worked on is maxing out its system, even ones at Google X with a full desktop CPU, a high end NVIDIA graphics card, and a couple secondary ARM CPUs.


Thanks! :)

That definitely helps me understand why the footage of the robots in this video had to be sped up: https://youtu.be/Ybk8hxKeMYQ


Crazy to me that as soon as one GPU wave is dying (crypto), another one is picking up slack.


Which is a good thing. So glad all that GPU compute is being used on cool stuff rather than running SHA-256 18 quintillion times


Definitely a good thing but FYI it hasn't been profitable/feasible to mine bitcoin (SHA-256) on GPU for many many years as ASIC based miners have completely taken over. I've talked about it plenty on HN but any way you slice it crypto is an unbelievable waste of resources in every possible way regardless.

What really (finally) more or less killed GPU mining was the Ethereum move to PoS (Proof of Stake).


They can’t both be cool?


Nothing cool in throwing away lots of resources for no reason. In fact there's substantial heating involved.


One is a contest to waste the most resources, one has potential to actually have useful results


Why do you think Bitcoin does not have useful results?


Bitcoin has a use, but there are other options for consensus algorithms that don't waste as much energy as the citizens of a medium sized country and fill the same user case (and other expanded use cases). Why not just do that?


Strictly speaking, the comment you're replying to doesn't say which of the two is a contest to waste resources and which has potential to have useful results.


I assumed that this was doubling down on atleastoptimal’s view.


Can you cite an useful result? I can't but I don't think that some people getting richer is useful.


https://www.elliptic.co/blog/live-updates-millions-in-crypto...

https://www.cnbc.com/2022/03/23/ukrainian-flees-to-poland-wi...

In general, the purpose of Bitcoin is not to get rich, but to have a currency that is universally accepted and not tied to a political party’s fiscal decisions.


In general, so is the Federal Reserve.


Sure: I can buy servers and domains anonymously and buy drugs online, which are illegal in my country.


Well, I agree with you there.


Only if you really like big numbers for the sake of them. Otherwise, one is just straight up snake oil[0], and the other… is kinda hard to tell yet, because while I'm really impressed, I don't know if it's {a toy, a tool, the first sign of a major transformation}.

[0] did you know the original snake oils contains more omega-3 and therefore improves cognitive function when compared to lard? I did not. But you can get omega-3 elsewhere, and the people who made the term synonymous with fraud didn't use those snakes, so…


> running SHA-256 18 quintillion times

or games. People could have been studying or doing something more important than wasting time and energy. I get that it is entertainment, but so are board games and that don't require mining rare earth minerals or putting pressure on the grid as you can always play board games with candles on.


Or TV shows or movies. People could have been studying or doing something more important than wasting time and energy.

Or going outside. People could have been studying or doing something more important than wasting time and energy.

Or not being locked in the education facility. People could have been studying or doing something more important than wasting time and energy.


I'm of two minds on this. I'm not a gamer so part of me thinks gaming is a complete waste of time and resources. Then again, the same could be said about almost any hobby/pastime.

That said, gaming is what gave us GPUs (which have developed for gaming over the course of decades) so that we can now utilize them for more interesting and "productive" applications.

So, for me, in the end I'm happy the PC gaming industry and user base has been pushing GPU capability.


Be careful. The gaming industry has successfully conditioned people into believing they need a $1500 GPU with the TDP of a microwave so they can play the next unfinished-at-release AAA title.


Monopoly by candlelight, just the future I had always envisioned.


Why are you wasting candlestick on playing games? It's such a waste! Don't you know bees died to make that candle?

The most ecologically friendly thing you can do is go to sleep. If you want to play games, do it while the sun is out!

(/s, just in case)


If more people played Monopoly, they would have realised the Western economy is at the stage were a few players bought all properties and utilities.


One day we'll find out that all of the VR, crypto, and maybe now AI bubbles were nothing but conspiracies being driven by big-GPU to keep their share price up.


Speaking for myself, I have already gotten more use out of 2 weeks of chatgpt than I have out of 16 years of Bitcoin


Which is just what I'd expect to read on an influential tech forum if grandparent was in fact right.


14. First Bitcoin was mined 14 years ago. And Bitcoins have not been mined with GPUs since 2013.


You pedantry reinforces the point, rather than diminishing it.


VR has been a godsend for forcing hardware, OS and driver developers to actually pay attention to jitter and max latency. If crypto means we get nice fast pretty games and fancy AI then I’m for it. :)


The universe was a hoax invented by a GPU company


DLSS and DLAA were at least, FSR proved that.


Charlie Stross (cstross on here) had a fun blog post[1] about this phenomenon just a week and a half ago.

> As for what you should look to invest in?

> I'm sure it's just a coincidence that training neural networks and mining cryptocurrencies are both applications that benefit from very large arrays of GPUs. [...]

> If I was a VC I'd be hiring complexity theory nerds to figure out what areas of research are promising once you have Yottaflops of numerical processing power available, then I'd be placing bets on the GPU manufacturers going there

[1]: https://www.antipope.org/charlie/blog-static/2023/02/place-y...


Gee, it's almost as if GPUs are useful.


Indeed, except for processing graphics.


For most part of the 20th, a bulk of the energy humanity was able to extract was used for industrialization. Now it seems that a vast bulk of the energy being extracted will go towards computation.


I doubt it frankly. Computation consumes a lot of energy, true, but it is dwarfed by how much energy we use in transportation and food production. Energy use per capita in most of the Global North is about 75,000kWh per year,

https://ourworldindata.org/grapher/per-capita-energy-use

That's like the average person running 27 NVIDIA A100s at max capacity at all times!


Yeah but every time we discover a new interesting thing to do with computers the requirements go up by several orders of magnitude. How many more orders of magnitude of energy can we spend on food production from here with current projections of world population to peak in 2100?


The singularity and fusion power are probably interlinked because of this. Once one happens the other will and in either order.


Doubt it, unless you include solar as fusion, but then we've already got it.


GPU conspiracy or just the side-effect of the decline of Intel?


theres an economics theory of “supply creates its own demand”. we wanted to do ml, gpus were around for games, we repurposed them for ml, and ml architectures that benefit from gpus won the “hardware lottery” (influential paper from sara hooker in case you are unaware)


Just imagine if Bitcoin, GPT and Half-Life had come out at the same time.


Bitcoin miners don't use GPUs.


They did initially.


[flagged]


It'll die with the environment it helped destroy, that's for sure.


No matter how many times Bitcoin halving events happen, it doesn't make it anymore useful.

Interest rates have far more impact on crypto than Bitcoin events.

Currently, crypto market trades closely with NASDAQ. Hopefully regulation will put an end to this.


What we will get are specialized hardware, with not so open APIs anyway.

With a bunch of people trailling behind with "it kind of works" open alternatives.


It sounds like you are complaining about capitalism :-)

It's not so bad. Nvidia could come and say, "hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games, and then you pay us $$$$$$ to buy our 'datacenter' thingy for anything else." But if they do it, people will go and buy the competitor's product.

And yes, probably their 4090 are being bought by some rich kids with their parents' money, but I reckon most of it are sales to professionals, people who would justify their purchase decision with more than playing First-person-shooters. I for example play videogames with my gf, and we have equivalent GPUs. Hers is AMD and costs less than mine, even if it does the same, but I went for Nvidia so that PhysX were available and I could use Pytorch and Numba+GPU and even C++ CUDA. The moment Nvidia locks that down, I'll have to switch to AMD.


> hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games

You just described gaming consoles.


As you will find on my comment history, I am perfectly fine with commercial products and APIs.

Good luck with AMD.


John Hopkins are working on organoids that will replace silicon GPUs for AI.


Here is an article from JHU on that topic - https://hub.jhu.edu/2023/02/28/organoid-intelligence-biocomp...


Reminds me of this Choose Your Own Adventure book from 1984. It was about how PCs had organic AI components and each was unique, and you happened to get your hands on a super intelligent one.

https://www.goodreads.com/en/book/show/755062


As is one of the YouTubers I follow.

Meatcubator: https://youtu.be/Z_ZGq8Tah0k

Growing human brain cells: https://youtu.be/V2YDApNRK3g


If they can get pass these new ethical committees...


I'm sure there are rat and mouse brain cells free for the taking from almost any pharmaceutical testing lab.

If organics is the only factor, I don't know why those wouldn't perform as well as human or ape brain cells.


Unlike Stable Diffusion, I don't stumble upon people who actually use it. Are there examples of the output this can generate? What happens once you manage to run the model?


I've been playing around with LLMs recently and it's definitely interesting stuff. I've mostly focused on roleplay/MUD applications and it's not quuitteee there but it's pretty good, and it's idiosyncrasies are often hilarious.

(when fed the leaked bing prompt, my AI decided it was Australian and started tossing in random shit like "but here in Australia, we'd call it limey green" when asked about chartreuse, i assume because the codename for bing chat is 'sydney')


This is very new, give it a few days. Here's one from Shawn: https://twitter.com/theshawwn/status/1632595934839177216


I have been using similar models like LLM for helping draft fictional stories. The community fine tuned models are geared towards SFW and/or NSFW story competition.

https://github.com/KoboldAI/KoboldAI-Client To read more about current popular models.

https://koboldai.net/ is a way to run some of these models in the "cloud". There's no account required and the prompts are run on other people's hardware, with priority weighting based on how much compute you have used or donated. There's an anonymous api key and there's no expectation that the output can't be logged.

The models that run on hardware locally are very basic in the quality of output. Here's an example of a 6B output used to try to emulate chatgpt. https://mobile.twitter.com/Knaikk/status/1629711223863345154 The model was finetuned on story completion so it's not meaningfully comparable.

It's less popular because the hardware required for the great output is still above the top of line consumer specs. 24 gb vram is closer to a bare minimum to get meaningful output, and fine-tuning is still out of reach. There's some development with using services like runpod.


On /g/ there's always a very active AI chatbot general that focuses on these models.


What's /g/?


https://4chan.org/g/

(enter at your own risk? I think it's kinda safe but still 4chan)


Better to direct people to the catalog: https://boards.4channel.org/g/catalog

Browsing page by page is not a good idea.

Search for "aicg" or visit https://boards.4channel.org/g/catalog#s=aicg to see the AI Chatbot General thread (a new one is created every time the previous one hits the reply limit).


Message boards like this are so unreadable unless you have hours upon hours of spare time.


We just need some better GUIs

Stable Diffusion was in the same place as this in the same time frame of the model getting released. Its only been a few days.


Pretty sure you wouldn’t see anyone using it commercially as IIRC it’s only public due to a leak.


It's not a leak, it's a shortcut.

You can download it from Facebook, but it's behind "apply for access" form. Magnet links floating around are just a workaround around that form.

That said, commercial use is forbidden by the license specified in the form: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...


I wasn't looking for a commercial use but its an Interesting point. Would it be possible to prove that someone is using it commercially?

1) Spin it up on a cluster in Belarus

2) ???

3) Profit?


4) Lukashenka grabs your profits and kicks you out ;-)


Why would he do that? More likely scenario would be "gets his share as a an apartment in London".


That's the oldest authoritarian trick in the book, pretty much any successful business in russia got the same fate for example. They even tried it with nginx.


It could be watermarked


how?



That emoji attack is hilarious, who would have thought that twitter-speak is watermark resistant.


cool, thanks


ive used LLMS a lot for filling out details in my dnd worlds. Both openai products but also the open source GPT-J from euluther Things like writing the text of some books for players to read, of I have to curate just like people do with stable diffusion. Also used it to write songs, its surprisingly good at taking things like chord progressions written in notation and rolling with variations on them


It's useless before the model gets instruction and preference tunings. Won't even follow a simple ask, it will just assume it is a list of questions and generate more, or continue with slightly related comments.

FB trained a LLaMA-I (instruction tuned) variant for sports, just to show they can, but I don't think it got released.


You have to prompt it correctly, non-instruction-aligned models don't behave like agent simulators by default.


Surely it would work with a format like:

User: <question or task>

Assistant:


Useless!? C'mon.


it's not that useless, you just have to prompt it the right way (usually by offering an example of the kind of output you want)


So, you need to know how to tune it.


It's still useful, but you need to know how to use it.


Try giving it a simple request instead ;)


0.35 words/s on my 11th gen i5 with 7B model (framework laptop)

not so bad !


How long did you have to wait for it to load? On my machine it's been running for 15mins, I'm still waiting for a prompt...


You get the full answer after completion, so it’s normal if you don’t get an output immediately

I computed the speed by doing speed=number of words/ total run time


How much RAM do you both have?


32gb of ram and 64gb on NVME swap


Would it be possible to run the 65B one like this as well? Is the bottleneck just the RAM, or would I need an absurd number of CPUs as well?

It's not that hard to create a consumer-grade desktop with 256GB in 2023.


I don't know about this fork specifically, but in general yes absolutely.

Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)

A lack of an absurd number of CPUs just means it's slow, not impossible.

https://github.com/gmorenz/llama/tree/ssd


Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.

However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.


You don't need 256 GB. A pair of the new 48GB DDR5 will work along with a pair of 32GB sticks should work in a consumer DDR5 MB to fit the weights. It does burst when initially loading. So, a fast disk with about the same swap size as RAM seems necessary. It took about 25 mins to generate a single 500 character response using a 5800X & 32 GB DDR4, but I was not able to get to it to run on more than 1 thread with the 7B model.


All current Ryzen CPUs do not work with 48GB DDR5, right? That means if you want to go beyond 128GB you can get an old X399 board (there are some reports of people getting 256GB to work) or more recent Threadripper boards.


Current Ryzen CPUs do not work with either 24GB or 48GB DDR5.


Follow up: https://github.com/facebookresearch/llama/issues/79#issuecom... claims 65B was able to fit in 128 GB by unsharding & merging weights into a single file instead of the multiple pth with 172Gb max swap file usage & appears to stream to GPU.


Why? Is it a limitation of the model or just something with the configuration that you couldn't figure out for this test?


I tried mark's OMP_NUM_THREADS suggestion (https://news.ycombinator.com/item?id=35018559), did not see any an obvious change to make it parallel, and given the huggingface patch (https://github.com/huggingface/transformers/pull/21955) once it gets in is suppose to allow streaming from RAM to the GPU. So, for me it was not worth the effort to keep working on the CPU version as even the best case ~30X speedup will still take around a minute to run the 7B.


I wonder if we will start to see complex prune functions and tools start to pop up.

So before you start a task, you sort of describe the domain, and the model is separated into the third most useful and relevant to that topic/query, and 2/3rd most distant from that realm. Then either just the 1/3rd is used in a detached fashion, or it works as 2 layers of cache, one in ram one on disk.


Wondering how difficult this would be to get running on a m1 max?


I got one token every 8 minutes or so.


Using which model ? On a pretty mid range i5 11th gen I'm getting 0.35 token/s, using the 7B model. Haven't tried the bigger models.


Is that good? Not good?


A token is approximately 4 characters. So, four characters per 8 minutes is pretty slow. This comment would take 1224 minutes to generate, if I was an AI.


Usually you want tokens per second, not seconds per token. So it's a bad sign.


another commenter posted a fork that does it https://news.ycombinator.com/item?id=35067469

per the readme it looks like there a few bugs to figure out in case anyone here is a pytorch expert


Would it not be possible to run on both gpu and cpu at same time in whatever proportion the hardware is available ?

Most gaming desktops have a solid gpu but not enough vram. Pity having the gpu idle here


> 1. Create a conda environment

Uh-oh, bad start.


Why is it a bad start?

It could be venv as well, I suppose, I haven't used conda.


Conda is gonna work much, much, much better for these kinds of applications, as that's what it's mostly used for, i.e. scientific/numerical computing with C/C++ dependencies.


Conda is an abomination that will download 4Gig of unnecessary shite and carelessly dump it into your system, thereby ruining your existing configuration in the process.

Use it in a container or a VM unless you enjoy re-installing your system from scratch.

Or better still, don't use it al all and let it wither away: these kind of braindead projects need to be put down with extreme prejudice.


The download size is large but conda doesn't ruin any existing configuration unless you explicitly tell it to be your native python environment. Conda is set up as a self-contained set of independent environments. Why would your system care what's inside the Anaconda directory unless you explicitly add it to your PATH/bash?


I haven't touched that steaming pile of shite in a looong while, so - who knows - they might have managed to minimize the amount of havoc their wreak on their user's systems.

But ... I seem to recall ... Conda tries to install GPU drivers does is not? ... Is that not the case anymore?

Because if it still does, your theory about "Why would your system care" and all that doesn't really hold water.


I use miniconda on Linux but it's never attempted to install graphics drivers on Windows.


Chill. Use Miniconda3 as a light alternative. Conda is unnecessary. I agree nobody should ever use Conda unless they are extreme noobs. We all have to start somewhere.


Since this is pytorch it should run on cpu anyway. What am I missing?


Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...

Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.


you don't actually need to switch to gloo, I just have no idea what I'm doing.


Lol, all my best work has been when I don’t know what I’m doing and it’s refreshing to see someone moving the ball forward and feeling the same way. Kudos


Gloo is a communication protocol for distributed computation (think along the lines of MPI)


I guess the simple fact that it didn't before his patch?


Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.


that's literally all I did (plus switching the tensor type). I'd imagine people are posting and upvoting this not because it's actually interesting code but rather just because it runs unexpectedly fast on consumer CPUs and it's not something they considered feasible before.


That is vastly underestimating how tricky it is to make novel pieces of software run. There is a huge fringe of people who know how to click things but not use the terminal and a large fringe of people who know how to run "./execute.bat" but not how to write syntactically correct Python.

But a lot of those people want to play with LLMs.


How are you getting this to run fast? I'm on a top of the line M1 MBP and getting 1 token every 8 minutes.


Try switching all the .cuda() to .mps() I got a 100x speedup on a different language model on a Macbook M1 Air.

https://pytorch.org/docs/stable/notes/mps.html



probably pytorch is very optimized to x86. It's likely using lots of SIMD and whatnot. I'm sure it's possible to get similar performance on m1 macs, but not with the current version of pytorch.

Do you have enough ram? (not swapping to disk)?


Same experience for me, looks like it is only using one cpu core instead of all of them.


Or better yet, define a device = 'cpu', and use tensor.to(device).


If someone else wrote this comment, would you find it useful?


Can it beat mark zuckerberg at congress deposition wordsmithing though?


Someone must have trained an LLM for that for sure.


Oh yes “”” Hackernews senator: “”Someone on the internet said meta aka Facebook is not considered a real data native, clean coder and high IQ company unless your new language model exceeds the elegance and slipperiness of mark Zuckerbergs (you) language output in senate hearings. he is smoother than a lake in the metaverse.“”

Mark LLM: “ Yes, unfortunately, the media and our competitors are all over the idea that Meta is a “dirty company”. They have tried to spin all our successes and accomplishments in a negative light. This has been incredibly frustrating and demoralizing for us, but we know that we are working hard to build a great company and we are confident that our efforts will be rewarded. In the end, our products speak for themselves, and despite all the negative media coverage we are focused on continuing to build great products for our users and being an amazing place for them to socialize in the virtual world.”


I have to say "he is smoother than a lake in the metaverse" is presumably accidental, based on the quality of the rest of that text, but it has to be one of the wittiest phrases ive seen LLMs come out with to date


I opened the twitch AI seinfeld stream once and stumbled into a conversation that went something to the effect of:

George: I really like that orange sweater

Jerry: Yeah, I just found black so depressing

George: Orange is such a great color! Orange is the new black.

...


That was my prompt, I am hackernews senator. People do sometimes ask how many A100s it takes to run me.


Would running on a cpu be more or less power efficient then running on a gpu with the same words per second rate?


Less.


What’s the rough idea of how this is possible? I thought you need the parrelism of a gpu


inference has less pressure of parallelism compared to training


Could this fit into GitHub Codespaces's top VM?


The 65 billion model is 160 GB so no - unless you request larger storage spaces from github. 7 billion and 13 billion should fit though.


how long for one token to infer on an average cpu?


I tested on a decidedly above average CPU, and got several words per second on the 7B model. I'd guess maybe one word per second on a more average one?


Cool so we're back to the days of 2400 baud modems


More like 300 baud. At 300 baud (30 cps) you can still read it as it arrives.


Simulating a slow typist


From the readme: On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!


i have a friend who owns an macbook pro m1 max. what kind of performance can i get?



Mps = Metal Performance Shaders, for those out of the loop.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: