For those wondering why the M2 Ultra is so fast, or the M1 & M2 series in general, it's because inference's main bottleneck is memory bandwidth, not compute power. And the M2 Ultra has a bandwidth of 800 GB/s which is about 8 times faster than an average modern desktop CPU (dual-channel DDR4-6400 offers a bandwidth of 102 GB/s).
This high bandwidth is really a result of Apple having designed a unified memory architecture for the M1 and M2 chips. Typically on a laptop or desktop, the CPU and GPU have distinct memory systems: high-bandwidth (but relatively low-capacity) graphics memory, and relatively low-bandwidth (but high-capacity) CPU memory. Apple decided to simplify that and instead implemented a single high-bandwidth memory system shared by the CPU and GPU. The only downside is that such high-bandwidth memory had to be tightly integrated in the M2 package, so the maximum capacity is limited. For example whether you spend 5,600 USD (cheapest Mac Studio machine with M2 Utra and 192 GB) or $10k+ (maxed out Mac Pro), you will only ever get 192 GB RAM max. For that amount, a PC could get 1024 GB RAM (5× more!) But on the other hand, if your workload, like inference, doesn't need more than 192 GB, then that's great. Personally I think Apple made the right tradeoff here. 800 GB/s of memory bandwidth on a general purpose CPU, on a single socket, has never been done before (to my knowledge.)
I agree, GPUs can still generate more token per second per dollar, but what is new and great about the high-end M1 and M2 is Apple offering this much memory bandwidth on a general purpose CPU, thus immediately available to all software running on the CPU.
Looking at the prices, The 4090 is about 3/4 of the price of a base model M2 Ultra Mac Studio, which has 64GB RAM. With the rest of the PC to go with the graphics card, it's about 7/8s of the price of an Ultra. Then you have the compatibility. Do you want your software to be CUDA compatible or Metal compatible? If you're writing the software, maybe you want both!
The budget option is to go with a used 3090, which still has greater memory bandwidth than the M2 Ultra.
It fully depends on the workload. The 4090 itself draws around 450w at load, and the M2 Ultra peaks around 300w. If your workload is >1.5x faster on Nvidia hardware, then it's per-prompt efficiency probably beats out the M2 Ultra.
The 800 G/s bandwidth is amazing. I ordered a maxed out Mac mini with 32G ram and 200G/s bandwidth. For the LLMs I want to run right now, that is sufficient for my needs, although I did consider over-buying and getting a M2 Ultra. I also pay Google for Colab, and as long as I don’t over-use it, I can almost always get an A100. My strategy is to split my work as appropriate between the Mac mini when I get it in a week or two, and Colab. I used to run on Lambda Labs, also excellent, but setup time was non-negligible.
I have an M2 Pro 16GB — anybody on Apple Silicon can download DiffusionBee.app and immediately be generating images (from text prompts) with its default model/engine... drag-and-drop.
Incredible what (even with limitations of a single $1000 computer, costing less than a single nVDA 4090) a desktop mac mini can accomplish.
----
For comparison: the hard drive in the M2 Mini is FASTER THAN A MACPRO5,1's RAM!!!
I've noticed that M series Macs have extremely fast disk drives that the OS uses as swap quite efficiently. I've frequently used all my RAM on my Mac and barely noticed any slowdown when it starts swapping.
As was already noted in other comments, the M2 Ultra bandwidth is not that special against high-end GPUs (all the recent ones have over 700GB/s generally) and this bandwidth has to be shared with CPU. So technically if you keep doing work on the CPU there is less available bandwidth at any given time. A 4090 + 13900K has almost 1100GB/s combined ; not that it matters for most use case.
For regular CPU tasks the added bandwidth doesn't seem to make a difference as far as I can tell, at least Apple Silicon isn't winning in any scenario where they don't have a specialized block on the chip for the task. So what's the point ? (beside overpaying for memory)
And to "win" this the total VRAM available at once is what make a difference, not really the bandwidth ; this is just because the task has been parallelized as much as possible. Even then, it required to optimize for the arch with parallelization and it is absolutely not cost competitive with the PC used as a reference. If you really need to maximize GPU VRAM in a single workstation (without going to a servers/cloud solutions) you could build a machine with multiple RTX A4000 SFF (1 slot, 20GB). It would get more expensive than the maxed out M2 Ultra but at this point the M2 Ultra lose so hard in FLOPS power that you really need to specifically look for situations where you would want more VRAM (up to 144GB available for the M2 Ultra GPUs vs 80GB for 4x1 slot card) but wouldn't want to run the model faster/longer in a dedicated server rack that could potentially have even more available VRAM (and be available/shared with other peoples).
Realistically NVidia knows how to put more RAM in their GPUs because it doesn't make sense to scale VRAM faster than computer power for most workload, you need to have a balance that make sense.
As an analogy it is like coming up with a truck than can carry 150T at once but can only do so at 1/3rd the speed of regular trucks. In most case you actually gonna want to run 3 regular trucks even thought is going to be less efficient (it still gonna cost less and be faster overall) unless you really don't have a choice ; at this point you are in "special convoy" territory (like for wind turbine blades) and it's gonna cause lots of headaches on top of being slow and expensive.
Apple market their stuff as an incredible innovation when in fact not only it is irrelevant for most workload that are usually thrown at workstations (mobile or not) but I would argue that running the workloads where it would actually make a difference is a bad idea on a single user workstation. For most things that actually matters in a single user workstation/prosumer/enthusiast system, Apple Silicon lose quite hard especially when it comes to GPU performance : viewport performance, close to real-time 3D rendering (before sending to render farm for final detailed render), games, etc...
And this is the Ultra version of the chip, that is out of reach for most people (it makes look at the 4090 as not that overpriced, which is quite funny). If you go down to the M2 Max version, suddenly the bandwidth is 400GB/s and not only it is not impressive at all, it is even worse than an Intel A770M laptop GPU (512GB/s) while still having less raw power and costing way more. The more you go down in the Apple Silicon roster, the worse it gets. AS is not competitive at the high-end workstation level but it is absurdly overpriced at almost every level.
The reason they have this architecture (that isn't very good for most traditional computer application) isn't because they went out of their way to engineer something great. Nope. It is because they basically scaled up a mobile architecture that was like this from the get go (power and space constraint, plus no need to have that much RAM nor have it upgradeable). And this is only because Apple is currently run by a Scrooge who figured he could get even more money out their silicon division if they solds SKUs with binned parts and controlled the RAM supply/price.
If Apple had actually done useful engineering they would have figured out a way to scale the GPU/VRAM combo independently and a way to package/sell it efficiently. It makes no sense to scale VRAM past a certain point : why would you want to load a 3D model/view/whatever if you cannot compute it fast enough. As for the CPUs existing memory interfaces where fast enough for most things and the "benefit" is inexistent in most case.
They went about it in the worse way possible with cost reduction above all approach while jacking up the price up to 11. This is the most lazy approach they could take and they even dumped all the unnecessary cost directly onto the consumer (low yield for big area chips and soldered RAM close to the chip from a lack a dedicated GPU SKUs). Even if the consumer want to absorb the cost he still get bad scaling and uncompetitive performance...
I just don't get how Apple get away with it and there are people like you falling for their marketing bullshit that is just a spin on what are actually weaknesses...
Interesting feedback but focus on cost is a old debate. Your feedback seems more about 3D than AI. A lot of developers also want a better user experience, reliability. My reply is that M1 and M2 are just the beginning as Apple is investing billion of dollars in R&D. Also, no PC laptops can do better than AS architecture today, at a lower cost than high-end PC laptops and more battery life. Pro servers is next step. M2 ultra is just a preview. Just saw two weeks ago, an engineer doing a demo of Llama 2 on a recent AMD laptop. he was complaining how slow it was compared to a Mac. Again, Apple is leading for laptops now, server is next. M3 will remove memory limitation.
For applications that aren't latency sensitive my company has found that Apple hardware inference is far less expensive than the competition when calculating for electricity usage. I wish they would make a cloud offering.
I'm so happy MacMiniColo is still around. I had projects hosted there when they were just getting started. They've stayed true to their mission. I love that their website still feels like Wordpress 3.0.
They aren't accepting new Mac minis for colocation. Which means if you want to go from, say, 8gb to 16gb of ram then that's an extra $90 a month for the privilege.
I have a fiber connection, the space and the experience to run a Mac mini colo. If people are really interested. I would want to do it as a co-operative that helps with the expense of the physical location.
We went with a M2 Studio with maxed out RAM because we simply cannot get reliable GPU availability with cloud providers and for $6000 (with tax) we can have the equivalent VRAM of ~2 80GB GPUs instead of paying $5/hr for the pleasure.
You need to pay for dedicated because they’re generally unavailable in the moment. So it’s more like 45 days, if we’re only talking about a single GPU—but we’re talking about ~2x.
Thanks! Ya, I opted for dual 3090 for my workstation (keeping full LLM in VRAM is crit) was wondering what lift was for M2.
OP implied that there were workloads where it out competes renting in terms of cost. Was hoping it was true for something than a single user interactive session (which can be done a lot cheaper)
Apple really should license the M chip IP to someone to make a server chip out of it, or do it themselves. It's money on the table for them and would not cannibalize their Mac business at all. It's a very nice core.
Apple silicon is great for low power desktops and laptops, but they don't actually have groundbreaking performance relative to what we've got in the server space. If you dropped the M2 Ultra from the $4000 into a server, it would perform about the same as a $1500 AMD 7950X3D based server (this is a common budget server setup with ECC) in CPU tasks. Stick a common GPU in there and you're running circles around the M2 Silicon in GPU tasks.
The Apple Silicon is great at really low power work, but if you dial desktop or server GPU power limits down they also become quite efficient. The marginal cost of electricity is cheaper than buying more hardware, so nVidia and others run their parts deep into the diminishing returns part of the curve to maximize performance at the expense of power efficiency.
What Apple Silicon brings to the table is not simply just performance, but a large amount of unified memory that can be used by the GPU (which are needed for inference of large deep neural networks like LLMs).
A top-of-the-line Mac Studio will give you 192 GB of unified RAM in less than $7000. Meanwhile a H100 from NVIDIA with 80 GB of VRAM will cost you like $30000...
> IDK about ML part, but equivalent performing Ryzen mini pc cost me 3x less over m1 macbook
When running an ML workload, the Nvidea A100 has massive GPU compute resources, and a large amount of GPU local high bandwidth memory, so it's ideal, but is nowhere near low cost.
A consumer Ryzen chip is inexpensive, but lacks in both memory bandwidth and GPU resources.
The M2 Ultra has access to way more RAM than consumer GPUs, many times the memory bandwidth of a Ryzen (800GB/s vs Ryzen 7 1800X at 40 GB/s) with a large amount of local GPU resources.
Even stepping up to a Threadripper Pro would only get you a quarter of the memory bandwidth, and those aren't exactly cheap either.
It's easy to outperform Apple Silicon on pure power, but what about efficiency & heat? (like FLOPS/watt or whatever). Does anything else come close yet?
That matters a lot in a laptop, but not so much in a 1U rack? Not that datacenters love heat, but the competition isn't extra hot, it's just hotter than Apple Silicon?
A 1U rack is a small machine, but a few hundred of them+ in a constrained space is a different story. At larger scales, moving electricity in and heat out are usually the defining factors.
I guess it depends on what you're doing with them? If you're running them 24/7 to train or model something, the energy costs might add up. Even if you're not, having more efficient chips might mean more data centers don't need as complex cooling equipment.
It’s sad that racking and maintaining your own physical hardware is becoming such a lost art… I appreciate the up-front simplicity of cloud offerings as much as anyone, but there’s something to be said for owning your own hardware and avoiding the continual rent payments you’re sending the cloud providers.
The wisdom is that cloud providers are better at infra than you, and that the economies of scale make it better to piggy back on what they’re doing, but… AWS is the most profitable part of Amazon for a reason. They’re overcharging you.
When you look at the cost of the hardware + hosting. Yes, it certainly looks and feels that way.
But if you've dealt with corporate IT, and had to deal with 3-6 month lead times on getting hardware, or politics to get your hands on hardware to get stuff done.
AWS is cheap. It gives you velocity.
If your company is large enough that it can offer the elasticity of resources that Amazon offers or even 1/4 of it... and you have an IT org that will let it happen. Yes, AWS is a waste.
But with AWS... when a project dies, you can wipe its costs out, people won't hold onto hardware so they have hardware for the next project, etc...
Trust me. I've been IT, I can spec and build rack systems. I am a software dev. And I've been a dev most all my career.
For 90%+ of orgs... they don't have the maturity and skills to handle that type of infra without substantially distracting from their primary business.
I find at AWS I am always wasting engineer time optimizing dumb things. Do you know how much a TB of ram costs? Or 10TB of blazing fast NVME costs? Less than $5K. How much does that cost at AWS?! This is not even considering bandwidth which AWS overcharges soo much. Yet I waste time.
Also, maintaining servers is not hard at a proper data center. It is often more hands off than the migrations cloud providers force on their customers.
Yeah. Also, to be honest, we still have process and approvals to get through to spin up AWS stuff.
It’s not like process disappears just cause you’re not on your own hardware. Infra is still its own team with its own budgets poking and prodding at every damn turn for every little thing till rejecting your requests, you escalate and then have a 4 week battle over needing the space.
If you are paying the price of being on prem, which is really lack of ability to provision and de-provision infra quickly. There's little point to the cloud, unless you just have no infra to begin with (small companies).
I'm in a small firm now. I can't imagine having an approval process to spin up a few instances to run my tests and spin them down after. That'd be silly.
Really? At my old company each division had its own budget and account and then you’d be an iam member of an account and spin up services under that account, but there’s no central authority to send a request to. There were tools to analyze underused services across all accounts (like EC2 instances constantly under 2% cpu load meaning they were flagged for downsizing if possible).
Getting good results on AWS/GCP is neither easy nor simple: it’s a different set of headaches.
It’s still a win for a lot of use cases and I still do it quite often, but the meme that it’s this “click and you’ve just hired the best ops team in the world to work for you” and so the 50-500% markup is actually a bargain is horseshit. A Bizon box in your living room fucks AWS up on flops/$ on most instance types and pays for itself in 30 days.
It is one of the best ops teams on Earth: but they’re working for you like the Google search team is working for the user.
The problem with using a cloud provider is that you still need to know what you're doing.
Your application isn't going to magically become HA/DR. You still have to make it that way, from your application design/coding up through the deployment.
I mean, if you're not storing your session IDs in a data store that's reachable by all the nodes behind your load balancer then no amount of infrastructure is going to save you.
A great illustration of this is the GitHub outage a few years back. They had a fairly well distributed application layer but the database topography at the time didn’t consider the failure mode, even though the application layer did.
That’s a realistic scenario no matter whether you’re bare metal, building out your own cloud, or using someone else’s. No amount of AWS/GCP/Azure/et al marketing changes that.
Back in the day, Google was an innovator by using lots of cheap commodity servers instead of a few expensive ones and just accepting failures as a fact of life. I wonder 25% seriously whether there's an opportunity for a similar mad genius move to pay for business class fiber at a half dozen remote employee's homes across the country and just have a good replication/failover strategy. 24/7 on call isn't that big of a deal if you just have to go into your basement to swap a drive. Going to be on vacation? Don't be the primary site while you're out.
More a realist. If you have the scale to go on prem. Do it.
Most firms don't. Or don't have the skills.
Also, the cloud can help an IT project recover from errors. Let's say, I'm about to buy 500k of hardware to setup some storage. I get my requirements, I architect it, do my design work, and then buy the hardware. I have to over provision a bit because of reality and human error... But when I discover that the requirements, shift 2mo in my project, and I've already ordered the hardware... I may be hosed.
This isn't hypothetical, this is what happens. Things evolve and shift. The cloud allows for more agility. If your firm is large enough, or has its stuff together enough, go for it on-prem.
I've got 20+ years on prem.. I've seen it fail all over. I've seen cloud be a mess too. But if you told me to clean up one. I'll take the cloud.
For me, I prefer hosted/cloud, preferably managed.
I’m quite capable of setting up whole server stacks. I did it for years, but I stopped, some time ago, and consider myself to be, for want of a better word, incompetent at being a modern admin.
I think I’d screw the pooch, so I prefer that someone who does it every day, handle it.
But I write Swift code, every day, so I’m not incompetent at everything.
The question I've often heard asked when deciding on build vs buy (which can apply to cloud vs. bare metal) is:
Are we in the business of building, maintaining and operating <thing to build> or do we want to buy that as a service instead and focus on our actual core business?
There's more to the cost of building and operating than just the hard costs.
Retaining good modern IT talent is getting harder and harder - and I'm not even talking about salaries.. You need a whole department including strong leaders who can hire, train, and lead the right people, etc..
This is something most companies wouldn't even know where to start with.
You can throw a bunch of boxes in a closet and it'll work. A surprisingly large amount of the early Internet was "a spare box under my desk."
The problems start when they become part of your critical path and you're on vacation and nobody knows WTF is happening.
I mean, it's a risk. If you're OK with that risk then go for it.
It's really about the politics of your office.
If everyone is OK with the idea that the box is in some closet somewhere that's fine. I've been part of a bunch of startups where we were running infrastructure on spare hardware. Sure it's not HA, but we didn't need it...or it was at least HA enough for what we needed.
Yeah, to be clear I wouldn't advocate this route for your core product. But running CI workers? Sure. Especially macs, which have onerous usage restrictions in the cloud that negate most of the elasticity benefits you might otherwise see.
Now your business depends on that one person, congratulations. Hope they don’t realize that their skills are better suited to working at AWS/Azure etc. for 2x the money. Which they are.
Sure, like any other high-level fictional situation you can surely come up with many valid fictional counter-points of your own, but cloud hosting is popular for a reason.
And I think in most cases companies want to focus their employees and efforts on their core business, and if that doesn't include setting up and maintaining hardware in the long-term, then you don't build, you buy.
I'm a former build engineer and used to do everything on prem and I gotta say I miss it (not being a build engineer, the on prem experience). Since those days pretty much every company I've worked at moved their CI/CD to the cloud and I gotta say it feels so much slower, even when working from home.
I remember twice switching from an in-house jenkins/teamcity/whatever type of CI to Azure devops and the thing I remember the most was how much longer it took a build to complete as well as the massively longer time downloading a build from Azure vs from within the office. Even when working from home the on-prem stuff was faster.
The thing is, the build/devops teams seem to be about the same size in both cases. It's just kind of worse in pretty much every case when we do CI in the cloud.
Notes
- My experiences are largely for game development so the build times and artifact sizes can be quite large.
- I've only ever had CI/CD experience with Azure, I've not tried other cloud providers
- Since this is game development and we're using CI downtime is more acceptable than other cases. That said, I don't remember much downtime when I was working as a build engineer. I have seen periods of 1-2 hours of downtime once in a blue moon but then again I've seen that with Azure. In both cases it wasn't so much the setup but a build script deployment issue.
Also being able to cool off in the rack room when it's a hot day is always a treat :)
Sometimes the flexibility and time savings is worth the added upfront cost. Similar to how companies like to hire consultants or lease office space. Being able to walk away is better for short term, because companies value profits in the short term.
The snark is getting a bit tiring. No, plenty of hardware failures but we had redundancy. Our availability wasn’t dependent on how fast I cold drive to the data center at 3am. Drive dies, who cares, there are plenty of hot spares, we’ll deal with it during business hours. Server dies, who cares, there are lots of them. We have remote hands too, so simple hardware replacement is something you can get cheap onsite labor to do.
If your operations halt until some poor sysadmin has to drive to the colo, you are absolutely doing it wrong.
Where does this stop? Do you produce your own electricity? Farm your own food? Make your own silverware and shoes? Sometimes it's just easier to outsource the things you don't want to (or aren't good at) doing yourself.
If I wanted to host a website, sure, I can build a server out of parts and negotiate with my ISP and get a business pipe and handle all caching and such. Or like I can pay a provider $5/mo and get better performance and reliability with no management overhead. Yeah, maybe over 5 years I'd save more money doing it myself... but it's not worth the time.
If I wanted to generate a photo or a dozen, or a few paragraphs of text, that's like a few cents worth of cloud AI. Maybe low single-digit dollars. Or I could spend thousands on fat GPUs or a Macbook, spend forever training it, and still end up with a sub-par result.
AWS is profitable not just because they're overcharging you but because they are providing a hugely useful service for millions of businesses that don't want to deal with that infrastructure themselves, any more than they'd want to manage their own plumbing or electrical grid or roads and bridges leading to their office. DIY makes sense if you're doing it as a hobby or if your scale is so big that you would incur significant savings to in-house it, but for millions of small and medium businesses, it's just not the most practical approach. Nothing wrong with that.
I mean, it's like saying development is such a lost art... why hire a dev if you can learn to code yourself? Sure, but not everyone wants to, can, or has time.
I hate to say this but it has gotten to the point where I'm starting to farm some of my own food (I'm starting to get fed up with produce quality issues in my hometown).
Still haven't started on the silverware or shoes yet.
I do agree with you though. If you are a non-tech company or a company that lacks the human resources you might as well go with the cloud.
I have a base-model m1 Mac Mini and it's a beast. I'm using it as my build/deploy server and also as a back-end server (for running jobs) for the prototype I'm working on. I also do development on it when I want to use my big monitor rather than my laptop. And I listen to music and run Cookie Clicker at the same time while doing development.
Got three databases up and running too. It's a beast. I'd definitely consider self-hosting with a few Mac Minis, that would be fun and they're really cute, sleek devices too. I paid $650 for it and consider it a great deal. Definitely should've gotten it with more than 8gb of ram but I got it to try it out and haven't yet really needed to upgrade to a unit with more memory.
Interestingly enough I was actually discussing this with a friend (who works in enterprise IT) the other day. Basically rack servers are purpose build for the task, with hot swappable components, redundant power/storage, multiple NICs, ECC, remote management, and so on. They come with enterprise support and can be easily maintained in the field.
Meanwhile a Mini cluster is literally a bunch of mini pcs in a rack, and idk if Apple even supports this kind of industrial use. While it's a quality product the Mini isn't really designed for the datacenter.
> and idk if Apple even supports this kind of industrial use. While it's a quality product the Mini isn't really designed for the datacenter.
I think they know of it and tacitly approve of this use case, as evidenced by the Mac Mini having the same form factor for ages. They’re well aware that a lot of people use Minis (and Studios now) in data centers, and that the Mini footprint is sort of “standardized” at this point.
They actually had a Mac Mini Server as well for a bit. It made sense because it had a second hard drive instead of an optical drive and came with a Mac OS X server, back when that was a standalone $499 product: https://support.apple.com/kb/SP586
(Not sure what differentiates the later model Mac Mini Servers from the regular Mac Minis, since Mac OS X Server just became a $19 App Store purchase, and optical drives were no longer a thing in Mac Minis)
I have one of the 2009 Mac mini Servers running Ubuntu 22.04 LTS like a champ. It’s still a great machine. Upgrading the HDDs to SSDs was a bit of a chore, but doable.
They discontinued the Mac mini Server line in October 2014, which was still sold with two drives instead of one. Configurable to order with SSDs by that time.
There was also a "server" model Mini but it was very short lived and was basically a regular Mini with the "Server" software pre-installed, something that you could just throw in via the App Store with one click anyway.
It had a five year run and saw four different hardware models. It included two hard drives instead of either one hard drive and an optical drive, or just one hard drive (after they ditched ODDs).
Mac OS X Server was its own operating system originally. It was still the same core OS, but had a ton of additional servers built in. Non-exhaustively, they included IPSec VPN, email, calendaring, wiki, SMB and AFS file shares (including support to act as a Time Machine backup destination), LDAP, DNS, and software update caching before it came to macOS proper. The Server app released via the App Store was a shadow of Mac OS X Server.
These were quite popular in small professional offices like law firms.
I saw it save the ass of a client who had one, as they got robbed and all their desktop computers, mostly iMacs, were stolen. The Mini was as much as lost in the wiring in the network closet and was overlooked, so everything had backups.
For applications that aren't latency sensitive, I run inference on a free 4 core Ampere server from Oracle. Once you ditch the "fast" prerequisite, a lot of hardware becomes viable.
For folks curious what this is, this seems to be a caching optimization for saving time on parallel streams of text. The benchmark is incidental. Most individual users likely have just one un-batchable conversation going at once with llama-cpp, and I think it’s unclear whether this PR improves that case much.
Also note that the demo video is sped up to fit inside GitHub attachment limits. Your observed speed may vary. :)
I've been curious about using LLMs for large-scale refactoring. Prompts like
anywhere you find `FooBarBaz(blip, kap)` replace it with `new newThing(blip).bump(kap)`
I don't know how reliable it is, but it seems like if you can easily run this on commodity hardware it could totally replace most IDE refactoring tools, although obviously the IDE refactoring is more reliable, it seems like this could be made simple and flexible, and possibly just as reliable as IDEs.
But also it could enable some interesting things that you could never do with an IDE refactoring tool.
I already do things like "write a script to replace `FooBarBaz(blip, kap)` with `new newThing(blip).bump(kap)` in a project folder"
I'm more comfortable with that because I find it usually takes two or three prompts to get it right
e.g. A couple hours ago I prompted it to help me do a diff of two commits ignoring all white space, just to check if there were any other changes. The first response didn't ignore new newlines, the second one was a multiline script, the third response gave me what I actually wanted.
diff -w <(git show 0bb2c8579efe775de883e0182db48989bfa324f2:"path/to/file"|tr -d '\n') <(git show 6c71efc17497ad7c90b9c7b690075ec031c13c69:"path/to/file"|tr -d '\n')
I think that an LLM could be an amazing interface or translation layer for this sort of thing, but I would argue that the underlying operations of refactoring or something similar should remain very much like a function with discrete inputs and outputs.
I believe that the application of multiple streams in parallel is a natural evolution of using a single stream. I've used some local models for help in creative writing, and some of the most productive results I got were from running the same prompt and sequence of interactions dozens and dozens of times. Although in that case, I was personally going through each result line by line, I can certainly imagine fully automated tools that leverage the range of responses to a given prompt.
Speed-wise, on a single stream, I have no need for it to generate text faster than I can read.
However, for scripts that try to use hundreds/thousands of invocations to solve some problem (eg. "write me a whole book"), the parallelism will be great (but obviously the script has to be written with that in mind).
Is it possible that in a few years time, only Mac silicon and PCs with high-end GPUs will be required to run "In-home LLMs" affordably?
If we get closer to a either "AGI" (whatever the hell that is) or at least a reasonably useful AutoGen/BabyAGI-like system that become popular to use at home, those machines will be the only ones capable of running advanced LLMs without having to pay OpenAI, Microsoft, Amazon/AWS, etc inordinate sums of money to do what consumers will deem a utility some day.
> Is it possible that in a few years time, only Mac silicon and PCs with high-end GPUs will be required to run "In-home LLMs" affordably?
No. They work well on the apple chips thanks to the integrated memory and the large size of the models. I know of no reason why an x86 chip could not be designed in a similar way if desired. IANAChipDesigner but I have worked for one of them.
FWIW, while Apple silicon can _run_ huge models thanks to the unified memory (not to be confused with shared memory), the inference is pretty slow compared to dedicated GPUs, so it's a tradeoff. The significance of this PR is that inference speed can—at least in certain applications—be sped up using parallel decoding.
Do these implementations use the neural engine? I saw that there was a stable diffusion implementation using the neural engine and I found that my macbook noticably did not run hot, as opposed to an average Teams call.
It doesn't. You need to generate models for use on the neural engine, which apple did for Stable Diffusion, but this is just taking advantage of lots of fast RAM and lots and lots of threads, if I understand it correctly.
It uses Metal acceleration, and takes advantage of the shared memory architecture, meaning it's basically a GPU with 196GB VRAM. Trading space (VRAM) for time (FLOPs), it can beat the performance of an RTX4080 here.
Encoder only transformers (like BERT) can be made to run on neural engine with CoreML. Efficient inference with autoregressive encoder-decoder and decoder only transformers (aka LLMs) needs KV-caching, which currently can't be efficiently implemented with CoreML (and thus neural engine). So, for now it's GPU only, with Metal.
You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.
There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.
Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.
Autoregressive transformer models are usually memory bound, whereas SD is compute bound, so perhaps the difference lies here. Also the reason why SD runs so much faster on the GPU than on the CPU.
M1 has (fast) unified memory between GPU and CPU, so something being memory bound ought not to have much bearing on whether it belongs on CPU or GPU… at least in theory. I’m a total noob here though so I may be wrong.
They're quite good at generating scaffolds and ideas (mistral specifically).
You can use them for trivial nlp tasks ("between 0 and 1 how similar are these two sentences? Respond with an explanation.") and because it's a small model, you just run it 4 or 5 times and take an average pretty quickly.
With these improvements llama.cpp/ggml is really becoming a pretty competitive serving stack even for large scale cloud hosted AI. I wonder how ggerganov finds the time to do all this, does anyone know if he's being sponsored?
IDK, I can see a future. It’s a one-man (for now) business, so minimal costs to consider. If he can swing consulting using the .cpp projects as advertising, that sounds like a good business.
Additionally, I can imagine companies investing and paying for the open source work to expand access to their licensed models. Use the same interface as people use LLAMA but upgrade to BetterModel, fully compatible.
Additionally, I could believe this is simply a build up to a future Acquihire, which is the most lucrative way to be hired.
Anyone here wanting to try an M1 Mac mini 16GB in the cloud for free this month just send me an email (click handle). Now that we've moved on to the M2, I've got a hand full of the M1 available FREE for trying. You can also try an M2 Pro, Max, or Ultra but for that you'll need to subscribe. https://www.macweb.com/macinthecloud
I'm waiting for someone to comment "use the page title/why did you change the title/etc.". It's frustrating when you find something important on a page and type that as the title, and then the post gets flagged because it violates HN rules.
For comparison, this is the actual title of the page, but do you think this would increase people's awareness about the fascinating fact I highlighted in the title?
Yeah, I really hate the "no changing titles" rule. I can understand something like "don't sensationalize", but many articles just have poor titles that lack context. What does that accomplish aside from discouraging readership and discussion?
The rule is officially "don't editorialize" which I would interpret (perhaps incorrectly) as allowing a little leeway in surfacing a buried lede so long as it's presented in neutral language.
Something like "Amazing Llama 2 7B performance on M2 Ultra" would obviously fail that test, but the current title of "M2 Ultra can run 128 streams of Llama 2 7B in parallel" seems to follow the spirit of the rule, at least as I read it.
It allows for plenty of leeway, and in my experience alternative titles are accepted and will stand unless they are significantly worse than the original. It happens even with major announcements with hundreds of votes. @dang isn’t some mindless robot who must always enforce one way of doing things. The instructions are, as the page title suggests, guidelines.
> If the title includes the name of the site, please take it out, because the site name will be displayed after the link.
> If the title contains a gratuitous number or number + adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."
> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.
Emphasis on "Otherwise please use the original title". I think dang is wonderful and we're lucky to have him, but even in my short year or two here, I've seen enough instances of title-policing (not necessarily from him) that discourage me not just from changing titles but sometimes from posting things altogether if the title isn't good enough originally.
> It's frustrating when you find something important on a page and type that as the title, and then the post gets flagged because it violates HN rules.
If that happens to you a lot, consider that perhaps other HN users disagreed with your assessment of what was important on the page and felt mislead when the content didn’t primarily match the tile.
Anecdotally, I see alternative titles as being well accepted when the true title is subpar. Especially relevant when the matter concerns GitHub issues (which this is).
I’ve heard that there are some issues of gaming performance on M1/2 Ultra specifically (due to it being just 2x M2 Max in the same package), however my M2 Max MacBook absolutely runs Dota 2, and runs it very, very well. Like 180-200fps average well.
Are these uncensored or decensored? If rlhf removes intelligence at any rate, I wouldn’t expect that intelligence to come back with a tune that’s let’s it say curse words and talk about religions
> Most of these models (for example, Alpaca, Vicuna, WizardLM, MPT-7B-Chat, Wizard-Vicuna, GPT4-X-Vicuna) have some sort of embedded alignment
> The reason these models are aligned is that they are trained with data that was generated by ChatGPT, which itself is aligned by an alignment team at OpenAI.
Most of those are fine-tunes of the base model. The fine-tuning data is 'aligned'. The uncensored fine-tune training data is edited to remove the "I can't help you with that" responses.
Yes, as stated in my earlier comment: there's an alignment tax, and then almost certainly an un-alignment tax, on top of that, compared to the raw, unaligned/uncensored, models.
that is interesting, but the "un-training" shown by EricH. is simply re-running some fine-tuning on the same public base model, regarding "refusals".. and it is expen$ive to do that, too.
This high bandwidth is really a result of Apple having designed a unified memory architecture for the M1 and M2 chips. Typically on a laptop or desktop, the CPU and GPU have distinct memory systems: high-bandwidth (but relatively low-capacity) graphics memory, and relatively low-bandwidth (but high-capacity) CPU memory. Apple decided to simplify that and instead implemented a single high-bandwidth memory system shared by the CPU and GPU. The only downside is that such high-bandwidth memory had to be tightly integrated in the M2 package, so the maximum capacity is limited. For example whether you spend 5,600 USD (cheapest Mac Studio machine with M2 Utra and 192 GB) or $10k+ (maxed out Mac Pro), you will only ever get 192 GB RAM max. For that amount, a PC could get 1024 GB RAM (5× more!) But on the other hand, if your workload, like inference, doesn't need more than 192 GB, then that's great. Personally I think Apple made the right tradeoff here. 800 GB/s of memory bandwidth on a general purpose CPU, on a single socket, has never been done before (to my knowledge.)