M2 Ultra can run 128 streams of Llama 2 7B in parallel

mrb · on Oct 12, 2023

For those wondering why the M2 Ultra is so fast, or the M1 & M2 series in general, it's because inference's main bottleneck is memory bandwidth, not compute power. And the M2 Ultra has a bandwidth of 800 GB/s which is about 8 times faster than an average modern desktop CPU (dual-channel DDR4-6400 offers a bandwidth of 102 GB/s).

This high bandwidth is really a result of Apple having designed a unified memory architecture for the M1 and M2 chips. Typically on a laptop or desktop, the CPU and GPU have distinct memory systems: high-bandwidth (but relatively low-capacity) graphics memory, and relatively low-bandwidth (but high-capacity) CPU memory. Apple decided to simplify that and instead implemented a single high-bandwidth memory system shared by the CPU and GPU. The only downside is that such high-bandwidth memory had to be tightly integrated in the M2 package, so the maximum capacity is limited. For example whether you spend 5,600 USD (cheapest Mac Studio machine with M2 Utra and 192 GB) or $10k+ (maxed out Mac Pro), you will only ever get 192 GB RAM max. For that amount, a PC could get 1024 GB RAM (5× more!) But on the other hand, if your workload, like inference, doesn't need more than 192 GB, then that's great. Personally I think Apple made the right tradeoff here. 800 GB/s of memory bandwidth on a general purpose CPU, on a single socket, has never been done before (to my knowledge.)

kristianp · on Oct 12, 2023

Compare with an rtx 4090, which has memory bandwidth of 1,008 GB/s, but only 24GB of gddr. The 4090 is cheaper.

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

mrb · on Oct 12, 2023

I agree, GPUs can still generate more token per second per dollar, but what is new and great about the high-end M1 and M2 is Apple offering this much memory bandwidth on a general purpose CPU, thus immediately available to all software running on the CPU.

kristianp · on Oct 12, 2023

Looking at the prices, The 4090 is about 3/4 of the price of a base model M2 Ultra Mac Studio, which has 64GB RAM. With the rest of the PC to go with the graphics card, it's about 7/8s of the price of an Ultra. Then you have the compatibility. Do you want your software to be CUDA compatible or Metal compatible? If you're writing the software, maybe you want both!

The budget option is to go with a used 3090, which still has greater memory bandwidth than the M2 Ultra.

In terms of FP16 Flops on the GPU, you have:

M2 Ultra, 27 TFLOPS

RTX 3090, 35 TFLOPS

RTX 4090, 82 TFLOPS

https://www.cpu-monkey.com/en/igpu-apple_m2_ultra_76_core

https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

boesboes · on Oct 12, 2023

Unless you calculate power usage I’d bet

datameta · on Oct 12, 2023

I would wager the M2 is more energy efficient than a 4090.

smoldesu · on Oct 12, 2023

It fully depends on the workload. The 4090 itself draws around 450w at load, and the M2 Ultra peaks around 300w. If your workload is >1.5x faster on Nvidia hardware, then it's per-prompt efficiency probably beats out the M2 Ultra.

frontierkodiak · on Oct 12, 2023

Not to mention that you can trivially set a 300w power cap on the 4090 and still get 75-80% of peak FLOPS.

datameta · on Oct 12, 2023

All valid points, I think I looked at it too trivially and fired an opinion from the hip.

mark_l_watson · on Oct 12, 2023

The 800 G/s bandwidth is amazing. I ordered a maxed out Mac mini with 32G ram and 200G/s bandwidth. For the LLMs I want to run right now, that is sufficient for my needs, although I did consider over-buying and getting a M2 Ultra. I also pay Google for Colab, and as long as I don’t over-use it, I can almost always get an A100. My strategy is to split my work as appropriate between the Mac mini when I get it in a week or two, and Colab. I used to run on Lambda Labs, also excellent, but setup time was non-negligible.

ProllyInfamous · on Oct 25, 2023

I have an M2 Pro 16GB — anybody on Apple Silicon can download DiffusionBee.app and immediately be generating images (from text prompts) with its default model/engine... drag-and-drop.

Incredible what (even with limitations of a single $1000 computer, costing less than a single nVDA 4090) a desktop mac mini can accomplish.

----

For comparison: the hard drive in the M2 Mini is FASTER THAN A MACPRO5,1's RAM!!!

j45 · on Oct 12, 2023

The M1 Max and M2 Max are quite serviceable too in not jumping to an ultra.

sonthonax · on Oct 12, 2023

I've noticed that M series Macs have extremely fast disk drives that the OS uses as swap quite efficiently. I've frequently used all my RAM on my Mac and barely noticed any slowdown when it starts swapping.

detourdog · on Oct 12, 2023

I found that SSD drives finally eliminated the drawbacks of Mach's VM. While on platter drives one needed lots of RAM to avoid needing to swap.

seec · on Oct 14, 2023

As was already noted in other comments, the M2 Ultra bandwidth is not that special against high-end GPUs (all the recent ones have over 700GB/s generally) and this bandwidth has to be shared with CPU. So technically if you keep doing work on the CPU there is less available bandwidth at any given time. A 4090 + 13900K has almost 1100GB/s combined ; not that it matters for most use case. For regular CPU tasks the added bandwidth doesn't seem to make a difference as far as I can tell, at least Apple Silicon isn't winning in any scenario where they don't have a specialized block on the chip for the task. So what's the point ? (beside overpaying for memory)

And to "win" this the total VRAM available at once is what make a difference, not really the bandwidth ; this is just because the task has been parallelized as much as possible. Even then, it required to optimize for the arch with parallelization and it is absolutely not cost competitive with the PC used as a reference. If you really need to maximize GPU VRAM in a single workstation (without going to a servers/cloud solutions) you could build a machine with multiple RTX A4000 SFF (1 slot, 20GB). It would get more expensive than the maxed out M2 Ultra but at this point the M2 Ultra lose so hard in FLOPS power that you really need to specifically look for situations where you would want more VRAM (up to 144GB available for the M2 Ultra GPUs vs 80GB for 4x1 slot card) but wouldn't want to run the model faster/longer in a dedicated server rack that could potentially have even more available VRAM (and be available/shared with other peoples).

Realistically NVidia knows how to put more RAM in their GPUs because it doesn't make sense to scale VRAM faster than computer power for most workload, you need to have a balance that make sense. As an analogy it is like coming up with a truck than can carry 150T at once but can only do so at 1/3rd the speed of regular trucks. In most case you actually gonna want to run 3 regular trucks even thought is going to be less efficient (it still gonna cost less and be faster overall) unless you really don't have a choice ; at this point you are in "special convoy" territory (like for wind turbine blades) and it's gonna cause lots of headaches on top of being slow and expensive.

Apple market their stuff as an incredible innovation when in fact not only it is irrelevant for most workload that are usually thrown at workstations (mobile or not) but I would argue that running the workloads where it would actually make a difference is a bad idea on a single user workstation. For most things that actually matters in a single user workstation/prosumer/enthusiast system, Apple Silicon lose quite hard especially when it comes to GPU performance : viewport performance, close to real-time 3D rendering (before sending to render farm for final detailed render), games, etc...

And this is the Ultra version of the chip, that is out of reach for most people (it makes look at the 4090 as not that overpriced, which is quite funny). If you go down to the M2 Max version, suddenly the bandwidth is 400GB/s and not only it is not impressive at all, it is even worse than an Intel A770M laptop GPU (512GB/s) while still having less raw power and costing way more. The more you go down in the Apple Silicon roster, the worse it gets. AS is not competitive at the high-end workstation level but it is absurdly overpriced at almost every level.

The reason they have this architecture (that isn't very good for most traditional computer application) isn't because they went out of their way to engineer something great. Nope. It is because they basically scaled up a mobile architecture that was like this from the get go (power and space constraint, plus no need to have that much RAM nor have it upgradeable). And this is only because Apple is currently run by a Scrooge who figured he could get even more money out their silicon division if they solds SKUs with binned parts and controlled the RAM supply/price.

If Apple had actually done useful engineering they would have figured out a way to scale the GPU/VRAM combo independently and a way to package/sell it efficiently. It makes no sense to scale VRAM past a certain point : why would you want to load a 3D model/view/whatever if you cannot compute it fast enough. As for the CPUs existing memory interfaces where fast enough for most things and the "benefit" is inexistent in most case. They went about it in the worse way possible with cost reduction above all approach while jacking up the price up to 11. This is the most lazy approach they could take and they even dumped all the unnecessary cost directly onto the consumer (low yield for big area chips and soldered RAM close to the chip from a lack a dedicated GPU SKUs). Even if the consumer want to absorb the cost he still get bad scaling and uncompetitive performance...

I just don't get how Apple get away with it and there are people like you falling for their marketing bullshit that is just a spin on what are actually weaknesses...

macwebcomputing · on Oct 14, 2023

Interesting feedback but focus on cost is a old debate. Your feedback seems more about 3D than AI. A lot of developers also want a better user experience, reliability. My reply is that M1 and M2 are just the beginning as Apple is investing billion of dollars in R&D. Also, no PC laptops can do better than AS architecture today, at a lower cost than high-end PC laptops and more battery life. Pro servers is next step. M2 ultra is just a preview. Just saw two weeks ago, an engineer doing a demo of Llama 2 on a recent AMD laptop. he was complaining how slow it was compared to a Mac. Again, Apple is leading for laptops now, server is next. M3 will remove memory limitation.

leetharris · on Oct 11, 2023

For applications that aren't latency sensitive my company has found that Apple hardware inference is far less expensive than the competition when calculating for electricity usage. I wish they would make a cloud offering.

whalesalad · on Oct 11, 2023

Amazon has Apple servers for rent. There is also https://macminicolo.net and https://www.macminivault.com

nathancahill · on Oct 11, 2023

I'm so happy MacMiniColo is still around. I had projects hosted there when they were just getting started. They've stayed true to their mission. I love that their website still feels like Wordpress 3.0.

hamandcheese · on Oct 11, 2023

They aren't accepting new Mac minis for colocation. Which means if you want to go from, say, 8gb to 16gb of ram then that's an extra $90 a month for the privilege.

jzombie · on Oct 12, 2023

I am using https://www.macminivault.com for colocation.

cyberge99 · on Oct 11, 2023

There’s also MacStadium.

detourdog · on Oct 11, 2023

I have a fiber connection, the space and the experience to run a Mac mini colo. If people are really interested. I would want to do it as a co-operative that helps with the expense of the physical location.

ruph123 · on Oct 11, 2023

MacMiniColo and MacStadium merged afaik.

JCharante · on Oct 12, 2023

Doesn’t macos require you to rent it out with a one month minimum? At that point just buy the hardware

garciasn · on Oct 11, 2023

We went with a M2 Studio with maxed out RAM because we simply cannot get reliable GPU availability with cloud providers and for $6000 (with tax) we can have the equivalent VRAM of ~2 80GB GPUs instead of paying $5/hr for the pleasure.

ttt3ts · on Oct 11, 2023

With a 70B param model how many tokens/second?

Did the math and assuming 100% util and equal performance (which is certainly not the case) payback on your Mac is 9 months...

garciasn · on Oct 11, 2023

You need to pay for dedicated because they’re generally unavailable in the moment. So it’s more like 45 days, if we’re only talking about a single GPU—but we’re talking about ~2x.

ttt3ts · on Oct 11, 2023

How many tokens a second? Really trying to figure out viability.

4x NVIDIA A100 at lamda labs is $4.40 an hour and I really have not had an issue getting them.

hnfong · on Oct 12, 2023

Note this is a M1, not M2.

https://www.reddit.com/r/LocalLLaMA/comments/16o4ka8/running...

ttt3ts · on Oct 12, 2023

Thanks! Ya, I opted for dual 3090 for my workstation (keeping full LLM in VRAM is crit) was wondering what lift was for M2.

OP implied that there were workloads where it out competes renting in terms of cost. Was hoping it was true for something than a single user interactive session (which can be done a lot cheaper)

api · on Oct 11, 2023

Apple really should license the M chip IP to someone to make a server chip out of it, or do it themselves. It's money on the table for them and would not cannibalize their Mac business at all. It's a very nice core.

Aurornis · on Oct 12, 2023

Apple silicon is great for low power desktops and laptops, but they don't actually have groundbreaking performance relative to what we've got in the server space. If you dropped the M2 Ultra from the $4000 into a server, it would perform about the same as a $1500 AMD 7950X3D based server (this is a common budget server setup with ECC) in CPU tasks. Stick a common GPU in there and you're running circles around the M2 Silicon in GPU tasks.

The Apple Silicon is great at really low power work, but if you dial desktop or server GPU power limits down they also become quite efficient. The marginal cost of electricity is cheaper than buying more hardware, so nVidia and others run their parts deep into the diminishing returns part of the curve to maximize performance at the expense of power efficiency.

cyber_kinetist · on Oct 12, 2023

What Apple Silicon brings to the table is not simply just performance, but a large amount of unified memory that can be used by the GPU (which are needed for inference of large deep neural networks like LLMs).

A top-of-the-line Mac Studio will give you 192 GB of unified RAM in less than $7000. Meanwhile a H100 from NVIDIA with 80 GB of VRAM will cost you like $30000...

api · on Oct 12, 2023

If the software were a little better Apple Silicon would be by far the most cost effective rig today for DIY LLM research or training.

192GiB RAM is enough to train or inference Falcon 180B in RAM at 8-bit resolution.

sitkack · on Oct 11, 2023

There are multiple riscv based solutions that will be out in 12-18 months. But for now, getting Apple hardware is the best solution.

mschuster91 · on Oct 11, 2023

Fully agree. It's time for some serious competition not just against Intel/x86 but also in the ARM space.

dzhiurgis · on Oct 11, 2023

IDK about ML part, but equivalent performing Ryzen mini pc cost me 3x less over m1 macbook (yes im aware you get more with macbook)

GeekyBear · on Oct 11, 2023

> IDK about ML part, but equivalent performing Ryzen mini pc cost me 3x less over m1 macbook

When running an ML workload, the Nvidea A100 has massive GPU compute resources, and a large amount of GPU local high bandwidth memory, so it's ideal, but is nowhere near low cost.

A consumer Ryzen chip is inexpensive, but lacks in both memory bandwidth and GPU resources.

The M2 Ultra has access to way more RAM than consumer GPUs, many times the memory bandwidth of a Ryzen (800GB/s vs Ryzen 7 1800X at 40 GB/s) with a large amount of local GPU resources.

Even stepping up to a Threadripper Pro would only get you a quarter of the memory bandwidth, and those aren't exactly cheap either.

solardev · on Oct 11, 2023

It's easy to outperform Apple Silicon on pure power, but what about efficiency & heat? (like FLOPS/watt or whatever). Does anything else come close yet?

blacksmith_tb · on Oct 11, 2023

That matters a lot in a laptop, but not so much in a 1U rack? Not that datacenters love heat, but the competition isn't extra hot, it's just hotter than Apple Silicon?

ska · on Oct 11, 2023

A 1U rack is a small machine, but a few hundred of them+ in a constrained space is a different story. At larger scales, moving electricity in and heat out are usually the defining factors.

solardev · on Oct 11, 2023

I guess it depends on what you're doing with them? If you're running them 24/7 to train or model something, the energy costs might add up. Even if you're not, having more efficient chips might mean more data centers don't need as complex cooling equipment.

dzhiurgis · on Oct 12, 2023

You're quite right - 6800H TDP is 45W, M1 is 30W

starcraft2wol · on Oct 11, 2023

you can buy rack mount Mac pro

crooked-v · on Oct 11, 2023

You could always rack mount some Mac Minis.

ninkendo · on Oct 11, 2023

It’s sad that racking and maintaining your own physical hardware is becoming such a lost art… I appreciate the up-front simplicity of cloud offerings as much as anyone, but there’s something to be said for owning your own hardware and avoiding the continual rent payments you’re sending the cloud providers.

The wisdom is that cloud providers are better at infra than you, and that the economies of scale make it better to piggy back on what they’re doing, but… AWS is the most profitable part of Amazon for a reason. They’re overcharging you.

ilc · on Oct 11, 2023

For most orgs: AWS is not overcharging IMHO.

When you look at the cost of the hardware + hosting. Yes, it certainly looks and feels that way.

But if you've dealt with corporate IT, and had to deal with 3-6 month lead times on getting hardware, or politics to get your hands on hardware to get stuff done.

AWS is cheap. It gives you velocity.

If your company is large enough that it can offer the elasticity of resources that Amazon offers or even 1/4 of it... and you have an IT org that will let it happen. Yes, AWS is a waste.

But with AWS... when a project dies, you can wipe its costs out, people won't hold onto hardware so they have hardware for the next project, etc...

Trust me. I've been IT, I can spec and build rack systems. I am a software dev. And I've been a dev most all my career.

For 90%+ of orgs... they don't have the maturity and skills to handle that type of infra without substantially distracting from their primary business.

ttt3ts · on Oct 11, 2023

I find at AWS I am always wasting engineer time optimizing dumb things. Do you know how much a TB of ram costs? Or 10TB of blazing fast NVME costs? Less than $5K. How much does that cost at AWS?! This is not even considering bandwidth which AWS overcharges soo much. Yet I waste time.

Also, maintaining servers is not hard at a proper data center. It is often more hands off than the migrations cloud providers force on their customers.

wredue · on Oct 12, 2023

Yeah. Also, to be honest, we still have process and approvals to get through to spin up AWS stuff.

It’s not like process disappears just cause you’re not on your own hardware. Infra is still its own team with its own budgets poking and prodding at every damn turn for every little thing till rejecting your requests, you escalate and then have a 4 week battle over needing the space.

ilc · on Oct 12, 2023

Then your company should have stayed on prem.

If you are paying the price of being on prem, which is really lack of ability to provision and de-provision infra quickly. There's little point to the cloud, unless you just have no infra to begin with (small companies).

I'm in a small firm now. I can't imagine having an approval process to spin up a few instances to run my tests and spin them down after. That'd be silly.

JCharante · on Oct 12, 2023

Really? At my old company each division had its own budget and account and then you’d be an iam member of an account and spin up services under that account, but there’s no central authority to send a request to. There were tools to analyze underused services across all accounts (like EC2 instances constantly under 2% cpu load meaning they were flagged for downsizing if possible).

benreesman · on Oct 12, 2023

Getting good results on AWS/GCP is neither easy nor simple: it’s a different set of headaches.

It’s still a win for a lot of use cases and I still do it quite often, but the meme that it’s this “click and you’ve just hired the best ops team in the world to work for you” and so the 50-500% markup is actually a bargain is horseshit. A Bizon box in your living room fucks AWS up on flops/$ on most instance types and pays for itself in 30 days.

It is one of the best ops teams on Earth: but they’re working for you like the Google search team is working for the user.

mannyv · on Oct 12, 2023

The problem with using a cloud provider is that you still need to know what you're doing.

Your application isn't going to magically become HA/DR. You still have to make it that way, from your application design/coding up through the deployment.

I mean, if you're not storing your session IDs in a data store that's reachable by all the nodes behind your load balancer then no amount of infrastructure is going to save you.

TheNewsIsHere · on Oct 12, 2023

A great illustration of this is the GitHub outage a few years back. They had a fairly well distributed application layer but the database topography at the time didn’t consider the failure mode, even though the application layer did.

That’s a realistic scenario no matter whether you’re bare metal, building out your own cloud, or using someone else’s. No amount of AWS/GCP/Azure/et al marketing changes that.

ilc · on Oct 12, 2023

And when your comcast link dies, GG man. Oh, what about when you drop a drive?

Yes, you have to learn things to goto the cloud, and I won't say it is all roses, it ain't. But... AWS is less likely to fsck it up.

If you have the constant load to burn the flops 24x7x365... go for it. If you have the ops team to do it... go for it.

If you don't... take a bit of time and learn the cloud which is much easier than getting on-prem right.

Especially for smaller firms, this isn't even a close call IMHO.

ndriscoll · on Oct 12, 2023

Back in the day, Google was an innovator by using lots of cheap commodity servers instead of a few expensive ones and just accepting failures as a fact of life. I wonder 25% seriously whether there's an opportunity for a similar mad genius move to pay for business class fiber at a half dozen remote employee's homes across the country and just have a good replication/failover strategy. 24/7 on call isn't that big of a deal if you just have to go into your basement to swap a drive. Going to be on vacation? Don't be the primary site while you're out.

benreesman · on Oct 12, 2023

People like vast.ai are making moves in this geberal direction FWIW if you’re passionate about it.

sitkack · on Oct 11, 2023

I work in cloud and you sound like a shill.

ilc · on Oct 12, 2023

More a realist. If you have the scale to go on prem. Do it.

Most firms don't. Or don't have the skills.

Also, the cloud can help an IT project recover from errors. Let's say, I'm about to buy 500k of hardware to setup some storage. I get my requirements, I architect it, do my design work, and then buy the hardware. I have to over provision a bit because of reality and human error... But when I discover that the requirements, shift 2mo in my project, and I've already ordered the hardware... I may be hosed.

This isn't hypothetical, this is what happens. Things evolve and shift. The cloud allows for more agility. If your firm is large enough, or has its stuff together enough, go for it on-prem.

I've got 20+ years on prem.. I've seen it fail all over. I've seen cloud be a mess too. But if you told me to clean up one. I'll take the cloud.

throwaway2990 · on Oct 12, 2023

Anyone who thinks the blanket statement “cloud is expensive” really doesn’t know what they are talking about.

ChrisMarshallNY · on Oct 12, 2023

For me, I prefer hosted/cloud, preferably managed.

I’m quite capable of setting up whole server stacks. I did it for years, but I stopped, some time ago, and consider myself to be, for want of a better word, incompetent at being a modern admin.

I think I’d screw the pooch, so I prefer that someone who does it every day, handle it.

But I write Swift code, every day, so I’m not incompetent at everything.

sbarre · on Oct 11, 2023

The question I've often heard asked when deciding on build vs buy (which can apply to cloud vs. bare metal) is:

Are we in the business of building, maintaining and operating <thing to build> or do we want to buy that as a service instead and focus on our actual core business?

There's more to the cost of building and operating than just the hard costs.

Retaining good modern IT talent is getting harder and harder - and I'm not even talking about salaries.. You need a whole department including strong leaders who can hire, train, and lead the right people, etc..

This is something most companies wouldn't even know where to start with.

hamandcheese · on Oct 11, 2023

I don't think you need a whole department to rack a few minis and run some buildkite agents on them.

mannyv · on Oct 12, 2023

It depends on how important it is to you.

You can throw a bunch of boxes in a closet and it'll work. A surprisingly large amount of the early Internet was "a spare box under my desk."

The problems start when they become part of your critical path and you're on vacation and nobody knows WTF is happening.

I mean, it's a risk. If you're OK with that risk then go for it.

It's really about the politics of your office.

If everyone is OK with the idea that the box is in some closet somewhere that's fine. I've been part of a bunch of startups where we were running infrastructure on spare hardware. Sure it's not HA, but we didn't need it...or it was at least HA enough for what we needed.

hamandcheese · on Oct 12, 2023

Yeah, to be clear I wouldn't advocate this route for your core product. But running CI workers? Sure. Especially macs, which have onerous usage restrictions in the cloud that negate most of the elasticity benefits you might otherwise see.

londons_explore · on Oct 11, 2023

You need one good motivated sysadmin.

But if you can motivate that same sysadmin to spend his skills on something more directly benefiting your company, then you should still buy it in.

hamandcheese · on Oct 11, 2023

I think you need one motivated developer who likes to tinker with hardware for a few hours a month.

Given how popular homelabs are, I don't think this would be too hard to find.

lghh · on Oct 12, 2023

I don’t want to build a business around a dev who likes to “tinker with hardware”.

hamandcheese · on Oct 12, 2023

Why the scare quotes? It's a skill like any other. Startups are built by folks who wear many hats.

c0pium · on Oct 11, 2023

Now your business depends on that one person, congratulations. Hope they don’t realize that their skills are better suited to working at AWS/Azure etc. for 2x the money. Which they are.

londons_explore · on Oct 11, 2023

It's not core stuff. If the one sysadmin leaves, you buy in a solution.

c0pium · on Oct 12, 2023

I mean, fair enough I guess. Why play the OpEx vs CapEx game when you can just pay both.

sbarre · on Oct 12, 2023

Sure, like any other high-level fictional situation you can surely come up with many valid fictional counter-points of your own, but cloud hosting is popular for a reason.

And I think in most cases companies want to focus their employees and efforts on their core business, and if that doesn't include setting up and maintaining hardware in the long-term, then you don't build, you buy.

ClimaxGravely · on Oct 12, 2023

I'm a former build engineer and used to do everything on prem and I gotta say I miss it (not being a build engineer, the on prem experience). Since those days pretty much every company I've worked at moved their CI/CD to the cloud and I gotta say it feels so much slower, even when working from home.

I remember twice switching from an in-house jenkins/teamcity/whatever type of CI to Azure devops and the thing I remember the most was how much longer it took a build to complete as well as the massively longer time downloading a build from Azure vs from within the office. Even when working from home the on-prem stuff was faster.

The thing is, the build/devops teams seem to be about the same size in both cases. It's just kind of worse in pretty much every case when we do CI in the cloud.

Notes

- My experiences are largely for game development so the build times and artifact sizes can be quite large.

- I've only ever had CI/CD experience with Azure, I've not tried other cloud providers

- Since this is game development and we're using CI downtime is more acceptable than other cases. That said, I don't remember much downtime when I was working as a build engineer. I have seen periods of 1-2 hours of downtime once in a blue moon but then again I've seen that with Azure. In both cases it wasn't so much the setup but a build script deployment issue.

Also being able to cool off in the rack room when it's a hot day is always a treat :)

cherryswitch · on Oct 11, 2023

Sometimes the flexibility and time savings is worth the added upfront cost. Similar to how companies like to hire consultants or lease office space. Being able to walk away is better for short term, because companies value profits in the short term.

whalesalad · on Oct 11, 2023

You’ve never had to drive to the colo at 3am huh?

ninkendo · on Oct 11, 2023

No, I had proper OOB access and could do any kind of power cycling or remote console access from my desktop.

hnfong · on Oct 12, 2023

Never seen a hardware failure, I see.

ninkendo · on Oct 12, 2023

The snark is getting a bit tiring. No, plenty of hardware failures but we had redundancy. Our availability wasn’t dependent on how fast I cold drive to the data center at 3am. Drive dies, who cares, there are plenty of hot spares, we’ll deal with it during business hours. Server dies, who cares, there are lots of them. We have remote hands too, so simple hardware replacement is something you can get cheap onsite labor to do.

If your operations halt until some poor sysadmin has to drive to the colo, you are absolutely doing it wrong.

solardev · on Oct 11, 2023

Where does this stop? Do you produce your own electricity? Farm your own food? Make your own silverware and shoes? Sometimes it's just easier to outsource the things you don't want to (or aren't good at) doing yourself.

If I wanted to host a website, sure, I can build a server out of parts and negotiate with my ISP and get a business pipe and handle all caching and such. Or like I can pay a provider $5/mo and get better performance and reliability with no management overhead. Yeah, maybe over 5 years I'd save more money doing it myself... but it's not worth the time.

If I wanted to generate a photo or a dozen, or a few paragraphs of text, that's like a few cents worth of cloud AI. Maybe low single-digit dollars. Or I could spend thousands on fat GPUs or a Macbook, spend forever training it, and still end up with a sub-par result.

AWS is profitable not just because they're overcharging you but because they are providing a hugely useful service for millions of businesses that don't want to deal with that infrastructure themselves, any more than they'd want to manage their own plumbing or electrical grid or roads and bridges leading to their office. DIY makes sense if you're doing it as a hobby or if your scale is so big that you would incur significant savings to in-house it, but for millions of small and medium businesses, it's just not the most practical approach. Nothing wrong with that.

I mean, it's like saying development is such a lost art... why hire a dev if you can learn to code yourself? Sure, but not everyone wants to, can, or has time.

ClimaxGravely · on Oct 12, 2023

I hate to say this but it has gotten to the point where I'm starting to farm some of my own food (I'm starting to get fed up with produce quality issues in my hometown).

Still haven't started on the silverware or shoes yet.

I do agree with you though. If you are a non-tech company or a company that lacks the human resources you might as well go with the cloud.

solardev · on Oct 12, 2023

That's not anything to be ashamed of. It's awesome you grow your own food!

Spooky23 · on Oct 11, 2023

Depends on the financials.

My employer has generated its own electricity and steam for decades.

For a small business - different story.

GravityLab · on Oct 12, 2023

I have a base-model m1 Mac Mini and it's a beast. I'm using it as my build/deploy server and also as a back-end server (for running jobs) for the prototype I'm working on. I also do development on it when I want to use my big monitor rather than my laptop. And I listen to music and run Cookie Clicker at the same time while doing development.

Got three databases up and running too. It's a beast. I'd definitely consider self-hosting with a few Mac Minis, that would be fun and they're really cute, sleek devices too. I paid $650 for it and consider it a great deal. Definitely should've gotten it with more than 8gb of ram but I got it to try it out and haven't yet really needed to upgrade to a unit with more memory.

nre · on Oct 11, 2023

Interestingly enough I was actually discussing this with a friend (who works in enterprise IT) the other day. Basically rack servers are purpose build for the task, with hot swappable components, redundant power/storage, multiple NICs, ECC, remote management, and so on. They come with enterprise support and can be easily maintained in the field.

Meanwhile a Mini cluster is literally a bunch of mini pcs in a rack, and idk if Apple even supports this kind of industrial use. While it's a quality product the Mini isn't really designed for the datacenter.

xienze · on Oct 11, 2023

> and idk if Apple even supports this kind of industrial use. While it's a quality product the Mini isn't really designed for the datacenter.

I think they know of it and tacitly approve of this use case, as evidenced by the Mac Mini having the same form factor for ages. They’re well aware that a lot of people use Minis (and Studios now) in data centers, and that the Mini footprint is sort of “standardized” at this point.

selectodude · on Oct 12, 2023

That 10GbE NIC option on the low end Mac Mini was a dead giveaway.

nxobject · on Oct 12, 2023

IIRC, Apple did indeed have a server SKU for the Intel Mac mini at some point.

ytch · on Oct 12, 2023

It's called Xserve:

https://en.wikipedia.org/wiki/Xserve#Intel_Xserve

But since Apple discontinued Xserve and macOS Server, they seems like don't care about this business anymore.

MenhirMike · on Oct 12, 2023

They actually had a Mac Mini Server as well for a bit. It made sense because it had a second hard drive instead of an optical drive and came with a Mac OS X server, back when that was a standalone $499 product: https://support.apple.com/kb/SP586

(Not sure what differentiates the later model Mac Mini Servers from the regular Mac Minis, since Mac OS X Server just became a $19 App Store purchase, and optical drives were no longer a thing in Mac Minis)

TheNewsIsHere · on Oct 12, 2023

I have one of the 2009 Mac mini Servers running Ubuntu 22.04 LTS like a champ. It’s still a great machine. Upgrading the HDDs to SSDs was a bit of a chore, but doable.

They discontinued the Mac mini Server line in October 2014, which was still sold with two drives instead of one. Configurable to order with SSDs by that time.

astrodust · on Oct 12, 2023

There was also a "server" model Mini but it was very short lived and was basically a regular Mini with the "Server" software pre-installed, something that you could just throw in via the App Store with one click anyway.

TheNewsIsHere · on Oct 12, 2023

It had a five year run and saw four different hardware models. It included two hard drives instead of either one hard drive and an optical drive, or just one hard drive (after they ditched ODDs).

Mac OS X Server was its own operating system originally. It was still the same core OS, but had a ton of additional servers built in. Non-exhaustively, they included IPSec VPN, email, calendaring, wiki, SMB and AFS file shares (including support to act as a Time Machine backup destination), LDAP, DNS, and software update caching before it came to macOS proper. The Server app released via the App Store was a shadow of Mac OS X Server.

These were quite popular in small professional offices like law firms.

astrodust · on Oct 13, 2023

I saw it save the ass of a client who had one, as they got robbed and all their desktop computers, mostly iMacs, were stolen. The Mini was as much as lost in the wiring in the network closet and was overlooked, so everything had backups.

smoldesu · on Oct 11, 2023

For applications that aren't latency sensitive, I run inference on a free 4 core Ampere server from Oracle. Once you ditch the "fast" prerequisite, a lot of hardware becomes viable.

chmod775 · on Oct 12, 2023

Nowadays GPUs factory defaults try to squeeze out the last bit of performance for a huge cost in power.

You can run them at half the power usage and only lose a fraction of the performance - at least in gaming. Try for AI tasks.

macwebcomputing · on Oct 12, 2023

macweb.com is also a solution :-)

gcr · on Oct 11, 2023

For folks curious what this is, this seems to be a caching optimization for saving time on parallel streams of text. The benchmark is incidental. Most individual users likely have just one un-batchable conversation going at once with llama-cpp, and I think it’s unclear whether this PR improves that case much.

Also note that the demo video is sped up to fit inside GitHub attachment limits. Your observed speed may vary. :)

lukeschlather · on Oct 11, 2023

I've been curious about using LLMs for large-scale refactoring. Prompts like

anywhere you find `FooBarBaz(blip, kap)` replace it with `new newThing(blip).bump(kap)`

I don't know how reliable it is, but it seems like if you can easily run this on commodity hardware it could totally replace most IDE refactoring tools, although obviously the IDE refactoring is more reliable, it seems like this could be made simple and flexible, and possibly just as reliable as IDEs.

But also it could enable some interesting things that you could never do with an IDE refactoring tool.

potatoman22 · on Oct 11, 2023

LLMs aren't very good at string operations. They struggle with things like character counts, extracting substrings, and replacements.

simonw · on Oct 12, 2023

Yeah, for this kind of thing I would get the LLM to generate the regular expression.

sharkjacobs · on Oct 12, 2023

I already do things like "write a script to replace `FooBarBaz(blip, kap)` with `new newThing(blip).bump(kap)` in a project folder"

I'm more comfortable with that because I find it usually takes two or three prompts to get it right

e.g. A couple hours ago I prompted it to help me do a diff of two commits ignoring all white space, just to check if there were any other changes. The first response didn't ignore new newlines, the second one was a multiline script, the third response gave me what I actually wanted.

  diff -w <(git show 0bb2c8579efe775de883e0182db48989bfa324f2:"path/to/file"|tr -d '\n') <(git show 6c71efc17497ad7c90b9c7b690075ec031c13c69:"path/to/file"|tr -d '\n')

Ductapemaster · on Oct 11, 2023

I think that an LLM could be an amazing interface or translation layer for this sort of thing, but I would argue that the underlying operations of refactoring or something similar should remain very much like a function with discrete inputs and outputs.

dwringer · on Oct 12, 2023

I believe that the application of multiple streams in parallel is a natural evolution of using a single stream. I've used some local models for help in creative writing, and some of the most productive results I got were from running the same prompt and sequence of interactions dozens and dozens of times. Although in that case, I was personally going through each result line by line, I can certainly imagine fully automated tools that leverage the range of responses to a given prompt.

gpderetta · on Oct 11, 2023

I always wondered of pipelining/parallelization can help with AutoGPT-like tasks were a supervisor AI delegates subtasks to sub instances.

londons_explore · on Oct 11, 2023

Speed-wise, on a single stream, I have no need for it to generate text faster than I can read.

However, for scripts that try to use hundreds/thousands of invocations to solve some problem (eg. "write me a whole book"), the parallelism will be great (but obviously the script has to be written with that in mind).

fudged71 · on Oct 11, 2023

So this will only work on, say, one prompt being run many times at once?

ugh123 · on Oct 11, 2023

Is it possible that in a few years time, only Mac silicon and PCs with high-end GPUs will be required to run "In-home LLMs" affordably?

If we get closer to a either "AGI" (whatever the hell that is) or at least a reasonably useful AutoGen/BabyAGI-like system that become popular to use at home, those machines will be the only ones capable of running advanced LLMs without having to pay OpenAI, Microsoft, Amazon/AWS, etc inordinate sums of money to do what consumers will deem a utility some day.

anotherhue · on Oct 11, 2023

> Is it possible that in a few years time, only Mac silicon and PCs with high-end GPUs will be required to run "In-home LLMs" affordably?

No. They work well on the apple chips thanks to the integrated memory and the large size of the models. I know of no reason why an x86 chip could not be designed in a similar way if desired. IANAChipDesigner but I have worked for one of them.

behnamoh · on Oct 11, 2023

FWIW, while Apple silicon can _run_ huge models thanks to the unified memory (not to be confused with shared memory), the inference is pretty slow compared to dedicated GPUs, so it's a tradeoff. The significance of this PR is that inference speed can—at least in certain applications—be sped up using parallel decoding.

Metus · on Oct 11, 2023

Do these implementations use the neural engine? I saw that there was a stable diffusion implementation using the neural engine and I found that my macbook noticably did not run hot, as opposed to an average Teams call.

snitty · on Oct 11, 2023

It doesn't. You need to generate models for use on the neural engine, which apple did for Stable Diffusion, but this is just taking advantage of lots of fast RAM and lots and lots of threads, if I understand it correctly.

ramesh31 · on Oct 11, 2023

It uses Metal acceleration, and takes advantage of the shared memory architecture, meaning it's basically a GPU with 196GB VRAM. Trading space (VRAM) for time (FLOPs), it can beat the performance of an RTX4080 here.

lostmsu · on Oct 12, 2023

> can beat the performance of an RTX4080 here

This needs some backing. When M1 just got out people were claiming it is comparable to 3080, until they saw the performance difference.

ramesh31 · on Oct 12, 2023

Read the PR

woadwarrior01 · on Oct 12, 2023

Encoder only transformers (like BERT) can be made to run on neural engine with CoreML. Efficient inference with autoregressive encoder-decoder and decoder only transformers (aka LLMs) needs KV-caching, which currently can't be efficiently implemented with CoreML (and thus neural engine). So, for now it's GPU only, with Metal.

smpanaro · on Oct 12, 2023

You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.

There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.

Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.

[1] Technically you should be able to use 1 input with an enumerated set of sizes but I haven't been able to get it to work on the Neural Engine. This would likely be even faster. [2] https://github.com/wangchou/whisper.coreml/ [3] https://github.com/smpanaro/more-ane-transformers/

cypress66 · on Oct 12, 2023

>I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching).

That seems very slow compared to llama cpp?

smpanaro · on Oct 12, 2023

Yeah, I believe it is. You trade off speed for lower power usage and CPU. 8 tokens/sec is usable though.

GaggiX · on Oct 11, 2023

Autoregressive transformer models are usually memory bound, whereas SD is compute bound, so perhaps the difference lies here. Also the reason why SD runs so much faster on the GPU than on the CPU.

ninkendo · on Oct 11, 2023

M1 has (fast) unified memory between GPU and CPU, so something being memory bound ought not to have much bearing on whether it belongs on CPU or GPU… at least in theory. I’m a total noob here though so I may be wrong.

GaggiX · on Oct 11, 2023

We were discussing mostly about NPU, I don't know if it makes a difference.

lib-dev · on Oct 11, 2023

From https://en.wikipedia.org/wiki/Apple_M1#Memory

> The M1 uses a 128-bit LPDDR4X SDRAM in a unified memory configuration shared by all the components of the processor.

I assume that includes the NPU, media engine, etc.

eurekin · on Oct 11, 2023

Ok, impressive!

What are real world use cases for 7B family of models? Is anyone using them for anything productive?

wokwokwok · on Oct 11, 2023

They're quite good at generating scaffolds and ideas (mistral specifically).

You can use them for trivial nlp tasks ("between 0 and 1 how similar are these two sentences? Respond with an explanation.") and because it's a small model, you just run it 4 or 5 times and take an average pretty quickly.

anotherjesse · on Oct 11, 2023

7B coding models? Having massive amounts of questionable code :)

d_sem · on Oct 11, 2023

Welp, looks like I'm out of a job. Perhaps management will suit me well, where I can make massive amounts of questionable decisions.

eurekin · on Oct 11, 2023

Yeah, same experience here

Havoc · on Oct 11, 2023

They're perfectly fine for story telling and basic chatbot duty. Also generating basic code boilerplate works just fine.

potatoman22 · on Oct 11, 2023

They make good classifiers when fine tuned

eurekin · on Oct 11, 2023

Interesting!

I'd have one use case for classification: user text (from a jira issue) mapped to the team responsible for the fix.

Can you share some tutorials? I only just managed to get this working on windows/cuda:

https://colab.research.google.com/drive/1vk8i01apaSp59GVV2yI...

It's been a royal pain to setup

objektif · on Oct 12, 2023

Do you have any pointers to learn how to start with fine tuning mistral locally!

cypress66 · on Oct 12, 2023

Use axolotl

nvm0n2 · on Oct 11, 2023

With these improvements llama.cpp/ggml is really becoming a pretty competitive serving stack even for large scale cloud hosted AI. I wonder how ggerganov finds the time to do all this, does anyone know if he's being sponsored?

killthebuddha · on Oct 11, 2023

He founded a startup https://ggml.ai/

selectodude · on Oct 11, 2023

Doesn’t seem like much of a business model there.

vineyardmike · on Oct 11, 2023

IDK, I can see a future. It’s a one-man (for now) business, so minimal costs to consider. If he can swing consulting using the .cpp projects as advertising, that sounds like a good business.

Additionally, I can imagine companies investing and paying for the open source work to expand access to their licensed models. Use the same interface as people use LLAMA but upgrade to BetterModel, fully compatible.

Additionally, I could believe this is simply a build up to a future Acquihire, which is the most lucrative way to be hired.

simonw · on Oct 12, 2023

He also raised a bunch of money, so I guess the plan is to figure out the business model as he goes.

bick · on Oct 12, 2023

Anyone here wanting to try an M1 Mac mini 16GB in the cloud for free this month just send me an email (click handle). Now that we've moved on to the M2, I've got a hand full of the M1 available FREE for trying. You can also try an M2 Pro, Max, or Ultra but for that you'll need to subscribe. https://www.macweb.com/macinthecloud

ngcc_hk · on Oct 12, 2023

Wonder about CUDA issue. Some can but most cannot ?

behnamoh · on Oct 11, 2023

I'm waiting for someone to comment "use the page title/why did you change the title/etc.". It's frustrating when you find something important on a page and type that as the title, and then the post gets flagged because it violates HN rules.

For comparison, this is the actual title of the page, but do you think this would increase people's awareness about the fascinating fact I highlighted in the title?

    llama : custom attention mask + parallel decoding + no context swaps #3228

solardev · on Oct 11, 2023

Yeah, I really hate the "no changing titles" rule. I can understand something like "don't sensationalize", but many articles just have poor titles that lack context. What does that accomplish aside from discouraging readership and discussion?

mortenjorck · on Oct 11, 2023

The rule is officially "don't editorialize" which I would interpret (perhaps incorrectly) as allowing a little leeway in surfacing a buried lede so long as it's presented in neutral language.

Something like "Amazing Llama 2 7B performance on M2 Ultra" would obviously fail that test, but the current title of "M2 Ultra can run 128 streams of Llama 2 7B in parallel" seems to follow the spirit of the rule, at least as I read it.

solardev · on Oct 11, 2023

I think the guidelines[1] say not just "don't editorialize", but (emphasis added):

> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.

[1] https://news.ycombinator.com/newsguidelines.html

mtillman · on Oct 11, 2023

Except the original title in my experience is always longer than they allow.

latexr · on Oct 11, 2023

> I really hate the "no changing titles" rule.

That’s not the rule.

https://news.ycombinator.com/newsguidelines.html

It allows for plenty of leeway, and in my experience alternative titles are accepted and will stand unless they are significantly worse than the original. It happens even with major announcements with hundreds of votes. @dang isn’t some mindless robot who must always enforce one way of doing things. The instructions are, as the page title suggests, guidelines.

solardev · on Oct 12, 2023

It specifically says...

> If the title includes the name of the site, please take it out, because the site name will be displayed after the link.

> If the title contains a gratuitous number or number + adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."

> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.

Emphasis on "Otherwise please use the original title". I think dang is wonderful and we're lucky to have him, but even in my short year or two here, I've seen enough instances of title-policing (not necessarily from him) that discourage me not just from changing titles but sometimes from posting things altogether if the title isn't good enough originally.

latexr · on Oct 11, 2023

> It's frustrating when you find something important on a page and type that as the title, and then the post gets flagged because it violates HN rules.

If that happens to you a lot, consider that perhaps other HN users disagreed with your assessment of what was important on the page and felt mislead when the content didn’t primarily match the tile.

Anecdotally, I see alternative titles as being well accepted when the true title is subpar. Especially relevant when the matter concerns GitHub issues (which this is).

yieldcrv · on Oct 11, 2023

you can leave a comment on the github issue and then link directly to that comment, if it becomes necessary to convey that message under the HN regime

behnamoh · on Oct 11, 2023

@dang maybe the rules can be updated to allow for more flexibility?

latexr · on Oct 11, 2023

The rules are flexible. Your parent comment is complaining of something which has not happened.

behnamoh · on Oct 12, 2023

It has happened in some of my previous posts.

teen · on Oct 12, 2023

It still can't run 10 year old games though (dota 2)

dagmx · on Oct 12, 2023

You’re purposefully conflating hardware capability and product availability/priority.

The hardware can run AAA games today. You might as well say the same thing about the PS5 because it can’t run an Atari game.

sgjohnson · on Oct 12, 2023

I’ve heard that there are some issues of gaming performance on M1/2 Ultra specifically (due to it being just 2x M2 Max in the same package), however my M2 Max MacBook absolutely runs Dota 2, and runs it very, very well. Like 180-200fps average well.

mistrial9 · on Oct 11, 2023

LLaMA2 is English-centric fwiw, censored for safety, and black-box on its contents. Happy to be proved wrong

mensetmanusman · on Oct 11, 2023

Think of the censoring as comedy.

It doesn’t like Kanye music and will tell you to listen to the song ‘Happy’ instead.

That’s pretty hilariously dystopian and therefore quite funny.

extasia · on Oct 11, 2023

You're right, none of that is disputed though?

crooked-v · on Oct 11, 2023

On the second point, there's an uncensored version available and assorted trained-for-purpose derivatives of it.

mholm · on Oct 11, 2023

Are these uncensored or decensored? If rlhf removes intelligence at any rate, I wouldn’t expect that intelligence to come back with a tune that’s let’s it say curse words and talk about religions

qeternity · on Oct 11, 2023

The foundation model is not “censored”, it’s the RLHF’d “chat” versions that are. Meta released both.

danielbln · on Oct 12, 2023

They are very much uncensored, give it a try yourself: `ollama run llama2-uncensored` [0]

It will be happy to curse, talk about religions, help you cook illicit substances or do all sorts of other stuff.

[0] https://ollama.ai/

mistrial9 · on Oct 11, 2023

links please!

mschuster91 · on Oct 11, 2023

You can run it on ollama: https://ollama.ai/blog/run-llama2-uncensored-locally

nomel · on Oct 11, 2023

These are not the uncensored models. It's a fine tune of the censored models [1].

> Filter refusals and bias from the dataset -> finetune the model -> release.

The alignment tax should still exist, maybe doubly so.

[1] https://erichartford.com/uncensored-models#heading-lets-get-...

kristianp · on Oct 12, 2023

The base model is uncensored. From a glance at your link, that is about uncensoring the conversational training set.

nomel · on Oct 13, 2023

I don't think so: https://erichartford.com/uncensored-models#heading-whats-an-...

> Most of these models (for example, Alpaca, Vicuna, WizardLM, MPT-7B-Chat, Wizard-Vicuna, GPT4-X-Vicuna) have some sort of embedded alignment

> The reason these models are aligned is that they are trained with data that was generated by ChatGPT, which itself is aligned by an alignment team at OpenAI.

kristianp · on Oct 13, 2023

Most of those are fine-tunes of the base model. The fine-tuning data is 'aligned'. The uncensored fine-tune training data is edited to remove the "I can't help you with that" responses.

nomel · on Oct 13, 2023

Yes, as stated in my earlier comment: there's an alignment tax, and then almost certainly an un-alignment tax, on top of that, compared to the raw, unaligned/uncensored, models.

mistrial9 · on Oct 11, 2023

that is interesting, but the "un-training" shown by EricH. is simply re-running some fine-tuning on the same public base model, regarding "refusals".. and it is expen$ive to do that, too.

woadwarrior01 · on Oct 12, 2023

Llama2 based models are actually quite proficient in at least French and German, if not more western european languages.

behnamoh · on Oct 11, 2023

well, your username suggests you'd enjoy Mistral 7b more /jk