Hacker News new | past | comments | ask | show | jobs | submit login
Why libvirt supports only 14 PCIe hotplugged devices on x86-64 (dottedmag.net)
249 points by andreyvit 11 months ago | hide | past | favorite | 68 comments

> It already supports a number of obscure options (you can make QEMU claim to support a CPU feature regardless of whether the host CPU supports it, really?), so adding one more woild fit in just fine.

> Nope. “there are no plans to address it further or fix it in an upcoming release”.


I could see that being the response of an individual open-source developer working for free. But that was IBM saying that, and people pay big bucks to IBM to fix things like this.

It's a bug filed against RHEL 7 originally, by someone working at Red Hat, who suggested that we add the qemu disable-io feature to libvirt. There was no customer case behind either the original RHEL 7 bug nor this cloned RHEL 8 bug, so we simply didn't think it was important to implement this, and 5 years after the original bug was filed, with no customer coming along nor anyone having done the work upstream, the bug was auto-closed.

However if someone came along and did the work upstream to fix it, I'm sure that would be accepted.

Or if a customer turned up who wanted this, that would also be implemented.

WONTFIX sounds so final.

Though, reading the closing comment, this is really CLOSED-WONTFIXYET, as in no plans.

Maybe it'd be nice to introduce a WONTFIXYET. Might be useful to fossick among features abandoned that someday become feasible.

Yeah WONTFIX implies will not accept patches in every project I've worked on.

Lucky they're not the ones accepting patches. If upstream adopts it they'd have to disable support in their build, which seems unlikely.

I wouldn't read WONTFIX like that for anything downstream (i.e. a distribution bug tracker for projects they package). It's we won't do it, not noone else can. WONTFIX only means WONTFIX if that's in the one true source of the project, i.e. upstream.

RH usually have a lot of not-upstream patches though (at least they do for the kernel, grub, and qemu, so I assume the same for libvirt) which complicates things a bit.

> Lucky they're not the ones accepting patches. If upstream adopts it they'd have to disable support in their build, which seems unlikely.

Upstream libvirt would probably be Red Hat employees. :) It would not be disabled in the RHEL build, but note that Red Hat does disable a lot of features of QEMU in RHEL. About this:

> RH usually have a lot of not-upstream patches though (at least they do for the kernel, grub, and qemu, so I assume the same for libvirt) which complicates things a bit.

Most of the patches in kernel and QEMU are backports from upstream.

For QEMU there are 20-30 patches not-upstream patches and they're mostly configuration. For the kernel it's a bit more but not many, and Libvirt probably has even fewer.

We usually use "Open, P4" (0 is highest. 2 is standard) for unscheduled work.

Some people strongly dislike "open" for anything not being worked on or having plans to work on. And I've come to agree to that. You just do not want it showing up in backlogs at all.

I'm expect if it comes from one of their $X million support contracts their answer will be very different.

You’d like to think so but often even at that level you’re paying for it not to be your fault.

Now if your CTO golfs with an exec at IBM, you might get somewhere.

At Red Hat, this would usually mostly be a PM+dev+QE decision, not C level. If C level got involved, that meant massive customer impairment. There are simply too many open bugs to spend time on something that no customer is apparently interested in, but if a customer came along with a big burning problem and there was no way for them to architect around it, they'd usually fix something like this.

When I worked at Red Hat (virtualization-related, but not libvirt) we had a case where a customer would be running a configuration which was... unwise. We still spent a lot of time figuring out how to help them and shipped a patch that made their life easier.

The people working there are not evil, they don't intentionally not fix things, but a bug that is seen as a minor limitation and hasn't popped up in any customer case in 5+ years simply won't get fixed. There hundreds if not thousands of other bugs that are more urgent and only so many developers, QE engineers, docs people to work on this. Even if someone wrote a patch for it, it may not get merged due to how expensive Red Hats process for shipping a change is.

Ooo, I gotta start playing golf. That might be the way I can get some support for my Gmail account.

"If posting to Hacker News doesn't get your support query fixed, book 9 holes with the board members."

I really need to get off gmail as soon as possible.

Obvious solution is to get friendly with the package maintainer, find a security vulnerability in the anti-feature you need, lodge security bug, get a CVE, and ask real nicely an upgrade to include the feature^H^H^H^H^H^H^ fix you need.

On a serious note, I deal with bug fix / feature requests regularly in Z-Stream (kernel) in an almost weekly cadence, getting the patch upstream, getting the request to backport, making an adequate test of the feature.

Source: Worked in product security for little red, now big blue.

Having worked for an organization with one of said $X million support contracts with IBM, it often is not.

What is the point of the support contract then?

There are a lot of various points to support contracts. Support contracts are not “fix any bug” or “implement any feature”, but…

- With a support contract, there are SLAs to respond to incidents / bugs within a timely manner, and

- You have a clear escalation path to talk to engineers, rather than customer service, and

- While you can’t dictate what features / bugs will get fixed, you do have some weight for prioritization.

If you have no support contract, then you may not be able to talk to engineers at all, your bug reports may get completely ignored (not even looked at), etc.

Yeah but if the response to your bug report is "WONTFIX", it doesn't much matter if you got that reply in a timely manner. It still does you absolutely no good.

I don’t understand what kind of point you’re making here.

Are you saying that support contracts are completely worthless because some bugs are closed WONTFIX?

B2B generally does not run on the “let’s screw our customers as much as possible” model. Of course, some do—companies like IBM and Oracle are famously extractive, and cloud providers are trying their best to bait you into getting locked into their cloud.

But in a typical B2B scenario, the support contract is the entry price for having real people read your bug reports and respond in a timely fashion. That’s the starting point, and from there, the bug will get fixed, or you’ll get connected to a “customer support engineer” or someone that will tell you that you’re using the product wrong, or you’ll be given a workaround. Without the support contract, you don’t get the workarounds, you don’t get the fixes, and you don’t get the contact with engineers. You just get to figure it out on your own. Yeah, a percentage of bugs get closed WONTFIX. That’s normal. Yeah, the contract may only require a response and not a fix. The actual practice is that you get some bugs fixed, and some not, and that’s a lot better than your bug reports going straight into the trash.

In my experience (not with IBM), support contracts get you issues resolved. You get a certain number of incidents per contract, and each one gets resolved. You ask for a feature or bug fix, and they implement it. That’s what you are paying for.

Now Red Hat would have no obligation to upstream or maintain the patch, even to projects they own. But you ask for a big fix under a support contract, they should fix the bug. Even if it’s just a patch for that one customer only.

To be the provider of a support contract and then just turn around and say “nah, won’t fix” in response to an official customer service contract request… I’ve never, ever heard of that in my professional career.

Sometimes customers come up with feature requests that are reasonable in the surface but would cost millions to implement and maintain. In that case, workarounds are one way to resolve the issue.

> Even if it’s just a patch for that one customer only.

Red Hat does not do one-off patches. If it's fixed in the product, it's fixed for everyone (including upstream).

> In that case, workarounds are one way to resolve the issue.

Yea I always check "why" and if the feature request is not already reasonably well motivated, I'll reach out to the customer and ask.

In my experience it's quite frequent a non-trivial feature or change request can be solved as well if not better by a simple, but different change instead.

Alternatively it allows me to see that three customers are asking for nearly the same thing, even though the feature requests make them sound quite different.

For Red Hat, there are bugs and there are customer cases. The two are not the same, but they are linked internally. Customer cases don't deal with "X functionality doesn't work in libvirt", but rather a higher level issue that the customer can't resolve due to the underlying bug. Here a workaround the customer can live with is a perfectly acceptable solution.

In my time I was there I never saw a bug closed as WONTFIX unless the customer case was resolved in a satisfactory manner. Red Hat people are very aware who's paying the bills. I have seen badly managed escalations, but I've never seen anyone taking customer problems lightly. However, bugs that have no customer case attached to them carry very little weight unless an engineer, a PM, etc says that this is really bad and will cause problems in the future.

This is part of where the disconnect comes from. Red Hat prides itself in being an open source company, but it is first and foremost a customer-oriented company. Sure, often individual people will take time they have left and go fix something for the community, but if it's something more involved, it will need PM and management support. Benefiting from and also providing their customers with open source is simply part of the model that has worked for Red Hat, but nobody should be under the illusion that this is done for the greater good. Red Hat is a company and companies serve the purpose of making money for their shareholders. If this happens to align with the interests of the open source community, that's awesome, but will not always be the case. Over the past few years there have been numerous instances of that unfortunate reality.

Without knowing the specifics about libvirt's funding, if a project needs to be truly community-driven, the community must come up with a model that doesn't involve Red Hat paying a large portion of the salaries of the people involved, or it will be subject to Red Hat's business interests.

For the ability to point a finger at someone else. Otherwise you would pointing at yourself.

> it doesn't much matter if you got that reply in a timely manner

Compare "I can't do that" with "They wouldn't do that".

> . I guess SeaBIOS can't figure out how to assign I/O space to all devices that want some, and so it simply gives up?

It appears that although for some devices VM works fine but for others the VM refuses to boot (esp e100)

So the answer might be more nuanced than it seems?

Those devices for which it works fine are such devices that don't request any I/O space

As redhat becomes more commercial it's imperative we don't let them be stewards of open source anymore. Too many times to their corporate strategy.

For example they took ownership of X11 only so they can let it die in favor of their preferred Wayland. While Wayland is not bad, it's not covering everything.

But anyway I don't really care anymore, I'm less and less invested in the Linux ecosystem. It's too commercial now, I just stick with the BSDs <3

Because, if you read the report, suc an option is not needed: - if you disable IO port allocation and plug in a card that requires it, that card cannot possibly work - if you don't disable it but use only cards that don't require IO ports, you might get an error in your dmesg but the card will still work just fine

So, why would you need to specify this option in the first place?

> So if you wish to have more than 14 PCIe slots in your VM, you’ll have to use QEMU directly.

No need, libvirt can pass arbitrary options to QEMU.


Back before libvirt made it trivial, I used QEMU/KVM directly to map PCI devices to VMs. It's a little tricky because you must first unmap the device from the host/hypervisor, and you need to unmap the whole bus that the device is on. So if there are other PCI devices on the same bus as that device you want to map, they must all go along, which is often impossible for things like the USB controller for your keyboard/mouse.

These days, instead of crafting a custom script to launch QEMU/KVM for PCI mapping, it's just a few clicks in virt-manager. Note that the first time you launch a VM with a mapped PCI device, the launch will often fail with an error, but it will work on a subsequent retry and thereafter.

Also, I've tinkered with lots of VMs over the past 15 years and I've NEVER had a need for more than 14 buses. Hopefully I never will.

Author here. Thanks, I'll give it a whirl!

Not sure it will work though: I need to add an option to a `pcie-root-port` command-line argument managed by libvirt.

I can try skipping creating `pcie-root-port`s by libvirt completely, and add them manually using options passthrough, but I'm not sure the rest of libvirt won't throw a fit when it finds other devices that refer to these (unknown to libvirt) PCIe slots.

I'm curious to know more about the VM host machine that they plugged 15 e1000 cards into to test this limitation. And even more curious about the non-test environment in which somebody ran into this limitation.

I can only imagine trying to passthrough 20 nvme devices to a guest, but it seems like a very weird configuration.

> but it seems like a very weird configuration

On IaaS providers, you get "local scratch NVMe" presented to the guest as individual fixed-sized disks — presumably because they're being IOMMU-pass-through'ed from the host (or a JBOD direct-attached to the host.)

The sizes for these disks were standardized several generations ago, so they're at least presented to the guest as 375G slices (I'm guessing they might actually be partitions of a larger disk nowadays.) To get "decent" amounts of local scratch storage for e.g. a serverless data-warehouse instance, you need "all you can get" of these small volumes — which on at least AWS and GCP, is 24 of them (equalling ~9TB.)

And that's just one guest. The host might have several such guests.

(To be clear, neither AWS nor GCP is likely to be using libvirt anywhere in their stack. This is just to demonstrate the use-case.)

A serverless data warehouse instance sounds like an oxymoron

"Serverless" is a jargon term, with a specific meaning — basically "all state is canonically durable in some external system, usually one rooted in a SAN-based managed object store like S3; there are no servers that keep durable state that must be managed, only object-store bills to pay and spot instances temporarily spun up to fetch and process the canonically-at-rest state."

(This kind of architecture is actually "serverless", but in a possibly-arcane sense to someone who doesn't admin these sorts of systems: it's "serverless" in that your QoS isn't bounded by any "scaling factor" proportional to some number of running servers. You don't have to think about how many "servers" — or "instances" or "pods" or "containers" or whatever-else — you have running. Especially, as a customer of a "serverless" SaaS, you will only get billed for the workloads you actually run, rather than for the underlying SaaS-backend-side servers that are being temporarily reserved to run those workloads.)

Snowflake and BigQuery are examples of serverless data warehouse systems. You do a query; servers get reserved from a pool (or spun up if the pool is empty); those servers stream your at-rest data from its canonically-at-rest storage form to answer your query.

In a serverless data warehouse, as long as you still have the same server spun up and serving your queries, it'll have the data it streamed to serve your previous queries in its local disk and memory caches, making further queries on the same data "hot." The more local scratch NVMe you give these instances, the more stuff they can keep "hot" in a session to accelerate follow-on queries or looping-over-the-dataset subqueries.

what does "canonically durable" and "canonically-at-rest storage" mean?

Most database systems are canonically-online: the state lives on the instances, and you make backups of it, but these are never more canonical than what’s on the local online storage of the cluster (and usually less-so, because it’s offset back in time by at least a few seconds, if not hours.)

When a cluster-node permanent-faults (say, its DC burns down), you lose at least a few seconds of what you — and your customers — thought of as committed data.

In a canonically-at-rest DBMS, the only state that matters is the state in the object store (or other external, highly-replicated durable-storage abstraction.) The reads are an ephemeral caches in front of the canonical at-rest data; and all writes must be pushed down to the at-rest representation before any other nodes in the cluster can see them, and before the write returns as successful to the client.

Not stored in memory.

...the use case of "our architecture's idiotic limitations made it hit hypervisor limitations" ?

That definitely wouldn't be the first time.

Probably not normal partitions but nvme namespaces instead since that 3ill also allow them to balance iops and such so that one customer doesn't affect another as much.

These are emulated `r1000` devices, not pass-through

If I'm not wrong, the pre-allocation of I/O ranges in PCIe bridges is needed only if you intend to hot-plug devices that were not present in the first enumeration.. but in VMs the hardware is known from the start and the PCIe enumeration can assign I/O ranges only if devices underneath actually needs them... is there a reason why hot-plugging is needed in VMs?

Cloud customers love it when they can just attach stuff to their VMs without having to recreate them or even reboot them.

Isn't the cloud notoriously worse about hotplugging anything than on-prem systems are? For example, vSphere supports hot adding CPUs and RAM to VMs, but Azure doesn't.

Seems unsurprising. On Azure, if it goes wrong, the various tenants aren't all working at the same company.

> is needed only if you intend to hot-plug devices that were not present in the first enumeration

Correct. I regularly use VMs with more that 14 statically configured PCI devices using QEMU with libvirt without having to resort to qemu:cmdline.

Author here.

Have you got it working with PCI or PCIe? PCI devices attached to the top-level bus do not request I/O ports unless they need to, and if they do, they request only small slice.

QEMU also allows one to put 8 static PCIe devices into a single "multifunction PCIe device", so it requests 4K I/O ports per 8 devices, giving a bigger headroom. The downside, of course, is that all these 8 devices lose individual hotpluggability, and can only be added/removed en masse.

The biggest problem is hotplug slots, each taking 4K I/O ports unless told otherwise in a way libvirt does not support as I described in the article.

There are 14 hardware based PCIe devices (mixture of NVME drives and NIC VFs) along with various other emulated PCI devices (virtio block, serial, etc.).

I have not tried to hot [un]plug devices with this configuration. It looks as though I’m likely to be disappointed if I try. Thanks for the explanation.

Author here. As correctly guessed in other comments: cloud infrastructure.

To make public IPs and volumes hotpluggable without a guest agent running inside every VM one has to manage them in a way guest OS will handle hotplug using regular mechanisms. For volumes it's PCIe storage hotplug, for public IPs it's PCIe network card hotplug.

If a VM is used as a Kubernetes worker, couple of dozen volumes and public IPs attached is not an unlikely situation.

It’s not a common use-case but I could see it being useful for sharing hardware that requires exclusive access like GPUs/ML accelerators.

Currently if you need GPUs they come with the instance itself meaning you need to boot your VM from scratch, do the work and then shut it down to relinquish the GPU.

With hot-plug you could have continuously running VMs that only attach/detach GPUs as needed, no longer taking the overhead of a full cold boot/shutdown every time.

adding NVMe emulated storage would be one

Devices passed from the host to the guest?

Hot plugging refers to adding devices to the VM while the VM is running. Passing host devices through is commonly accomplished without hotplug.

Right. I should have been more clear that it can be you hotplug a host device and pass it in. Admitted, this is typically USB and not PCIe. And the PCMCIA days are over...

I ran into this on FreeNAS which uses Bhyve. Not sure if it's FreeNAS' way of doing things, but adding a virtual disk using VirtIO creates a separate SATA controller.

I tried forwarding quad NVMe's and couldn't get it working until I discovered I was hitting this limitation between the existing disks and VirtIO network card.

Are you sure that's the same issue? Bhyve doesn't share an awful lot of code with KVM/Qemu.

Perhaps I am slightly misrembering and it was incidental to the NVMe's, but it did fail due to this 14 PCIe device limit due to virtual disks did not share a controller, and I had to change to using Bhyves AHCI driver for some disks to get the VM running again.

I even did a test adding one disk at a time until the VM stopped booting.

Would like to hear more about why i/o ports stayed fixed and "usage decreased over time." USB/TB devices must not use them, right?

They stayed fixed because they were fixed devices in a simple computer. Basic keyboard support, legacy interrupt controller, legacy timers, VGA… stuff that still to this day to an extent makes a PC actually “a PC”, and that may even still be used to various extents in early stages of the boot chain.

In early computers, most device resources were fixed, especially critical stuff like keyboard and interrupt controller. Sometimes device were jumpered, but even then you’d have the choice between a few well known ranges.

There wasn’t any configuration/negotiation protocol in the early days, it was literally defined by how the wires were connected, and a few fixed logic gates. For compatibility, it had to stay that way. x86 PCs have a lot of legacy cruft.

But early boot is also where use of that legacy stuff usually ends with modern operating systems. Pretty much all these devices have been replaced by additional modern variants that are now mostly just using regular MMIO as well (I don’t think x86 I/O ports have relevant advantages, would appreciate to be told otherwise). For devices that are supposed to work on other machines than PCs (nowadays that mostly means ARM stuff), it can even get in the way, since they don’t know about this weird I/O port address space.

So modern OSes of course prefer the newer device variants (most are also decades old by now) of keyboard, interrupt controller, etc., and since those don’t tend to use I/O ports for the aforementioned reasons, modern OSes don’t use them too much in general anymore.

> I don’t think x86 I/O ports have relevant advantages, would appreciate to be told otherwise

It's far outside the mainstream, but the x86 task state segment allows for allowing user level tasks to do i/o on specific ports, with single port granularity. You can map memory for a task only at a page level, so you could potentially allow user-space drivers finer grained access to devices. Of course, more or less nothing uses this.

> For devices that are supposed to work on other machines than PCs (nowadays that mostly means ARM stuff), it can even get in the way, since they don’t know about this weird I/O port address space.

PCI host bridges are supposed to offer a way to interact with I/O ports if it's not something natural for the CPU. Whether or not that happens regularly, I'm not really sure.

> Whether or not that happens regularly, I'm not really sure.

All older machines (e.g. PowerPC Macs) mapped the I/O ports to an area of the "regular" address space. They probably still do it for legacy reasons. I think only s390 got rid completely of I/O ports because they never implemented PCI, only PCIe.

Because it's used mostly for commands, so each device used very few ports. 128 bytes is already a pretty large size for the I/O port area of a PCI device, and a lot of them fit in 64k.

Author here. Your guess is right.

A lot of hardware has migrated from using I/O ports to memory-mapped I/O, and instead of fixed I/O addresses ACPI or a similar mechanism provides the OS with the directory of memory addresses to talk to.

For example, instead of PS/2 keyboard/mouse at I/O ports 0x0060-0x0064, ACPI provides the OS with the memory address to talk to a USB controller, and the USB controller does not use I/O ports at all.

Have a look at a list of the most common I/O ports: https://wiki.osdev.org/I/O_Ports#The_list

Most of this hardware is gone. The easiest way to see them at all is to boot a VM in QEMU and specifically ask for these ancient devices to be present.

Yeah, I remember them. Never really understood what they did besides needing to be set for ISA cards, etc.

But this doesn't answer the question why 14 and not 16. There's a diff of two there...

Author here. D'oh, you're right. Added it to QEMU section.

Amazing, thanks, that closes the loop!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact