USBs and IOMMUs and moving disk mounts … oh my!

In my last post I commented about my USB situation, which I hoped would be quickly resolved with a USB expansion card. I picked up this model from Inateck; a 5-port USB A device which would allow me to connect my USB switcher for keyboard & webcam, a headset for gaming, as well as gamepads and wheels.

It arrived with customary Amazon quickness, and I added it to the machine. Slightly annoyingly it (and others) required a power connection from the PSU, as the PCIe slots can’t supply the 5V required. I was out of SATA connectors so this meant running a Molex cable through the case (urgh).

I brought the machine back online and found the new USB device sitting happily in its own IOMMU group, so I bound it to VFIO, rebooted the server, then passed it through to the Windows machine and started it up.

That’s where my troubles began.

The VM booted up with no video. I shut it down, removed the USB device, and powered it on again – everything worked fine. As it turns out I started thinking along the right lines, but in a strictly hardware sense rather than a software-defined one.

I removed the card and put it in another slot, then tried again. The same problem occurred. Thinking that I had a bad card and not really being sure what to try next, I removed the card and booted Unraid again to go off in search of a solution. Had a I bought a bad card? Was it a bad chipset that Unraid didn’t like?

All of my answers would be replaced by a huge WTF, as Unraid booted and told me that my 1TB NVMe had disappeared. What made it especially WTF-y is that the disk was there, but unmounted and presenting at a different mount point. What should have been nvme1n1 was showing as nvme2n2.
After a reboot didn’t clear this weird gremlin, I stopped the array and assigned the disk using its new mount point. This worked fine briefly, then the new disk mount point went offline. Unknown to me at the time was that this had coincided with my VM booting up.

I shut the machine down, took it to the kitchen island, stripped it open and moved the physical NVMe card to another slot, and then reassembled it all.

Another boot-up went fine – the disks all appeared where they should. This was good! Then it wasn’t. Suddenly the three disks I have attached to my SAS breakout card all went offline, showing billions of read/writes. One of the disks was entirely disabled. I freaked out slightly, shut down the array, then posted for help on /r/Unraid on Reddit.

After a spell I powered the array back on and, finding one disk still disabled, decided to try rebuilding the array after the machine had sat stable and error free for a while. This was a 15+ hour job, so I took this opportunity to sulk off to the lounge for the night.

Thankfully somebody found my post and realized the exact same thing had happened to them, and told me the issue and the fix.

As it turns out, and this is really important to know, when you add a new piece of hardware, the IOMMU groups can change for all hardware in the system. It’s a bit like all the house numbers in the street changing when someone else moves in, with the added confusion of some of the family members swapping families.

Essentially after my USB shenanigans, whenever I powered on the Windows 10 VM, it was trying to address hardware it believed was certain things at certain addresses, when in fact it was now addressing totally different hardware, and the entire machine freaked out.
Ironically this is not entirely dissimilar to something I once did as a child when gifted a 400Mhz Dell machine from my dad’s work. It had some IRQ conflicts on the motherboard which meant you had to attach the PS2 mouse to a PS2 > Serial converter, which was then plugged into a Serial > Parallel converter and the whole thing went where the printer should have done.
Naturally I decided I could fix it, and broke it so badly that when it was returned to the IT Manager, he couldn’t fix it either.

Anyway, once I’d realized this, things started to click into place.

Once the array rebuild had finished I shut the machine down, remounted the USB card (and also remounted the whole CPU heatsink assembly one last time – it’s now idling at 40C which is about where I think it should be – and then blew away the VM configuration and remapped all the hardware.

It was a little unhappy at first; I had to unmount the video card entirely, go in via VNC to uninstall the drivers, give it a few reboots and then reattach the card, but it’s now working well – I’m writing this post on it, on the right keyboard talking wirelessly to the USB switcher, itself wired into the new USB card being passed through to the VM.

A success! Had I realized the IOMMU shifting sands situation, none of this would have happened, but it’s a great learning experience! After all, this was a project PC …

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: