Issues with GTX960 on CentOS7 using bhyve PCI passthru (FreeBSD 11-RC2)

soralx at cydem.org soralx at cydem.org
Fri Jan 13 08:17:39 UTC 2017


> >  First, `nvidia-smi -q` output diff [0] is interesting. It suggests
> > that the card may be in some incompletely initialized state: notice
> > the "Unknown Error" instead of real UUID, and the P8 power state.
> > Could it be that the driver doesn't put the card's BIOS in the right
> > state?  
> 
>   That is extremely likely. bhyve itself doesn't have a BIOS, though 
> bhyve/UEFI could be modified to handle options ROMs (see 
> http://awilliam.github.io/presentations/KVM-Forum-2014/#/)

Hm, interesting. I wonder if a card that's not designed for use
with UEFI is destined not to work well/at_all with bhyve...
I'll read the presentation later.

> > -    GPU UUID                        :
> > GPU-f6c71b8e-f6c8-5a42-260d-1164720bf4f2
> > +    GPU UUID                        : Unknown Error  
> 
>   That implies some type of h/w access isn't working, either MMIO 
> registers or response from a DMA command.

I have a feeling it's something to do with DMA that's
not getting configured correctly for data transfers,
and returns wrong data (or good data to wrong location).

> > -    Board ID                        : 0x100
> > +    Board ID                        : 0x4  
> 
>   The same ?

I'm quite sure it was the same card.

> >              PCIe Generation
> >                  Max                 : 2
> > -                Current             : 2
> > +                Current             : 1  
> 
>   bhyve's emulated PCI hostbridge only advertises gen-1 - that could be 
> easily changed to gen2. That could make a difference for some of the 
> clock issues below
>   (source is pci_emul.c:pci_emul_add_pciecap())

I doubt the generation number matters. But yeah,
wouldn't hurt to change it to '2'.


> >              Link Width
> >                  Max                 : 16x
> >                  Current             : 16x  
>   That's a bit unexpected since the hostbridge only advertises 1x, but 
> the driver is probably exporting the host value here.

Yeah, nVidia is known to like talking directly to the card
in its own, proprietary way.

> > -    Performance State               : P0
> > +    Performance State               : P8  
> 
>   Note sure what's happening here.

Driver not kicking the card's BIOS into the right mode
to switch to dynamic power state selection?

> >      Clocks
> > -        Graphics                    : 625 MHz
> > -        SM                          : 1251 MHz
> > -        Memory                      : 1304 MHz
> > -        Video                       : 540 MHz
> > +        Graphics                    : 405 MHz
> > +        SM                          : 810 MHz
> > +        Memory                      : 324 MHz
> > +        Video                       : 405 MHz  
> 
>   This may be related to the gen1 vs gen2 issue above.

I doubt it's related to PCIe gen. Most likely because the
card seems to remain in P8 (low power) mode, according to
the same SMI tool. But the frequencies don't look right
anyway; well, I didn't bother to look up what P8 is supposed
to run at.

> > When rebooting, I get this:
> > nvidia-modeset: ERROR: GPU:0: Idling display engine timed out:
> > 0x0000857d:0:0:0x00000040  
> 
>   This may be DMA not working.

Yes, I strongly suspect DMA too, especially when it comes
to DRI stuff.

>   A general issue with PCI passthrough is that often MMIO from the
> guest works, since that is just VT-x remapping, but DMA doesn't work
> due to issues with IOMMU programming (or incorrect mappings being
> used). This gives a device that partially works in that registers can
> be read, but data transfer doesn't work.

Didn't we verify that the BARs are programmed correctly?
So you're saying that bhyve has a bug in that it doesn't
program the IOMMU right to match guest's memory-mapped
address regions to host's addresses?

> > Jan 11 11:34:49 fbsd12tst kernel: nvidia-modeset: ERROR: GPU:0:
> > Display engine push buffer channel allocation failed Jan 11 11:34:49
> > fbsd12tst kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate
> > display engine core DMA push buffer  
> 
>   Not sure what's happening with those.
> 
>   Would it be possible to try the nouveau driver ? At least the source 
> is available, so it may be easier to determine what is broken.

I could, but for now I'd like to focus more on AMD card
(which also has an open-source driver).

> > BTW, is it [generally] safe to decrease the BAR base address further?
> > My workstation has a CPU with just 36 address bits...  
>   Yes. The only potential conflict is with the top of guest RAM, and 36 
> bits is a lot of RAM :)

64G of RAM isn't that much these days, how incredible is that :)
But you're saying there's nothing else inbetween the top of
guest's RAM and the BAR base? In that case it's nothing to
worry about at all, as a guest will always have less RAM that
the host's CPU can address.

> later,
> Peter.

-- 
[SorAlx]  ridin' VN2000 Classic LT


More information about the freebsd-virtualization mailing list