Patchwork Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

login
register
mail settings
Submitter Leo Yan
Date March 16, 2019, 4:56 a.m.
Message ID <20190316045632.GA5330@leoy-ThinkPad-X240s>
Download mbox | patch
Permalink /patch/750061/
State New
Headers show

Comments

Leo Yan - March 16, 2019, 4:56 a.m.
Hi Robin,

On Fri, Mar 15, 2019 at 12:54:10PM +0000, Robin Murphy wrote:
> Hi Leo,
> 
> Sorry for the delay - I'm on holiday this week, but since I've made the
> mistake of glancing at my inbox I should probably save you from wasting any
> more time...

Sorry for disturbing you in holiday and appreciate your help.  It's no
rush to reply.

> On 2019-03-15 11:03 am, Auger Eric wrote:
> > Hi Leo,
> > 
> > + Jean-Philippe
> > 
> > On 3/15/19 10:37 AM, Leo Yan wrote:
> > > Hi Eric, Robin,
> > > 
> > > On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
> > > 
> > > [...]
> > > 
> > > > > If the NIC supports MSIs they logically are used. This can be easily
> > > > > checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> > > > > check whether the guest received any interrupt? I remember that Robin
> > > > > said in the past that on Juno, the MSI doorbell was in the PCI host
> > > > > bridge window and possibly transactions towards the doorbell could not
> > > > > reach it since considered as peer to peer.
> > > > 
> > > > I found back Robin's explanation. It was not related to MSI IOVA being
> > > > within the PCI host bridge window but RAM GPA colliding with host PCI
> > > > config space?
> > > > 
> > > > "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
> > > > typically have a programmable address, so could be anywhere. In the more
> > > > general category of "special hardware addresses", QEMU's default ARM
> > > > guest memory map puts RAM starting at 0x40000000; on the ARM Juno
> > > > platform, that happens to be where PCI config space starts; as Juno's
> > > > PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
> > > > the PCI bus to a guest (all of it, given the lack of ACS), the root
> > > > complex just sees the guest's attempts to DMA to "memory" as the device
> > > > attempting to access config space and aborts them."
> > > 
> > > Below is some following investigation at my side:
> > > 
> > > Firstly, must admit that I don't understand well for up paragraph; so
> > > based on the description I am wandering if can use INTx mode and if
> > > it's lucky to avoid this hardware pitfall.
> > 
> > The problem above is that during the assignment process, the virtualizer
> > maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
> > allow the device, programmed in GPA to access the whole guest RAM.
> > Unfortunately if the device emits a DMA request with 0x40000000 IOVA
> > address, this IOVA is interpreted by the Juno RC as a transaction
> > towards the PCIe config space. So this DMA request will not go beyond
> > the RC, will never reach the IOMMU and will never reach the guest RAM.
> > So globally the device is not able to reach part of the guest RAM.
> > That's how I interpret the above statement. Then I don't know the
> > details of the collision, I don't have access to this HW. I don't know
> > either if this problem still exists on the r2 HW.

Thanks a lot for rephrasing, Eric :)

> The short answer is that if you want PCI passthrough to work on Juno, the
> guest memory map has to look like a Juno.
> 
> The PCIe root complex uses an internal lookup table to generate appropriate
> AXI attributes for outgoing PCIe transactions; unfortunately this has no
> notion of 'default' attributes, so addresses *must* match one of the
> programmed windows in order to be valid. From memory, EDK2 sets up a 2GB
> window covering the lower DRAM bank, an 8GB window covering the upper DRAM
> bank, and a 1MB (or thereabouts) window covering the GICv2m region with
> Device attributes.

I checked kernel memory blocks info, it gives out below result:

root@debian:~# cat /sys/kernel/debug/memblock/memory
   0: 0x0000000080000000..0x00000000feffffff
   1: 0x0000000880000000..0x00000009ffffffff

So I think the lower 2GB DRAM window is: [0x8000_0000..0xfeff_ffff]
and the high DRAM window is [0x8_8000_0000..0x9_ffff_ffff].

BTW, now I am using uboot rather than UEFI, so not sure if uboot has
programmed memory windows for PCIe.  Could you help give a point for
which registers should be set in UEFI thus I also can check related
configurations in uboot?

> Any PCIe transactions to addresses not within one of
> those windows will be aborted by the RC without ever going out to the AXI
> side where the SMMU lies (and I think anything matching the config space or
> I/O space windows or a region claimed by a BAR will be aborted even earlier
> as a peer-to-peer attempt regardless of the AXI Translation Table setup).
> 
> You could potentially modify the firmware to change the window
> configuration, but the alignment restrictions make it awkward. I've only
> ever tested passthrough on Juno using kvmtool, which IIRC already has guest
> RAM in an appropriate place (and is trivially easy to hack if not) - I don't
> remember if I ever actually tried guest MSI with that.

I did several tries with kvmtool to tweak memory regions but it's no
lucky.  Since the host uses [0x8000_0000..0xfeff_ffff] as the first
valid memory window for PCIe, thus I tried to change all memory/io
regions into this window with below changes but it's no lucky:

IOMMU, etc) and will keep posted if I make any progress.

Thanks,
Leo Yan
Robin Murphy - March 18, 2019, 12:25 p.m.
On 16/03/2019 04:56, Leo Yan wrote:
> Hi Robin,
> 
> On Fri, Mar 15, 2019 at 12:54:10PM +0000, Robin Murphy wrote:
>> Hi Leo,
>>
>> Sorry for the delay - I'm on holiday this week, but since I've made the
>> mistake of glancing at my inbox I should probably save you from wasting any
>> more time...
> 
> Sorry for disturbing you in holiday and appreciate your help.  It's no
> rush to reply.
> 
>> On 2019-03-15 11:03 am, Auger Eric wrote:
>>> Hi Leo,
>>>
>>> + Jean-Philippe
>>>
>>> On 3/15/19 10:37 AM, Leo Yan wrote:
>>>> Hi Eric, Robin,
>>>>
>>>> On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
>>>>
>>>> [...]
>>>>
>>>>>> If the NIC supports MSIs they logically are used. This can be easily
>>>>>> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
>>>>>> check whether the guest received any interrupt? I remember that Robin
>>>>>> said in the past that on Juno, the MSI doorbell was in the PCI host
>>>>>> bridge window and possibly transactions towards the doorbell could not
>>>>>> reach it since considered as peer to peer.
>>>>>
>>>>> I found back Robin's explanation. It was not related to MSI IOVA being
>>>>> within the PCI host bridge window but RAM GPA colliding with host PCI
>>>>> config space?
>>>>>
>>>>> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
>>>>> typically have a programmable address, so could be anywhere. In the more
>>>>> general category of "special hardware addresses", QEMU's default ARM
>>>>> guest memory map puts RAM starting at 0x40000000; on the ARM Juno
>>>>> platform, that happens to be where PCI config space starts; as Juno's
>>>>> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
>>>>> the PCI bus to a guest (all of it, given the lack of ACS), the root
>>>>> complex just sees the guest's attempts to DMA to "memory" as the device
>>>>> attempting to access config space and aborts them."
>>>>
>>>> Below is some following investigation at my side:
>>>>
>>>> Firstly, must admit that I don't understand well for up paragraph; so
>>>> based on the description I am wandering if can use INTx mode and if
>>>> it's lucky to avoid this hardware pitfall.
>>>
>>> The problem above is that during the assignment process, the virtualizer
>>> maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
>>> allow the device, programmed in GPA to access the whole guest RAM.
>>> Unfortunately if the device emits a DMA request with 0x40000000 IOVA
>>> address, this IOVA is interpreted by the Juno RC as a transaction
>>> towards the PCIe config space. So this DMA request will not go beyond
>>> the RC, will never reach the IOMMU and will never reach the guest RAM.
>>> So globally the device is not able to reach part of the guest RAM.
>>> That's how I interpret the above statement. Then I don't know the
>>> details of the collision, I don't have access to this HW. I don't know
>>> either if this problem still exists on the r2 HW.
> 
> Thanks a lot for rephrasing, Eric :)
> 
>> The short answer is that if you want PCI passthrough to work on Juno, the
>> guest memory map has to look like a Juno.
>>
>> The PCIe root complex uses an internal lookup table to generate appropriate
>> AXI attributes for outgoing PCIe transactions; unfortunately this has no
>> notion of 'default' attributes, so addresses *must* match one of the
>> programmed windows in order to be valid. From memory, EDK2 sets up a 2GB
>> window covering the lower DRAM bank, an 8GB window covering the upper DRAM
>> bank, and a 1MB (or thereabouts) window covering the GICv2m region with
>> Device attributes.
> 
> I checked kernel memory blocks info, it gives out below result:
> 
> root@debian:~# cat /sys/kernel/debug/memblock/memory
>     0: 0x0000000080000000..0x00000000feffffff
>     1: 0x0000000880000000..0x00000009ffffffff
> 
> So I think the lower 2GB DRAM window is: [0x8000_0000..0xfeff_ffff]
> and the high DRAM window is [0x8_8000_0000..0x9_ffff_ffff].
> 
> BTW, now I am using uboot rather than UEFI, so not sure if uboot has
> programmed memory windows for PCIe.  Could you help give a point for
> which registers should be set in UEFI thus I also can check related
> configurations in uboot?

U-Boot does the same thing[1] - you can confirm that by whether PCIe 
works at all on the host ;)

>> Any PCIe transactions to addresses not within one of
>> those windows will be aborted by the RC without ever going out to the AXI
>> side where the SMMU lies (and I think anything matching the config space or
>> I/O space windows or a region claimed by a BAR will be aborted even earlier
>> as a peer-to-peer attempt regardless of the AXI Translation Table setup).
>>
>> You could potentially modify the firmware to change the window
>> configuration, but the alignment restrictions make it awkward. I've only
>> ever tested passthrough on Juno using kvmtool, which IIRC already has guest
>> RAM in an appropriate place (and is trivially easy to hack if not) - I don't
>> remember if I ever actually tried guest MSI with that.
> 
> I did several tries with kvmtool to tweak memory regions but it's no
> lucky.  Since the host uses [0x8000_0000..0xfeff_ffff] as the first
> valid memory window for PCIe, thus I tried to change all memory/io
> regions into this window with below changes but it's no lucky:
> 
> diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
> index b9d486d..43f78b1 100644
> --- a/arm/include/arm-common/kvm-arch.h
> +++ b/arm/include/arm-common/kvm-arch.h
> @@ -7,10 +7,10 @@
> 
>   #include "arm-common/gic.h"
> 
> -#define ARM_IOPORT_AREA                _AC(0x0000000000000000, UL)
> -#define ARM_MMIO_AREA          _AC(0x0000000000010000, UL)
> -#define ARM_AXI_AREA           _AC(0x0000000040000000, UL)
> -#define ARM_MEMORY_AREA                _AC(0x0000000080000000, UL)
> +#define ARM_IOPORT_AREA                _AC(0x0000000080000000, UL)
> +#define ARM_MMIO_AREA          _AC(0x0000000080010000, UL)
> +#define ARM_AXI_AREA           _AC(0x0000000088000000, UL)
> +#define ARM_MEMORY_AREA                _AC(0x0000000090000000, UL)
> 
> Anyway, very appreciate for the suggestions; it's sufficent for me to
> dig more for memory related information (e.g. PCIe configurations,
> IOMMU, etc) and will keep posted if I make any progress.

None of those should need to change (all the MMIO emulation stuff is 
irrelevant to PCIe DMA anyway) - provided you don't give the guest more 
than 2GB of RAM, passthrough with legacy INTx ought to work 
out-of-the-box. For MSIs to get through, you'll further need to change 
the host kernel to place its software MSI region[2] within any of the 
host bridge windows as well.

Robin.

[1] 
http://git.denx.de/?p=u-boot.git;a=blob;f=board/armltd/vexpress64/pcie.c;h=0608a5a88b941cdd362e9f231250a981aebab357;hb=HEAD#l95
[2] MSI_IOVA_BASE in drivers/iommu/arm-smmu.c
Leo Yan - March 19, 2019, 1:33 a.m.
Hi Robin,

On Mon, Mar 18, 2019 at 12:25:33PM +0000, Robin Murphy wrote:

[...]

> > diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
> > index b9d486d..43f78b1 100644
> > --- a/arm/include/arm-common/kvm-arch.h
> > +++ b/arm/include/arm-common/kvm-arch.h
> > @@ -7,10 +7,10 @@
> > 
> >   #include "arm-common/gic.h"
> > 
> > -#define ARM_IOPORT_AREA                _AC(0x0000000000000000, UL)
> > -#define ARM_MMIO_AREA          _AC(0x0000000000010000, UL)
> > -#define ARM_AXI_AREA           _AC(0x0000000040000000, UL)
> > -#define ARM_MEMORY_AREA                _AC(0x0000000080000000, UL)
> > +#define ARM_IOPORT_AREA                _AC(0x0000000080000000, UL)
> > +#define ARM_MMIO_AREA          _AC(0x0000000080010000, UL)
> > +#define ARM_AXI_AREA           _AC(0x0000000088000000, UL)
> > +#define ARM_MEMORY_AREA                _AC(0x0000000090000000, UL)
> > 
> > Anyway, very appreciate for the suggestions; it's sufficent for me to
> > dig more for memory related information (e.g. PCIe configurations,
> > IOMMU, etc) and will keep posted if I make any progress.
> 
> None of those should need to change (all the MMIO emulation stuff is
> irrelevant to PCIe DMA anyway) - provided you don't give the guest more than
> 2GB of RAM, passthrough with legacy INTx ought to work out-of-the-box. For
> MSIs to get through, you'll further need to change the host kernel to place
> its software MSI region[2] within any of the host bridge windows as well.

From PCI configurations dumping, I can see after launch the guest with
kvmtool, the host receives the first interrupt (checked with the
function vfio_intx_handler() has been invoked once) and then PCI sent
command with PCI_COMMAND_INTX_DISABLE to disable interrupt line.  So
this flow is very likely the interrupt is not forwarded properly and
guest doesn't receive interrupt.

It's lucky that I found below flow can let interrupt forwarding from
host to guest after I always set "sky2.disable_msi=1" for both kernel
command lines:

    host                    guest

  INTx mode               INTx mode

So far, it still cannot work well if I only set "sky2.disable_msi=1"
for host kernel command line, with this config it runs with below flow
and which cannot forward interrupt properly from host to guest:

    host                    guest

  INTx mode               msi enable
                          msi disable
                          Switch back to INTx mode

I am so happy now I can use pure INTx mode on Juno board for NIC
enabling and pinged successfully from guest OS to my router :)

Will look into the issue in the second secnario; and if I have more
time I will look into msi mode as well (I confirmed msi mode can work
with host OS but failed in guest OS).

Very appreciate you & Eric helping!

Thanks,
Leo Yan

Patch

diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
index b9d486d..43f78b1 100644
--- a/arm/include/arm-common/kvm-arch.h
+++ b/arm/include/arm-common/kvm-arch.h
@@ -7,10 +7,10 @@ 

 #include "arm-common/gic.h"

-#define ARM_IOPORT_AREA                _AC(0x0000000000000000, UL)
-#define ARM_MMIO_AREA          _AC(0x0000000000010000, UL)
-#define ARM_AXI_AREA           _AC(0x0000000040000000, UL)
-#define ARM_MEMORY_AREA                _AC(0x0000000080000000, UL)
+#define ARM_IOPORT_AREA                _AC(0x0000000080000000, UL)
+#define ARM_MMIO_AREA          _AC(0x0000000080010000, UL)
+#define ARM_AXI_AREA           _AC(0x0000000088000000, UL)
+#define ARM_MEMORY_AREA                _AC(0x0000000090000000, UL)

Anyway, very appreciate for the suggestions; it's sufficent for me to
dig more for memory related information (e.g. PCIe configurations,