Patchwork [v10,4/6] VT-d: introduce update_irte to update irte safely

login
register
mail settings
Submitter Chao Gao
Date March 15, 2017, 5:11 a.m.
Message ID <1489554682-6126-5-git-send-email-chao.gao@intel.com>
Download mbox | patch
Permalink /patch/198427/
State New
Headers show

Comments

Chao Gao - March 15, 2017, 5:11 a.m.
We used structure assignment to update irte which was not safe when an
interrupt happened during update. It is better to update IRTE atomically
through cmpxchg16b(). When cmpxchg16b is not supported, two 64-bit write
operations can update IRTE safely when only the high qword or the low qword is
intented to be updated.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v10:
- rename copy_irte_to_irt to update_irte
- remove copy_from_to_irt
- change commmit message and add some comments to illustrate on which
condition update_irte() is safe.

 xen/drivers/passthrough/vtd/intremap.c | 36 ++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)
Jan Beulich - March 15, 2017, 4:48 p.m.
>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
> +static void update_irte(struct iremap_entry *entry,
> +                        const struct iremap_entry *new_ire)
> +{
> +    if ( cpu_has_cx16 )
> +    {
> +        __uint128_t ret;
> +        struct iremap_entry old_ire;
> +
> +        old_ire = *entry;
> +        ret = cmpxchg16b(entry, &old_ire, new_ire);
> +
> +        /*
> +         * In the above, we use cmpxchg16 to atomically update the 128-bit
> +         * IRTE, and the hardware cannot update the IRTE behind us, so
> +         * the return value of cmpxchg16 should be the same as old_ire.
> +         * This ASSERT validate it.
> +         */
> +        ASSERT(ret == old_ire.val);
> +    }
> +    else
> +    {
> +        /*
> +         * The following method to update IRTE is safe on condition that
> +         * only the high qword or the low qword is to be updated.
> +         * If entire IRTE is to be updated, callers should make sure the
> +         * IRTE is not in use.
> +         */
> +        entry->lo = new_ire->lo;
> +        entry->hi = new_ire->hi;

How is this any better than structure assignment? Furthermore
the comment here partially contradicts the commit message. I
guess callers need to be given a way (another function parameter?)
to signal the function whether the unsafe variant is okay to use.
You should then add a suitable BUG_ON() in the else path here.

Jan
Chao Gao - March 15, 2017, 10:39 p.m.
On Wed, Mar 15, 2017 at 10:48:25AM -0600, Jan Beulich wrote:
>>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
>> +        /*
>> +         * The following method to update IRTE is safe on condition that
>> +         * only the high qword or the low qword is to be updated.
>> +         * If entire IRTE is to be updated, callers should make sure the
>> +         * IRTE is not in use.
>> +         */
>> +        entry->lo = new_ire->lo;
>> +        entry->hi = new_ire->hi;
>
>How is this any better than structure assignment? Furthermore

Indeed, not better. when using structure assignment, the assembly code is
48 8b 06                mov    (%rsi),%rax                    
48 8b 56 08             mov    0x8(%rsi),%rdx                 
48 89 07                mov    %rax,(%rdi)                    
48 89 57 08             mov    %rdx,0x8(%rdi)
Using the code above, the assembly code is
48 8b 06                mov    (%rsi),%rax                  
48 89 07                mov    %rax,(%rdi)                    
48 8b 46 08             mov    0x8(%rsi),%rax                 
48 89 47 08             mov    %rax,0x8(%rdi)

I thought structure assignment maybe ultilize memcpy considering structure
of a big size, so I made this change. I will change this back. Although
that, this patch is trying to make the change safer when cmpxchg16() is
supported. 

>the comment here partially contradicts the commit message. I

Yes.

>guess callers need to be given a way (another function parameter?)
>to signal the function whether the unsafe variant is okay to use.

This means we need to add the new parameter to iommu ops for only
IOAPIC/MSI know the entry they want to change is masked. Is there
any another reasonable and correct solution? How about...

>You should then add a suitable BUG_ON() in the else path here.

just add a BUG_ON() like this
BUG_ON( (entry->hi != new_ire->hi) && (entry->lo != new_ire->lo) );
Adding this BUG_ON() means update_irte() can't be used for initializing
or clearing IRTE which are not bugs.

Thanks
Chao
>
>Jan
>
Jan Beulich - March 16, 2017, 10:29 a.m.
>>> On 15.03.17 at 23:39, <chao.gao@intel.com> wrote:
> On Wed, Mar 15, 2017 at 10:48:25AM -0600, Jan Beulich wrote:
>>>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
>>> +        /*
>>> +         * The following method to update IRTE is safe on condition that
>>> +         * only the high qword or the low qword is to be updated.
>>> +         * If entire IRTE is to be updated, callers should make sure the
>>> +         * IRTE is not in use.
>>> +         */
>>> +        entry->lo = new_ire->lo;
>>> +        entry->hi = new_ire->hi;
>>
>>How is this any better than structure assignment? Furthermore
> 
> Indeed, not better. when using structure assignment, the assembly code is
> 48 8b 06                mov    (%rsi),%rax                    
> 48 8b 56 08             mov    0x8(%rsi),%rdx                 
> 48 89 07                mov    %rax,(%rdi)                    
> 48 89 57 08             mov    %rdx,0x8(%rdi)
> Using the code above, the assembly code is
> 48 8b 06                mov    (%rsi),%rax                  
> 48 89 07                mov    %rax,(%rdi)                    
> 48 8b 46 08             mov    0x8(%rsi),%rax                 
> 48 89 47 08             mov    %rax,0x8(%rdi)
> 
> I thought structure assignment maybe ultilize memcpy considering structure
> of a big size, so I made this change. I will change this back. Although
> that, this patch is trying to make the change safer when cmpxchg16() is
> supported. 

Perhaps you've really meant to use write_atomic()?

>>the comment here partially contradicts the commit message. I
> 
> Yes.
> 
>>guess callers need to be given a way (another function parameter?)
>>to signal the function whether the unsafe variant is okay to use.
> 
> This means we need to add the new parameter to iommu ops for only
> IOAPIC/MSI know the entry they want to change is masked. Is there
> any another reasonable and correct solution?

Well, users you convert in this patch must be okay to use the
non-atomic variant. The PI user(s) know(s) that cmpxchg16b is
available, so could always request the safe variant. No need for
a new parameter higher up in the call trees afaics.

> How about...
> 
>>You should then add a suitable BUG_ON() in the else path here.
> 
> just add a BUG_ON() like this
> BUG_ON( (entry->hi != new_ire->hi) && (entry->lo != new_ire->lo) );
> Adding this BUG_ON() means update_irte() can't be used for initializing
> or clearing IRTE which are not bugs.

Yes, that's an option too, albeit then I'd suggest (pseudo code)

    if ( high_up_to_date )
        update_low;
    else if ( low_up_to_date )
        update_high;
    else
       BUG();

But you'll want to have the okay from Kevin as the maintainer for
something like this.

Jan
Chao Gao - March 17, 2017, 1:52 a.m.
On Thu, Mar 16, 2017 at 04:29:29AM -0600, Jan Beulich wrote:
>>>> On 15.03.17 at 23:39, <chao.gao@intel.com> wrote:
>> On Wed, Mar 15, 2017 at 10:48:25AM -0600, Jan Beulich wrote:
>>>>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
>>>> +        /*
>>>> +         * The following method to update IRTE is safe on condition that
>>>> +         * only the high qword or the low qword is to be updated.
>>>> +         * If entire IRTE is to be updated, callers should make sure the
>>>> +         * IRTE is not in use.
>>>> +         */
>>>> +        entry->lo = new_ire->lo;
>>>> +        entry->hi = new_ire->hi;
>>>
>>>How is this any better than structure assignment? Furthermore
>> 
>> Indeed, not better. when using structure assignment, the assembly code is
>> 48 8b 06                mov    (%rsi),%rax                    
>> 48 8b 56 08             mov    0x8(%rsi),%rdx                 
>> 48 89 07                mov    %rax,(%rdi)                    
>> 48 89 57 08             mov    %rdx,0x8(%rdi)
>> Using the code above, the assembly code is
>> 48 8b 06                mov    (%rsi),%rax                  
>> 48 89 07                mov    %rax,(%rdi)                    
>> 48 8b 46 08             mov    0x8(%rsi),%rax                 
>> 48 89 47 08             mov    %rax,0x8(%rdi)
>> 
>> I thought structure assignment maybe ultilize memcpy considering structure
>> of a big size, so I made this change. I will change this back. Although
>> that, this patch is trying to make the change safer when cmpxchg16() is
>> supported. 
>
>Perhaps you've really meant to use write_atomic()?

I don't understand what you mean. But I think write_atomic may be not related
to the problem how to update a 16 byte memory atomically if cmpxchg16() is not
supported.

>
>>>the comment here partially contradicts the commit message. I
>> 
>> Yes.
>> 
>>>guess callers need to be given a way (another function parameter?)
>>>to signal the function whether the unsafe variant is okay to use.
>> 
>> This means we need to add the new parameter to iommu ops for only
>> IOAPIC/MSI know the entry they want to change is masked. Is there
>> any another reasonable and correct solution?
>
>Well, users you convert in this patch must be okay to use the
>non-atomic variant. The PI user(s) know(s) that cmpxchg16b is
>available, so could always request the safe variant. No need for
>a new parameter higher up in the call trees afaics.
>
>> How about...
>> 
>>>You should then add a suitable BUG_ON() in the else path here.
>> 
>> just add a BUG_ON() like this
>> BUG_ON( (entry->hi != new_ire->hi) && (entry->lo != new_ire->lo) );
>> Adding this BUG_ON() means update_irte() can't be used for initializing
>> or clearing IRTE which are not bugs.
>
>Yes, that's an option too, albeit then I'd suggest (pseudo code)
>
>    if ( high_up_to_date )
>        update_low;
>    else if ( low_up_to_date )
>        update_high;
>    else
>       BUG();
>
>But you'll want to have the okay from Kevin as the maintainer for
>something like this.

ok. I will wait for comments of Kevin.

Thank,
Chao
>
>Jan
>
Jan Beulich - March 17, 2017, 9:08 a.m.
>>> On 17.03.17 at 02:52, <chao.gao@intel.com> wrote:
> On Thu, Mar 16, 2017 at 04:29:29AM -0600, Jan Beulich wrote:
>>>>> On 15.03.17 at 23:39, <chao.gao@intel.com> wrote:
>>> On Wed, Mar 15, 2017 at 10:48:25AM -0600, Jan Beulich wrote:
>>>>>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
>>>>> +        /*
>>>>> +         * The following method to update IRTE is safe on condition that
>>>>> +         * only the high qword or the low qword is to be updated.
>>>>> +         * If entire IRTE is to be updated, callers should make sure the
>>>>> +         * IRTE is not in use.
>>>>> +         */
>>>>> +        entry->lo = new_ire->lo;
>>>>> +        entry->hi = new_ire->hi;
>>>>
>>>>How is this any better than structure assignment? Furthermore
>>> 
>>> Indeed, not better. when using structure assignment, the assembly code is
>>> 48 8b 06                mov    (%rsi),%rax                    
>>> 48 8b 56 08             mov    0x8(%rsi),%rdx                 
>>> 48 89 07                mov    %rax,(%rdi)                    
>>> 48 89 57 08             mov    %rdx,0x8(%rdi)
>>> Using the code above, the assembly code is
>>> 48 8b 06                mov    (%rsi),%rax                  
>>> 48 89 07                mov    %rax,(%rdi)                    
>>> 48 8b 46 08             mov    0x8(%rsi),%rax                 
>>> 48 89 47 08             mov    %rax,0x8(%rdi)
>>> 
>>> I thought structure assignment maybe ultilize memcpy considering structure
>>> of a big size, so I made this change. I will change this back. Although
>>> that, this patch is trying to make the change safer when cmpxchg16() is
>>> supported. 
>>
>>Perhaps you've really meant to use write_atomic()?
> 
> I don't understand what you mean. But I think write_atomic may be not related
> to the problem how to update a 16 byte memory atomically if cmpxchg16() is not
> supported.

Of course not, but you were concerned about memcpy() being
called by the compiler. I.e. I did assume that when doing the
2x8-byte update you would want to have them carried out as
two 8-byte writes, instead of possibly being broken up further
by the compiler.

Jan
Tian, Kevin - March 22, 2017, 6:26 a.m.
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 16, 2017 6:29 PM
> 
> >>> On 15.03.17 at 23:39, <chao.gao@intel.com> wrote:
> > On Wed, Mar 15, 2017 at 10:48:25AM -0600, Jan Beulich wrote:
> >>>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
> >>> +        /*
> >>> +         * The following method to update IRTE is safe on condition that
> >>> +         * only the high qword or the low qword is to be updated.
> >>> +         * If entire IRTE is to be updated, callers should make sure the
> >>> +         * IRTE is not in use.
> >>> +         */
> >>> +        entry->lo = new_ire->lo;
> >>> +        entry->hi = new_ire->hi;
> >>
> >>How is this any better than structure assignment? Furthermore
> >
> > Indeed, not better. when using structure assignment, the assembly code is
> > 48 8b 06                mov    (%rsi),%rax
> > 48 8b 56 08             mov    0x8(%rsi),%rdx
> > 48 89 07                mov    %rax,(%rdi)
> > 48 89 57 08             mov    %rdx,0x8(%rdi)
> > Using the code above, the assembly code is
> > 48 8b 06                mov    (%rsi),%rax
> > 48 89 07                mov    %rax,(%rdi)
> > 48 8b 46 08             mov    0x8(%rsi),%rax
> > 48 89 47 08             mov    %rax,0x8(%rdi)
> >
> > I thought structure assignment maybe ultilize memcpy considering
> > structure of a big size, so I made this change. I will change this
> > back. Although that, this patch is trying to make the change safer
> > when cmpxchg16() is supported.
> 
> Perhaps you've really meant to use write_atomic()?
> 
> >>the comment here partially contradicts the commit message. I
> >
> > Yes.
> >
> >>guess callers need to be given a way (another function parameter?) to
> >>signal the function whether the unsafe variant is okay to use.
> >
> > This means we need to add the new parameter to iommu ops for only
> > IOAPIC/MSI know the entry they want to change is masked. Is there any
> > another reasonable and correct solution?
> 
> Well, users you convert in this patch must be okay to use the non-atomic
> variant. The PI user(s) know(s) that cmpxchg16b is available, so could always
> request the safe variant. No need for a new parameter higher up in the call
> trees afaics.
> 
> > How about...
> >
> >>You should then add a suitable BUG_ON() in the else path here.
> >
> > just add a BUG_ON() like this
> > BUG_ON( (entry->hi != new_ire->hi) && (entry->lo != new_ire->lo) );
> > Adding this BUG_ON() means update_irte() can't be used for
> > initializing or clearing IRTE which are not bugs.
> 
> Yes, that's an option too, albeit then I'd suggest (pseudo code)
> 
>     if ( high_up_to_date )
>         update_low;
>     else if ( low_up_to_date )
>         update_high;
>     else
>        BUG();
> 
> But you'll want to have the okay from Kevin as the maintainer for something
> like this.
> 
> Jan


We'll have an internal discussion about exact requirement of update
atomicity here and then get back with a proposal.

Thanks
Kevin
Tian, Kevin - March 24, 2017, 8:44 a.m.
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 16, 2017 6:29 PM
> 
> >>> On 15.03.17 at 23:39, <chao.gao@intel.com> wrote:
> > On Wed, Mar 15, 2017 at 10:48:25AM -0600, Jan Beulich wrote:
> >>>>> On 15.03.17 at 06:11, <chao.gao@intel.com> wrote:
> >>> +        /*
> >>> +         * The following method to update IRTE is safe on condition that
> >>> +         * only the high qword or the low qword is to be updated.
> >>> +         * If entire IRTE is to be updated, callers should make sure the
> >>> +         * IRTE is not in use.
> >>> +         */
> >>> +        entry->lo = new_ire->lo;
> >>> +        entry->hi = new_ire->hi;
> >>
> >>How is this any better than structure assignment? Furthermore
> >
> > Indeed, not better. when using structure assignment, the assembly code is
> > 48 8b 06                mov    (%rsi),%rax
> > 48 8b 56 08             mov    0x8(%rsi),%rdx
> > 48 89 07                mov    %rax,(%rdi)
> > 48 89 57 08             mov    %rdx,0x8(%rdi)
> > Using the code above, the assembly code is
> > 48 8b 06                mov    (%rsi),%rax
> > 48 89 07                mov    %rax,(%rdi)
> > 48 8b 46 08             mov    0x8(%rsi),%rax
> > 48 89 47 08             mov    %rax,0x8(%rdi)
> >
> > I thought structure assignment maybe ultilize memcpy considering
> > structure of a big size, so I made this change. I will change this
> > back. Although that, this patch is trying to make the change safer
> > when cmpxchg16() is supported.
> 
> Perhaps you've really meant to use write_atomic()?
> 
> >>the comment here partially contradicts the commit message. I
> >
> > Yes.
> >
> >>guess callers need to be given a way (another function parameter?) to
> >>signal the function whether the unsafe variant is okay to use.
> >
> > This means we need to add the new parameter to iommu ops for only
> > IOAPIC/MSI know the entry they want to change is masked. Is there any
> > another reasonable and correct solution?
> 
> Well, users you convert in this patch must be okay to use the non-atomic
> variant. The PI user(s) know(s) that cmpxchg16b is available, so could always
> request the safe variant. No need for a new parameter higher up in the call
> trees afaics.
> 
> > How about...
> >
> >>You should then add a suitable BUG_ON() in the else path here.
> >
> > just add a BUG_ON() like this
> > BUG_ON( (entry->hi != new_ire->hi) && (entry->lo != new_ire->lo) );
> > Adding this BUG_ON() means update_irte() can't be used for
> > initializing or clearing IRTE which are not bugs.
> 
> Yes, that's an option too, albeit then I'd suggest (pseudo code)
> 
>     if ( high_up_to_date )
>         update_low;
>     else if ( low_up_to_date )
>         update_high;
>     else
>        BUG();
> 
> But you'll want to have the okay from Kevin as the maintainer for something
> like this.
> 

Just had a discussion with Chao. Here is a summary:

- PI implies availability of cmpxchg16b, so can always request atomicity

- The problem is really about remap mode. cmpxchg16b might be not
available on those platforms:

	* for init/clear operations, atomicity is not a requirement. We
can have high/low parts updated separately;

	* for reprogram operations (e.g. irq balance), atomicity
is a requirement. But in reality only lower 64bit should be touched.

Then:

- for remapping IRTE
	* do update_irte(high) and update_irte(low) separately
when IRTE is newly allocated
	* Then only do update_irte(low) in case an IRTE is found

- for posting IRTE
	* always do update_irte(high&low)

Then we can use the pseudo code you suggested.

Thanks
Kevin

Patch

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 7d4c7e1..342b45f 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -169,6 +169,38 @@  bool_t __init iommu_supports_eim(void)
     return 1;
 }
 
+static void update_irte(struct iremap_entry *entry,
+                        const struct iremap_entry *new_ire)
+{
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t ret;
+        struct iremap_entry old_ire;
+
+        old_ire = *entry;
+        ret = cmpxchg16b(entry, &old_ire, new_ire);
+
+        /*
+         * In the above, we use cmpxchg16 to atomically update the 128-bit
+         * IRTE, and the hardware cannot update the IRTE behind us, so
+         * the return value of cmpxchg16 should be the same as old_ire.
+         * This ASSERT validate it.
+         */
+        ASSERT(ret == old_ire.val);
+    }
+    else
+    {
+        /*
+         * The following method to update IRTE is safe on condition that
+         * only the high qword or the low qword is to be updated.
+         * If entire IRTE is to be updated, callers should make sure the
+         * IRTE is not in use.
+         */
+        entry->lo = new_ire->lo;
+        entry->hi = new_ire->hi;
+    }
+}
+
 /* Mark specified intr remap entry as free */
 static void free_remap_entry(struct iommu *iommu, int index)
 {
@@ -353,7 +385,7 @@  static int ioapic_rte_to_remap_entry(struct iommu *iommu,
         remap_rte->format = 1;    /* indicate remap format */
     }
 
-    *iremap_entry = new_ire;
+    update_irte(iremap_entry, &new_ire);
     iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
@@ -640,7 +672,7 @@  static int msi_msg_to_remap_entry(
     remap_rte->address_hi = 0;
     remap_rte->data = index - i;
 
-    *iremap_entry = new_ire;
+    update_irte(iremap_entry, &new_ire);
     iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);