Patchwork [RFC] hvm/vpt: fix inconsistent views of vIOAPIC in vmx_intr_assist()

login
register
mail settings
Submitter Chao Gao
Date April 7, 2017, 3:24 a.m.
Message ID <1491535456-12682-1-git-send-email-chao.gao@intel.com>
Download mbox | patch
Permalink /patch/218633/
State New
Headers show

Comments

Chao Gao - April 7, 2017, 3:24 a.m.
When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
operation is operated during one event delivery and incur to inconsistent
views of vIOAPIC. For example, if a periodic timer interrupt is from PIT, when
set the corresponding bit in vIRR, the corresponding RTE is accessed in
pt_update_irq(). When this function returns, it accesses the RTE again to get
the vector it set in vIRR.  Between the two accesses, the content of RTE may
have been changed by another CPU for no protection method in use. This case
can incur the assertion failure in vmx_intr_assist().  Also after a periodic
timer interrupt is injected successfully, pt_irq_posted() will decrease the
count (pending_intr_nr). But if the corresponding RTE has been changed,
pt_irq_posted() will not decrease the count, resulting one more timer
interrupt.

More discussion and history can be found in
1.https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00906.html
2.https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01019.html

This patch simply uses irq_lock to avoid these inconsistent views of vIOAPIC.
To fix the deadlock, provide locked version of several functions.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/irq.c      | 22 ++++++++++++++++------
 xen/arch/x86/hvm/svm/intr.c | 21 +++++++++++++++------
 xen/arch/x86/hvm/vmx/intr.c |  9 +++++++++
 xen/arch/x86/hvm/vpic.c     | 20 ++++++++++++++------
 xen/arch/x86/hvm/vpt.c      |  4 ++--
 xen/include/xen/hvm/irq.h   |  4 ++++
 6 files changed, 60 insertions(+), 20 deletions(-)
Chao Gao - April 11, 2017, 8:01 p.m.
On Wed, Apr 12, 2017 at 02:45:36AM +0000, Xuquan (Quan Xu) wrote:
>On April 07, 2017 11:24 AM, Chao Gao wrote:
>>When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
>>operation is operated during one event delivery and incur to inconsistent
>>views of vIOAPIC. For example, if a periodic timer interrupt is from PIT, when
>>set the corresponding bit in vIRR, the corresponding RTE is accessed in
>>pt_update_irq(). When this function returns, it accesses the RTE again to get
>>the vector it set in vIRR.  Between the two accesses, the content of RTE may
>>have been changed by another CPU for no protection method in use. This case
>>can incur the assertion failure in vmx_intr_assist().  Also after a periodic
>>timer interrupt is injected successfully, pt_irq_posted() will decrease the
>>count (pending_intr_nr). But if the corresponding RTE has been changed,
>>pt_irq_posted() will not decrease the count, resulting one more timer
>>interrupt.
>>
>>More discussion and history can be found in
>>1.https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00906.ht
>>ml
>>2.https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01019.ht
>>ml
>>
>>This patch simply uses irq_lock to avoid these inconsistent views of vIOAPIC.
>
>To vLAPIC, is it in a similar situation?

For vLAPIC, there is no such situation. The irq field in structure periodic_time
has different meanings for vLAPIC/vIOAPIC case. For vLAPIC case, it is the interrupt
vector. For vIOAPIC case, it is the irq number; to get interrupt vector and to
deliver this interrupt, accessing IOAPIC RTE is needed.

Thanks
Chao
Chao Gao - April 11, 2017, 9:22 p.m.
On Wed, Apr 12, 2017 at 02:45:36AM +0000, Xuquan (Quan Xu) wrote:
>On April 07, 2017 11:24 AM, Chao Gao wrote:
>>When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
>>operation is operated during one event delivery and incur to inconsistent
>>views of vIOAPIC. For example, if a periodic timer interrupt is from PIT, when
>>set the corresponding bit in vIRR, the corresponding RTE is accessed in
>>pt_update_irq(). When this function returns, it accesses the RTE again to get
>>the vector it set in vIRR.  Between the two accesses, the content of RTE may
>>have been changed by another CPU for no protection method in use. This case
>>can incur the assertion failure in vmx_intr_assist().  Also after a periodic
>>timer interrupt is injected successfully, pt_irq_posted() will decrease the
>>count (pending_intr_nr). But if the corresponding RTE has been changed,
>>pt_irq_posted() will not decrease the count, resulting one more timer
>>interrupt.
>>
>>More discussion and history can be found in
>>1.https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00906.ht
>>ml
>>2.https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01019.ht
>>ml
>>
>>+    /*
>>+     * Avoid the vIOAPIC RTE being changed by another vCPU.
>
>This comment seems inconsistent with the patch description, which said ' by another CPU ', instead of 'by another vCPU'..
>

[Sorry. I ignored this in the previous email.]
Yes. Should be 'by another vCPU'.

Thanks
Chao
Xuquan (Euler) - April 12, 2017, 2:45 a.m.
On April 07, 2017 11:24 AM, Chao Gao wrote:
>When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
>operation is operated during one event delivery and incur to inconsistent
>views of vIOAPIC. For example, if a periodic timer interrupt is from PIT, when
>set the corresponding bit in vIRR, the corresponding RTE is accessed in
>pt_update_irq(). When this function returns, it accesses the RTE again to get
>the vector it set in vIRR.  Between the two accesses, the content of RTE may
>have been changed by another CPU for no protection method in use. This case
>can incur the assertion failure in vmx_intr_assist().  Also after a periodic
>timer interrupt is injected successfully, pt_irq_posted() will decrease the
>count (pending_intr_nr). But if the corresponding RTE has been changed,
>pt_irq_posted() will not decrease the count, resulting one more timer
>interrupt.
>
>More discussion and history can be found in
>1.https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00906.ht
>ml
>2.https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01019.ht
>ml
>
>This patch simply uses irq_lock to avoid these inconsistent views of vIOAPIC.

To vLAPIC, is it in a similar situation?


>To fix the deadlock, provide locked version of several functions.
>
>Signed-off-by: Chao Gao <chao.gao@intel.com>
>---
> xen/arch/x86/hvm/irq.c      | 22 ++++++++++++++++------
> xen/arch/x86/hvm/svm/intr.c | 21 +++++++++++++++------
>xen/arch/x86/hvm/vmx/intr.c |  9 +++++++++
> xen/arch/x86/hvm/vpic.c     | 20 ++++++++++++++------
> xen/arch/x86/hvm/vpt.c      |  4 ++--
> xen/include/xen/hvm/irq.h   |  4 ++++
> 6 files changed, 60 insertions(+), 20 deletions(-)
>
>diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c index
>8625584..a1af61e 100644
>--- a/xen/arch/x86/hvm/irq.c
>+++ b/xen/arch/x86/hvm/irq.c
>@@ -126,37 +126,47 @@ void hvm_pci_intx_deassert(
>     spin_unlock(&d->arch.hvm_domain.irq_lock);
> }
>
>-void hvm_isa_irq_assert(
>+void hvm_isa_irq_assert_locked(
>     struct domain *d, unsigned int isa_irq)  {
>     struct hvm_irq *hvm_irq = hvm_domain_irq(d);
>     unsigned int gsi = hvm_isa_irq_to_gsi(isa_irq);
>
>     ASSERT(isa_irq <= 15);
>-
>-    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    ASSERT(spin_is_locked(&d->arch.hvm_domain.irq_lock));
>
>     if ( !__test_and_set_bit(isa_irq, &hvm_irq->isa_irq.i) &&
>          (hvm_irq->gsi_assert_count[gsi]++ == 0) )
>         assert_irq(d, gsi, isa_irq);
>+}
>
>+void hvm_isa_irq_assert(
>+    struct domain *d, unsigned int isa_irq) {
>+    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    hvm_isa_irq_assert_locked(d, isa_irq);
>     spin_unlock(&d->arch.hvm_domain.irq_lock);
> }
>
>-void hvm_isa_irq_deassert(
>+void hvm_isa_irq_deassert_locked(
>     struct domain *d, unsigned int isa_irq)  {
>     struct hvm_irq *hvm_irq = hvm_domain_irq(d);
>     unsigned int gsi = hvm_isa_irq_to_gsi(isa_irq);
>
>     ASSERT(isa_irq <= 15);
>-
>-    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    ASSERT(spin_is_locked(&d->arch.hvm_domain.irq_lock));
>
>     if ( __test_and_clear_bit(isa_irq, &hvm_irq->isa_irq.i) &&
>          (--hvm_irq->gsi_assert_count[gsi] == 0) )
>         deassert_irq(d, isa_irq);
>+}
>
>+void hvm_isa_irq_deassert(
>+    struct domain *d, unsigned int isa_irq) {
>+    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    hvm_isa_irq_deassert_locked(d, isa_irq);
>     spin_unlock(&d->arch.hvm_domain.irq_lock);
> }
>
>diff --git a/xen/arch/x86/hvm/svm/intr.c b/xen/arch/x86/hvm/svm/intr.c
>index 8511ff0..d2a318c 100644
>--- a/xen/arch/x86/hvm/svm/intr.c
>+++ b/xen/arch/x86/hvm/svm/intr.c
>@@ -137,18 +137,25 @@ void svm_intr_assist(void)
>     struct hvm_intack intack;
>     enum hvm_intblk intblk;
>
>+    /*
>+     * Avoid the vIOAPIC RTE being changed by another vCPU.
>+     * Otherwise, pt_update_irq() may return a wrong vector which is not
>in
>+     * vIRR and pt_irq_posted() may not recognize a timer interrupt has
>been
>+     * injected.
>+     */
>+    spin_lock(&v->domain->arch.hvm_domain.irq_lock);
>     /* Crank the handle on interrupt state. */
>     pt_update_irq(v);
>
>     do {
>         intack = hvm_vcpu_has_pending_irq(v);
>         if ( likely(intack.source == hvm_intsrc_none) )
>-            return;
>+            goto out;
>
>         intblk = hvm_interrupt_blocked(v, intack);
>         if ( intblk == hvm_intblk_svm_gif ) {
>             ASSERT(nestedhvm_enabled(v->domain));
>-            return;
>+            goto out;
>         }
>
>         /* Interrupts for the nested guest are already @@ -165,13 +172,13
>@@ void svm_intr_assist(void)
>             switch (rc) {
>             case NSVM_INTR_NOTINTERCEPTED:
>                 /* Inject interrupt into 2nd level guest directly. */
>-                break;
>+                goto out;
>             case NSVM_INTR_NOTHANDLED:
>             case NSVM_INTR_FORCEVMEXIT:
>-                return;
>+                goto out;
>             case NSVM_INTR_MASKED:
>                 /* Guest already enabled an interrupt window. */
>-                return;
>+                goto out;
>             default:
>                 panic("%s: nestedsvm_vcpu_interrupt can't handle
>value %#x",
>                     __func__, rc);
>@@ -195,7 +202,7 @@ void svm_intr_assist(void)
>         if ( unlikely(vmcb->eventinj.fields.v) || intblk )
>         {
>             svm_enable_intr_window(v, intack);
>-            return;
>+            goto out;
>         }
>
>         intack = hvm_vcpu_ack_pending_irq(v, intack); @@ -216,6 +223,8
>@@ void svm_intr_assist(void)
>     intack = hvm_vcpu_has_pending_irq(v);
>     if ( unlikely(intack.source != hvm_intsrc_none) )
>         svm_enable_intr_window(v, intack);
>+ out:
>+    spin_unlock(&v->domain->arch.hvm_domain.irq_lock);
> }
>
> /*
>diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
>index 1eb9f38..8f3d1a7 100644
>--- a/xen/arch/x86/hvm/vmx/intr.c
>+++ b/xen/arch/x86/hvm/vmx/intr.c
>@@ -234,6 +234,14 @@ void vmx_intr_assist(void)
>         return;
>     }
>
>+    /*
>+     * Avoid the vIOAPIC RTE being changed by another vCPU.

This comment seems inconsistent with the patch description, which said ' by another CPU ', instead of 'by another vCPU'..


-Quan



>+     * Otherwise, pt_update_irq() may return a wrong vector which is not
>in
>+     * vIRR and pt_irq_posted() may not recognize a timer interrupt has
>been
>+     * injected.
>+     */
>+    spin_lock(&v->domain->arch.hvm_domain.irq_lock);
>+
>     /* Crank the handle on interrupt state. */
>     if ( is_hvm_vcpu(v) )
>         pt_vector = pt_update_irq(v);
>@@ -396,6 +404,7 @@ void vmx_intr_assist(void)
>     }
>
>  out:
>+    spin_unlock(&v->domain->arch.hvm_domain.irq_lock);
>     if ( !nestedhvm_vcpu_in_guestmode(v) &&
>          !cpu_has_vmx_virtual_intr_delivery &&
>          cpu_has_vmx_tpr_shadow )
>diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c index
>e160bbd..8d02e52 100644
>--- a/xen/arch/x86/hvm/vpic.c
>+++ b/xen/arch/x86/hvm/vpic.c
>@@ -153,14 +153,13 @@ static void __vpic_intack(struct hvm_hw_vpic
>*vpic, int irq)
>     vpic_update_int_output(vpic);
> }
>
>-static int vpic_intack(struct hvm_hw_vpic *vpic)
>+static int vpic_intack_locked(struct hvm_hw_vpic *vpic)
> {
>     int irq = -1;
>
>-    vpic_lock(vpic);
>-
>+    ASSERT(vpic_is_locked(vpic));
>     if ( !vpic->int_output )
>-        goto out;
>+        return irq;
>
>     irq = vpic_get_highest_priority_irq(vpic);
>     BUG_ON(irq < 0);
>@@ -175,7 +174,15 @@ static int vpic_intack(struct hvm_hw_vpic *vpic)
>         irq += 8;
>     }
>
>- out:
>+    return irq;
>+}
>+
>+static int vpic_intack(struct hvm_hw_vpic *vpic) {
>+    int irq;
>+
>+    vpic_lock(vpic);
>+    irq = vpic_intack_locked(vpic);
>     vpic_unlock(vpic);
>     return irq;
> }
>@@ -487,13 +494,14 @@ int vpic_ack_pending_irq(struct vcpu *v)
>     struct hvm_hw_vpic *vpic = &v->domain->arch.hvm_domain.vpic[0];
>
>     ASSERT(has_vpic(v->domain));
>+    ASSERT(vpic_is_locked(vpic));
>
>     TRACE_2D(TRC_HVM_EMUL_PIC_PEND_IRQ_CALL,
>vlapic_accept_pic_intr(v),
>              vpic->int_output);
>     if ( !vlapic_accept_pic_intr(v) || !vpic->int_output )
>         return -1;
>
>-    irq = vpic_intack(vpic);
>+    irq = vpic_intack_locked(vpic);
>     if ( irq == -1 )
>         return -1;
>
>diff --git a/xen/arch/x86/hvm/vpt.c b/xen/arch/x86/hvm/vpt.c index
>e3f2039..d048f26 100644
>--- a/xen/arch/x86/hvm/vpt.c
>+++ b/xen/arch/x86/hvm/vpt.c
>@@ -296,8 +296,8 @@ int pt_update_irq(struct vcpu *v)
>         vlapic_set_irq(vcpu_vlapic(v), irq, 0);
>     else
>     {
>-        hvm_isa_irq_deassert(v->domain, irq);
>-        hvm_isa_irq_assert(v->domain, irq);
>+        hvm_isa_irq_deassert_locked(v->domain, irq);
>+        hvm_isa_irq_assert_locked(v->domain, irq);
>     }
>
>     /*
>diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h index
>f041252..a8f36f8 100644
>--- a/xen/include/xen/hvm/irq.h
>+++ b/xen/include/xen/hvm/irq.h
>@@ -119,8 +119,12 @@ void hvm_pci_intx_deassert(
> /* Modify state of an ISA device's IRQ wire. */  void hvm_isa_irq_assert(
>     struct domain *d, unsigned int isa_irq);
>+void hvm_isa_irq_assert_locked(
>+    struct domain *d, unsigned int isa_irq);
> void hvm_isa_irq_deassert(
>     struct domain *d, unsigned int isa_irq);
>+void hvm_isa_irq_deassert_locked(
>+    struct domain *d, unsigned int isa_irq);
>
> int hvm_set_pci_link_route(struct domain *d, u8 link, u8 isa_irq);
>
>--
>1.8.3.1
Chao Gao - April 21, 2017, 2:42 a.m.
PING...

Please give some comments. I think this bug should be fixed in Xen 4.9.

On Fri, Apr 07, 2017 at 11:24:16AM +0800, Chao Gao wrote:
>When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
>operation is operated during one event delivery and incur to inconsistent
>views of vIOAPIC. For example, if a periodic timer interrupt is from PIT, when
>set the corresponding bit in vIRR, the corresponding RTE is accessed in
>pt_update_irq(). When this function returns, it accesses the RTE again to get
>the vector it set in vIRR.  Between the two accesses, the content of RTE may
>have been changed by another CPU for no protection method in use. This case
>can incur the assertion failure in vmx_intr_assist().  Also after a periodic
>timer interrupt is injected successfully, pt_irq_posted() will decrease the
>count (pending_intr_nr). But if the corresponding RTE has been changed,
>pt_irq_posted() will not decrease the count, resulting one more timer
>interrupt.
>
>More discussion and history can be found in
>1.https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00906.html
>2.https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01019.html
>
>This patch simply uses irq_lock to avoid these inconsistent views of vIOAPIC.
>To fix the deadlock, provide locked version of several functions.
>
>Signed-off-by: Chao Gao <chao.gao@intel.com>
>---
> xen/arch/x86/hvm/irq.c      | 22 ++++++++++++++++------
> xen/arch/x86/hvm/svm/intr.c | 21 +++++++++++++++------
> xen/arch/x86/hvm/vmx/intr.c |  9 +++++++++
> xen/arch/x86/hvm/vpic.c     | 20 ++++++++++++++------
> xen/arch/x86/hvm/vpt.c      |  4 ++--
> xen/include/xen/hvm/irq.h   |  4 ++++
> 6 files changed, 60 insertions(+), 20 deletions(-)
>
>diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
>index 8625584..a1af61e 100644
>--- a/xen/arch/x86/hvm/irq.c
>+++ b/xen/arch/x86/hvm/irq.c
>@@ -126,37 +126,47 @@ void hvm_pci_intx_deassert(
>     spin_unlock(&d->arch.hvm_domain.irq_lock);
> }
> 
>-void hvm_isa_irq_assert(
>+void hvm_isa_irq_assert_locked(
>     struct domain *d, unsigned int isa_irq)
> {
>     struct hvm_irq *hvm_irq = hvm_domain_irq(d);
>     unsigned int gsi = hvm_isa_irq_to_gsi(isa_irq);
> 
>     ASSERT(isa_irq <= 15);
>-
>-    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    ASSERT(spin_is_locked(&d->arch.hvm_domain.irq_lock));
> 
>     if ( !__test_and_set_bit(isa_irq, &hvm_irq->isa_irq.i) &&
>          (hvm_irq->gsi_assert_count[gsi]++ == 0) )
>         assert_irq(d, gsi, isa_irq);
>+}
> 
>+void hvm_isa_irq_assert(
>+    struct domain *d, unsigned int isa_irq)
>+{
>+    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    hvm_isa_irq_assert_locked(d, isa_irq);
>     spin_unlock(&d->arch.hvm_domain.irq_lock);
> }
> 
>-void hvm_isa_irq_deassert(
>+void hvm_isa_irq_deassert_locked(
>     struct domain *d, unsigned int isa_irq)
> {
>     struct hvm_irq *hvm_irq = hvm_domain_irq(d);
>     unsigned int gsi = hvm_isa_irq_to_gsi(isa_irq);
> 
>     ASSERT(isa_irq <= 15);
>-
>-    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    ASSERT(spin_is_locked(&d->arch.hvm_domain.irq_lock));
> 
>     if ( __test_and_clear_bit(isa_irq, &hvm_irq->isa_irq.i) &&
>          (--hvm_irq->gsi_assert_count[gsi] == 0) )
>         deassert_irq(d, isa_irq);
>+}
> 
>+void hvm_isa_irq_deassert(
>+    struct domain *d, unsigned int isa_irq)
>+{
>+    spin_lock(&d->arch.hvm_domain.irq_lock);
>+    hvm_isa_irq_deassert_locked(d, isa_irq);
>     spin_unlock(&d->arch.hvm_domain.irq_lock);
> }
> 
>diff --git a/xen/arch/x86/hvm/svm/intr.c b/xen/arch/x86/hvm/svm/intr.c
>index 8511ff0..d2a318c 100644
>--- a/xen/arch/x86/hvm/svm/intr.c
>+++ b/xen/arch/x86/hvm/svm/intr.c
>@@ -137,18 +137,25 @@ void svm_intr_assist(void)
>     struct hvm_intack intack;
>     enum hvm_intblk intblk;
> 
>+    /*
>+     * Avoid the vIOAPIC RTE being changed by another vCPU.
>+     * Otherwise, pt_update_irq() may return a wrong vector which is not in
>+     * vIRR and pt_irq_posted() may not recognize a timer interrupt has been
>+     * injected.
>+     */
>+    spin_lock(&v->domain->arch.hvm_domain.irq_lock);
>     /* Crank the handle on interrupt state. */
>     pt_update_irq(v);
> 
>     do {
>         intack = hvm_vcpu_has_pending_irq(v);
>         if ( likely(intack.source == hvm_intsrc_none) )
>-            return;
>+            goto out;
> 
>         intblk = hvm_interrupt_blocked(v, intack);
>         if ( intblk == hvm_intblk_svm_gif ) {
>             ASSERT(nestedhvm_enabled(v->domain));
>-            return;
>+            goto out;
>         }
> 
>         /* Interrupts for the nested guest are already
>@@ -165,13 +172,13 @@ void svm_intr_assist(void)
>             switch (rc) {
>             case NSVM_INTR_NOTINTERCEPTED:
>                 /* Inject interrupt into 2nd level guest directly. */
>-                break;	
>+                goto out;
>             case NSVM_INTR_NOTHANDLED:
>             case NSVM_INTR_FORCEVMEXIT:
>-                return;
>+                goto out;
>             case NSVM_INTR_MASKED:
>                 /* Guest already enabled an interrupt window. */
>-                return;
>+                goto out;
>             default:
>                 panic("%s: nestedsvm_vcpu_interrupt can't handle value %#x",
>                     __func__, rc);
>@@ -195,7 +202,7 @@ void svm_intr_assist(void)
>         if ( unlikely(vmcb->eventinj.fields.v) || intblk )
>         {
>             svm_enable_intr_window(v, intack);
>-            return;
>+            goto out;
>         }
> 
>         intack = hvm_vcpu_ack_pending_irq(v, intack);
>@@ -216,6 +223,8 @@ void svm_intr_assist(void)
>     intack = hvm_vcpu_has_pending_irq(v);
>     if ( unlikely(intack.source != hvm_intsrc_none) )
>         svm_enable_intr_window(v, intack);
>+ out:
>+    spin_unlock(&v->domain->arch.hvm_domain.irq_lock);
> }
> 
> /*
>diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
>index 1eb9f38..8f3d1a7 100644
>--- a/xen/arch/x86/hvm/vmx/intr.c
>+++ b/xen/arch/x86/hvm/vmx/intr.c
>@@ -234,6 +234,14 @@ void vmx_intr_assist(void)
>         return;
>     }
> 
>+    /*
>+     * Avoid the vIOAPIC RTE being changed by another vCPU.
>+     * Otherwise, pt_update_irq() may return a wrong vector which is not in
>+     * vIRR and pt_irq_posted() may not recognize a timer interrupt has been
>+     * injected.
>+     */
>+    spin_lock(&v->domain->arch.hvm_domain.irq_lock);
>+
>     /* Crank the handle on interrupt state. */
>     if ( is_hvm_vcpu(v) )
>         pt_vector = pt_update_irq(v);
>@@ -396,6 +404,7 @@ void vmx_intr_assist(void)
>     }
> 
>  out:
>+    spin_unlock(&v->domain->arch.hvm_domain.irq_lock);
>     if ( !nestedhvm_vcpu_in_guestmode(v) &&
>          !cpu_has_vmx_virtual_intr_delivery &&
>          cpu_has_vmx_tpr_shadow )
>diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c
>index e160bbd..8d02e52 100644
>--- a/xen/arch/x86/hvm/vpic.c
>+++ b/xen/arch/x86/hvm/vpic.c
>@@ -153,14 +153,13 @@ static void __vpic_intack(struct hvm_hw_vpic *vpic, int irq)
>     vpic_update_int_output(vpic);
> }
> 
>-static int vpic_intack(struct hvm_hw_vpic *vpic)
>+static int vpic_intack_locked(struct hvm_hw_vpic *vpic)
> {
>     int irq = -1;
> 
>-    vpic_lock(vpic);
>-
>+    ASSERT(vpic_is_locked(vpic));
>     if ( !vpic->int_output )
>-        goto out;
>+        return irq;
> 
>     irq = vpic_get_highest_priority_irq(vpic);
>     BUG_ON(irq < 0);
>@@ -175,7 +174,15 @@ static int vpic_intack(struct hvm_hw_vpic *vpic)
>         irq += 8;
>     }
> 
>- out:
>+    return irq;
>+}
>+
>+static int vpic_intack(struct hvm_hw_vpic *vpic)
>+{
>+    int irq;
>+
>+    vpic_lock(vpic);
>+    irq = vpic_intack_locked(vpic);
>     vpic_unlock(vpic);
>     return irq;
> }
>@@ -487,13 +494,14 @@ int vpic_ack_pending_irq(struct vcpu *v)
>     struct hvm_hw_vpic *vpic = &v->domain->arch.hvm_domain.vpic[0];
> 
>     ASSERT(has_vpic(v->domain));
>+    ASSERT(vpic_is_locked(vpic));
> 
>     TRACE_2D(TRC_HVM_EMUL_PIC_PEND_IRQ_CALL, vlapic_accept_pic_intr(v),
>              vpic->int_output);
>     if ( !vlapic_accept_pic_intr(v) || !vpic->int_output )
>         return -1;
> 
>-    irq = vpic_intack(vpic);
>+    irq = vpic_intack_locked(vpic);
>     if ( irq == -1 )
>         return -1;
> 
>diff --git a/xen/arch/x86/hvm/vpt.c b/xen/arch/x86/hvm/vpt.c
>index e3f2039..d048f26 100644
>--- a/xen/arch/x86/hvm/vpt.c
>+++ b/xen/arch/x86/hvm/vpt.c
>@@ -296,8 +296,8 @@ int pt_update_irq(struct vcpu *v)
>         vlapic_set_irq(vcpu_vlapic(v), irq, 0);
>     else
>     {
>-        hvm_isa_irq_deassert(v->domain, irq);
>-        hvm_isa_irq_assert(v->domain, irq);
>+        hvm_isa_irq_deassert_locked(v->domain, irq);
>+        hvm_isa_irq_assert_locked(v->domain, irq);
>     }
> 
>     /*
>diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
>index f041252..a8f36f8 100644
>--- a/xen/include/xen/hvm/irq.h
>+++ b/xen/include/xen/hvm/irq.h
>@@ -119,8 +119,12 @@ void hvm_pci_intx_deassert(
> /* Modify state of an ISA device's IRQ wire. */
> void hvm_isa_irq_assert(
>     struct domain *d, unsigned int isa_irq);
>+void hvm_isa_irq_assert_locked(
>+    struct domain *d, unsigned int isa_irq);
> void hvm_isa_irq_deassert(
>     struct domain *d, unsigned int isa_irq);
>+void hvm_isa_irq_deassert_locked(
>+    struct domain *d, unsigned int isa_irq);
> 
> int hvm_set_pci_link_route(struct domain *d, u8 link, u8 isa_irq);
> 
>-- 
>1.8.3.1
>
Chao Gao - April 21, 2017, 4:22 a.m.
Kevin, thanks for your reply.

On Fri, Apr 21, 2017 at 06:56:39PM +0800, Tian, Kevin wrote:
>> From: Gao, Chao
>> Sent: Friday, April 7, 2017 11:24 AM
>> 
>> When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
>> operation is operated during one event delivery and incur to inconsistent
>
>operation is operated -> operations are done
>

will fix.

>> @@ -487,13 +494,14 @@ int vpic_ack_pending_irq(struct vcpu *v)
>>      struct hvm_hw_vpic *vpic = &v->domain->arch.hvm_domain.vpic[0];
>> 
>>      ASSERT(has_vpic(v->domain));
>> +    ASSERT(vpic_is_locked(vpic));
>> 
>>      TRACE_2D(TRC_HVM_EMUL_PIC_PEND_IRQ_CALL,
>> vlapic_accept_pic_intr(v),
>>               vpic->int_output);
>>      if ( !vlapic_accept_pic_intr(v) || !vpic->int_output )
>>          return -1;
>> 
>> -    irq = vpic_intack(vpic);
>> +    irq = vpic_intack_locked(vpic);
>>      if ( irq == -1 )
>>          return -1;
>> 
>
>hvm_vcpu_ack_pending_irq is also invoked in nvmx_intr_intercept,
>where you also need surround it with spin_lock/unlock otherwise
>above ASSERT will be triggered.
>

nvmx_intr_intercept(), currently, is only called from region
protected by irq_lock. So i think the ASSERT won't be triggered.
But I also need add an ASSERT in nvmx_intr_intercept().

Thanks
Chao
Chao Gao - April 21, 2017, 7:11 a.m.
On Fri, Apr 21, 2017 at 10:00:16PM +0800, Tian, Kevin wrote:
>btw I'm thinking whether adding lock in vmx_intr_assist is
>necessary... Can you think whether below option may
>work?
>
>Do lock protection within pt_update_irq. Make sure vector
>not changed between setting vIRR and finding pt_irq_vector.
>Within the lock also saves the vector to earliest_pt.
>
>Then vmx_intr_assist will get the exact vector programmed
>to vIRR from pt_irq_vector. Then in pt_intr_post, compare
>saved vector to intack.vector, instead of re-searching
>pt_irq_vector again.
>
>This way even RTE is changed after pt_update_irq it won't
>impact the whole flow here, since new vector hasn't got
>chance to hit vIRR. then we don't need adding lock in 
>vmx_intr_assist at all.

I think it will work and is better than this patch.

Thanks
Chao
Jan Beulich - April 21, 2017, 9:54 a.m.
>>> On 21.04.17 at 04:42, <chao.gao@intel.com> wrote:
> PING...
> 
> Please give some comments. I think this bug should be fixed in Xen 4.9.

I'm sorry, you've posted this as an RFC, which sits behind 80 other
RFCs which want looking at, which in turn sit behind some 30 non-RFCs
for 4.10. If you intend something for 4.9, you should make this explicit.

Jan
Tian, Kevin - April 21, 2017, 10:56 a.m.
> From: Gao, Chao
> Sent: Friday, April 7, 2017 11:24 AM
> 
> When injecting periodic timer interrupt in vmx_intr_assist(), multiple read
> operation is operated during one event delivery and incur to inconsistent

operation is operated -> operations are done

> views of vIOAPIC. For example, if a periodic timer interrupt is from PIT, when
> set the corresponding bit in vIRR, the corresponding RTE is accessed in
> pt_update_irq(). When this function returns, it accesses the RTE again to get
> the vector it set in vIRR.  Between the two accesses, the content of RTE may
> have been changed by another CPU for no protection method in use. This
> case
> can incur the assertion failure in vmx_intr_assist().  Also after a periodic
> timer interrupt is injected successfully, pt_irq_posted() will decrease the
> count (pending_intr_nr). But if the corresponding RTE has been changed,
> pt_irq_posted() will not decrease the count, resulting one more timer
> interrupt.
> 
> More discussion and history can be found in
> 1.https://lists.xenproject.org/archives/html/xen-devel/2017-
> 03/msg00906.html
> 2.https://lists.xenproject.org/archives/html/xen-devel/2017-
> 01/msg01019.html
> 
> This patch simply uses irq_lock to avoid these inconsistent views of vIOAPIC.
> To fix the deadlock, provide locked version of several functions.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  xen/arch/x86/hvm/irq.c      | 22 ++++++++++++++++------
>  xen/arch/x86/hvm/svm/intr.c | 21 +++++++++++++++------
>  xen/arch/x86/hvm/vmx/intr.c |  9 +++++++++
>  xen/arch/x86/hvm/vpic.c     | 20 ++++++++++++++------
>  xen/arch/x86/hvm/vpt.c      |  4 ++--
>  xen/include/xen/hvm/irq.h   |  4 ++++
>  6 files changed, 60 insertions(+), 20 deletions(-)
> 

[...]
> diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c
> index e160bbd..8d02e52 100644
> --- a/xen/arch/x86/hvm/vpic.c
> +++ b/xen/arch/x86/hvm/vpic.c
> @@ -153,14 +153,13 @@ static void __vpic_intack(struct hvm_hw_vpic *vpic,
> int irq)
>      vpic_update_int_output(vpic);
>  }
> 
> -static int vpic_intack(struct hvm_hw_vpic *vpic)
> +static int vpic_intack_locked(struct hvm_hw_vpic *vpic)
>  {
>      int irq = -1;
> 
> -    vpic_lock(vpic);
> -
> +    ASSERT(vpic_is_locked(vpic));
>      if ( !vpic->int_output )
> -        goto out;
> +        return irq;
> 
>      irq = vpic_get_highest_priority_irq(vpic);
>      BUG_ON(irq < 0);
> @@ -175,7 +174,15 @@ static int vpic_intack(struct hvm_hw_vpic *vpic)
>          irq += 8;
>      }
> 
> - out:
> +    return irq;
> +}
> +
> +static int vpic_intack(struct hvm_hw_vpic *vpic)
> +{
> +    int irq;
> +
> +    vpic_lock(vpic);
> +    irq = vpic_intack_locked(vpic);
>      vpic_unlock(vpic);
>      return irq;
>  }
> @@ -487,13 +494,14 @@ int vpic_ack_pending_irq(struct vcpu *v)
>      struct hvm_hw_vpic *vpic = &v->domain->arch.hvm_domain.vpic[0];
> 
>      ASSERT(has_vpic(v->domain));
> +    ASSERT(vpic_is_locked(vpic));
> 
>      TRACE_2D(TRC_HVM_EMUL_PIC_PEND_IRQ_CALL,
> vlapic_accept_pic_intr(v),
>               vpic->int_output);
>      if ( !vlapic_accept_pic_intr(v) || !vpic->int_output )
>          return -1;
> 
> -    irq = vpic_intack(vpic);
> +    irq = vpic_intack_locked(vpic);
>      if ( irq == -1 )
>          return -1;
> 

hvm_vcpu_ack_pending_irq is also invoked in nvmx_intr_intercept,
where you also need surround it with spin_lock/unlock otherwise
above ASSERT will be triggered.


Thanks
Kevin
Tian, Kevin - April 21, 2017, 2 p.m.
> From: Gao, Chao
> Sent: Friday, April 21, 2017 12:23 PM
> 
> >> @@ -487,13 +494,14 @@ int vpic_ack_pending_irq(struct vcpu *v)
> >>      struct hvm_hw_vpic *vpic = &v->domain->arch.hvm_domain.vpic[0];
> >>
> >>      ASSERT(has_vpic(v->domain));
> >> +    ASSERT(vpic_is_locked(vpic));
> >>
> >>      TRACE_2D(TRC_HVM_EMUL_PIC_PEND_IRQ_CALL,
> >> vlapic_accept_pic_intr(v),
> >>               vpic->int_output);
> >>      if ( !vlapic_accept_pic_intr(v) || !vpic->int_output )
> >>          return -1;
> >>
> >> -    irq = vpic_intack(vpic);
> >> +    irq = vpic_intack_locked(vpic);
> >>      if ( irq == -1 )
> >>          return -1;
> >>
> >
> >hvm_vcpu_ack_pending_irq is also invoked in nvmx_intr_intercept,
> >where you also need surround it with spin_lock/unlock otherwise
> >above ASSERT will be triggered.
> >
> 
> nvmx_intr_intercept(), currently, is only called from region
> protected by irq_lock. So i think the ASSERT won't be triggered.
> But I also need add an ASSERT in nvmx_intr_intercept().
> 

You're right. I overlooked it.

btw I'm thinking whether adding lock in vmx_intr_assist is
necessary... Can you think whether below option may
work?

Do lock protection within pt_update_irq. Make sure vector
not changed between setting vIRR and finding pt_irq_vector.
Within the lock also saves the vector to earliest_pt.

Then vmx_intr_assist will get the exact vector programmed
to vIRR from pt_irq_vector. Then in pt_intr_post, compare
saved vector to intack.vector, instead of re-searching
pt_irq_vector again.

This way even RTE is changed after pt_update_irq it won't
impact the whole flow here, since new vector hasn't got
chance to hit vIRR. then we don't need adding lock in 
vmx_intr_assist at all.

Thanks
Kevin

Patch

diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
index 8625584..a1af61e 100644
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -126,37 +126,47 @@  void hvm_pci_intx_deassert(
     spin_unlock(&d->arch.hvm_domain.irq_lock);
 }
 
-void hvm_isa_irq_assert(
+void hvm_isa_irq_assert_locked(
     struct domain *d, unsigned int isa_irq)
 {
     struct hvm_irq *hvm_irq = hvm_domain_irq(d);
     unsigned int gsi = hvm_isa_irq_to_gsi(isa_irq);
 
     ASSERT(isa_irq <= 15);
-
-    spin_lock(&d->arch.hvm_domain.irq_lock);
+    ASSERT(spin_is_locked(&d->arch.hvm_domain.irq_lock));
 
     if ( !__test_and_set_bit(isa_irq, &hvm_irq->isa_irq.i) &&
          (hvm_irq->gsi_assert_count[gsi]++ == 0) )
         assert_irq(d, gsi, isa_irq);
+}
 
+void hvm_isa_irq_assert(
+    struct domain *d, unsigned int isa_irq)
+{
+    spin_lock(&d->arch.hvm_domain.irq_lock);
+    hvm_isa_irq_assert_locked(d, isa_irq);
     spin_unlock(&d->arch.hvm_domain.irq_lock);
 }
 
-void hvm_isa_irq_deassert(
+void hvm_isa_irq_deassert_locked(
     struct domain *d, unsigned int isa_irq)
 {
     struct hvm_irq *hvm_irq = hvm_domain_irq(d);
     unsigned int gsi = hvm_isa_irq_to_gsi(isa_irq);
 
     ASSERT(isa_irq <= 15);
-
-    spin_lock(&d->arch.hvm_domain.irq_lock);
+    ASSERT(spin_is_locked(&d->arch.hvm_domain.irq_lock));
 
     if ( __test_and_clear_bit(isa_irq, &hvm_irq->isa_irq.i) &&
          (--hvm_irq->gsi_assert_count[gsi] == 0) )
         deassert_irq(d, isa_irq);
+}
 
+void hvm_isa_irq_deassert(
+    struct domain *d, unsigned int isa_irq)
+{
+    spin_lock(&d->arch.hvm_domain.irq_lock);
+    hvm_isa_irq_deassert_locked(d, isa_irq);
     spin_unlock(&d->arch.hvm_domain.irq_lock);
 }
 
diff --git a/xen/arch/x86/hvm/svm/intr.c b/xen/arch/x86/hvm/svm/intr.c
index 8511ff0..d2a318c 100644
--- a/xen/arch/x86/hvm/svm/intr.c
+++ b/xen/arch/x86/hvm/svm/intr.c
@@ -137,18 +137,25 @@  void svm_intr_assist(void)
     struct hvm_intack intack;
     enum hvm_intblk intblk;
 
+    /*
+     * Avoid the vIOAPIC RTE being changed by another vCPU.
+     * Otherwise, pt_update_irq() may return a wrong vector which is not in
+     * vIRR and pt_irq_posted() may not recognize a timer interrupt has been
+     * injected.
+     */
+    spin_lock(&v->domain->arch.hvm_domain.irq_lock);
     /* Crank the handle on interrupt state. */
     pt_update_irq(v);
 
     do {
         intack = hvm_vcpu_has_pending_irq(v);
         if ( likely(intack.source == hvm_intsrc_none) )
-            return;
+            goto out;
 
         intblk = hvm_interrupt_blocked(v, intack);
         if ( intblk == hvm_intblk_svm_gif ) {
             ASSERT(nestedhvm_enabled(v->domain));
-            return;
+            goto out;
         }
 
         /* Interrupts for the nested guest are already
@@ -165,13 +172,13 @@  void svm_intr_assist(void)
             switch (rc) {
             case NSVM_INTR_NOTINTERCEPTED:
                 /* Inject interrupt into 2nd level guest directly. */
-                break;	
+                goto out;
             case NSVM_INTR_NOTHANDLED:
             case NSVM_INTR_FORCEVMEXIT:
-                return;
+                goto out;
             case NSVM_INTR_MASKED:
                 /* Guest already enabled an interrupt window. */
-                return;
+                goto out;
             default:
                 panic("%s: nestedsvm_vcpu_interrupt can't handle value %#x",
                     __func__, rc);
@@ -195,7 +202,7 @@  void svm_intr_assist(void)
         if ( unlikely(vmcb->eventinj.fields.v) || intblk )
         {
             svm_enable_intr_window(v, intack);
-            return;
+            goto out;
         }
 
         intack = hvm_vcpu_ack_pending_irq(v, intack);
@@ -216,6 +223,8 @@  void svm_intr_assist(void)
     intack = hvm_vcpu_has_pending_irq(v);
     if ( unlikely(intack.source != hvm_intsrc_none) )
         svm_enable_intr_window(v, intack);
+ out:
+    spin_unlock(&v->domain->arch.hvm_domain.irq_lock);
 }
 
 /*
diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
index 1eb9f38..8f3d1a7 100644
--- a/xen/arch/x86/hvm/vmx/intr.c
+++ b/xen/arch/x86/hvm/vmx/intr.c
@@ -234,6 +234,14 @@  void vmx_intr_assist(void)
         return;
     }
 
+    /*
+     * Avoid the vIOAPIC RTE being changed by another vCPU.
+     * Otherwise, pt_update_irq() may return a wrong vector which is not in
+     * vIRR and pt_irq_posted() may not recognize a timer interrupt has been
+     * injected.
+     */
+    spin_lock(&v->domain->arch.hvm_domain.irq_lock);
+
     /* Crank the handle on interrupt state. */
     if ( is_hvm_vcpu(v) )
         pt_vector = pt_update_irq(v);
@@ -396,6 +404,7 @@  void vmx_intr_assist(void)
     }
 
  out:
+    spin_unlock(&v->domain->arch.hvm_domain.irq_lock);
     if ( !nestedhvm_vcpu_in_guestmode(v) &&
          !cpu_has_vmx_virtual_intr_delivery &&
          cpu_has_vmx_tpr_shadow )
diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c
index e160bbd..8d02e52 100644
--- a/xen/arch/x86/hvm/vpic.c
+++ b/xen/arch/x86/hvm/vpic.c
@@ -153,14 +153,13 @@  static void __vpic_intack(struct hvm_hw_vpic *vpic, int irq)
     vpic_update_int_output(vpic);
 }
 
-static int vpic_intack(struct hvm_hw_vpic *vpic)
+static int vpic_intack_locked(struct hvm_hw_vpic *vpic)
 {
     int irq = -1;
 
-    vpic_lock(vpic);
-
+    ASSERT(vpic_is_locked(vpic));
     if ( !vpic->int_output )
-        goto out;
+        return irq;
 
     irq = vpic_get_highest_priority_irq(vpic);
     BUG_ON(irq < 0);
@@ -175,7 +174,15 @@  static int vpic_intack(struct hvm_hw_vpic *vpic)
         irq += 8;
     }
 
- out:
+    return irq;
+}
+
+static int vpic_intack(struct hvm_hw_vpic *vpic)
+{
+    int irq;
+
+    vpic_lock(vpic);
+    irq = vpic_intack_locked(vpic);
     vpic_unlock(vpic);
     return irq;
 }
@@ -487,13 +494,14 @@  int vpic_ack_pending_irq(struct vcpu *v)
     struct hvm_hw_vpic *vpic = &v->domain->arch.hvm_domain.vpic[0];
 
     ASSERT(has_vpic(v->domain));
+    ASSERT(vpic_is_locked(vpic));
 
     TRACE_2D(TRC_HVM_EMUL_PIC_PEND_IRQ_CALL, vlapic_accept_pic_intr(v),
              vpic->int_output);
     if ( !vlapic_accept_pic_intr(v) || !vpic->int_output )
         return -1;
 
-    irq = vpic_intack(vpic);
+    irq = vpic_intack_locked(vpic);
     if ( irq == -1 )
         return -1;
 
diff --git a/xen/arch/x86/hvm/vpt.c b/xen/arch/x86/hvm/vpt.c
index e3f2039..d048f26 100644
--- a/xen/arch/x86/hvm/vpt.c
+++ b/xen/arch/x86/hvm/vpt.c
@@ -296,8 +296,8 @@  int pt_update_irq(struct vcpu *v)
         vlapic_set_irq(vcpu_vlapic(v), irq, 0);
     else
     {
-        hvm_isa_irq_deassert(v->domain, irq);
-        hvm_isa_irq_assert(v->domain, irq);
+        hvm_isa_irq_deassert_locked(v->domain, irq);
+        hvm_isa_irq_assert_locked(v->domain, irq);
     }
 
     /*
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index f041252..a8f36f8 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -119,8 +119,12 @@  void hvm_pci_intx_deassert(
 /* Modify state of an ISA device's IRQ wire. */
 void hvm_isa_irq_assert(
     struct domain *d, unsigned int isa_irq);
+void hvm_isa_irq_assert_locked(
+    struct domain *d, unsigned int isa_irq);
 void hvm_isa_irq_deassert(
     struct domain *d, unsigned int isa_irq);
+void hvm_isa_irq_deassert_locked(
+    struct domain *d, unsigned int isa_irq);
 
 int hvm_set_pci_link_route(struct domain *d, u8 link, u8 isa_irq);