Patchwork [v2,1/2] VMX: fix VMCS race on context-switch paths

login
register
mail settings
Submitter Jan Beulich
Date Nov. 7, 2017, 3:16 p.m.
Message ID <5A01DC76020000780018CEFA@prv-mh.provo.novell.com>
Download mbox | patch
Permalink /patch/379967/
State New
Headers show

Comments

Jan Beulich - Nov. 7, 2017, 3:16 p.m.
>>> On 07.11.17 at 09:07, <JBeulich@suse.com> wrote:
>>>> On 02.11.17 at 20:46, <igor.druzhinin@citrix.com> wrote:
>>> Any ideas about the root cause of the fault and suggestions how to reproduce 
> it
>>> would be welcome. Does this crash really has something to do with PML? I 
> doubt
>>> because the original environment may hardly be called PML-heavy.
> 
> Well, PML-heaviness doesn't matter. It's the mere fact that PML
> is enabled on the vCPU being destroyed.
> 
>> So we finally have complete understanding of what's going on:
>> 
>> Some vCPU has just migrated to another pCPU and we switched to idle but
>> per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is
>> how the current logic works. While we're in idle we're issuing
>> vcpu_destroy() for some other domain which eventually calls
>> vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At
>> this moment we get a TLB flush IPI from that same vCPU which is now
>> context switching on another pCPU - it appears to clean TLB after
>> itself. This vCPU is already marked is_running=1 by the scheduler. In
>> the IPI handler we enter __sync_local_execstate() and trying to call
>> vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call
>> vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE
>> crashes the hypervisor.
>> 
>> So the state transition diagram might look like:
>> pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks ->
> 
> I'm not really clear about who/what is "idle" here: pCPU1,
> pCPU2, or yet something else? If vCPUx migrated to pCPU2,
> wouldn't it be put back into runnable state right away, and
> hence pCPU2 can't be idle at this point? Yet for pCPU1 I don't
> think its idleness would matter much, i.e. the situation could
> also arise without it becoming idle afaics. pCPU1 making it
> anywhere softirqs are being processed would suffice.
> 
>> vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear()
>> pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush
>> pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH!
>> 
>> We can basically just fix the condition around vmcs_reload() call but
>> I'm not completely sure that it's the right way to do - I don't think
>> leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea
>> (maybe we need to clean it). What are your thoughts?
> 
> per_cpu(curr_vcpu) can only validly be written inside
> __context_switch(), hence the only way to achieve this would
> be to force __context_switch() to be called earlier than out of
> the TLB flush IPI handler, perhaps like in the (untested!) patch
> below. Two questions then remain:
> - Should we perhaps rather do this in an arch-independent way
>   (i.e. ahead of the call to vcpu_destroy() in common code)?
> - This deals with only a special case of the more general "TLB
>   flush behind the back of a vmx_vmcs_enter() /
>   vmx_vmcs_exit() section" - does this need dealing with in a
>   more general way? Here I'm thinking of introducing a
>   FLUSH_STATE flag to be passed to flush_mask() instead of
>   the current flush_tlb_mask() in context_switch() and
>   sync_vcpu_execstate(). This could at the same time be used
>   for a small performance optimization: At least for HAP vCPU-s
>   I don't think we really need the TLB part of the flushes here.

Btw., for this second aspect below is what I have in mind.

Jan

x86: make CPU state flush requests explicit

Having this be an implied side effect of a TLB flush is not very nice:
It could (at least in theory) lead to unintended state flushes (see e.g.
https://lists.xenproject.org/archives/html/xen-devel/2017-11/msg00187.html 
for context). Introduce a flag to be used in the two places actually
wanting the state flushed, and conditionalize the
__sync_local_execstate() invocation in the IPI handler accordingly.

At the same time also conditionalize the flush_area_local() invocations,
to short-circuit the function ending up as a no-op anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I first thought we could also suppress the TLB flush part in the context
switch cases for HAP vCPU-s, but the per-domain mappings require that to
happen.

Patch

--- unstable.orig/xen/arch/x86/domain.c
+++ unstable/xen/arch/x86/domain.c
@@ -1699,7 +1699,7 @@  void context_switch(struct vcpu *prev, s
                   !cpumask_empty(&dirty_mask)) )
     {
         /* Other cpus call __sync_local_execstate from flush ipi handler. */
-        flush_tlb_mask(&dirty_mask);
+        flush_mask(&dirty_mask, FLUSH_TLB | FLUSH_STATE);
     }
 
     if ( prev != next )
@@ -1808,7 +1808,7 @@  void sync_vcpu_execstate(struct vcpu *v)
         sync_local_execstate();
 
     /* Other cpus call __sync_local_execstate from flush ipi handler. */
-    flush_tlb_mask(v->vcpu_dirty_cpumask);
+    flush_mask(v->vcpu_dirty_cpumask, FLUSH_TLB | FLUSH_STATE);
 }
 
 static int relinquish_memory(
--- unstable.orig/xen/arch/x86/smp.c
+++ unstable/xen/arch/x86/smp.c
@@ -207,9 +207,10 @@  void invalidate_interrupt(struct cpu_use
     unsigned int flags = flush_flags;
     ack_APIC_irq();
     perfc_incr(ipis);
-    if ( __sync_local_execstate() )
+    if ( (flags & FLUSH_STATE) && __sync_local_execstate() )
         flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL);
-    flush_area_local(flush_va, flags);
+    if ( flags & ~(FLUSH_STATE | FLUSH_ORDER_MASK) )
+        flush_area_local(flush_va, flags);
     cpumask_clear_cpu(smp_processor_id(), &flush_cpumask);
 }
 
@@ -219,7 +220,8 @@  void flush_area_mask(const cpumask_t *ma
 
     ASSERT(local_irq_is_enabled());
 
-    if ( cpumask_test_cpu(cpu, mask) )
+    if ( (flags & ~(FLUSH_STATE | FLUSH_ORDER_MASK)) &&
+         cpumask_test_cpu(cpu, mask) )
         flags = flush_area_local(va, flags);
 
     if ( (flags & ~FLUSH_ORDER_MASK) &&
--- unstable.orig/xen/include/asm-x86/flushtlb.h
+++ unstable/xen/include/asm-x86/flushtlb.h
@@ -101,6 +101,8 @@  void write_cr3(unsigned long cr3);
 #define FLUSH_CACHE      0x400
  /* VA for the flush has a valid mapping */
 #define FLUSH_VA_VALID   0x800
+ /* Flush CPU state */
+#define FLUSH_STATE      0x1000
 
 /* Flush local TLBs/caches. */
 unsigned int flush_area_local(const void *va, unsigned int flags);