Patchwork [v2,1/2] VMX: fix VMCS race on context-switch paths

login
register
mail settings
Submitter Jan Beulich
Date Nov. 7, 2017, 2:55 p.m.
Message ID <5A01D776020000780018CEBD@prv-mh.provo.novell.com>
Download mbox | patch
Permalink /patch/379925/
State New
Headers show

Comments

Jan Beulich - Nov. 7, 2017, 2:55 p.m.
>>> On 07.11.17 at 15:24, <igor.druzhinin@citrix.com> wrote:
> On 07/11/17 08:07, Jan Beulich wrote:
>> --- unstable.orig/xen/arch/x86/domain.c
>> +++ unstable/xen/arch/x86/domain.c
>> @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
>>  
>>  void vcpu_destroy(struct vcpu *v)
>>  {
>> +    /*
>> +     * Flush all state for this vCPU before fully tearing it down. This is
>> +     * particularly important for HVM ones on VMX, so that this flushing of
>> +     * state won't happen from the TLB flush IPI handler behind the back of
>> +     * a vmx_vmcs_enter() / vmx_vmcs_exit() section.
>> +     */
>> +    sync_vcpu_execstate(v);
>> +
>>      xfree(v->arch.vm_event);
>>      v->arch.vm_event = NULL;
> 
> I don't think this is going to fix the problem since vCPU we are
> currently destroying has nothing to do with the vCPUx that actually
> caused the problem by its migration. We still are going to call
> vmx_vcpu_disable_pml() which loads and cleans VMCS on the current pCPU.

Oh, right, wrong vCPU. This should be better:

In that case the question then is whether (rather than generalizing
is, as mentioned for the earlier version) this wouldn't better go into
vmx_vcpu_destroy(), assuming anything called earlier from
hvm_vcpu_destroy() isn't susceptible to the problem (i.e. doesn't
play with VMCSes).

Jan
Igor Druzhinin - Nov. 7, 2017, 3:52 p.m.
On 07/11/17 14:55, Jan Beulich wrote:
>>>> On 07.11.17 at 15:24, <igor.druzhinin@citrix.com> wrote:
>> On 07/11/17 08:07, Jan Beulich wrote:
>>> --- unstable.orig/xen/arch/x86/domain.c
>>> +++ unstable/xen/arch/x86/domain.c
>>> @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
>>>  
>>>  void vcpu_destroy(struct vcpu *v)
>>>  {
>>> +    /*
>>> +     * Flush all state for this vCPU before fully tearing it down. This is
>>> +     * particularly important for HVM ones on VMX, so that this flushing of
>>> +     * state won't happen from the TLB flush IPI handler behind the back of
>>> +     * a vmx_vmcs_enter() / vmx_vmcs_exit() section.
>>> +     */
>>> +    sync_vcpu_execstate(v);
>>> +
>>>      xfree(v->arch.vm_event);
>>>      v->arch.vm_event = NULL;
>>
>> I don't think this is going to fix the problem since vCPU we are
>> currently destroying has nothing to do with the vCPUx that actually
>> caused the problem by its migration. We still are going to call
>> vmx_vcpu_disable_pml() which loads and cleans VMCS on the current pCPU.
> 
> Oh, right, wrong vCPU. This should be better:
> 
> --- unstable.orig/xen/arch/x86/domain.c
> +++ unstable/xen/arch/x86/domain.c
> @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
>  
>  void vcpu_destroy(struct vcpu *v)
>  {
> +    /*
> +     * Flush all state for the vCPU previously having run on the current CPU.
> +     * This is in particular relevant for HVM ones on VMX, so that this
> +     * flushing of state won't happen from the TLB flush IPI handler behind
> +     * the back of a vmx_vmcs_enter() / vmx_vmcs_exit() section.
> +     */
> +    sync_local_execstate();
> +
>      xfree(v->arch.vm_event);
>      v->arch.vm_event = NULL;
>  
> In that case the question then is whether (rather than generalizing
> is, as mentioned for the earlier version) this wouldn't better go into
> vmx_vcpu_destroy(), assuming anything called earlier from
> hvm_vcpu_destroy() isn't susceptible to the problem (i.e. doesn't
> play with VMCSes).

Ah, ok. Does this also apply to the previous issue? May I revert that
change to test it?

There is one things that I'm worrying about with this approach:

At this place we just sync the idle context because we know that we are
going to deal with VMCS later. But what about other potential cases
(perhaps some softirqs) in which we are accessing a vCPU data structure
that is currently shared between different pCPUs. Maybe we'd better sync
the context as soon as possible after we switched to idle from a
migrated vCPU.

Igor

> 
> Jan
>
Jan Beulich - Nov. 7, 2017, 4:31 p.m.
>>> On 07.11.17 at 16:52, <igor.druzhinin@citrix.com> wrote:
> On 07/11/17 14:55, Jan Beulich wrote:
>>>>> On 07.11.17 at 15:24, <igor.druzhinin@citrix.com> wrote:
>>> On 07/11/17 08:07, Jan Beulich wrote:
>>>> --- unstable.orig/xen/arch/x86/domain.c
>>>> +++ unstable/xen/arch/x86/domain.c
>>>> @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
>>>>  
>>>>  void vcpu_destroy(struct vcpu *v)
>>>>  {
>>>> +    /*
>>>> +     * Flush all state for this vCPU before fully tearing it down. This is
>>>> +     * particularly important for HVM ones on VMX, so that this flushing of
>>>> +     * state won't happen from the TLB flush IPI handler behind the back of
>>>> +     * a vmx_vmcs_enter() / vmx_vmcs_exit() section.
>>>> +     */
>>>> +    sync_vcpu_execstate(v);
>>>> +
>>>>      xfree(v->arch.vm_event);
>>>>      v->arch.vm_event = NULL;
>>>
>>> I don't think this is going to fix the problem since vCPU we are
>>> currently destroying has nothing to do with the vCPUx that actually
>>> caused the problem by its migration. We still are going to call
>>> vmx_vcpu_disable_pml() which loads and cleans VMCS on the current pCPU.
>> 
>> Oh, right, wrong vCPU. This should be better:
>> 
>> --- unstable.orig/xen/arch/x86/domain.c
>> +++ unstable/xen/arch/x86/domain.c
>> @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
>>  
>>  void vcpu_destroy(struct vcpu *v)
>>  {
>> +    /*
>> +     * Flush all state for the vCPU previously having run on the current CPU.
>> +     * This is in particular relevant for HVM ones on VMX, so that this
>> +     * flushing of state won't happen from the TLB flush IPI handler behind
>> +     * the back of a vmx_vmcs_enter() / vmx_vmcs_exit() section.
>> +     */
>> +    sync_local_execstate();
>> +
>>      xfree(v->arch.vm_event);
>>      v->arch.vm_event = NULL;
>>  
>> In that case the question then is whether (rather than generalizing
>> is, as mentioned for the earlier version) this wouldn't better go into
>> vmx_vcpu_destroy(), assuming anything called earlier from
>> hvm_vcpu_destroy() isn't susceptible to the problem (i.e. doesn't
>> play with VMCSes).
> 
> Ah, ok. Does this also apply to the previous issue? May I revert that
> change to test it?

Feel free to try it, but I had checked that previous patch earlier
today, and right now I don't think the two issues are related.

> There is one things that I'm worrying about with this approach:
> 
> At this place we just sync the idle context because we know that we are
> going to deal with VMCS later. But what about other potential cases
> (perhaps some softirqs) in which we are accessing a vCPU data structure
> that is currently shared between different pCPUs. Maybe we'd better sync
> the context as soon as possible after we switched to idle from a
> migrated vCPU.

Well, yes, I had pointed out in the earlier reply that this is just to
deal with the specific case here. Whether to sync earlier after a
migration I'm not really sure about - the way it's written right now
is meant to deal with migration across CPUs. If so, this would
perhaps belong into scheduler code (and hence cover ARM as
well), and till now I wasn't able to figure a good place where to
put this.

George, Dario, do you have any thoughts both on the general
idea as well as where to put the necessary code?

Jan
Jan Beulich - Nov. 9, 2017, 10:05 a.m.
>>> On 07.11.17 at 16:52, <igor.druzhinin@citrix.com> wrote:
> There is one things that I'm worrying about with this approach:
> 
> At this place we just sync the idle context because we know that we are
> going to deal with VMCS later. But what about other potential cases
> (perhaps some softirqs) in which we are accessing a vCPU data structure
> that is currently shared between different pCPUs. Maybe we'd better sync
> the context as soon as possible after we switched to idle from a
> migrated vCPU.

Short of feedback from the scheduler maintainers I've looked
into this some more. Calling sync_vcpu_execstate() out of
vcpu_move_locked() would seem feasible, but there are a number
of other places where ->processor of a vCPU is being updated,
and where the vCPU was not (obviously) put to sleep already:

- credit1's csched_runq_steal()
- credit2's migrate()
- csched2_schedule()
- null's vcpu_assign() when called out of null_schedule()
- rt_schedule()

I don't think it is reasonable to call sync_vcpu_execstate() in all of
those places, and hence VMX'es special VMCS management may
indeed better be taken care of by VMX-specific code (i.e. as
previously indicated the sync_local_execstate() invocation moved
from vcpu_destroy() - where my most recent patch draft had put
it - to vmx_vcpu_destroy()). And we'd have to live with other
VMX code paths having similar asynchronous behavior needing to
similarly take care of the requirement.

Jan
Dario Faggioli - Nov. 9, 2017, 10:36 a.m.
On Thu, 2017-11-09 at 03:05 -0700, Jan Beulich wrote:
> > > > On 07.11.17 at 16:52, <igor.druzhinin@citrix.com> wrote:
> > 
> > There is one things that I'm worrying about with this approach:
> > 
> > At this place we just sync the idle context because we know that we
> > are
> > going to deal with VMCS later. But what about other potential cases
> > (perhaps some softirqs) in which we are accessing a vCPU data
> > structure
> > that is currently shared between different pCPUs. Maybe we'd better
> > sync
> > the context as soon as possible after we switched to idle from a
> > migrated vCPU.
> 
> Short of feedback from the scheduler maintainers I've looked
> into this some more.
>
Sorry, as you may have seen by the other email, I was looking into this
today.

> Calling sync_vcpu_execstate() out of
> vcpu_move_locked() would seem feasible, but there are a number
> of other places where ->processor of a vCPU is being updated,
> and where the vCPU was not (obviously) put to sleep already:
> 
> - credit1's csched_runq_steal()
> - credit2's migrate()
> - csched2_schedule()
> - null's vcpu_assign() when called out of null_schedule()
> - rt_schedule()
> 
> I don't think it is reasonable to call sync_vcpu_execstate() in all
> of
> those places, 
>
Yes, I agree.

> and hence VMX'es special VMCS management may
> indeed better be taken care of by VMX-specific code (i.e. as
> previously indicated the sync_local_execstate() invocation moved
> from vcpu_destroy() - where my most recent patch draft had put
> it - to vmx_vcpu_destroy()). 
>
I was still trying to form an opinion about the issue, and was leaning
toward suggesting exactly the same.

In fact, the point of lazy context switch is exactly that: trying to
save syncing the state. Of course, that requires that we identify all
the places and occasions where having the state out of sync may be a
problem, and sync it!.

What seems to me to be happening here is as follows:

 a. a pCPU becomes idle
 b. we do the lazy switch, i.e., the context of the previously running 
    vCPU stays on the pCPU
 c. *something* happens which wants to either play with or alter the 
    context currently loaded on the pCPU (in this case it's VMX bits  
    of the context, but it could be other parts of it too) that is 
    loaded on the pCPU

Well, I'm afraid I only see two solutions:
1) we get rid of lazy context switch;
2) whatever it is that is happening at point c above, it needs to be 
   aware that we use lazy context switch, and make sure to sync the 
   context before playing with or altering it;

> And we'd have to live with other
> VMX code paths having similar asynchronous behavior needing to
> similarly take care of the requirement.
> 
Exactly.

And in fact, in this thread, migration of vCPUs between pCPUs was
mentioned, and it was being considered to treat that in a special way. 

But it looks to me that something very similar may, at least in theory,
happen any time a lazy context switch occurs, no matter whether the
pCPU has become idle because the previously running vCPU wants to move,
or because it blocked for whatever other reason.

Regards,
Dario
Jan Beulich - Nov. 9, 2017, 12:58 p.m.
>>> On 09.11.17 at 11:36, <raistlin@linux.it> wrote:
> Well, I'm afraid I only see two solutions:
> 1) we get rid of lazy context switch;
> 2) whatever it is that is happening at point c above, it needs to be 
>    aware that we use lazy context switch, and make sure to sync the 
>    context before playing with or altering it;

3) Better centralize the updating of v->processor, so that it becomes
reasonable to sync state there. Igor's idea of flushing state once it
is known (or at least pretty certain) that the vCPU won't run on the
prior pCPU next time it gets scheduled is certainly a reasonable one.
It just doesn't fit well with how the individual schedulers currently
behave.

Jan

Patch

--- unstable.orig/xen/arch/x86/domain.c
+++ unstable/xen/arch/x86/domain.c
@@ -379,6 +379,14 @@  int vcpu_initialise(struct vcpu *v)
 
 void vcpu_destroy(struct vcpu *v)
 {
+    /*
+     * Flush all state for the vCPU previously having run on the current CPU.
+     * This is in particular relevant for HVM ones on VMX, so that this
+     * flushing of state won't happen from the TLB flush IPI handler behind
+     * the back of a vmx_vmcs_enter() / vmx_vmcs_exit() section.
+     */
+    sync_local_execstate();
+
     xfree(v->arch.vm_event);
     v->arch.vm_event = NULL;