Patchwork x86emul: handle address wrapping for VMASKMOVP{S, D}

login
register
mail settings
Submitter Jan Beulich
Date Oct. 10, 2017, 10:43 a.m.
Message ID <59DCC06B020000780018452D@prv-mh.provo.novell.com>
Download mbox | patch
Permalink /patch/357723/
State New
Headers show

Comments

Jan Beulich - Oct. 10, 2017, 10:43 a.m.
I failed to recognize the need to mirror the changes done by 7869e2bafe
("x86emul/fuzz: add rudimentary limit checking") into the earlier
written but later committed 2fe43d333f ("x86emul: support remaining AVX
insns"): Behavior here is the same as for multi-part reads or writes.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
There's another issue here, but I'll first have to think about possible
(preferably non-intrusive) solutions: An access crossing a page
boundary and having
- a set mask bit corresponding to an element fully living in the first
  page,
- one or more clear mask bits corresponding to the initial elements on
  the second page,
- another higher mask bit being set
would result in a wrong CR2 value to be reported in case the access to
the second page would cause a fault (it would point to the start of the
page instead of the element being accessed). Neither splitting the
access here into multiple ones nor uniformly passing a byte or word
enable mask into ->write() seem very desirable.
Andrew Cooper - Oct. 10, 2017, 12:43 p.m.
On 10/10/17 11:43, Jan Beulich wrote:
> I failed to recognize the need to mirror the changes done by 7869e2bafe
> ("x86emul/fuzz: add rudimentary limit checking") into the earlier
> written but later committed 2fe43d333f ("x86emul: support remaining AVX
> insns"): Behavior here is the same as for multi-part reads or writes.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

> ---
> There's another issue here, but I'll first have to think about possible
> (preferably non-intrusive) solutions: An access crossing a page
> boundary and having
> - a set mask bit corresponding to an element fully living in the first
>   page,
> - one or more clear mask bits corresponding to the initial elements on
>   the second page,
> - another higher mask bit being set
> would result in a wrong CR2 value to be reported in case the access to
> the second page would cause a fault (it would point to the start of the
> page instead of the element being accessed). Neither splitting the
> access here into multiple ones nor uniformly passing a byte or word
> enable mask into ->write() seem very desirable.

Is this just supposition, or have you confirmed what really happens on
hardware?

I expect that the mask operations turn into multi-part operations, which
means their behaviour on a straddled fault is implementation defined
(and behaves differently between Atoms and Xeons).

One option we could do is to have a variation of the "Implement
hvmemul_write() using real mappings" logic where we pull mappings into
the vmap individually, but that would require some part of the code to
convert ea + mask => linear address of each unit, so the eventual
mapping can be constructed piece-wise.

~Andrew
Jan Beulich - Oct. 11, 2017, 7:44 a.m.
>>> On 10.10.17 at 14:43, <andrew.cooper3@citrix.com> wrote:
> On 10/10/17 11:43, Jan Beulich wrote:
>> There's another issue here, but I'll first have to think about possible
>> (preferably non-intrusive) solutions: An access crossing a page
>> boundary and having
>> - a set mask bit corresponding to an element fully living in the first
>>   page,
>> - one or more clear mask bits corresponding to the initial elements on
>>   the second page,
>> - another higher mask bit being set
>> would result in a wrong CR2 value to be reported in case the access to
>> the second page would cause a fault (it would point to the start of the
>> page instead of the element being accessed). Neither splitting the
>> access here into multiple ones nor uniformly passing a byte or word
>> enable mask into ->write() seem very desirable.
> 
> Is this just supposition, or have you confirmed what really happens on
> hardware?

I had done some experiments already at the time I wrote this.
I've now done more. If the fault occurs on the high page, the CR2
value (on Haswell) depends on whether there was an enabled
access to the low page. If so, CR2 holds the address of the last
byte of the overall range (even if the highest mask bit is clear). If
not, CR2 holds the address of the lowest byte actually being
accessed.

> I expect that the mask operations turn into multi-part operations, which
> means their behaviour on a straddled fault is implementation defined
> (and behaves differently between Atoms and Xeons).
> 
> One option we could do is to have a variation of the "Implement
> hvmemul_write() using real mappings" logic where we pull mappings into
> the vmap individually, but that would require some part of the code to
> convert ea + mask => linear address of each unit, so the eventual
> mapping can be constructed piece-wise.

I certainly have no Atom to play with - on the Haswell, the write
to the first page does not take effect when the access to the
second page faults. Hence splitting the accesses (as suggested
as an option above) clearly would not be valid.

On my Dinar (AMD Fam15), otoh, CR2 always points at the
lowest byte of the actually accessed field(s) on the page causing
the fault (i.e. the behavior I had implied in my original remark to
the patch).

Perhaps to avoid passing byte/word enables into ->read and
->write() (not sure why originally I had thought of the latter
only) we could extend struct pagefault_info to include an
"offset from start of accessed range" field, to be passed into
x86_emul_pagefault() in addition to the current arguments?

Jan
Andrew Cooper - Oct. 11, 2017, 11:23 a.m.
On 11/10/17 08:44, Jan Beulich wrote:
>>>> On 10.10.17 at 14:43, <andrew.cooper3@citrix.com> wrote:
>> On 10/10/17 11:43, Jan Beulich wrote:
>>> There's another issue here, but I'll first have to think about possible
>>> (preferably non-intrusive) solutions: An access crossing a page
>>> boundary and having
>>> - a set mask bit corresponding to an element fully living in the first
>>>   page,
>>> - one or more clear mask bits corresponding to the initial elements on
>>>   the second page,
>>> - another higher mask bit being set
>>> would result in a wrong CR2 value to be reported in case the access to
>>> the second page would cause a fault (it would point to the start of the
>>> page instead of the element being accessed). Neither splitting the
>>> access here into multiple ones nor uniformly passing a byte or word
>>> enable mask into ->write() seem very desirable.
>> Is this just supposition, or have you confirmed what really happens on
>> hardware?
> I had done some experiments already at the time I wrote this.
> I've now done more. If the fault occurs on the high page, the CR2
> value (on Haswell) depends on whether there was an enabled
> access to the low page. If so, CR2 holds the address of the last
> byte of the overall range (even if the highest mask bit is clear). If
> not, CR2 holds the address of the lowest byte actually being
> accessed.

The answer appears to be "its complicated".

>> I expect that the mask operations turn into multi-part operations, which
>> means their behaviour on a straddled fault is implementation defined
>> (and behaves differently between Atoms and Xeons).
>>
>> One option we could do is to have a variation of the "Implement
>> hvmemul_write() using real mappings" logic where we pull mappings into
>> the vmap individually, but that would require some part of the code to
>> convert ea + mask => linear address of each unit, so the eventual
>> mapping can be constructed piece-wise.
> I certainly have no Atom to play with - on the Haswell, the write
> to the first page does not take effect when the access to the
> second page faults. Hence splitting the accesses (as suggested
> as an option above) clearly would not be valid.

Not so.  If the behaviour is implementation defined, splitting the
accesses in the emulator would be valid, even if it differs from
hardware behaviour.

The problem here is working out whether the behaviour here is
implementation defined, or whether it defined as having atomic
properties.  I am not aware of any statement in any of the vendor
manuals which confirms the atomic properties, or declares the behaviour
as implementation defined.

>
> On my Dinar (AMD Fam15), otoh, CR2 always points at the
> lowest byte of the actually accessed field(s) on the page causing
> the fault (i.e. the behavior I had implied in my original remark to
> the patch).

This is certainly more logical behaviour.

>
> Perhaps to avoid passing byte/word enables into ->read and
> ->write() (not sure why originally I had thought of the latter
> only) we could extend struct pagefault_info to include an
> "offset from start of accessed range" field, to be passed into
> x86_emul_pagefault() in addition to the current arguments?

An interesting question would be whether an access with no mask bits set
in the first page faults if the first page is not present.  This would
help identify whether a pagewalk is done for start of the access, or
whether the first page is omitted entirely when the mask permits.  (I
can see the latter behaviour being faster, but perhaps more complicated
in silicon, so I'm having a hard time judging is likely to happen).


At the end of the day, this is clearly a corner case which behaves
differently in different circumstances, which gives us some flexibility
in how we fix it.  I'd opt for atomic behaviour wherever possible,
because it is clearly how at least one piece of hardware actually
behaves, and it is the safer option to take e.g. for emulation caused by
introspection.  So long as CR2 points to the correct 4k frame, demand
paging will work inside the guest, so I don't think we should worry too
much about exactly where it points.

~Andrew
Jan Beulich - Oct. 11, 2017, 11:37 a.m.
>>> On 11.10.17 at 13:23, <andrew.cooper3@citrix.com> wrote:
> On 11/10/17 08:44, Jan Beulich wrote:
>>>>> On 10.10.17 at 14:43, <andrew.cooper3@citrix.com> wrote:
>>> On 10/10/17 11:43, Jan Beulich wrote:
>>>> There's another issue here, but I'll first have to think about possible
>>>> (preferably non-intrusive) solutions: An access crossing a page
>>>> boundary and having
>>>> - a set mask bit corresponding to an element fully living in the first
>>>>   page,
>>>> - one or more clear mask bits corresponding to the initial elements on
>>>>   the second page,
>>>> - another higher mask bit being set
>>>> would result in a wrong CR2 value to be reported in case the access to
>>>> the second page would cause a fault (it would point to the start of the
>>>> page instead of the element being accessed). Neither splitting the
>>>> access here into multiple ones nor uniformly passing a byte or word
>>>> enable mask into ->write() seem very desirable.
>>> Is this just supposition, or have you confirmed what really happens on
>>> hardware?
>> I had done some experiments already at the time I wrote this.
>> I've now done more. If the fault occurs on the high page, the CR2
>> value (on Haswell) depends on whether there was an enabled
>> access to the low page. If so, CR2 holds the address of the last
>> byte of the overall range (even if the highest mask bit is clear). If
>> not, CR2 holds the address of the lowest byte actually being
>> accessed.
> 
> The answer appears to be "its complicated".
> 
>>> I expect that the mask operations turn into multi-part operations, which
>>> means their behaviour on a straddled fault is implementation defined
>>> (and behaves differently between Atoms and Xeons).
>>>
>>> One option we could do is to have a variation of the "Implement
>>> hvmemul_write() using real mappings" logic where we pull mappings into
>>> the vmap individually, but that would require some part of the code to
>>> convert ea + mask => linear address of each unit, so the eventual
>>> mapping can be constructed piece-wise.
>> I certainly have no Atom to play with - on the Haswell, the write
>> to the first page does not take effect when the access to the
>> second page faults. Hence splitting the accesses (as suggested
>> as an option above) clearly would not be valid.
> 
> Not so.  If the behaviour is implementation defined, splitting the
> accesses in the emulator would be valid, even if it differs from
> hardware behaviour.
> 
> The problem here is working out whether the behaviour here is
> implementation defined, or whether it defined as having atomic
> properties.  I am not aware of any statement in any of the vendor
> manuals which confirms the atomic properties, or declares the behaviour
> as implementation defined.

As much as other instructions rarely have any such information.
In fact, the absence of an "implementation defined" statement
normally kind of implies there's nothing implementation defined
there. In the case here, vendors agreeing that no partial writes
are being done, I would consider it wrong for the emulator to
do so unless there was an explicit statement allowing for this.
The text, however, only says that the order of accesses is
implementation defined.

>> Perhaps to avoid passing byte/word enables into ->read and
>> ->write() (not sure why originally I had thought of the latter
>> only) we could extend struct pagefault_info to include an
>> "offset from start of accessed range" field, to be passed into
>> x86_emul_pagefault() in addition to the current arguments?
> 
> An interesting question would be whether an access with no mask bits set
> in the first page faults if the first page is not present.

It doesn't - that's what I had tried in my first round of experiments.

>  This would
> help identify whether a pagewalk is done for start of the access, or
> whether the first page is omitted entirely when the mask permits.

Equally well the page walk may be done, but its results discarded.

> At the end of the day, this is clearly a corner case which behaves
> differently in different circumstances, which gives us some flexibility
> in how we fix it.  I'd opt for atomic behaviour wherever possible,
> because it is clearly how at least one piece of hardware actually
> behaves, and it is the safer option to take e.g. for emulation caused by
> introspection.  So long as CR2 points to the correct 4k frame, demand
> paging will work inside the guest, so I don't think we should worry too
> much about exactly where it points.

Well, for the moment (i.e. 4.10) I'm certainly not meaning to fix this
beyond the patch already submitted, but since AVX-512 allows such
masked accesses all the time I will want to get this right before
adding AVX-512 support (or else we'd spread the problem).

Jan

Patch

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -7887,7 +7887,7 @@  x86_emulate(
             switch ( d & SrcMask )
             {
             case SrcMem:
-                rc = ops->read(ea.mem.seg, ea.mem.off + first_byte,
+                rc = ops->read(ea.mem.seg, truncate_ea(ea.mem.off + first_byte),
                                (void *)mmvalp + first_byte, op_bytes,
                                ctxt);
                 if ( rc != X86EMUL_OKAY )
@@ -7970,7 +7970,7 @@  x86_emulate(
         else
         {
             fail_if(!ops->write);
-            rc = ops->write(dst.mem.seg, dst.mem.off + first_byte,
+            rc = ops->write(dst.mem.seg, truncate_ea(dst.mem.off + first_byte),
                             !state->simd_size ? &dst.val
                                               : (void *)mmvalp + first_byte,
                             dst.bytes, ctxt);