Patchwork [01/12] KVM: arm64: Document PV-time interface

login
register
mail settings
Submitter Steven Price
Date Nov. 28, 2018, 2:45 p.m.
Message ID <20181128144527.44710-2-steven.price@arm.com>
Download mbox | patch
Permalink /patch/667163/
State New
Headers show

Comments

Steven Price - Nov. 28, 2018, 2:45 p.m.
We introduce a paravirtualization interface for KVM/arm64 based on the
"Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.

User space can specify a reserved area of memory for the guest and
inform KVM to populate the memory with information on stolen time and
Live Physical Time (LPT) that can be used to derive a stable
counter/timer for a guest subject to migration between hosts with
different counter frequencies.

A hypercall interface is provided for the guest to interrogate the
hypervisor's support for this interface and the location of the shared
memory structures.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 Documentation/virtual/kvm/arm/pvtime.txt | 169 +++++++++++++++++++++++
 1 file changed, 169 insertions(+)
 create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
Andrew Jones - Dec. 3, 2018, 1:50 p.m.
On Wed, Nov 28, 2018 at 02:45:16PM +0000, Steven Price wrote:
> We introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
> 
> User space can specify a reserved area of memory for the guest and
> inform KVM to populate the memory with information on stolen time and
> Live Physical Time (LPT) that can be used to derive a stable
> counter/timer for a guest subject to migration between hosts with
> different counter frequencies.
> 
> A hypercall interface is provided for the guest to interrogate the
> hypervisor's support for this interface and the location of the shared
> memory structures.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  Documentation/virtual/kvm/arm/pvtime.txt | 169 +++++++++++++++++++++++
>  1 file changed, 169 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
> 
> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
> new file mode 100644
> index 000000000000..1870b904075b
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
> @@ -0,0 +1,169 @@
> +Paravirtualized time support for arm64
> +======================================
> +
> +Arm specification DEN0057/A defined a standard for paravirtualised time
> +support for Aarch64 guests:
> +
> +https://developer.arm.com/docs/den0057/a
> +
> +KVM/Arm64 implements this specification by providing some hypervisor service
> +calls to support a paravirtualized guest obtaining a view of the amount of
> +time stolen from its execution and a concept of Live Physical Time (LPT) which
> +represents time during which the guest is running and works across migrations.
> +
> +Three new SMCCC compatible hypercalls are defined:
> +
> +PV_FEATURES 0xC5000020
> +PV_TIME_LPT 0xC5000021
> +PV_TIME_ST  0xC5000022
> +
> +These are only available in the SMC64/HVC64 calling convention as
> +paravirtualized time is not available to 32 bit Arm guests.
> +
> +PV_FEATURES
> +    Function ID:  (uint32)  : 0xC5000020
> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
> +                              PV-time feature is supported by the hypervisor.

What happens when a new guest tries this hypercall on an old host? Why not
return a bitmap of all supported features so the guest doesn't have to
do multiple hypercalls to build its bitmap?

> +
> +PV_TIME_LPT
> +    Function ID:  (uint32)  : 0xC5000021
> +    Flags:        (uint32)  : Bit[0]: Request migration interrupts
> +                                      (not currently supported by KVM)
> +    Return value: (int64)   : IPA of the shared live physical time data
> +                              structure or negative error code on failure:
> +                              NOT_SUPPORTED (-1)
> +                              INVALID_PARAMETERS (-2)
> +
> +PV_TIME_ST
> +    Function ID:  (uint32)  : 0xC5000022
> +    Return value: (int64)   : IPA of the stolen time data structure for this
> +                              (V)CPU. On failure:
> +                              NOT_SUPPORTED (-1)
> +
> +Live Physical Time
> +------------------
> +
> +The structure pointed to by the PV_TIME_LPT hypercall is as follows:
> +
> +  Field           | Byte Length | Byte Offset | Description
> +  --------------- | ----------- | ----------- | -------------------------------
> +  Revision        |      4      |      0      | Must be 0 for this revision
> +  Attributes      |      4      |      4      | Must be 0
> +  sequence_number |      8      |      8      | Bit 0: reserved
> +                  |             |             | Bits 1:63 number of migrations
> +  scale_mult      |      8      |      16     | Multiplier to scale from native
> +                  |             |             | frequency to PV frequency
> +  shift           |      4      |      24     | Shift applied before multiplier
> +  Reserved        |      4      |      28     | Must be 0
> +  Fn              |      8      |      32     | Native frequency
> +  Fpv             |      8      |      40     | Paravirtualized frequency seen
> +                  |             |             | by guest
> +  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
> +                  |             |             | divide by Fpv

Here's kvmclock's struct

 struct pvclock_vcpu_time_info {
        u32   version; 
        u32   pad0;
        u64   tsc_timestamp;
        u64   system_time;
        u32   tsc_to_system_mul;
        s8    tsc_shift;
        u8    flags;
        u8    pad[2];
 }

 Revision	 =>
 Attributes	 =>
 sequence_number => version
 scale_mult	 => tsc_to_system_mul	(this is reversed, but OK)
 shift		 => tsc_shift           (also reversed)
 Reserved	 =>
 Fn		 => (pvclock doesn't have, but does have system_time)
 Fpv		 =>
 div_by_fpv_mult =>

I haven't thought about this enough yet to be sure kvmclock's fields
are sufficient, but several look close - although the 'tsc' naming
isn't nice. Also, the pvclock struct could be extended by adding an
'extended' flag to 'flags', and then appending more fields.

> +
> +Where scale_mult is defined as 2^(64-shift) * Fpv / Fn
> +
> +The structure will be updated by the hypervisor whenever the guest is migrated
> +to a new host. It will be present within a reserved region of the normal
> +memory given to the guest. The guest should not attempt to write into this
> +memory.
> +
> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.
> +
> +The structure will be updated by the hypervisor periodically as time is stolen
> +from the VCPU. It will be present within a reserved region of the normal
> +memory given to the guest. The guest should not attempt to write into this
> +memory. There is a structure by VCPU of the guest.
> +
> +User space interface
> +====================
> +
> +User space can request that KVM provide the paravirtualized time interface to
> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
> +
> +    struct kvm_create_device pvtime_device = {
> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
> +            .attr = 0,
> +            .flags = 0,
> +    };
> +
> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
> +
> +The guest IPA of the structures must be given to KVM. This is the address of
> +the LPT structure and the base address of an array of stolen time structures
> +(one for each VCPU). For example:
> +
> +    struct kvm_device_attr lpt_base = {
> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
> +            .attr = KVM_DEV_ARM_PV_TIME_LPT,
> +            .addr = (u64)(unsigned long)&lpt_paddr
> +    };
> +    struct kvm_device_attr st_base = {
> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
> +            .addr = (u64)(unsigned long)&st_paddr
> +    };
> +
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_base);
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> +
> +The paravirtualized frequency of the guest can also be set. By default this
> +will be the counter frequency of the host. However when migrating a guest from
> +another host, this must be manually set to ensure that the guest sees the same
> +frequency.
> +
> +    u32 frequency;
> +
> +    struct kvm_device_attr lpt_freq = {
> +            .group = KVM_DEV_ARM_PV_TIME_FREQUENCY,
> +            .attr = KVM_DEV_ARM_PV_TIME_LPT,
> +            .addr = (u64)(unsigned long)&frequency
> +    };
> +
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_freq);
> +
> +For migration (or save/restore) of a guest it is necessary to save the contents
> +of the shared pages and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
> +to be read/written. The state for stolen time and LPT are accessed separately.

I guess there's no harm in userspace accessing these separately, but I
hope we'll be trying to put all the PV structures on one shared page
for the guest.

Thanks,
drew

> +It is also necessary for the physical address and frequency to be set
> +identically when restoring. The kernel will update the structure on first run
> +of the vCPU(s) to contain the new coefficients.
> +
> +    void *save_state(int fd, u64 attr, u32 *size) {
> +        struct kvm_device_attr get_size = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
> +
> +        void *buffer = malloc(*size);
> +
> +        struct kvm_device_attr get_state = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
> +    }
> +
> +    void *lpt_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_LPT, &lpt_size);
> +    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
> +
> -- 
> 2.19.2
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Marc Zyngier - Dec. 3, 2018, 2:18 p.m.
Hi Drew,

On 03/12/2018 13:50, Andrew Jones wrote:
> On Wed, Nov 28, 2018 at 02:45:16PM +0000, Steven Price wrote:
>> We introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on stolen time and
>> Live Physical Time (LPT) that can be used to derive a stable
>> counter/timer for a guest subject to migration between hosts with
>> different counter frequencies.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  Documentation/virtual/kvm/arm/pvtime.txt | 169 +++++++++++++++++++++++
>>  1 file changed, 169 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
>> new file mode 100644
>> index 000000000000..1870b904075b
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
>> @@ -0,0 +1,169 @@
>> +Paravirtualized time support for arm64
>> +======================================
>> +
>> +Arm specification DEN0057/A defined a standard for paravirtualised time
>> +support for Aarch64 guests:
>> +
>> +https://developer.arm.com/docs/den0057/a
>> +
>> +KVM/Arm64 implements this specification by providing some hypervisor service
>> +calls to support a paravirtualized guest obtaining a view of the amount of
>> +time stolen from its execution and a concept of Live Physical Time (LPT) which
>> +represents time during which the guest is running and works across migrations.
>> +
>> +Three new SMCCC compatible hypercalls are defined:
>> +
>> +PV_FEATURES 0xC5000020
>> +PV_TIME_LPT 0xC5000021
>> +PV_TIME_ST  0xC5000022
>> +
>> +These are only available in the SMC64/HVC64 calling convention as
>> +paravirtualized time is not available to 32 bit Arm guests.
>> +
>> +PV_FEATURES
>> +    Function ID:  (uint32)  : 0xC5000020
>> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
>> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
>> +                              PV-time feature is supported by the hypervisor.
> 
> What happens when a new guest tries this hypercall on an old host? Why not
> return a bitmap of all supported features so the guest doesn't have to
> do multiple hypercalls to build its bitmap?

A guest doesn't call this directly. It first checks that the service is
implemented, and only if it is, it can call it. It is the exact same
mechanism we use for other hypercalls (such as the Spectre mitigation
services). If the guest bypasses the discovery mechanism, tough.

> 
>> +
>> +PV_TIME_LPT
>> +    Function ID:  (uint32)  : 0xC5000021
>> +    Flags:        (uint32)  : Bit[0]: Request migration interrupts
>> +                                      (not currently supported by KVM)
>> +    Return value: (int64)   : IPA of the shared live physical time data
>> +                              structure or negative error code on failure:
>> +                              NOT_SUPPORTED (-1)
>> +                              INVALID_PARAMETERS (-2)
>> +
>> +PV_TIME_ST
>> +    Function ID:  (uint32)  : 0xC5000022
>> +    Return value: (int64)   : IPA of the stolen time data structure for this
>> +                              (V)CPU. On failure:
>> +                              NOT_SUPPORTED (-1)
>> +
>> +Live Physical Time
>> +------------------
>> +
>> +The structure pointed to by the PV_TIME_LPT hypercall is as follows:
>> +
>> +  Field           | Byte Length | Byte Offset | Description
>> +  --------------- | ----------- | ----------- | -------------------------------
>> +  Revision        |      4      |      0      | Must be 0 for this revision
>> +  Attributes      |      4      |      4      | Must be 0
>> +  sequence_number |      8      |      8      | Bit 0: reserved
>> +                  |             |             | Bits 1:63 number of migrations
>> +  scale_mult      |      8      |      16     | Multiplier to scale from native
>> +                  |             |             | frequency to PV frequency
>> +  shift           |      4      |      24     | Shift applied before multiplier
>> +  Reserved        |      4      |      28     | Must be 0
>> +  Fn              |      8      |      32     | Native frequency
>> +  Fpv             |      8      |      40     | Paravirtualized frequency seen
>> +                  |             |             | by guest
>> +  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
>> +                  |             |             | divide by Fpv
> 
> Here's kvmclock's struct
> 
>  struct pvclock_vcpu_time_info {
>         u32   version; 
>         u32   pad0;
>         u64   tsc_timestamp;
>         u64   system_time;
>         u32   tsc_to_system_mul;
>         s8    tsc_shift;
>         u8    flags;
>         u8    pad[2];
>  }
> 
>  Revision	 =>
>  Attributes	 =>
>  sequence_number => version
>  scale_mult	 => tsc_to_system_mul	(this is reversed, but OK)
>  shift		 => tsc_shift           (also reversed)
>  Reserved	 =>
>  Fn		 => (pvclock doesn't have, but does have system_time)
>  Fpv		 =>
>  div_by_fpv_mult =>
> 
> I haven't thought about this enough yet to be sure kvmclock's fields
> are sufficient, but several look close - although the 'tsc' naming
> isn't nice. Also, the pvclock struct could be extended by adding an
> 'extended' flag to 'flags', and then appending more fields.

One important thing is that this is not a KVM PV time implementation.
This is something generic for the ARM architecture. Any ARM hypervisor
should be able to implement this.

Thanks,

	M.
Andrew Jones - Dec. 3, 2018, 3:16 p.m.
On Mon, Dec 03, 2018 at 02:18:25PM +0000, Marc Zyngier wrote:
> On 03/12/2018 13:50, Andrew Jones wrote:
> > On Wed, Nov 28, 2018 at 02:45:16PM +0000, Steven Price wrote:
> >> +The structure pointed to by the PV_TIME_LPT hypercall is as follows:
> >> +
> >> +  Field           | Byte Length | Byte Offset | Description
> >> +  --------------- | ----------- | ----------- | -------------------------------
> >> +  Revision        |      4      |      0      | Must be 0 for this revision
> >> +  Attributes      |      4      |      4      | Must be 0
> >> +  sequence_number |      8      |      8      | Bit 0: reserved
> >> +                  |             |             | Bits 1:63 number of migrations
> >> +  scale_mult      |      8      |      16     | Multiplier to scale from native
> >> +                  |             |             | frequency to PV frequency
> >> +  shift           |      4      |      24     | Shift applied before multiplier
> >> +  Reserved        |      4      |      28     | Must be 0
> >> +  Fn              |      8      |      32     | Native frequency
> >> +  Fpv             |      8      |      40     | Paravirtualized frequency seen
> >> +                  |             |             | by guest
> >> +  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
> >> +                  |             |             | divide by Fpv
> > 
> > Here's kvmclock's struct
> > 
> >  struct pvclock_vcpu_time_info {
> >         u32   version; 
> >         u32   pad0;
> >         u64   tsc_timestamp;
> >         u64   system_time;
> >         u32   tsc_to_system_mul;
> >         s8    tsc_shift;
> >         u8    flags;
> >         u8    pad[2];
> >  }
> > 
> >  Revision	 =>
> >  Attributes	 =>
> >  sequence_number => version
> >  scale_mult	 => tsc_to_system_mul	(this is reversed, but OK)
> >  shift		 => tsc_shift           (also reversed)
> >  Reserved	 =>
> >  Fn		 => (pvclock doesn't have, but does have system_time)
> >  Fpv		 =>
> >  div_by_fpv_mult =>
> > 
> > I haven't thought about this enough yet to be sure kvmclock's fields
> > are sufficient, but several look close - although the 'tsc' naming
> > isn't nice. Also, the pvclock struct could be extended by adding an
> > 'extended' flag to 'flags', and then appending more fields.
> 
> One important thing is that this is not a KVM PV time implementation.
> This is something generic for the ARM architecture. Any ARM hypervisor
> should be able to implement this.
>

I agree with that, but xen also uses the same pvclock structure on x86.

Thanks,
drew
Marc Zyngier - Dec. 3, 2018, 3:23 p.m.
On 03/12/2018 15:16, Andrew Jones wrote:
> On Mon, Dec 03, 2018 at 02:18:25PM +0000, Marc Zyngier wrote:
>> On 03/12/2018 13:50, Andrew Jones wrote:
>>> On Wed, Nov 28, 2018 at 02:45:16PM +0000, Steven Price wrote:
>>>> +The structure pointed to by the PV_TIME_LPT hypercall is as follows:
>>>> +
>>>> +  Field           | Byte Length | Byte Offset | Description
>>>> +  --------------- | ----------- | ----------- | -------------------------------
>>>> +  Revision        |      4      |      0      | Must be 0 for this revision
>>>> +  Attributes      |      4      |      4      | Must be 0
>>>> +  sequence_number |      8      |      8      | Bit 0: reserved
>>>> +                  |             |             | Bits 1:63 number of migrations
>>>> +  scale_mult      |      8      |      16     | Multiplier to scale from native
>>>> +                  |             |             | frequency to PV frequency
>>>> +  shift           |      4      |      24     | Shift applied before multiplier
>>>> +  Reserved        |      4      |      28     | Must be 0
>>>> +  Fn              |      8      |      32     | Native frequency
>>>> +  Fpv             |      8      |      40     | Paravirtualized frequency seen
>>>> +                  |             |             | by guest
>>>> +  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
>>>> +                  |             |             | divide by Fpv
>>>
>>> Here's kvmclock's struct
>>>
>>>  struct pvclock_vcpu_time_info {
>>>         u32   version; 
>>>         u32   pad0;
>>>         u64   tsc_timestamp;
>>>         u64   system_time;
>>>         u32   tsc_to_system_mul;
>>>         s8    tsc_shift;
>>>         u8    flags;
>>>         u8    pad[2];
>>>  }
>>>
>>>  Revision	 =>
>>>  Attributes	 =>
>>>  sequence_number => version
>>>  scale_mult	 => tsc_to_system_mul	(this is reversed, but OK)
>>>  shift		 => tsc_shift           (also reversed)
>>>  Reserved	 =>
>>>  Fn		 => (pvclock doesn't have, but does have system_time)
>>>  Fpv		 =>
>>>  div_by_fpv_mult =>
>>>
>>> I haven't thought about this enough yet to be sure kvmclock's fields
>>> are sufficient, but several look close - although the 'tsc' naming
>>> isn't nice. Also, the pvclock struct could be extended by adding an
>>> 'extended' flag to 'flags', and then appending more fields.
>>
>> One important thing is that this is not a KVM PV time implementation.
>> This is something generic for the ARM architecture. Any ARM hypervisor
>> should be able to implement this.
>>
> 
> I agree with that, but xen also uses the same pvclock structure on x86.

Which makes sense.

Each architecture has a way to express its PV time based on its own
requirements, and I'd expect Xen on arm64 to use the ARM data structure
(/me eyes the Xen/ARM maintainer...). I thought that what you were
advocating was to use the x86 layout on ARM, which didn't make complete
sense to me.

Thanks,

	M.
Andrew Jones - Dec. 3, 2018, 3:52 p.m.
On Mon, Dec 03, 2018 at 03:23:41PM +0000, Marc Zyngier wrote:
> On 03/12/2018 15:16, Andrew Jones wrote:
> > On Mon, Dec 03, 2018 at 02:18:25PM +0000, Marc Zyngier wrote:
> >> On 03/12/2018 13:50, Andrew Jones wrote:
> >>> On Wed, Nov 28, 2018 at 02:45:16PM +0000, Steven Price wrote:
> >>>> +The structure pointed to by the PV_TIME_LPT hypercall is as follows:
> >>>> +
> >>>> +  Field           | Byte Length | Byte Offset | Description
> >>>> +  --------------- | ----------- | ----------- | -------------------------------
> >>>> +  Revision        |      4      |      0      | Must be 0 for this revision
> >>>> +  Attributes      |      4      |      4      | Must be 0
> >>>> +  sequence_number |      8      |      8      | Bit 0: reserved
> >>>> +                  |             |             | Bits 1:63 number of migrations
> >>>> +  scale_mult      |      8      |      16     | Multiplier to scale from native
> >>>> +                  |             |             | frequency to PV frequency
> >>>> +  shift           |      4      |      24     | Shift applied before multiplier
> >>>> +  Reserved        |      4      |      28     | Must be 0
> >>>> +  Fn              |      8      |      32     | Native frequency
> >>>> +  Fpv             |      8      |      40     | Paravirtualized frequency seen
> >>>> +                  |             |             | by guest
> >>>> +  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
> >>>> +                  |             |             | divide by Fpv
> >>>
> >>> Here's kvmclock's struct
> >>>
> >>>  struct pvclock_vcpu_time_info {
> >>>         u32   version; 
> >>>         u32   pad0;
> >>>         u64   tsc_timestamp;
> >>>         u64   system_time;
> >>>         u32   tsc_to_system_mul;
> >>>         s8    tsc_shift;
> >>>         u8    flags;
> >>>         u8    pad[2];
> >>>  }
> >>>
> >>>  Revision	 =>
> >>>  Attributes	 =>
> >>>  sequence_number => version
> >>>  scale_mult	 => tsc_to_system_mul	(this is reversed, but OK)
> >>>  shift		 => tsc_shift           (also reversed)
> >>>  Reserved	 =>
> >>>  Fn		 => (pvclock doesn't have, but does have system_time)
> >>>  Fpv		 =>
> >>>  div_by_fpv_mult =>
> >>>
> >>> I haven't thought about this enough yet to be sure kvmclock's fields
> >>> are sufficient, but several look close - although the 'tsc' naming
> >>> isn't nice. Also, the pvclock struct could be extended by adding an
> >>> 'extended' flag to 'flags', and then appending more fields.
> >>
> >> One important thing is that this is not a KVM PV time implementation.
> >> This is something generic for the ARM architecture. Any ARM hypervisor
> >> should be able to implement this.
> >>
> > 
> > I agree with that, but xen also uses the same pvclock structure on x86.
> 
> Which makes sense.
> 
> Each architecture has a way to express its PV time based on its own
> requirements, and I'd expect Xen on arm64 to use the ARM data structure
> (/me eyes the Xen/ARM maintainer...). I thought that what you were
> advocating was to use the x86 layout on ARM, which didn't make complete
> sense to me.

I sort of was. I was wondering if pvclock shouldn't be improved in a
backward compatible way by using the flags field to extend it. Field names
could change and new fields can be appended. Then, when x86 kvm and xen
are updated to use the extended pvclock struct, there may be more
opportunity for code sharing across architectures - perhaps much of
kvmclock could be shared. Of course I'm just thinking out loud, and doing
more of the out loud part than the thinking...

Hmm, I do see that arm-xen has already adopted the x86 layouts without
modification and also has a TODO saying the struct definitions should put
somewhere arch neutral.

Thanks,
drew
Steven Price - Dec. 5, 2018, 12:32 p.m.
On 03/12/2018 13:50, Andrew Jones wrote:
> On Wed, Nov 28, 2018 at 02:45:16PM +0000, Steven Price wrote:
>> We introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on stolen time and
>> Live Physical Time (LPT) that can be used to derive a stable
>> counter/timer for a guest subject to migration between hosts with
>> different counter frequencies.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  Documentation/virtual/kvm/arm/pvtime.txt | 169 +++++++++++++++++++++++
>>  1 file changed, 169 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
>> new file mode 100644
>> index 000000000000..1870b904075b
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
>> @@ -0,0 +1,169 @@
>> +Paravirtualized time support for arm64
>> +======================================
>> +
>> +Arm specification DEN0057/A defined a standard for paravirtualised time
>> +support for Aarch64 guests:
>> +
>> +https://developer.arm.com/docs/den0057/a
>> +
>> +KVM/Arm64 implements this specification by providing some hypervisor service
>> +calls to support a paravirtualized guest obtaining a view of the amount of
>> +time stolen from its execution and a concept of Live Physical Time (LPT) which
>> +represents time during which the guest is running and works across migrations.
>> +
>> +Three new SMCCC compatible hypercalls are defined:
>> +
>> +PV_FEATURES 0xC5000020
>> +PV_TIME_LPT 0xC5000021
>> +PV_TIME_ST  0xC5000022
>> +
>> +These are only available in the SMC64/HVC64 calling convention as
>> +paravirtualized time is not available to 32 bit Arm guests.
>> +
>> +PV_FEATURES
>> +    Function ID:  (uint32)  : 0xC5000020
>> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
>> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
>> +                              PV-time feature is supported by the hypervisor.
> 
> What happens when a new guest tries this hypercall on an old host? Why not
> return a bitmap of all supported features so the guest doesn't have to
> do multiple hypercalls to build its bitmap?

As Marc pointed out - a guest shouldn't (the discover mechanism requires
using SMCCC ARCH_FEATURES). However if a guest was to ignore this it
would still receive the default NOT_SUPPORTED (-1) response.

The design of PV_FEATURES matches the existing SMCCC ARCH_FEATURES so
does require multiple calls. It would be possible to return a bitmap but
that would limit the room for expansion. It's not exactly a performance
critical path ;)

>> +
>> +PV_TIME_LPT
>> +    Function ID:  (uint32)  : 0xC5000021
>> +    Flags:        (uint32)  : Bit[0]: Request migration interrupts
>> +                                      (not currently supported by KVM)
>> +    Return value: (int64)   : IPA of the shared live physical time data
>> +                              structure or negative error code on failure:
>> +                              NOT_SUPPORTED (-1)
>> +                              INVALID_PARAMETERS (-2)
>> +
>> +PV_TIME_ST
>> +    Function ID:  (uint32)  : 0xC5000022
>> +    Return value: (int64)   : IPA of the stolen time data structure for this
>> +                              (V)CPU. On failure:
>> +                              NOT_SUPPORTED (-1)
>> +
>> +Live Physical Time
>> +------------------
>> +
>> +The structure pointed to by the PV_TIME_LPT hypercall is as follows:
>> +
>> +  Field           | Byte Length | Byte Offset | Description
>> +  --------------- | ----------- | ----------- | -------------------------------
>> +  Revision        |      4      |      0      | Must be 0 for this revision
>> +  Attributes      |      4      |      4      | Must be 0
>> +  sequence_number |      8      |      8      | Bit 0: reserved
>> +                  |             |             | Bits 1:63 number of migrations
>> +  scale_mult      |      8      |      16     | Multiplier to scale from native
>> +                  |             |             | frequency to PV frequency
>> +  shift           |      4      |      24     | Shift applied before multiplier
>> +  Reserved        |      4      |      28     | Must be 0
>> +  Fn              |      8      |      32     | Native frequency
>> +  Fpv             |      8      |      40     | Paravirtualized frequency seen
>> +                  |             |             | by guest
>> +  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
>> +                  |             |             | divide by Fpv
> 
> Here's kvmclock's struct
> 
>  struct pvclock_vcpu_time_info {
>         u32   version; 
>         u32   pad0;
>         u64   tsc_timestamp;
>         u64   system_time;
>         u32   tsc_to_system_mul;
>         s8    tsc_shift;
>         u8    flags;
>         u8    pad[2];
>  }
> 
>  Revision	 =>
>  Attributes	 =>
>  sequence_number => version
>  scale_mult	 => tsc_to_system_mul	(this is reversed, but OK)
>  shift		 => tsc_shift           (also reversed)
>  Reserved	 =>
>  Fn		 => (pvclock doesn't have, but does have system_time)
>  Fpv		 =>
>  div_by_fpv_mult =>
> 
> I haven't thought about this enough yet to be sure kvmclock's fields
> are sufficient, but several look close - although the 'tsc' naming
> isn't nice. Also, the pvclock struct could be extended by adding an
> 'extended' flag to 'flags', and then appending more fields.

While it may be possible to try to map the fields onto the existing
structure it's not going to be a great fit, and I don't really see the
point. Unless the meaning is going to exactly match x86 then it would be
the same structure in name only.

>> +
>> +Where scale_mult is defined as 2^(64-shift) * Fpv / Fn
>> +
>> +The structure will be updated by the hypervisor whenever the guest is migrated
>> +to a new host. It will be present within a reserved region of the normal
>> +memory given to the guest. The guest should not attempt to write into this
>> +memory.
>> +
>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
>> +
>> +The structure will be updated by the hypervisor periodically as time is stolen
>> +from the VCPU. It will be present within a reserved region of the normal
>> +memory given to the guest. The guest should not attempt to write into this
>> +memory. There is a structure by VCPU of the guest.
>> +
>> +User space interface
>> +====================
>> +
>> +User space can request that KVM provide the paravirtualized time interface to
>> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
>> +
>> +    struct kvm_create_device pvtime_device = {
>> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
>> +            .attr = 0,
>> +            .flags = 0,
>> +    };
>> +
>> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
>> +
>> +The guest IPA of the structures must be given to KVM. This is the address of
>> +the LPT structure and the base address of an array of stolen time structures
>> +(one for each VCPU). For example:
>> +
>> +    struct kvm_device_attr lpt_base = {
>> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
>> +            .attr = KVM_DEV_ARM_PV_TIME_LPT,
>> +            .addr = (u64)(unsigned long)&lpt_paddr
>> +    };
>> +    struct kvm_device_attr st_base = {
>> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
>> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
>> +            .addr = (u64)(unsigned long)&st_paddr
>> +    };
>> +
>> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_base);
>> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
>> +
>> +The paravirtualized frequency of the guest can also be set. By default this
>> +will be the counter frequency of the host. However when migrating a guest from
>> +another host, this must be manually set to ensure that the guest sees the same
>> +frequency.
>> +
>> +    u32 frequency;
>> +
>> +    struct kvm_device_attr lpt_freq = {
>> +            .group = KVM_DEV_ARM_PV_TIME_FREQUENCY,
>> +            .attr = KVM_DEV_ARM_PV_TIME_LPT,
>> +            .addr = (u64)(unsigned long)&frequency
>> +    };
>> +
>> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_freq);
>> +
>> +For migration (or save/restore) of a guest it is necessary to save the contents
>> +of the shared pages and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
>> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
>> +to be read/written. The state for stolen time and LPT are accessed separately.
> 
> I guess there's no harm in userspace accessing these separately, but I
> hope we'll be trying to put all the PV structures on one shared page
> for the guest.

Note that it's possible to only just one of the PV mechanisms. In
particular if the hardware in the future supports clock scaling then we
won't need LPT any more so you would just want to expose stolen time.

The API allows the hypervisor to place the structure(s) arbitrarily in
physical memory - the requirement is just to present them at the correct
IPA to the guest. The implementation in this series uses one page for
LPT and (at least) one page for stolen time.

Thanks for the review,

Steve

> Thanks,
> drew
> 
>> +It is also necessary for the physical address and frequency to be set
>> +identically when restoring. The kernel will update the structure on first run
>> +of the vCPU(s) to contain the new coefficients.
>> +
>> +    void *save_state(int fd, u64 attr, u32 *size) {
>> +        struct kvm_device_attr get_size = {
>> +                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
>> +                .attr = attr,
>> +                .addr = (u64)(unsigned long)size
>> +        };
>> +
>> +        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
>> +
>> +        void *buffer = malloc(*size);
>> +
>> +        struct kvm_device_attr get_state = {
>> +                .group = KVM_DEV_ARM_PV_TIME_STATE,
>> +                .attr = attr,
>> +                .addr = (u64)(unsigned long)size
>> +        };
>> +
>> +        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
>> +    }
>> +
>> +    void *lpt_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_LPT, &lpt_size);
>> +    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
>> +
>> -- 
>> 2.19.2
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

Patch

diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
new file mode 100644
index 000000000000..1870b904075b
--- /dev/null
+++ b/Documentation/virtual/kvm/arm/pvtime.txt
@@ -0,0 +1,169 @@ 
+Paravirtualized time support for arm64
+======================================
+
+Arm specification DEN0057/A defined a standard for paravirtualised time
+support for Aarch64 guests:
+
+https://developer.arm.com/docs/den0057/a
+
+KVM/Arm64 implements this specification by providing some hypervisor service
+calls to support a paravirtualized guest obtaining a view of the amount of
+time stolen from its execution and a concept of Live Physical Time (LPT) which
+represents time during which the guest is running and works across migrations.
+
+Three new SMCCC compatible hypercalls are defined:
+
+PV_FEATURES 0xC5000020
+PV_TIME_LPT 0xC5000021
+PV_TIME_ST  0xC5000022
+
+These are only available in the SMC64/HVC64 calling convention as
+paravirtualized time is not available to 32 bit Arm guests.
+
+PV_FEATURES
+    Function ID:  (uint32)  : 0xC5000020
+    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
+    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
+                              PV-time feature is supported by the hypervisor.
+
+PV_TIME_LPT
+    Function ID:  (uint32)  : 0xC5000021
+    Flags:        (uint32)  : Bit[0]: Request migration interrupts
+                                      (not currently supported by KVM)
+    Return value: (int64)   : IPA of the shared live physical time data
+                              structure or negative error code on failure:
+                              NOT_SUPPORTED (-1)
+                              INVALID_PARAMETERS (-2)
+
+PV_TIME_ST
+    Function ID:  (uint32)  : 0xC5000022
+    Return value: (int64)   : IPA of the stolen time data structure for this
+                              (V)CPU. On failure:
+                              NOT_SUPPORTED (-1)
+
+Live Physical Time
+------------------
+
+The structure pointed to by the PV_TIME_LPT hypercall is as follows:
+
+  Field           | Byte Length | Byte Offset | Description
+  --------------- | ----------- | ----------- | -------------------------------
+  Revision        |      4      |      0      | Must be 0 for this revision
+  Attributes      |      4      |      4      | Must be 0
+  sequence_number |      8      |      8      | Bit 0: reserved
+                  |             |             | Bits 1:63 number of migrations
+  scale_mult      |      8      |      16     | Multiplier to scale from native
+                  |             |             | frequency to PV frequency
+  shift           |      4      |      24     | Shift applied before multiplier
+  Reserved        |      4      |      28     | Must be 0
+  Fn              |      8      |      32     | Native frequency
+  Fpv             |      8      |      40     | Paravirtualized frequency seen
+                  |             |             | by guest
+  div_by_fpv_mult |      8      |      48     | Multiplier to implement fast
+                  |             |             | divide by Fpv
+
+Where scale_mult is defined as 2^(64-shift) * Fpv / Fn
+
+The structure will be updated by the hypervisor whenever the guest is migrated
+to a new host. It will be present within a reserved region of the normal
+memory given to the guest. The guest should not attempt to write into this
+memory.
+
+Stolen Time
+-----------
+
+The structure pointed to by the PV_TIME_ST hypercall is as follows:
+
+  Field       | Byte Length | Byte Offset | Description
+  ----------- | ----------- | ----------- | --------------------------
+  Revision    |      4      |      0      | Must be 0 for version 0.1
+  Attributes  |      4      |      4      | Must be 0
+  Stolen time |      8      |      8      | Stolen time in unsigned
+              |             |             | nanoseconds indicating how
+              |             |             | much time this VCPU thread
+              |             |             | was involuntarily not
+              |             |             | running on a physical CPU.
+
+The structure will be updated by the hypervisor periodically as time is stolen
+from the VCPU. It will be present within a reserved region of the normal
+memory given to the guest. The guest should not attempt to write into this
+memory. There is a structure by VCPU of the guest.
+
+User space interface
+====================
+
+User space can request that KVM provide the paravirtualized time interface to
+a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
+
+    struct kvm_create_device pvtime_device = {
+            .type = KVM_DEV_TYPE_ARM_PV_TIME,
+            .attr = 0,
+            .flags = 0,
+    };
+
+    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
+
+The guest IPA of the structures must be given to KVM. This is the address of
+the LPT structure and the base address of an array of stolen time structures
+(one for each VCPU). For example:
+
+    struct kvm_device_attr lpt_base = {
+            .group = KVM_DEV_ARM_PV_TIME_PADDR,
+            .attr = KVM_DEV_ARM_PV_TIME_LPT,
+            .addr = (u64)(unsigned long)&lpt_paddr
+    };
+    struct kvm_device_attr st_base = {
+            .group = KVM_DEV_ARM_PV_TIME_PADDR,
+            .attr = KVM_DEV_ARM_PV_TIME_ST,
+            .addr = (u64)(unsigned long)&st_paddr
+    };
+
+    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_base);
+    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
+
+The paravirtualized frequency of the guest can also be set. By default this
+will be the counter frequency of the host. However when migrating a guest from
+another host, this must be manually set to ensure that the guest sees the same
+frequency.
+
+    u32 frequency;
+
+    struct kvm_device_attr lpt_freq = {
+            .group = KVM_DEV_ARM_PV_TIME_FREQUENCY,
+            .attr = KVM_DEV_ARM_PV_TIME_LPT,
+            .addr = (u64)(unsigned long)&frequency
+    };
+
+    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_freq);
+
+For migration (or save/restore) of a guest it is necessary to save the contents
+of the shared pages and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
+provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
+to be read/written. The state for stolen time and LPT are accessed separately.
+It is also necessary for the physical address and frequency to be set
+identically when restoring. The kernel will update the structure on first run
+of the vCPU(s) to contain the new coefficients.
+
+    void *save_state(int fd, u64 attr, u32 *size) {
+        struct kvm_device_attr get_size = {
+                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
+                .attr = attr,
+                .addr = (u64)(unsigned long)size
+        };
+
+        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
+
+        void *buffer = malloc(*size);
+
+        struct kvm_device_attr get_state = {
+                .group = KVM_DEV_ARM_PV_TIME_STATE,
+                .attr = attr,
+                .addr = (u64)(unsigned long)size
+        };
+
+        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
+    }
+
+    void *lpt_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_LPT, &lpt_size);
+    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
+