Patchwork [v1,5/5] s390: do not call memory_region_allocate_system_memory() multiple times

login
register
mail settings
Submitter Igor Mammedov
Date April 15, 2019, 1:27 p.m.
Message ID <1555334842-195718-6-git-send-email-imammedo@redhat.com>
Download mbox | patch
Permalink /patch/773143/
State New
Headers show

Comments

Igor Mammedov - April 15, 2019, 1:27 p.m.
s390 was trying to solve limited memslot size issue by abusing
memory_region_allocate_system_memory(), which breaks API contract
where the function might be called only once.

s390 should have used memory aliases to fragment inital memory into
smaller chunks to satisfy KVM's memslot limitation. But its a bit
late now, since allocated pieces are transfered in migration stream
separately, so it's not possible to just replace broken layout with
correct one. Previous patch made MemoryRegion alases migratable and
this patch switches to use them to split big initial RAM chunk into
smaller pieces up to KVM_SLOT_MAX_BYTES each and registers aliases
for migration.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
A don't have access to a suitable system to test it, so I've simulated
it with smaller chunks on x84 host. Ping-pong migration between old
and new QEMU worked fine.  KVM part should be fine as memslots
using mapped MemoryRegions (in this case it would be aliases) as
far as I know but is someone could test it on big enough host it
would be nice.
---
 hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)
Christian Borntraeger - April 16, 2019, 11:01 a.m.
This crashes

a simple -kernel -initrd example on s390x.

#0  0x000003ff94e3e47c in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000003ff94e23d18 in __GI_abort () at abort.c:79
#2  0x000003ff94e365e6 in __assert_fail_base
    (fmt=0x3ff94f60ca6 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x14aac70 "new_block", file=file@entry=0x13b1168 "/home/cborntra/REPOS/qemu/exec.c", line=line@entry=2041, function=function@entry=0x13b0c8a <__PRETTY_FUNCTION__.34656> "qemu_ram_set_idstr") at assert.c:92
#3  0x000003ff94e36664 in __GI___assert_fail
    (assertion=assertion@entry=0x14aac70 "new_block", file=file@entry=0x13b1168 "/home/cborntra/REPOS/qemu/exec.c", line=line@entry=2041, function=function@entry=0x13b0c8a <__PRETTY_FUNCTION__.34656> "qemu_ram_set_idstr") at assert.c:101
#4  0x000000000102e062 in qemu_ram_set_idstr (new_block=new_block@entry=0x0, name=<optimized out>, dev=dev@entry=0x0) at /home/cborntra/REPOS/qemu/exec.c:2041
#5  0x00000000011f5b0a in vmstate_register_ram (mr=0x2cd2dd0, mr@entry=<error reading variable: value has been optimized out>, dev=dev@entry=0x0) at /home/cborntra/REPOS/qemu/migration/savevm.c:2828
#6  0x00000000011f5b5a in vmstate_register_ram_global (mr=<error reading variable: value has been optimized out>) at /home/cborntra/REPOS/qemu/migration/savevm.c:2841
#7  0x000000000110d2ce in s390_memory_init (mem_size=<optimized out>) at /home/cborntra/REPOS/qemu/hw/s390x/s390-virtio-ccw.c:186
#8  0x000000000110d2ce in ccw_init (machine=0x2a96770) at /home/cborntra/REPOS/qemu/hw/s390x/s390-virtio-ccw.c:266
#9  0x00000000011b342c in machine_run_board_init (machine=0x2a96770) at /home/cborntra/REPOS/qemu/hw/core/machine.c:1030
#10 0x0000000001026fee in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4479



On 15.04.19 15:27, Igor Mammedov wrote:
> s390 was trying to solve limited memslot size issue by abusing
> memory_region_allocate_system_memory(), which breaks API contract
> where the function might be called only once.
> 
> s390 should have used memory aliases to fragment inital memory into
> smaller chunks to satisfy KVM's memslot limitation. But its a bit
> late now, since allocated pieces are transfered in migration stream
> separately, so it's not possible to just replace broken layout with
> correct one. Previous patch made MemoryRegion alases migratable and
> this patch switches to use them to split big initial RAM chunk into
> smaller pieces up to KVM_SLOT_MAX_BYTES each and registers aliases
> for migration.
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
> A don't have access to a suitable system to test it, so I've simulated
> it with smaller chunks on x84 host. Ping-pong migration between old
> and new QEMU worked fine.  KVM part should be fine as memslots
> using mapped MemoryRegions (in this case it would be aliases) as
> far as I know but is someone could test it on big enough host it
> would be nice.
> ---
>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index d11069b..12ca3a9 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -161,20 +161,30 @@ static void virtio_ccw_register_hcalls(void)
>  static void s390_memory_init(ram_addr_t mem_size)
>  {
>      MemoryRegion *sysmem = get_system_memory();
> +    MemoryRegion *ram = g_new(MemoryRegion, 1);
>      ram_addr_t chunk, offset = 0;
>      unsigned int number = 0;
>      gchar *name;
>  
>      /* allocate RAM for core */
> +    memory_region_allocate_system_memory(ram, NULL, "s390.whole.ram", mem_size);
> +    /*
> +     * memory_region_allocate_system_memory() registers allocated RAM for
> +     * migration, however for compat reasons the RAM should be passed over
> +     * as RAMBlocks of the size upto KVM_SLOT_MAX_BYTES. So unregister just
> +     * allocated RAM so it won't be migrated directly. Aliases will take
> +     * of segmenting RAM into legacy chunks.
> +     */
> +    vmstate_unregister_ram(ram, NULL);
>      name = g_strdup_printf("s390.ram");
>      while (mem_size) {
> -        MemoryRegion *ram = g_new(MemoryRegion, 1);
> -        uint64_t size = mem_size;
> +        MemoryRegion *alias = g_new(MemoryRegion, 1);
>  
>          /* KVM does not allow memslots >= 8 TB */
> -        chunk = MIN(size, KVM_SLOT_MAX_BYTES);
> -        memory_region_allocate_system_memory(ram, NULL, name, chunk);
> -        memory_region_add_subregion(sysmem, offset, ram);
> +        chunk = MIN(mem_size, KVM_SLOT_MAX_BYTES);
> +        memory_region_init_alias(alias, NULL, name, ram, offset, chunk);
> +        vmstate_register_ram_global(alias);
> +        memory_region_add_subregion(sysmem, offset, alias);
>          mem_size -= chunk;
>          offset += chunk;
>          g_free(name);
>
Christian Borntraeger - April 16, 2019, 11:02 a.m.
On 16.04.19 13:01, Christian Borntraeger wrote:
> This crashes
> 
> a simple -kernel -initrd example on s390x.
> 
> #0  0x000003ff94e3e47c in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x000003ff94e23d18 in __GI_abort () at abort.c:79
> #2  0x000003ff94e365e6 in __assert_fail_base
>     (fmt=0x3ff94f60ca6 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x14aac70 "new_block", file=file@entry=0x13b1168 "/home/cborntra/REPOS/qemu/exec.c", line=line@entry=2041, function=function@entry=0x13b0c8a <__PRETTY_FUNCTION__.34656> "qemu_ram_set_idstr") at assert.c:92
> #3  0x000003ff94e36664 in __GI___assert_fail
>     (assertion=assertion@entry=0x14aac70 "new_block", file=file@entry=0x13b1168 "/home/cborntra/REPOS/qemu/exec.c", line=line@entry=2041, function=function@entry=0x13b0c8a <__PRETTY_FUNCTION__.34656> "qemu_ram_set_idstr") at assert.c:101
> #4  0x000000000102e062 in qemu_ram_set_idstr (new_block=new_block@entry=0x0, name=<optimized out>, dev=dev@entry=0x0) at /home/cborntra/REPOS/qemu/exec.c:2041
> #5  0x00000000011f5b0a in vmstate_register_ram (mr=0x2cd2dd0, mr@entry=<error reading variable: value has been optimized out>, dev=dev@entry=0x0) at /home/cborntra/REPOS/qemu/migration/savevm.c:2828
> #6  0x00000000011f5b5a in vmstate_register_ram_global (mr=<error reading variable: value has been optimized out>) at /home/cborntra/REPOS/qemu/migration/savevm.c:2841
> #7  0x000000000110d2ce in s390_memory_init (mem_size=<optimized out>) at /home/cborntra/REPOS/qemu/hw/s390x/s390-virtio-ccw.c:186
> #8  0x000000000110d2ce in ccw_init (machine=0x2a96770) at /home/cborntra/REPOS/qemu/hw/s390x/s390-virtio-ccw.c:266
> #9  0x00000000011b342c in machine_run_board_init (machine=0x2a96770) at /home/cborntra/REPOS/qemu/hw/core/machine.c:1030
> #10 0x0000000001026fee in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4479

Sorry, I forgot to also add Patch4. With that the crash is gone. I will have a look at the patch.
Christian Borntraeger - April 16, 2019, 11:09 a.m.
This fails with more than 8TB, e.g.  "-m 9T "

[pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
[pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)

seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).


On 15.04.19 15:27, Igor Mammedov wrote:
> s390 was trying to solve limited memslot size issue by abusing
> memory_region_allocate_system_memory(), which breaks API contract
> where the function might be called only once.
> 
> s390 should have used memory aliases to fragment inital memory into
> smaller chunks to satisfy KVM's memslot limitation. But its a bit
> late now, since allocated pieces are transfered in migration stream
> separately, so it's not possible to just replace broken layout with
> correct one. Previous patch made MemoryRegion alases migratable and
> this patch switches to use them to split big initial RAM chunk into
> smaller pieces up to KVM_SLOT_MAX_BYTES each and registers aliases
> for migration.
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
> A don't have access to a suitable system to test it, so I've simulated
> it with smaller chunks on x84 host. Ping-pong migration between old
> and new QEMU worked fine.  KVM part should be fine as memslots
> using mapped MemoryRegions (in this case it would be aliases) as
> far as I know but is someone could test it on big enough host it
> would be nice.
> ---
>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index d11069b..12ca3a9 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -161,20 +161,30 @@ static void virtio_ccw_register_hcalls(void)
>  static void s390_memory_init(ram_addr_t mem_size)
>  {
>      MemoryRegion *sysmem = get_system_memory();
> +    MemoryRegion *ram = g_new(MemoryRegion, 1);
>      ram_addr_t chunk, offset = 0;
>      unsigned int number = 0;
>      gchar *name;
>  
>      /* allocate RAM for core */
> +    memory_region_allocate_system_memory(ram, NULL, "s390.whole.ram", mem_size);
> +    /*
> +     * memory_region_allocate_system_memory() registers allocated RAM for
> +     * migration, however for compat reasons the RAM should be passed over
> +     * as RAMBlocks of the size upto KVM_SLOT_MAX_BYTES. So unregister just
> +     * allocated RAM so it won't be migrated directly. Aliases will take
> +     * of segmenting RAM into legacy chunks.
> +     */
> +    vmstate_unregister_ram(ram, NULL);
>      name = g_strdup_printf("s390.ram");
>      while (mem_size) {
> -        MemoryRegion *ram = g_new(MemoryRegion, 1);
> -        uint64_t size = mem_size;
> +        MemoryRegion *alias = g_new(MemoryRegion, 1);
>  
>          /* KVM does not allow memslots >= 8 TB */
> -        chunk = MIN(size, KVM_SLOT_MAX_BYTES);
> -        memory_region_allocate_system_memory(ram, NULL, name, chunk);
> -        memory_region_add_subregion(sysmem, offset, ram);
> +        chunk = MIN(mem_size, KVM_SLOT_MAX_BYTES);
> +        memory_region_init_alias(alias, NULL, name, ram, offset, chunk);
> +        vmstate_register_ram_global(alias);
> +        memory_region_add_subregion(sysmem, offset, alias);
>          mem_size -= chunk;
>          offset += chunk;
>          g_free(name);
>
Igor Mammedov - April 17, 2019, 2:30 p.m.
On Tue, 16 Apr 2019 13:09:08 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> This fails with more than 8TB, e.g.  "-m 9T "
> 
> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
> 
> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).

I'm able to simulate issue on s390 host with KVM enabled, it looks like
memory region aliases are broken on s390 host (aliasing works as expected with
x86 where where it for splitting RAM on low and high mem).
I'll try to debug and find out where it goes off on a tangent.

> 
> 
> On 15.04.19 15:27, Igor Mammedov wrote:
> > s390 was trying to solve limited memslot size issue by abusing
> > memory_region_allocate_system_memory(), which breaks API contract
> > where the function might be called only once.
> > 
> > s390 should have used memory aliases to fragment inital memory into
> > smaller chunks to satisfy KVM's memslot limitation. But its a bit
> > late now, since allocated pieces are transfered in migration stream
> > separately, so it's not possible to just replace broken layout with
> > correct one. Previous patch made MemoryRegion alases migratable and
> > this patch switches to use them to split big initial RAM chunk into
> > smaller pieces up to KVM_SLOT_MAX_BYTES each and registers aliases
> > for migration.
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> > A don't have access to a suitable system to test it, so I've simulated
> > it with smaller chunks on x84 host. Ping-pong migration between old
> > and new QEMU worked fine.  KVM part should be fine as memslots
> > using mapped MemoryRegions (in this case it would be aliases) as
> > far as I know but is someone could test it on big enough host it
> > would be nice.
> > ---
> >  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
> >  1 file changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> > index d11069b..12ca3a9 100644
> > --- a/hw/s390x/s390-virtio-ccw.c
> > +++ b/hw/s390x/s390-virtio-ccw.c
> > @@ -161,20 +161,30 @@ static void virtio_ccw_register_hcalls(void)
> >  static void s390_memory_init(ram_addr_t mem_size)
> >  {
> >      MemoryRegion *sysmem = get_system_memory();
> > +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> >      ram_addr_t chunk, offset = 0;
> >      unsigned int number = 0;
> >      gchar *name;
> >  
> >      /* allocate RAM for core */
> > +    memory_region_allocate_system_memory(ram, NULL, "s390.whole.ram", mem_size);
> > +    /*
> > +     * memory_region_allocate_system_memory() registers allocated RAM for
> > +     * migration, however for compat reasons the RAM should be passed over
> > +     * as RAMBlocks of the size upto KVM_SLOT_MAX_BYTES. So unregister just
> > +     * allocated RAM so it won't be migrated directly. Aliases will take
> > +     * of segmenting RAM into legacy chunks.
> > +     */
> > +    vmstate_unregister_ram(ram, NULL);
> >      name = g_strdup_printf("s390.ram");
> >      while (mem_size) {
> > -        MemoryRegion *ram = g_new(MemoryRegion, 1);
> > -        uint64_t size = mem_size;
> > +        MemoryRegion *alias = g_new(MemoryRegion, 1);
> >  
> >          /* KVM does not allow memslots >= 8 TB */
> > -        chunk = MIN(size, KVM_SLOT_MAX_BYTES);
> > -        memory_region_allocate_system_memory(ram, NULL, name, chunk);
> > -        memory_region_add_subregion(sysmem, offset, ram);
> > +        chunk = MIN(mem_size, KVM_SLOT_MAX_BYTES);
> > +        memory_region_init_alias(alias, NULL, name, ram, offset, chunk);
> > +        vmstate_register_ram_global(alias);
> > +        memory_region_add_subregion(sysmem, offset, alias);
> >          mem_size -= chunk;
> >          offset += chunk;
> >          g_free(name);
> >   
> 
>
Igor Mammedov - April 18, 2019, 9:38 a.m.
On Tue, 16 Apr 2019 13:09:08 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> This fails with more than 8TB, e.g.  "-m 9T "
> 
> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
> 
> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).

it turns out MemoryRegions is rendered correctly in to 2 parts (one per alias),
but follow up flatview_simplify() collapses adjacent ranges back
into big one.

I see 2 ways how to approach it:
 1. 'improve' memory region API, so we could disable merging for
     a specific memory region (i.e. RAM providing memory region)
     (I don't in particular like the idea of twisting this API to serve KVM specific purpose)

 2. hide KVMism in kvm code. Move KVM_SLOT_MAX_BYTES out of s390 machine
    code and handle splitting big chunk into several (upto KVM_SLOT_MAX_BYTES)
    in kvm_set_phys_mem().
    We could add KVMState::max_slot_size which is set only by s390,
    so it won't affect other targets.

Paolo,
 I'd like to get your opinion/suggestion which direction I should look into?
     

 
> On 15.04.19 15:27, Igor Mammedov wrote:
> > s390 was trying to solve limited memslot size issue by abusing
> > memory_region_allocate_system_memory(), which breaks API contract
> > where the function might be called only once.
> > 
> > s390 should have used memory aliases to fragment inital memory into
> > smaller chunks to satisfy KVM's memslot limitation. But its a bit
> > late now, since allocated pieces are transfered in migration stream
> > separately, so it's not possible to just replace broken layout with
> > correct one. Previous patch made MemoryRegion alases migratable and
> > this patch switches to use them to split big initial RAM chunk into
> > smaller pieces up to KVM_SLOT_MAX_BYTES each and registers aliases
> > for migration.
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> > A don't have access to a suitable system to test it, so I've simulated
> > it with smaller chunks on x84 host. Ping-pong migration between old
> > and new QEMU worked fine.  KVM part should be fine as memslots
> > using mapped MemoryRegions (in this case it would be aliases) as
> > far as I know but is someone could test it on big enough host it
> > would be nice.
> > ---
> >  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
> >  1 file changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> > index d11069b..12ca3a9 100644
> > --- a/hw/s390x/s390-virtio-ccw.c
> > +++ b/hw/s390x/s390-virtio-ccw.c
> > @@ -161,20 +161,30 @@ static void virtio_ccw_register_hcalls(void)
> >  static void s390_memory_init(ram_addr_t mem_size)
> >  {
> >      MemoryRegion *sysmem = get_system_memory();
> > +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> >      ram_addr_t chunk, offset = 0;
> >      unsigned int number = 0;
> >      gchar *name;
> >  
> >      /* allocate RAM for core */
> > +    memory_region_allocate_system_memory(ram, NULL, "s390.whole.ram", mem_size);
> > +    /*
> > +     * memory_region_allocate_system_memory() registers allocated RAM for
> > +     * migration, however for compat reasons the RAM should be passed over
> > +     * as RAMBlocks of the size upto KVM_SLOT_MAX_BYTES. So unregister just
> > +     * allocated RAM so it won't be migrated directly. Aliases will take
> > +     * of segmenting RAM into legacy chunks.
> > +     */
> > +    vmstate_unregister_ram(ram, NULL);
> >      name = g_strdup_printf("s390.ram");
> >      while (mem_size) {
> > -        MemoryRegion *ram = g_new(MemoryRegion, 1);
> > -        uint64_t size = mem_size;
> > +        MemoryRegion *alias = g_new(MemoryRegion, 1);
> >  
> >          /* KVM does not allow memslots >= 8 TB */
> > -        chunk = MIN(size, KVM_SLOT_MAX_BYTES);
> > -        memory_region_allocate_system_memory(ram, NULL, name, chunk);
> > -        memory_region_add_subregion(sysmem, offset, ram);
> > +        chunk = MIN(mem_size, KVM_SLOT_MAX_BYTES);
> > +        memory_region_init_alias(alias, NULL, name, ram, offset, chunk);
> > +        vmstate_register_ram_global(alias);
> > +        memory_region_add_subregion(sysmem, offset, alias);
> >          mem_size -= chunk;
> >          offset += chunk;
> >          g_free(name);
> >   
> 
>
David Hildenbrand - April 18, 2019, 11:24 a.m.
On 18.04.19 11:38, Igor Mammedov wrote:
> On Tue, 16 Apr 2019 13:09:08 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> This fails with more than 8TB, e.g.  "-m 9T "
>>
>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
>>
>> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).
> 
> it turns out MemoryRegions is rendered correctly in to 2 parts (one per alias),
> but follow up flatview_simplify() collapses adjacent ranges back
> into big one.

That sounds dangerous. Imagine doing that at runtime (e.g. hotplugging a
DIMM), the kvm memory slot would temporarily be deleted to insert the
new, bigger one. Guest would crash. This could happen if backing memory
of two DIMMs would by pure luck be allocated side by side in user space.
Igor Mammedov - April 18, 2019, 12:01 p.m.
On Thu, 18 Apr 2019 13:24:43 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 18.04.19 11:38, Igor Mammedov wrote:
> > On Tue, 16 Apr 2019 13:09:08 +0200
> > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> >   
> >> This fails with more than 8TB, e.g.  "-m 9T "
> >>
> >> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
> >> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
> >>
> >> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).  
> > 
> > it turns out MemoryRegions is rendered correctly in to 2 parts (one per alias),
> > but follow up flatview_simplify() collapses adjacent ranges back
> > into big one.  
> 
> That sounds dangerous. Imagine doing that at runtime (e.g. hotplugging a
> DIMM), the kvm memory slot would temporarily be deleted to insert the
> new, bigger one. Guest would crash. This could happen if backing memory
> of two DIMMs would by pure luck be allocated side by side in user space.
> 

not sure I fully get your concerns, but if you look at can_merge()
it ensures that ranges belong to the same MemoryRegion.

It's hard for me to say if flatview_simplify() works as designed,
MemoryRegion code is quite complicated so I'd deffer to Paolo's
opinion.
David Hildenbrand - April 18, 2019, 12:06 p.m.
On 18.04.19 14:01, Igor Mammedov wrote:
> On Thu, 18 Apr 2019 13:24:43 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.04.19 11:38, Igor Mammedov wrote:
>>> On Tue, 16 Apr 2019 13:09:08 +0200
>>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>>   
>>>> This fails with more than 8TB, e.g.  "-m 9T "
>>>>
>>>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
>>>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
>>>>
>>>> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).  
>>>
>>> it turns out MemoryRegions is rendered correctly in to 2 parts (one per alias),
>>> but follow up flatview_simplify() collapses adjacent ranges back
>>> into big one.  
>>
>> That sounds dangerous. Imagine doing that at runtime (e.g. hotplugging a
>> DIMM), the kvm memory slot would temporarily be deleted to insert the
>> new, bigger one. Guest would crash. This could happen if backing memory
>> of two DIMMs would by pure luck be allocated side by side in user space.
>>
> 
> not sure I fully get your concerns, but if you look at can_merge()
> it ensures that ranges belong to the same MemoryRegion.
> 
> It's hard for me to say if flatview_simplify() works as designed,
> MemoryRegion code is quite complicated so I'd deffer to Paolo's
> opinion.
> 

What I had in mind:

We have the Memory Region for memory devices (m->device_memory).

Assume The first DIMM is created, allocating memory in the user space
process:

[0x100000000 .. 0x20000000]. It is placed at offset 0 in m->device_memory.

Guests starts to run, a second DIMM is hotplugged. Memory in user space
process is allocated (by pure luck) at:

[0x200000000 .. 0x30000000]. It is placed at offset 0x100000000 in
m->device_memory.

Without looking at the code, I could imagine that both might be merged
into a single memory slot. That is my concern. Maybe it is not valid.
Igor Mammedov - April 18, 2019, 2:56 p.m.
On Thu, 18 Apr 2019 14:06:25 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 18.04.19 14:01, Igor Mammedov wrote:
> > On Thu, 18 Apr 2019 13:24:43 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 18.04.19 11:38, Igor Mammedov wrote:  
> >>> On Tue, 16 Apr 2019 13:09:08 +0200
> >>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> >>>     
> >>>> This fails with more than 8TB, e.g.  "-m 9T "
> >>>>
> >>>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
> >>>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
> >>>>
> >>>> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).    
> >>>
> >>> it turns out MemoryRegions is rendered correctly in to 2 parts (one per alias),
> >>> but follow up flatview_simplify() collapses adjacent ranges back
> >>> into big one.    
> >>
> >> That sounds dangerous. Imagine doing that at runtime (e.g. hotplugging a
> >> DIMM), the kvm memory slot would temporarily be deleted to insert the
> >> new, bigger one. Guest would crash. This could happen if backing memory
> >> of two DIMMs would by pure luck be allocated side by side in user space.
> >>  
> > 
> > not sure I fully get your concerns, but if you look at can_merge()
> > it ensures that ranges belong to the same MemoryRegion.
> > 
> > It's hard for me to say if flatview_simplify() works as designed,
> > MemoryRegion code is quite complicated so I'd deffer to Paolo's
> > opinion.
> >   
> 
> What I had in mind:
> 
> We have the Memory Region for memory devices (m->device_memory).
> 
> Assume The first DIMM is created, allocating memory in the user space
> process:
> 
> [0x100000000 .. 0x20000000]. It is placed at offset 0 in m->device_memory.
> 
> Guests starts to run, a second DIMM is hotplugged. Memory in user space
> process is allocated (by pure luck) at:
> 
> [0x200000000 .. 0x30000000]. It is placed at offset 0x100000000 in
> m->device_memory.
> 
> Without looking at the code, I could imagine that both might be merged
> into a single memory slot. That is my concern. Maybe it is not valid.
it's not. As far as I see ranges are merged only if they belong to
the same 'mr'. So to dimms will result in 2 memory sections -> 2 KVMSlots.
David Hildenbrand - April 18, 2019, 3:01 p.m.
On 18.04.19 16:56, Igor Mammedov wrote:
> On Thu, 18 Apr 2019 14:06:25 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.04.19 14:01, Igor Mammedov wrote:
>>> On Thu, 18 Apr 2019 13:24:43 +0200
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> On 18.04.19 11:38, Igor Mammedov wrote:  
>>>>> On Tue, 16 Apr 2019 13:09:08 +0200
>>>>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>>>>     
>>>>>> This fails with more than 8TB, e.g.  "-m 9T "
>>>>>>
>>>>>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=0, userspace_addr=0x3ffc8500000}) = 0
>>>>>> [pid 231065] ioctl(10, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=9895604649984, userspace_addr=0x3ffc8500000}) = -1 EINVAL (Invalid argument)
>>>>>>
>>>>>> seems that the 2nd memslot gets the full size (and not 9TB-size of first slot).    
>>>>>
>>>>> it turns out MemoryRegions is rendered correctly in to 2 parts (one per alias),
>>>>> but follow up flatview_simplify() collapses adjacent ranges back
>>>>> into big one.    
>>>>
>>>> That sounds dangerous. Imagine doing that at runtime (e.g. hotplugging a
>>>> DIMM), the kvm memory slot would temporarily be deleted to insert the
>>>> new, bigger one. Guest would crash. This could happen if backing memory
>>>> of two DIMMs would by pure luck be allocated side by side in user space.
>>>>  
>>>
>>> not sure I fully get your concerns, but if you look at can_merge()
>>> it ensures that ranges belong to the same MemoryRegion.
>>>
>>> It's hard for me to say if flatview_simplify() works as designed,
>>> MemoryRegion code is quite complicated so I'd deffer to Paolo's
>>> opinion.
>>>   
>>
>> What I had in mind:
>>
>> We have the Memory Region for memory devices (m->device_memory).
>>
>> Assume The first DIMM is created, allocating memory in the user space
>> process:
>>
>> [0x100000000 .. 0x20000000]. It is placed at offset 0 in m->device_memory.
>>
>> Guests starts to run, a second DIMM is hotplugged. Memory in user space
>> process is allocated (by pure luck) at:
>>
>> [0x200000000 .. 0x30000000]. It is placed at offset 0x100000000 in
>> m->device_memory.
>>
>> Without looking at the code, I could imagine that both might be merged
>> into a single memory slot. That is my concern. Maybe it is not valid.
> it's not. As far as I see ranges are merged only if they belong to
> the same 'mr'. So to dimms will result in 2 memory sections -> 2 KVMSlots.

Okay, so a shared "parent memory region" is not enough to result in a
merge, only aliases.

Patch

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index d11069b..12ca3a9 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -161,20 +161,30 @@  static void virtio_ccw_register_hcalls(void)
 static void s390_memory_init(ram_addr_t mem_size)
 {
     MemoryRegion *sysmem = get_system_memory();
+    MemoryRegion *ram = g_new(MemoryRegion, 1);
     ram_addr_t chunk, offset = 0;
     unsigned int number = 0;
     gchar *name;
 
     /* allocate RAM for core */
+    memory_region_allocate_system_memory(ram, NULL, "s390.whole.ram", mem_size);
+    /*
+     * memory_region_allocate_system_memory() registers allocated RAM for
+     * migration, however for compat reasons the RAM should be passed over
+     * as RAMBlocks of the size upto KVM_SLOT_MAX_BYTES. So unregister just
+     * allocated RAM so it won't be migrated directly. Aliases will take
+     * of segmenting RAM into legacy chunks.
+     */
+    vmstate_unregister_ram(ram, NULL);
     name = g_strdup_printf("s390.ram");
     while (mem_size) {
-        MemoryRegion *ram = g_new(MemoryRegion, 1);
-        uint64_t size = mem_size;
+        MemoryRegion *alias = g_new(MemoryRegion, 1);
 
         /* KVM does not allow memslots >= 8 TB */
-        chunk = MIN(size, KVM_SLOT_MAX_BYTES);
-        memory_region_allocate_system_memory(ram, NULL, name, chunk);
-        memory_region_add_subregion(sysmem, offset, ram);
+        chunk = MIN(mem_size, KVM_SLOT_MAX_BYTES);
+        memory_region_init_alias(alias, NULL, name, ram, offset, chunk);
+        vmstate_register_ram_global(alias);
+        memory_region_add_subregion(sysmem, offset, alias);
         mem_size -= chunk;
         offset += chunk;
         g_free(name);