Patchwork [1/2,RESEND,v7] resource: add the new I/O resource descriptor 'IORES_DESC_RESERVED'

login
register
mail settings
Submitter Lianbo Jiang
Date Nov. 24, 2018, 5:12 a.m.
Message ID <20181124051223.19994-2-lijiang@redhat.com>
Download mbox | patch
Permalink /patch/664109/
State New
Headers show

Comments

Lianbo Jiang - Nov. 24, 2018, 5:12 a.m.
The upstream kernel can not accurately add the e820 reserved type to
kdump krenel e820 table.

Kdump uses walk_iomem_res_desc() to iterate io resources, then adds
the matched resource ranges to the e820 table for kdump kernel. But,
when convert the e820 type to the iores descriptor, several e820
types are converted to 'IORES_DESC_NONE' in this function e820_type
_to_iores_desc(). So the walk_iomem_res_desc() will get the redundant
types(such as E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE_KERN) when
walk through io resources with the descriptor 'IORES_DESC_NONE'.

This patch adds the new I/O resource descriptor 'IORES_DESC_RESERVED'
for the iomem resources search interfaces. It is helpful to exactly
match the reserved resource ranges when walking through iomem resources.

Furthermore, in order to make it still work after the new descriptor
is added, these codes originally related to 'IORES_DESC_NONE' have
been updated.

Suggested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
---
 arch/ia64/kernel/efi.c | 4 ++++
 arch/x86/kernel/e820.c | 2 +-
 arch/x86/mm/ioremap.c  | 9 ++++++++-
 include/linux/ioport.h | 1 +
 kernel/resource.c      | 6 +++---
 5 files changed, 17 insertions(+), 5 deletions(-)
Dave Hansen - Nov. 26, 2018, 8:52 p.m.
On 11/23/18 9:12 PM, Lianbo Jiang wrote:
> The upstream kernel can not accurately add the e820 reserved type to> kdump krenel e820 table.

	^ kernel

> Kdump uses walk_iomem_res_desc() to iterate io resources, then adds
> the matched resource ranges to the e820 table for kdump kernel. But,
> when convert the e820 type to the iores descriptor, several e820
> types are converted to 'IORES_DESC_NONE' in this function e820_type
> _to_iores_desc(). So the walk_iomem_res_desc() will get the redundant
> types(such as E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE_KERN) when
> walk through io resources with the descriptor 'IORES_DESC_NONE'.

Let me see if I understand.

When doing kexec, the old kernel makes a new e820 table for the new
kernel.  The old kernel constructs the new e820 table from
'iomem_resource'.  But, when creating the 'iomem_resource' tree,
reserved areas like E820_TYPE_RESERVED are not properly passed through.
 This causes problems like described in the next patch.

> This patch adds the new I/O resource descriptor 'IORES_DESC_RESERVED'
> for the iomem resources search interfaces. It is helpful to exactly
> match the reserved resource ranges when walking through iomem resources.

It's more than that, though.  You're specifically storing the reserved
area(s) when we see them come through the firmware.

> Furthermore, in order to make it still work after the new descriptor
> is added, these codes originally related to 'IORES_DESC_NONE' have
> been updated.

It would be nice to explain why they needed to be updated and what
breaks if they are not.

> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 5378d10f1d31..91b6112e7489 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -83,7 +83,14 @@ static bool __ioremap_check_ram(struct resource *res)
>  
>  static int __ioremap_check_desc_other(struct resource *res)
>  {
> -	return (res->desc != IORES_DESC_NONE);
> +	/*
> +	 * The E820_TYPE_RESERVED was converted to the IORES_DESC_NONE
> +	 * before the new IORES_DESC_RESERVED is added, so it contained
> +	 * the e820 reserved type. In order to make it still work for
> +	 * SEV, here keep it the same as before.
> +	 */
> +	return ((res->desc != IORES_DESC_NONE) ||
> +		(res->desc != IORES_DESC_RESERVED));
>  }

After reading the changelog and the comment, I still have no idea why
this hunk is here.  Could you try to explain a bit more?
Lianbo Jiang - Nov. 27, 2018, 10:04 a.m.
在 2018年11月27日 04:52, Dave Hansen 写道:
> On 11/23/18 9:12 PM, Lianbo Jiang wrote:
>> The upstream kernel can not accurately add the e820 reserved type to> kdump krenel e820 table.
> 
> 	^ kernel
> 
>> Kdump uses walk_iomem_res_desc() to iterate io resources, then adds
>> the matched resource ranges to the e820 table for kdump kernel. But,
>> when convert the e820 type to the iores descriptor, several e820
>> types are converted to 'IORES_DESC_NONE' in this function e820_type
>> _to_iores_desc(). So the walk_iomem_res_desc() will get the redundant
>> types(such as E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE_KERN) when
>> walk through io resources with the descriptor 'IORES_DESC_NONE'.
> 
> Let me see if I understand.
> 
> When doing kexec, the old kernel makes a new e820 table for the new
> kernel.  The old kernel constructs the new e820 table from
> 'iomem_resource'.  But, when creating the 'iomem_resource' tree,
> reserved areas like E820_TYPE_RESERVED are not properly passed through.
>  This causes problems like described in the next patch.
> 

When doing kexec_file_load, the old kernel need to pass the e820 reserved ranges
to the kdump kernel. But kernel can not exactly match the e820 reserved ranges when
walking through the iomem resources with the descriptor 'IORES_DESC_NONE', because
the descriptor 'IORES_DESC_NONE' contains four e820 types, such as 'E820_TYPE_RAM',
'E820_TYPE_UNUSABLE', 'E820_TYPE_RESERVED_KERN', 'E820_TYPE_RESERVED'. It may pass
these four types to the kdump kernel, this is not desired result.

Please refer to this function e820_type_to_iores_desc(). (arch/x86/kernel/e820.c)

>> This patch adds the new I/O resource descriptor 'IORES_DESC_RESERVED'
>> for the iomem resources search interfaces. It is helpful to exactly
>> match the reserved resource ranges when walking through iomem resources.
> 
> It's more than that, though.  You're specifically storing the reserved
> area(s) when we see them come through the firmware.
> 
>> Furthermore, in order to make it still work after the new descriptor
>> is added, these codes originally related to 'IORES_DESC_NONE' have
>> been updated.
> 
> It would be nice to explain why they needed to be updated and what
> breaks if they are not.
> 

If these code paths have not been updated, the reserved regions descriptor is still marked
as the 'IORES_DESC_NONE' instead of 'IORES_DESC_RESERVED', this is not correct and it is 
also easily confused.

And also cause some errors, such as the following function __ioremap_check_desc_other().

>> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
>> index 5378d10f1d31..91b6112e7489 100644
>> --- a/arch/x86/mm/ioremap.c
>> +++ b/arch/x86/mm/ioremap.c
>> @@ -83,7 +83,14 @@ static bool __ioremap_check_ram(struct resource *res)
>>  
>>  static int __ioremap_check_desc_other(struct resource *res)
>>  {
>> -	return (res->desc != IORES_DESC_NONE);
>> +	/*
>> +	 * The E820_TYPE_RESERVED was converted to the IORES_DESC_NONE
>> +	 * before the new IORES_DESC_RESERVED is added, so it contained
>> +	 * the e820 reserved type. In order to make it still work for
>> +	 * SEV, here keep it the same as before.
>> +	 */
>> +	return ((res->desc != IORES_DESC_NONE) ||
>> +		(res->desc != IORES_DESC_RESERVED));
>>  }
> 
> After reading the changelog and the comment, I still have no idea why
> this hunk is here.  Could you try to explain a bit more?
> 
> 

Thanks for your comment.

What happens if we don't modify this functions __ioremap_check_desc_other()?
-When SEV is active, it might map these reserved regions with the encryption mask. That is incorrect.
Therefore, keep it the same as before in order to make it still work for the SEV case.

**BTW**:
In this function __ioremap_check_desc_other(), it should use the '&&' operator instead of
the '||' operator. I made a mistake in this function(sorry for this), i will improve this
function in patch v8 and post again like this:

static int __ioremap_check_desc_other(struct resource *res)
{
    return ((res->desc != IORES_DESC_NONE) &&
            (res->desc != IORES_DESC_RESERVED));
}


Regards,
Lianbo
Dave Hansen - Nov. 27, 2018, 3:34 p.m.
On 11/27/18 2:04 AM, lijiang wrote:
> What happens if we don't modify this functions
> __ioremap_check_desc_other()? -When SEV is active, it might map these
> reserved regions with the encryption mask. That is incorrect. 

This is missing another sentence or two to "connect the dots".

SEV uses data that comes from the e820 table to tell whether or not the
memory should be encrypted?  If we don't reflect these reserved areas in
the e820 table, the SEV code will set up encrypted mappings for device
memory, for instance?
Lianbo Jiang - Nov. 28, 2018, 3:51 a.m.
在 2018年11月27日 23:34, Dave Hansen 写道:
> On 11/27/18 2:04 AM, lijiang wrote:
>> What happens if we don't modify this functions
>> __ioremap_check_desc_other()? -When SEV is active, it might map these
>> reserved regions with the encryption mask. That is incorrect. 
> 
> This is missing another sentence or two to "connect the dots".
> 
> SEV uses data that comes from the e820 table to tell whether or not the
> memory should be encrypted?  If we don't reflect these reserved areas in
> the e820 table, the SEV code will set up encrypted mappings for device
> memory, for instance?
> 

For the convenience of description, here copy some code fragments and put them at the end.
Please refer to these codes.

When the SEV is active, the page being mapped will determine whether to use the memory
encryption attribute for mapping based on the 'mem_flags.desc_other'.

But the value of 'mem_flags.desc_other' is set according to the returned result of this
function __ioremap_check_desc_other().

Originally, the 'E820_TYPE_RESERVED' type is converted to the descriptor 'IORES_DESC_NONE',
therefore, for the e820 reserved type, the value of 'mem_flags.desc_other' is equal to
'false'.

But now, the 'E820_TYPE_RESERVED' type is converted to the descriptor 'IORES_DESC_RESERVED'
instead of 'IORES_DESC_NONE', it has been changed and the value of 'mem_flags.desc_other'
is equal to 'true'. This is wrong.

So that would be nice to keep it the same as before, that is to say, this function has to
be improved like this.
static int __ioremap_check_desc_other(struct resource *res)
{
    return ((res->desc != IORES_DESC_NONE) &&
            (res->desc != IORES_DESC_RESERVED));
}


Please look at the following function __ioremap_caller(), and also list the call trace path.

Call trace path:
   __ioremap_caller()->
__ioremap_check_mem()->
       walk_mem_res()->
__ioremap_res_check()-> 
__ioremap_check_desc_other()

static void __iomem *__ioremap_caller(resource_size_t phys_addr,
                unsigned long size, enum page_cache_mode pcm,
                void *caller, bool encrypted)
{
        unsigned long offset, vaddr;
        resource_size_t last_addr;
        const resource_size_t unaligned_phys_addr = phys_addr;
        const unsigned long unaligned_size = size;
        struct ioremap_mem_flags mem_flags;

	/* skip some codes...*/

        __ioremap_check_mem(phys_addr, size, &mem_flags);

	/* skip some codes...*/

	/*
         * If the page being mapped is in memory and SEV is active then
         * make sure the memory encryption attribute is enabled in the
         * resulting mapping.
         */
        prot = PAGE_KERNEL_IO;
        if ((sev_active() && mem_flags.desc_other) || encrypted)
                prot = pgprot_encrypted(prot);

	/* skip some codes...*/
}

Thanks.
Lianbo
Dave Hansen - Nov. 28, 2018, 4:02 p.m.
On 11/27/18 7:51 PM, lijiang wrote:
> But now, the 'E820_TYPE_RESERVED' type is converted to the descriptor 'IORES_DESC_RESERVED'
> instead of 'IORES_DESC_NONE', it has been changed and the value of 'mem_flags.desc_other'
> is equal to 'true'. This is wrong.

Thanks for the improved description of the problem.  Seems to all make
sense now.

Could you include this in the changelog for the next version?
Lianbo Jiang - Nov. 29, 2018, 2:14 a.m.
在 2018年11月29日 00:02, Dave Hansen 写道:
> On 11/27/18 7:51 PM, lijiang wrote:
>> But now, the 'E820_TYPE_RESERVED' type is converted to the descriptor 'IORES_DESC_RESERVED'
>> instead of 'IORES_DESC_NONE', it has been changed and the value of 'mem_flags.desc_other'
>> is equal to 'true'. This is wrong.
> 
> Thanks for the improved description of the problem.  Seems to all make
> sense now.
> 
> Could you include this in the changelog for the next version?
> 

Ok, i will include these changes in patch v8. Also thanks for your advice.

Regards,
Lianbo

Patch

diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
index 8f106638913c..1841e9b4db30 100644
--- a/arch/ia64/kernel/efi.c
+++ b/arch/ia64/kernel/efi.c
@@ -1231,6 +1231,10 @@  efi_initialize_iomem_resources(struct resource *code_resource,
 				break;
 
 			case EFI_RESERVED_TYPE:
+				name = "reserved";
+				desc = IORES_DESC_RESERVED;
+				break;
+
 			case EFI_RUNTIME_SERVICES_CODE:
 			case EFI_RUNTIME_SERVICES_DATA:
 			case EFI_ACPI_RECLAIM_MEMORY:
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 50895c2f937d..57fafdafb860 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1048,10 +1048,10 @@  static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
 	case E820_TYPE_NVS:		return IORES_DESC_ACPI_NV_STORAGE;
 	case E820_TYPE_PMEM:		return IORES_DESC_PERSISTENT_MEMORY;
 	case E820_TYPE_PRAM:		return IORES_DESC_PERSISTENT_MEMORY_LEGACY;
+	case E820_TYPE_RESERVED:	return IORES_DESC_RESERVED;
 	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
 	case E820_TYPE_RAM:		/* Fall-through: */
 	case E820_TYPE_UNUSABLE:	/* Fall-through: */
-	case E820_TYPE_RESERVED:	/* Fall-through: */
 	default:			return IORES_DESC_NONE;
 	}
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 5378d10f1d31..91b6112e7489 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -83,7 +83,14 @@  static bool __ioremap_check_ram(struct resource *res)
 
 static int __ioremap_check_desc_other(struct resource *res)
 {
-	return (res->desc != IORES_DESC_NONE);
+	/*
+	 * The E820_TYPE_RESERVED was converted to the IORES_DESC_NONE
+	 * before the new IORES_DESC_RESERVED is added, so it contained
+	 * the e820 reserved type. In order to make it still work for
+	 * SEV, here keep it the same as before.
+	 */
+	return ((res->desc != IORES_DESC_NONE) ||
+		(res->desc != IORES_DESC_RESERVED));
 }
 
 static int __ioremap_res_check(struct resource *res, void *arg)
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..6ed59de48bd5 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -133,6 +133,7 @@  enum {
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
 	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
 	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
+	IORES_DESC_RESERVED			= 8,
 };
 
 /* helpers to define resources */
diff --git a/kernel/resource.c b/kernel/resource.c
index b0fbf685c77a..f34a632c4169 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -994,7 +994,7 @@  __reserve_region_with_split(struct resource *root, resource_size_t start,
 	res->start = start;
 	res->end = end;
 	res->flags = type | IORESOURCE_BUSY;
-	res->desc = IORES_DESC_NONE;
+	res->desc = IORES_DESC_RESERVED;
 
 	while (1) {
 
@@ -1029,7 +1029,7 @@  __reserve_region_with_split(struct resource *root, resource_size_t start,
 				next_res->start = conflict->end + 1;
 				next_res->end = end;
 				next_res->flags = type | IORESOURCE_BUSY;
-				next_res->desc = IORES_DESC_NONE;
+				next_res->desc = IORES_DESC_RESERVED;
 			}
 		} else {
 			res->start = conflict->end + 1;
@@ -1477,7 +1477,7 @@  static int __init reserve_setup(char *str)
 			res->start = io_start;
 			res->end = io_start + io_num - 1;
 			res->flags |= IORESOURCE_BUSY;
-			res->desc = IORES_DESC_NONE;
+			res->desc = IORES_DESC_RESERVED;
 			res->child = NULL;
 			if (request_resource(parent, res) == 0)
 				reserved = x+1;