Patchwork x86/MCE/AMD: Fix the thresholding machinery initialization order

login
register
mail settings
Submitter Borislav Petkov
Date Nov. 27, 2018, 7:25 p.m.
Message ID <20181127192535.GQ26395@zn.tnic>
Download mbox | patch
Permalink /patch/666351/
State New
Headers show

Comments

Borislav Petkov - Nov. 27, 2018, 7:25 p.m.
Currently, the code sets up the thresholding interrupt vector and only
then goes about initializing the thresholding banks. Which is wrong,
because an early thresholding interrupt would cause a NULL pointer
dereference when accessing those banks and prevent the machine from
booting.

Therefore, set the thresholding interrupt vector only *after* having
initialized the banks successfully.

Fixes: 18807ddb7f88 ("x86/mce/AMD: Reset Threshold Limit after logging error")
Reported-by: Rafał Miłecki <rafal@milecki.pl>
Reported-by: John Clemens <clemej@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: https://lkml.kernel.org/r/20181127101700.2964-1-zajec5@gmail.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201291
---
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)
Rafał Miłecki - Nov. 27, 2018, 8:38 p.m.
On 27.11.2018 20:25, Borislav Petkov wrote:
> Currently, the code sets up the thresholding interrupt vector and only
> then goes about initializing the thresholding banks. Which is wrong,
> because an early thresholding interrupt would cause a NULL pointer
> dereference when accessing those banks and prevent the machine from
> booting.
> 
> Therefore, set the thresholding interrupt vector only *after* having
> initialized the banks successfully.
> 
> Fixes: 18807ddb7f88 ("x86/mce/AMD: Reset Threshold Limit after logging error")
> Reported-by: Rafał Miłecki <rafal@milecki.pl>
> Reported-by: John Clemens <clemej@gmail.com>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
> Cc: linux-edac@vger.kernel.org
> Cc: stable@vger.kernel.org
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: x86@kernel.org
> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
> Link: https://lkml.kernel.org/r/20181127101700.2964-1-zajec5@gmail.com
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201291

Tested on top of 4.14.83 and 4.20-rc4. It fixed both kernels for me!

Tested-by: Rafał Miłecki <rafal@milecki.pl>


John: thanks a lot for bisecting this issue down and finding a
workaround (mce=off). It allowed me to boot recent Linux & debug it
further.

Borislav: thank you for looking at my patch and coming with a nicer fix
SO quickly!
John Clemens - Nov. 28, 2018, 3:06 a.m.
On 11/27/18 3:38 PM, Rafał Miłecki wrote:
> On 27.11.2018 20:25, Borislav Petkov wrote:
>> Currently, the code sets up the thresholding interrupt vector and only
>> then goes about initializing the thresholding banks. Which is wrong,
>> because an early thresholding interrupt would cause a NULL pointer
>> dereference when accessing those banks and prevent the machine from
>> booting.
>>
>> Therefore, set the thresholding interrupt vector only *after* having
>> initialized the banks successfully.
>>
>> Fixes: 18807ddb7f88 ("x86/mce/AMD: Reset Threshold Limit after logging 
>> error")
>> Reported-by: Rafał Miłecki <rafal@milecki.pl>
>> Reported-by: John Clemens <clemej@gmail.com>
>> Signed-off-by: Borislav Petkov <bp@suse.de>
>> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
>> Cc: linux-edac@vger.kernel.org
>> Cc: stable@vger.kernel.org
>> Cc: Tony Luck <tony.luck@intel.com>
>> Cc: x86@kernel.org
>> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
>> Link: https://lkml.kernel.org/r/20181127101700.2964-1-zajec5@gmail.com
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201291
> 
> Tested on top of 4.14.83 and 4.20-rc4. It fixed both kernels for me!
> 
> Tested-by: Rafał Miłecki <rafal@milecki.pl>

Not that anyone needs further verification at this point, but tested 
working on top of 4.19.5, on the HP EliteBook 745 G5, BIOS Q81 v01.03.01 
(same as bugzilla report).

Tested-by: John Clemens <john@deater.net>

> John: thanks a lot for bisecting this issue down and finding a
> workaround (mce=off). It allowed me to boot recent Linux & debug it
> further.

Thank you for taking it the rest of the way, life got in the way after 
the bisect.  I'm glad you were able to pick it up.

> Borislav: thank you for looking at my patch and coming with a nicer fix
> SO quickly!

My thanks as well, to all.

john.c
Borislav Petkov - Nov. 28, 2018, 9:09 a.m.
On Tue, Nov 27, 2018 at 10:06:55PM -0500, John Clemens wrote:
> Not that anyone needs further verification at this point, but tested working
> on top of 4.19.5, on the HP EliteBook 745 G5, BIOS Q81 v01.03.01 (same as
> bugzilla report).
> 
> Tested-by: John Clemens <john@deater.net>

Thanks for testing!

Patch

diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index dd33c357548f..e12454e21b8a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -56,7 +56,7 @@ 
 /* Threshold LVT offset is at MSR0xC0000410[15:12] */
 #define SMCA_THR_LVT_OFF	0xF000
 
-static bool thresholding_en;
+static bool thresholding_irq_en;
 
 static const char * const th_names[] = {
 	"load_store",
@@ -534,9 +534,8 @@  prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
 
 set_offset:
 	offset = setup_APIC_mce_threshold(offset, new);
-
-	if ((offset == new) && (mce_threshold_vector != amd_threshold_interrupt))
-		mce_threshold_vector = amd_threshold_interrupt;
+	if (offset == new)
+		thresholding_irq_en = true;
 
 done:
 	mce_threshold_block_init(&b, offset);
@@ -1357,9 +1356,6 @@  int mce_threshold_remove_device(unsigned int cpu)
 {
 	unsigned int bank;
 
-	if (!thresholding_en)
-		return 0;
-
 	for (bank = 0; bank < mca_cfg.banks; ++bank) {
 		if (!(per_cpu(bank_map, cpu) & (1 << bank)))
 			continue;
@@ -1377,9 +1373,6 @@  int mce_threshold_create_device(unsigned int cpu)
 	struct threshold_bank **bp;
 	int err = 0;
 
-	if (!thresholding_en)
-		return 0;
-
 	bp = per_cpu(threshold_banks, cpu);
 	if (bp)
 		return 0;
@@ -1408,9 +1401,6 @@  static __init int threshold_init_device(void)
 {
 	unsigned lcpu = 0;
 
-	if (mce_threshold_vector == amd_threshold_interrupt)
-		thresholding_en = true;
-
 	/* to hit CPUs online before the notifier is up */
 	for_each_online_cpu(lcpu) {
 		int err = mce_threshold_create_device(lcpu);
@@ -1419,6 +1409,9 @@  static __init int threshold_init_device(void)
 			return err;
 	}
 
+	if (thresholding_irq_en)
+		mce_threshold_vector = amd_threshold_interrupt;
+
 	return 0;
 }
 /*