Patchwork kvm: mmu: Fix flushing in kvm_zap_gfn_range

login
register
mail settings
Submitter Ben Gardon
Date March 12, 2019, 4:17 p.m.
Message ID <20190312161727.55146-1-bgardon@google.com>
Download mbox | patch
Permalink /patch/747313/
State New
Headers show

Comments

Ben Gardon - March 12, 2019, 4:17 p.m.
zap_direct_gfn_range contains an optimization to use gfn-limited TLB
flushes, if enabled. If using these limited flushes,
zap_direct_gfn_range passes lock_flush_tlb=false to
slot_handle_level_range which creates a race when the function unlocks
to call cond_resched. See an example of this race below:

CPU 0			CPU 1				CPU 3
// zap_direct_gfn_range
mmu_lock()
// *ptep == pte_1
*ptep = 0
if (lock_flush_tlb)
	flush_tlbs()
mmu_unlock()
			// In invalidate range
			// MMU notifier
			mmu_lock()
			if (pte != 0)
				*ptep = 0
				flush = true
			if (flush)
				flush_remote_tlbs()
			mmu_unlock()
			return
			// Host MM reallocates
			// page previously
			// backing guest memory.
							// Guest accesses
							// invalid page
							// through pte_1
							// in its TLB!!

Avoid the above race by passing lock_flush_tlb=true from kvm_zap_gfn_range
and replacing kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
in slot_handle_level_range. When range based flushes are not enabled
kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.

This does change the behavior of many other functions that indirectly use
slot_handle_level_range, iff the range based flushes are enabled. The
only potential problem I see with this is that kvm->tlbs_dirty will be
cleared less often, however the only caller of slot_handle_level_range that
checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
checks it and does a kvm_flush_remote_tlbs after calling
kvm_unmap_hva_range anyway.

Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
	without this patch. The patch introduced no new failures.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Junaid Shadid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)
Christopherson, Sean J - March 12, 2019, 4:56 p.m.
On Tue, Mar 12, 2019 at 09:17:27AM -0700, Ben Gardon wrote:
> zap_direct_gfn_range contains an optimization to use gfn-limited TLB
> flushes, if enabled. If using these limited flushes,
> zap_direct_gfn_range passes lock_flush_tlb=false to
> slot_handle_level_range which creates a race when the function unlocks
> to call cond_resched. See an example of this race below:
> 
> CPU 0			CPU 1				CPU 3
> // zap_direct_gfn_range
> mmu_lock()
> // *ptep == pte_1
> *ptep = 0
> if (lock_flush_tlb)
> 	flush_tlbs()
> mmu_unlock()
> 			// In invalidate range
> 			// MMU notifier
> 			mmu_lock()
> 			if (pte != 0)
> 				*ptep = 0
> 				flush = true
> 			if (flush)
> 				flush_remote_tlbs()
> 			mmu_unlock()
> 			return
> 			// Host MM reallocates
> 			// page previously
> 			// backing guest memory.
> 							// Guest accesses
> 							// invalid page
> 							// through pte_1
> 							// in its TLB!!
> 
> Avoid the above race by passing lock_flush_tlb=true from kvm_zap_gfn_range
> and replacing kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
> in slot_handle_level_range. When range based flushes are not enabled
> kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
> 
> This does change the behavior of many other functions that indirectly use
> slot_handle_level_range, iff the range based flushes are enabled. The
> only potential problem I see with this is that kvm->tlbs_dirty will be
> cleared less often, however the only caller of slot_handle_level_range that
> checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
> checks it and does a kvm_flush_remote_tlbs after calling
> kvm_unmap_hva_range anyway.
> 
> Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
> 	without this patch. The patch introduced no new failures.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> Reviewed-by: Junaid Shadid <junaids@google.com>
> ---
>  arch/x86/kvm/mmu.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 8d43b7c0f56fd..2e6abfa47a2e4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5501,7 +5501,9 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  
>  		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>  			if (flush && lock_flush_tlb) {
> -				kvm_flush_remote_tlbs(kvm);
> +				kvm_flush_remote_tlbs_with_address(kvm,
> +						start_gfn,
> +						iterator.gfn - start_gfn + 1);
>  				flush = false;
>  			}
>  			cond_resched_lock(&kvm->mmu_lock);
> @@ -5509,7 +5511,8 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	}
>  
>  	if (flush && lock_flush_tlb) {
> -		kvm_flush_remote_tlbs(kvm);
> +		kvm_flush_remote_tlbs_with_address(kvm, start_gfn,
> +						   end_gfn - start_gfn + 1);
>  		flush = false;
>  	}
>  
> @@ -5660,13 +5663,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  {
>  	struct kvm_memslots *slots;
>  	struct kvm_memory_slot *memslot;
> -	bool flush_tlb = true;
>  	bool flush = false;

'flush' is no longer used.  It's set by the slot_handle_level_range()
call, but never used.  And I'm pretty sure that removing 'flush' makes
this part of the patch a full revert of commit 71883a62fcd6 ("KVM/MMU:
Flush tlb directly in the kvm_zap_gfn_range()").

From a stable backporting perspective, I think it would make sense to
first revert the aforementioned commit, and then introduce the new
logic.  That'd allow the pure fix to be backported to stable kernels
without potentially introducing new issues due to changing the
behavior of slot_handle_level_range().

>  	int i;
>  
> -	if (kvm_available_flush_tlb_with_range())
> -		flush_tlb = false;
> -
>  	spin_lock(&kvm->mmu_lock);
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>  		slots = __kvm_memslots(kvm, i);
> @@ -5681,14 +5680,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  			flush |= slot_handle_level_range(kvm, memslot,
>  					kvm_zap_rmapp, PT_PAGE_TABLE_LEVEL,
>  					PT_MAX_HUGEPAGE_LEVEL, start,
> -					end - 1, flush_tlb);
> +					end - 1, true);
>  		}
>  	}
>  
> -	if (flush)
> -		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> -				gfn_end - gfn_start + 1);
> -
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> -- 
> 2.21.0.360.g471c308f928-goog
>
Ben Gardon - March 13, 2019, 12:25 a.m.
Good suggestion Sean, thank you. I've split this change into a revert
of 71883a62fcd6 and another patch with the new logic, and sent them
both out as a v2 patch set. I put the "Cc: stable@vger.kernel.org" tag
on the revert as well, I hope that was correct.

On Tue, Mar 12, 2019 at 9:57 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Mar 12, 2019 at 09:17:27AM -0700, Ben Gardon wrote:
> > zap_direct_gfn_range contains an optimization to use gfn-limited TLB
> > flushes, if enabled. If using these limited flushes,
> > zap_direct_gfn_range passes lock_flush_tlb=false to
> > slot_handle_level_range which creates a race when the function unlocks
> > to call cond_resched. See an example of this race below:
> >
> > CPU 0                 CPU 1                           CPU 3
> > // zap_direct_gfn_range
> > mmu_lock()
> > // *ptep == pte_1
> > *ptep = 0
> > if (lock_flush_tlb)
> >       flush_tlbs()
> > mmu_unlock()
> >                       // In invalidate range
> >                       // MMU notifier
> >                       mmu_lock()
> >                       if (pte != 0)
> >                               *ptep = 0
> >                               flush = true
> >                       if (flush)
> >                               flush_remote_tlbs()
> >                       mmu_unlock()
> >                       return
> >                       // Host MM reallocates
> >                       // page previously
> >                       // backing guest memory.
> >                                                       // Guest accesses
> >                                                       // invalid page
> >                                                       // through pte_1
> >                                                       // in its TLB!!
> >
> > Avoid the above race by passing lock_flush_tlb=true from kvm_zap_gfn_range
> > and replacing kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
> > in slot_handle_level_range. When range based flushes are not enabled
> > kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
> >
> > This does change the behavior of many other functions that indirectly use
> > slot_handle_level_range, iff the range based flushes are enabled. The
> > only potential problem I see with this is that kvm->tlbs_dirty will be
> > cleared less often, however the only caller of slot_handle_level_range that
> > checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
> > checks it and does a kvm_flush_remote_tlbs after calling
> > kvm_unmap_hva_range anyway.
> >
> > Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
> >       without this patch. The patch introduced no new failures.
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > Reviewed-by: Junaid Shadid <junaids@google.com>
> > ---
> >  arch/x86/kvm/mmu.c | 17 ++++++-----------
> >  1 file changed, 6 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 8d43b7c0f56fd..2e6abfa47a2e4 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -5501,7 +5501,9 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
> >
> >               if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> >                       if (flush && lock_flush_tlb) {
> > -                             kvm_flush_remote_tlbs(kvm);
> > +                             kvm_flush_remote_tlbs_with_address(kvm,
> > +                                             start_gfn,
> > +                                             iterator.gfn - start_gfn + 1);
> >                               flush = false;
> >                       }
> >                       cond_resched_lock(&kvm->mmu_lock);
> > @@ -5509,7 +5511,8 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
> >       }
> >
> >       if (flush && lock_flush_tlb) {
> > -             kvm_flush_remote_tlbs(kvm);
> > +             kvm_flush_remote_tlbs_with_address(kvm, start_gfn,
> > +                                                end_gfn - start_gfn + 1);
> >               flush = false;
> >       }
> >
> > @@ -5660,13 +5663,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >  {
> >       struct kvm_memslots *slots;
> >       struct kvm_memory_slot *memslot;
> > -     bool flush_tlb = true;
> >       bool flush = false;
>
> 'flush' is no longer used.  It's set by the slot_handle_level_range()
> call, but never used.  And I'm pretty sure that removing 'flush' makes
> this part of the patch a full revert of commit 71883a62fcd6 ("KVM/MMU:
> Flush tlb directly in the kvm_zap_gfn_range()").
>
> From a stable backporting perspective, I think it would make sense to
> first revert the aforementioned commit, and then introduce the new
> logic.  That'd allow the pure fix to be backported to stable kernels
> without potentially introducing new issues due to changing the
> behavior of slot_handle_level_range().
>
> >       int i;
> >
> > -     if (kvm_available_flush_tlb_with_range())
> > -             flush_tlb = false;
> > -
> >       spin_lock(&kvm->mmu_lock);
> >       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> >               slots = __kvm_memslots(kvm, i);
> > @@ -5681,14 +5680,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >                       flush |= slot_handle_level_range(kvm, memslot,
> >                                       kvm_zap_rmapp, PT_PAGE_TABLE_LEVEL,
> >                                       PT_MAX_HUGEPAGE_LEVEL, start,
> > -                                     end - 1, flush_tlb);
> > +                                     end - 1, true);
> >               }
> >       }
> >
> > -     if (flush)
> > -             kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> > -                             gfn_end - gfn_start + 1);
> > -
> >       spin_unlock(&kvm->mmu_lock);
> >  }
> >
> > --
> > 2.21.0.360.g471c308f928-goog
> >

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8d43b7c0f56fd..2e6abfa47a2e4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5501,7 +5501,9 @@  slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 
 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 			if (flush && lock_flush_tlb) {
-				kvm_flush_remote_tlbs(kvm);
+				kvm_flush_remote_tlbs_with_address(kvm,
+						start_gfn,
+						iterator.gfn - start_gfn + 1);
 				flush = false;
 			}
 			cond_resched_lock(&kvm->mmu_lock);
@@ -5509,7 +5511,8 @@  slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	}
 
 	if (flush && lock_flush_tlb) {
-		kvm_flush_remote_tlbs(kvm);
+		kvm_flush_remote_tlbs_with_address(kvm, start_gfn,
+						   end_gfn - start_gfn + 1);
 		flush = false;
 	}
 
@@ -5660,13 +5663,9 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
-	bool flush_tlb = true;
 	bool flush = false;
 	int i;
 
-	if (kvm_available_flush_tlb_with_range())
-		flush_tlb = false;
-
 	spin_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
@@ -5681,14 +5680,10 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 			flush |= slot_handle_level_range(kvm, memslot,
 					kvm_zap_rmapp, PT_PAGE_TABLE_LEVEL,
 					PT_MAX_HUGEPAGE_LEVEL, start,
-					end - 1, flush_tlb);
+					end - 1, true);
 		}
 	}
 
-	if (flush)
-		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
-				gfn_end - gfn_start + 1);
-
 	spin_unlock(&kvm->mmu_lock);
 }