Patchwork [v2,2/2] locking/rwsem: Optimize down_read_trylock()

login
register
mail settings
Submitter Waiman Long
Date Feb. 11, 2019, 7:31 p.m.
Message ID <1549913486-16799-3-git-send-email-longman@redhat.com>
Download mbox | patch
Permalink /patch/723291/
State New
Headers show

Comments

Waiman Long - Feb. 11, 2019, 7:31 p.m.
Modify __down_read_trylock() to make it generate slightly better code
(smaller and maybe a tiny bit faster).

Before this patch, down_read_trylock:

   0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
   0x0000000000000005 <+5>:     jmp    0x18 <down_read_trylock+24>
   0x0000000000000007 <+7>:     lea    0x1(%rdx),%rcx
   0x000000000000000b <+11>:    mov    %rdx,%rax
   0x000000000000000e <+14>:    lock cmpxchg %rcx,(%rdi)
   0x0000000000000013 <+19>:    cmp    %rax,%rdx
   0x0000000000000016 <+22>:    je     0x23 <down_read_trylock+35>
   0x0000000000000018 <+24>:    mov    (%rdi),%rdx
   0x000000000000001b <+27>:    test   %rdx,%rdx
   0x000000000000001e <+30>:    jns    0x7 <down_read_trylock+7>
   0x0000000000000020 <+32>:    xor    %eax,%eax
   0x0000000000000022 <+34>:    retq
   0x0000000000000023 <+35>:    mov    %gs:0x0,%rax
   0x000000000000002c <+44>:    or     $0x3,%rax
   0x0000000000000030 <+48>:    mov    %rax,0x20(%rdi)
   0x0000000000000034 <+52>:    mov    $0x1,%eax
   0x0000000000000039 <+57>:    retq

After patch, down_read_trylock:

   0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
   0x0000000000000005 <+5>:     mov    (%rdi),%rax
   0x0000000000000008 <+8>:     test   %rax,%rax
   0x000000000000000b <+11>:    js     0x2f <down_read_trylock+47>
   0x000000000000000d <+13>:    lea    0x1(%rax),%rdx
   0x0000000000000011 <+17>:    lock cmpxchg %rdx,(%rdi)
   0x0000000000000016 <+22>:    jne    0x8 <down_read_trylock+8>
   0x0000000000000018 <+24>:    mov    %gs:0x0,%rax
   0x0000000000000021 <+33>:    or     $0x3,%rax
   0x0000000000000025 <+37>:    mov    %rax,0x20(%rdi)
   0x0000000000000029 <+41>:    mov    $0x1,%eax
   0x000000000000002e <+46>:    retq
   0x000000000000002f <+47>:    xor    %eax,%eax
   0x0000000000000031 <+49>:    retq

By using a rwsem microbenchmark, the down_read_trylock() rate on a
x86-64 system before and after the patch were:

                 Before Patch    After Patch
   # of Threads     rlock           rlock
   ------------     -----           -----
        1           27,787          28,259
        2            8,359           9,234

On a ARM64 system, the performance results were:

                 Before Patch    After Patch
   # of Threads     rlock           rlock
   ------------     -----           -----
        1           24,155          25,000
        2            6,820           8,699

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Peter Zijlstra - Feb. 12, 2019, 1:24 p.m.
On Mon, Feb 11, 2019 at 02:31:26PM -0500, Waiman Long wrote:
> Modify __down_read_trylock() to make it generate slightly better code
> (smaller and maybe a tiny bit faster).
> 
> Before this patch, down_read_trylock:
> 
>    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
>    0x0000000000000005 <+5>:     jmp    0x18 <down_read_trylock+24>
>    0x0000000000000007 <+7>:     lea    0x1(%rdx),%rcx
>    0x000000000000000b <+11>:    mov    %rdx,%rax
>    0x000000000000000e <+14>:    lock cmpxchg %rcx,(%rdi)
>    0x0000000000000013 <+19>:    cmp    %rax,%rdx
>    0x0000000000000016 <+22>:    je     0x23 <down_read_trylock+35>
>    0x0000000000000018 <+24>:    mov    (%rdi),%rdx
>    0x000000000000001b <+27>:    test   %rdx,%rdx
>    0x000000000000001e <+30>:    jns    0x7 <down_read_trylock+7>
>    0x0000000000000020 <+32>:    xor    %eax,%eax
>    0x0000000000000022 <+34>:    retq
>    0x0000000000000023 <+35>:    mov    %gs:0x0,%rax
>    0x000000000000002c <+44>:    or     $0x3,%rax
>    0x0000000000000030 <+48>:    mov    %rax,0x20(%rdi)
>    0x0000000000000034 <+52>:    mov    $0x1,%eax
>    0x0000000000000039 <+57>:    retq
> 
> After patch, down_read_trylock:
> 
>    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
>    0x0000000000000005 <+5>:     mov    (%rdi),%rax
>    0x0000000000000008 <+8>:     test   %rax,%rax
>    0x000000000000000b <+11>:    js     0x2f <down_read_trylock+47>
>    0x000000000000000d <+13>:    lea    0x1(%rax),%rdx
>    0x0000000000000011 <+17>:    lock cmpxchg %rdx,(%rdi)
>    0x0000000000000016 <+22>:    jne    0x8 <down_read_trylock+8>
>    0x0000000000000018 <+24>:    mov    %gs:0x0,%rax
>    0x0000000000000021 <+33>:    or     $0x3,%rax
>    0x0000000000000025 <+37>:    mov    %rax,0x20(%rdi)
>    0x0000000000000029 <+41>:    mov    $0x1,%eax
>    0x000000000000002e <+46>:    retq
>    0x000000000000002f <+47>:    xor    %eax,%eax
>    0x0000000000000031 <+49>:    retq
> 
> By using a rwsem microbenchmark, the down_read_trylock() rate on a
> x86-64 system before and after the patch were:
> 
>                  Before Patch    After Patch
>    # of Threads     rlock           rlock
>    ------------     -----           -----
>         1           27,787          28,259
>         2            8,359           9,234

From 1/2:

1        29,201  30,143  29,458    28,615  30,172  29,201
2         6,807  13,299   1,171     7,725  15,025   1,804

> 
> On a ARM64 system, the performance results were:
> 
>                  Before Patch    After Patch
>    # of Threads     rlock           rlock
>    ------------     -----           -----
>         1           24,155          25,000
>         2            6,820           8,699
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/rwsem.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
> index 067e265..028bc33 100644
> --- a/kernel/locking/rwsem.h
> +++ b/kernel/locking/rwsem.h
> @@ -175,11 +175,11 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
>  
>  static inline int __down_read_trylock(struct rw_semaphore *sem)
>  {
> -	long tmp;
> +	long tmp = atomic_long_read(&sem->count);
>  
> -	while ((tmp = atomic_long_read(&sem->count)) >= 0) {
> -		if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
> -				   tmp + RWSEM_ACTIVE_READ_BIAS)) {
> +	while (tmp >= 0) {
> +		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> +					tmp + RWSEM_ACTIVE_READ_BIAS)) {
>  			return 1;
>  		}
>  	}
> -- 
> 1.8.3.1
>
Peter Zijlstra - Feb. 12, 2019, 1:25 p.m.
On Tue, Feb 12, 2019 at 02:24:04PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 11, 2019 at 02:31:26PM -0500, Waiman Long wrote:
> > Modify __down_read_trylock() to make it generate slightly better code
> > (smaller and maybe a tiny bit faster).
> > 
> > Before this patch, down_read_trylock:
> > 
> >    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
> >    0x0000000000000005 <+5>:     jmp    0x18 <down_read_trylock+24>
> >    0x0000000000000007 <+7>:     lea    0x1(%rdx),%rcx
> >    0x000000000000000b <+11>:    mov    %rdx,%rax
> >    0x000000000000000e <+14>:    lock cmpxchg %rcx,(%rdi)
> >    0x0000000000000013 <+19>:    cmp    %rax,%rdx
> >    0x0000000000000016 <+22>:    je     0x23 <down_read_trylock+35>
> >    0x0000000000000018 <+24>:    mov    (%rdi),%rdx
> >    0x000000000000001b <+27>:    test   %rdx,%rdx
> >    0x000000000000001e <+30>:    jns    0x7 <down_read_trylock+7>
> >    0x0000000000000020 <+32>:    xor    %eax,%eax
> >    0x0000000000000022 <+34>:    retq
> >    0x0000000000000023 <+35>:    mov    %gs:0x0,%rax
> >    0x000000000000002c <+44>:    or     $0x3,%rax
> >    0x0000000000000030 <+48>:    mov    %rax,0x20(%rdi)
> >    0x0000000000000034 <+52>:    mov    $0x1,%eax
> >    0x0000000000000039 <+57>:    retq
> > 
> > After patch, down_read_trylock:
> > 
> >    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
> >    0x0000000000000005 <+5>:     mov    (%rdi),%rax
> >    0x0000000000000008 <+8>:     test   %rax,%rax
> >    0x000000000000000b <+11>:    js     0x2f <down_read_trylock+47>
> >    0x000000000000000d <+13>:    lea    0x1(%rax),%rdx
> >    0x0000000000000011 <+17>:    lock cmpxchg %rdx,(%rdi)
> >    0x0000000000000016 <+22>:    jne    0x8 <down_read_trylock+8>
> >    0x0000000000000018 <+24>:    mov    %gs:0x0,%rax
> >    0x0000000000000021 <+33>:    or     $0x3,%rax
> >    0x0000000000000025 <+37>:    mov    %rax,0x20(%rdi)
> >    0x0000000000000029 <+41>:    mov    $0x1,%eax
> >    0x000000000000002e <+46>:    retq
> >    0x000000000000002f <+47>:    xor    %eax,%eax
> >    0x0000000000000031 <+49>:    retq
> > 
> > By using a rwsem microbenchmark, the down_read_trylock() rate on a
> > x86-64 system before and after the patch were:
> > 
> >                  Before Patch    After Patch
> >    # of Threads     rlock           rlock
> >    ------------     -----           -----
> >         1           27,787          28,259
> >         2            8,359           9,234
> 
> From 1/2:
> 
> 1        29,201  30,143  29,458    28,615  30,172  29,201
> 2         6,807  13,299   1,171     7,725  15,025   1,804

Argh, fat fingered and send before I was done typing.

What I wanted to say was; those rlock numbers don't match up. What
gives?

The before _this_ patch number of 27k787 should be the same as the after
first patch number of 30k172.
Waiman Long - Feb. 12, 2019, 6:36 p.m.
On 02/12/2019 08:25 AM, Peter Zijlstra wrote:
> On Tue, Feb 12, 2019 at 02:24:04PM +0100, Peter Zijlstra wrote:
>> On Mon, Feb 11, 2019 at 02:31:26PM -0500, Waiman Long wrote:
>>> Modify __down_read_trylock() to make it generate slightly better code
>>> (smaller and maybe a tiny bit faster).
>>>
>>> Before this patch, down_read_trylock:
>>>
>>>    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
>>>    0x0000000000000005 <+5>:     jmp    0x18 <down_read_trylock+24>
>>>    0x0000000000000007 <+7>:     lea    0x1(%rdx),%rcx
>>>    0x000000000000000b <+11>:    mov    %rdx,%rax
>>>    0x000000000000000e <+14>:    lock cmpxchg %rcx,(%rdi)
>>>    0x0000000000000013 <+19>:    cmp    %rax,%rdx
>>>    0x0000000000000016 <+22>:    je     0x23 <down_read_trylock+35>
>>>    0x0000000000000018 <+24>:    mov    (%rdi),%rdx
>>>    0x000000000000001b <+27>:    test   %rdx,%rdx
>>>    0x000000000000001e <+30>:    jns    0x7 <down_read_trylock+7>
>>>    0x0000000000000020 <+32>:    xor    %eax,%eax
>>>    0x0000000000000022 <+34>:    retq
>>>    0x0000000000000023 <+35>:    mov    %gs:0x0,%rax
>>>    0x000000000000002c <+44>:    or     $0x3,%rax
>>>    0x0000000000000030 <+48>:    mov    %rax,0x20(%rdi)
>>>    0x0000000000000034 <+52>:    mov    $0x1,%eax
>>>    0x0000000000000039 <+57>:    retq
>>>
>>> After patch, down_read_trylock:
>>>
>>>    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
>>>    0x0000000000000005 <+5>:     mov    (%rdi),%rax
>>>    0x0000000000000008 <+8>:     test   %rax,%rax
>>>    0x000000000000000b <+11>:    js     0x2f <down_read_trylock+47>
>>>    0x000000000000000d <+13>:    lea    0x1(%rax),%rdx
>>>    0x0000000000000011 <+17>:    lock cmpxchg %rdx,(%rdi)
>>>    0x0000000000000016 <+22>:    jne    0x8 <down_read_trylock+8>
>>>    0x0000000000000018 <+24>:    mov    %gs:0x0,%rax
>>>    0x0000000000000021 <+33>:    or     $0x3,%rax
>>>    0x0000000000000025 <+37>:    mov    %rax,0x20(%rdi)
>>>    0x0000000000000029 <+41>:    mov    $0x1,%eax
>>>    0x000000000000002e <+46>:    retq
>>>    0x000000000000002f <+47>:    xor    %eax,%eax
>>>    0x0000000000000031 <+49>:    retq
>>>
>>> By using a rwsem microbenchmark, the down_read_trylock() rate on a
>>> x86-64 system before and after the patch were:
>>>
>>>                  Before Patch    After Patch
>>>    # of Threads     rlock           rlock
>>>    ------------     -----           -----
>>>         1           27,787          28,259
>>>         2            8,359           9,234
>> From 1/2:
>>
>> 1        29,201  30,143  29,458    28,615  30,172  29,201
>> 2         6,807  13,299   1,171     7,725  15,025   1,804
> Argh, fat fingered and send before I was done typing.
>
> What I wanted to say was; those rlock numbers don't match up. What
> gives?
>
> The before _this_ patch number of 27k787 should be the same as the after
> first patch number of 30k172.

The rlock number in patch 1 refers to down_read() which uses xadd. The
number here in patch 2 refers specifically to down_read_trylock() which
uses cmpxchg() as this patch changes only __down_read_tryulock(). So the
performance data differ.

Cheers,
Longman
Waiman Long - Feb. 12, 2019, 6:38 p.m.
On 02/12/2019 01:36 PM, Waiman Long wrote:
> On 02/12/2019 08:25 AM, Peter Zijlstra wrote:
>> On Tue, Feb 12, 2019 at 02:24:04PM +0100, Peter Zijlstra wrote:
>>> On Mon, Feb 11, 2019 at 02:31:26PM -0500, Waiman Long wrote:
>>>> Modify __down_read_trylock() to make it generate slightly better code
>>>> (smaller and maybe a tiny bit faster).
>>>>
>>>> Before this patch, down_read_trylock:
>>>>
>>>>    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
>>>>    0x0000000000000005 <+5>:     jmp    0x18 <down_read_trylock+24>
>>>>    0x0000000000000007 <+7>:     lea    0x1(%rdx),%rcx
>>>>    0x000000000000000b <+11>:    mov    %rdx,%rax
>>>>    0x000000000000000e <+14>:    lock cmpxchg %rcx,(%rdi)
>>>>    0x0000000000000013 <+19>:    cmp    %rax,%rdx
>>>>    0x0000000000000016 <+22>:    je     0x23 <down_read_trylock+35>
>>>>    0x0000000000000018 <+24>:    mov    (%rdi),%rdx
>>>>    0x000000000000001b <+27>:    test   %rdx,%rdx
>>>>    0x000000000000001e <+30>:    jns    0x7 <down_read_trylock+7>
>>>>    0x0000000000000020 <+32>:    xor    %eax,%eax
>>>>    0x0000000000000022 <+34>:    retq
>>>>    0x0000000000000023 <+35>:    mov    %gs:0x0,%rax
>>>>    0x000000000000002c <+44>:    or     $0x3,%rax
>>>>    0x0000000000000030 <+48>:    mov    %rax,0x20(%rdi)
>>>>    0x0000000000000034 <+52>:    mov    $0x1,%eax
>>>>    0x0000000000000039 <+57>:    retq
>>>>
>>>> After patch, down_read_trylock:
>>>>
>>>>    0x0000000000000000 <+0>:     callq  0x5 <down_read_trylock+5>
>>>>    0x0000000000000005 <+5>:     mov    (%rdi),%rax
>>>>    0x0000000000000008 <+8>:     test   %rax,%rax
>>>>    0x000000000000000b <+11>:    js     0x2f <down_read_trylock+47>
>>>>    0x000000000000000d <+13>:    lea    0x1(%rax),%rdx
>>>>    0x0000000000000011 <+17>:    lock cmpxchg %rdx,(%rdi)
>>>>    0x0000000000000016 <+22>:    jne    0x8 <down_read_trylock+8>
>>>>    0x0000000000000018 <+24>:    mov    %gs:0x0,%rax
>>>>    0x0000000000000021 <+33>:    or     $0x3,%rax
>>>>    0x0000000000000025 <+37>:    mov    %rax,0x20(%rdi)
>>>>    0x0000000000000029 <+41>:    mov    $0x1,%eax
>>>>    0x000000000000002e <+46>:    retq
>>>>    0x000000000000002f <+47>:    xor    %eax,%eax
>>>>    0x0000000000000031 <+49>:    retq
>>>>
>>>> By using a rwsem microbenchmark, the down_read_trylock() rate on a
>>>> x86-64 system before and after the patch were:
>>>>
>>>>                  Before Patch    After Patch
>>>>    # of Threads     rlock           rlock
>>>>    ------------     -----           -----
>>>>         1           27,787          28,259
>>>>         2            8,359           9,234
>>> From 1/2:
>>>
>>> 1        29,201  30,143  29,458    28,615  30,172  29,201
>>> 2         6,807  13,299   1,171     7,725  15,025   1,804
>> Argh, fat fingered and send before I was done typing.
>>
>> What I wanted to say was; those rlock numbers don't match up. What
>> gives?
>>
>> The before _this_ patch number of 27k787 should be the same as the after
>> first patch number of 30k172.
> The rlock number in patch 1 refers to down_read() which uses xadd. The
> number here in patch 2 refers specifically to down_read_trylock() which
> uses cmpxchg() as this patch changes only __down_read_tryulock(). So the
> performance data differ.

You can see that the performance is worse if we use cmpxchg for
down_read instead of using xadd.

Cheers,
Longman
Linus Torvalds - Feb. 12, 2019, 7:58 p.m.
On Mon, Feb 11, 2019 at 11:31 AM Waiman Long <longman@redhat.com> wrote:
>
> Modify __down_read_trylock() to make it generate slightly better code
> (smaller and maybe a tiny bit faster).

This looks good, but I would ask you to try one slightly different approach.

Instead of this:

>        long tmp = atomic_long_read(&sem->count);
>
>        while (tmp >= 0) {
>                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>                                        tmp + RWSEM_ACTIVE_READ_BIAS)) {
>                        return 1;
>                }
>        }

try doing this instead:

        long tmp = 0;

        do {
                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
                                        tmp + RWSEM_ACTIVE_READ_BIAS)) {
                        return 1;
        } while (tmp >= 0);
        return 0;

because especially when it comes to locking, it's usually better to
just *guess* that the lock is unlocked, than it is to actually read
from the line to see what the state is.

Often - but certainly not always - the lock is the first access to the
target cacheline, and assuming the trylock is successful (which I
think is the case we want to optimize for), we're much better off
causing that first access to be a read-for-ownership, rather than a
read-for-sharing.

Because if you first read from the line, and then do a cmpxchg, and if
the line was not in the cache, your cache coherency protocol will
generally go through two states: first shared (for the initial read)
and then exclusive-dirty (for the cmpxchg).

Now, this is obviously very micro-architecture dependent, and in fact
the microarchitecture could even see the "predict fallthrough to a
cmpxchg with the same address" and turn the first read into a
read-for-ownership, but we've done this at some point before, and the
"guess unlocked" was actually the one that performed better.

Of course, the downside is that it might be worse when the guess is
incorrect - either because of a nested read lock or due to an actual
conflict with a write, but on the whole those *should* be the rare
cases, and not the cases where we necessarily optimize for latency of
the operation.

Hmm?

                    Linus
Waiman Long - Feb. 12, 2019, 9:21 p.m.
On 02/12/2019 02:58 PM, Linus Torvalds wrote:
> On Mon, Feb 11, 2019 at 11:31 AM Waiman Long <longman@redhat.com> wrote:
>> Modify __down_read_trylock() to make it generate slightly better code
>> (smaller and maybe a tiny bit faster).
> This looks good, but I would ask you to try one slightly different approach.
>
> Instead of this:
>
>>        long tmp = atomic_long_read(&sem->count);
>>
>>        while (tmp >= 0) {
>>                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>>                                        tmp + RWSEM_ACTIVE_READ_BIAS)) {
>>                        return 1;
>>                }
>>        }
> try doing this instead:
>
>         long tmp = 0;
>
>         do {
>                 if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>                                         tmp + RWSEM_ACTIVE_READ_BIAS)) {
>                         return 1;
>         } while (tmp >= 0);
>         return 0;
>
> because especially when it comes to locking, it's usually better to
> just *guess* that the lock is unlocked, than it is to actually read
> from the line to see what the state is.
>
> Often - but certainly not always - the lock is the first access to the
> target cacheline, and assuming the trylock is successful (which I
> think is the case we want to optimize for), we're much better off
> causing that first access to be a read-for-ownership, rather than a
> read-for-sharing.
>
> Because if you first read from the line, and then do a cmpxchg, and if
> the line was not in the cache, your cache coherency protocol will
> generally go through two states: first shared (for the initial read)
> and then exclusive-dirty (for the cmpxchg).
>
> Now, this is obviously very micro-architecture dependent, and in fact
> the microarchitecture could even see the "predict fallthrough to a
> cmpxchg with the same address" and turn the first read into a
> read-for-ownership, but we've done this at some point before, and the
> "guess unlocked" was actually the one that performed better.
>
> Of course, the downside is that it might be worse when the guess is
> incorrect - either because of a nested read lock or due to an actual
> conflict with a write, but on the whole those *should* be the rare
> cases, and not the cases where we necessarily optimize for latency of
> the operation.
>
> Hmm?

I looked at the assembly code in arch/x86/include/asm/rwsem.h. For both
trylocks (read & write), the count is read first before attempting to
lock it. We did the same for all trylock functions in other locks.
Depending on how the trylock is used and how contended the lock is, it
may help or hurt performance. Changing down_read_trylock to do an
unconditional cmpxchg will change the performance profile of existing
code. So I would prefer keeping the current code.

I do notice now that the generic down_write_trylock() code is doing an
unconditional compxchg. So I wonder if we should change it to read the
lock first like other trylocks or just leave it as it is.

Cheers,
Longman
Ingo Molnar - Feb. 13, 2019, 7:45 a.m.
* Waiman Long <longman@redhat.com> wrote:

> I looked at the assembly code in arch/x86/include/asm/rwsem.h. For both
> trylocks (read & write), the count is read first before attempting to
> lock it. We did the same for all trylock functions in other locks.
> Depending on how the trylock is used and how contended the lock is, it
> may help or hurt performance. Changing down_read_trylock to do an
> unconditional cmpxchg will change the performance profile of existing
> code. So I would prefer keeping the current code.
> 
> I do notice now that the generic down_write_trylock() code is doing an
> unconditional compxchg. So I wonder if we should change it to read the
> lock first like other trylocks or just leave it as it is.

No, I think we should instead move the other trylocks to the 
try-for-ownership model as well, like Linus suggested.

That's the general assumption we make in locking primitives, that we 
optimize for the common, expected case - which would be that the trylock 
succeeds, and I don't see why trylock primitives should be different.

In fact I can see more ways for read-for-sharing to perform suboptimally 
on larger systems.

Thanks,

	Ingo
Waiman Long - Feb. 13, 2019, 3:33 p.m.
On 02/13/2019 02:45 AM, Ingo Molnar wrote:
> * Waiman Long <longman@redhat.com> wrote:
>
>> I looked at the assembly code in arch/x86/include/asm/rwsem.h. For both
>> trylocks (read & write), the count is read first before attempting to
>> lock it. We did the same for all trylock functions in other locks.
>> Depending on how the trylock is used and how contended the lock is, it
>> may help or hurt performance. Changing down_read_trylock to do an
>> unconditional cmpxchg will change the performance profile of existing
>> code. So I would prefer keeping the current code.
>>
>> I do notice now that the generic down_write_trylock() code is doing an
>> unconditional compxchg. So I wonder if we should change it to read the
>> lock first like other trylocks or just leave it as it is.
> No, I think we should instead move the other trylocks to the 
> try-for-ownership model as well, like Linus suggested.
>
> That's the general assumption we make in locking primitives, that we 
> optimize for the common, expected case - which would be that the trylock 
> succeeds, and I don't see why trylock primitives should be different.
>
> In fact I can see more ways for read-for-sharing to perform suboptimally 
> on larger systems.

I don't mind changing to the try-for-ownership model for rwsem and
mutex. I do have some concern to do that for spinlock. Some of the lock
slowpath code do optimistic trylock. Making them unconditional cmpxchg
will impact lock contention performance.

I will update this rwsem patch to make the change while I am working on
it. For other locks, I will suggest we go slow and carefully evaluate
the performance implication before we make the changes.

Cheers,
Longman

Patch

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 067e265..028bc33 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -175,11 +175,11 @@  static inline int __down_read_killable(struct rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	long tmp = atomic_long_read(&sem->count);
 
-	while ((tmp = atomic_long_read(&sem->count)) >= 0) {
-		if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
-				   tmp + RWSEM_ACTIVE_READ_BIAS)) {
+	while (tmp >= 0) {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					tmp + RWSEM_ACTIVE_READ_BIAS)) {
 			return 1;
 		}
 	}