Patchwork [RFC,v3,3/6] sched/idle: Add a generic poll before enter real idle path

login
register
mail settings
Submitter Quan Xu
Date Nov. 16, 2017, 9:12 a.m.
Message ID <46086489-5a01-16e1-9314-70ae53c01952@gmail.com>
Download mbox | patch
Permalink /patch/385019/
State New
Headers show

Comments

Quan Xu - Nov. 16, 2017, 9:12 a.m.
On 2017-11-16 06:03, Thomas Gleixner wrote:
> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>
>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>
>>> Implement a generic idle poll which resembles the functionality
>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>> can be overridden by the architecture code if needed.
>> No, we want less of those magic hooks, not more.
>>
>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>> for interrupts and VM-exit immediately. Also this becomes more
>>> expensive than bare metal. Add a generic idle poll before enter
>>> real idle path. When a reschedule event is pending, we can bypass
>>> the real idle path.
>> Why not do a HV specific idle driver?
> If I understand the problem correctly then he wants to avoid the heavy
> lifting in tick_nohz_idle_enter() in the first place, but there is already
> an interesting quirk there which makes it exit early.  See commit
> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit
> looks similar. But lets not proliferate that. I'd rather see that go away.

agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce 
arch_needs_cpu")
in kvm guest. I won't proliferate that..

> But the irq_timings stuff is heading into the same direction, with a more
> complex prediction logic which should tell you pretty good how long that
> idle period is going to be and in case of an interrupt heavy workload this
> would skip the extra work of stopping and restarting the tick and provide a
> very good input into a polling decision.


interesting. I have tested with IRQ_TIMINGS related code, which seems 
not working so far.
Also I'd like to help as much as I can.
> This can be handled either in a HV specific idle driver or even in the
> generic core code. If the interrupt does not arrive then you can assume
> within the predicted time then you can assume that the flood stopped and
> invoke halt or whatever.
>
> That avoids all of that 'tunable and tweakable' x86 specific hackery and
> utilizes common functionality which is mostly there already.
here is some sample code. Poll for a while before enter halt in 
cpuidle_enter_state()
If I get a reschedule event, then don't try to enter halt.  (I hope this 
is the right direction as Peter mentioned in another email)





thanks,

Quan
Alibaba Cloud
Daniel Lezcano - Nov. 16, 2017, 9:45 a.m.
On 16/11/2017 10:12, Quan Xu wrote:
> 
> 
> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>>
>>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>
>>>> Implement a generic idle poll which resembles the functionality
>>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>>> can be overridden by the architecture code if needed.
>>> No, we want less of those magic hooks, not more.
>>>
>>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>>> for interrupts and VM-exit immediately. Also this becomes more
>>>> expensive than bare metal. Add a generic idle poll before enter
>>>> real idle path. When a reschedule event is pending, we can bypass
>>>> the real idle path.
>>> Why not do a HV specific idle driver?
>> If I understand the problem correctly then he wants to avoid the heavy
>> lifting in tick_nohz_idle_enter() in the first place, but there is
>> already
>> an interesting quirk there which makes it exit early.  See commit
>> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this
>> commit
>> looks similar. But lets not proliferate that. I'd rather see that go
>> away.
> 
> agreed.
> 
> Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce
> arch_needs_cpu")
> in kvm guest. I won't proliferate that..
> 
>> But the irq_timings stuff is heading into the same direction, with a more
>> complex prediction logic which should tell you pretty good how long that
>> idle period is going to be and in case of an interrupt heavy workload
>> this
>> would skip the extra work of stopping and restarting the tick and
>> provide a
>> very good input into a polling decision.
> 
> 
> interesting. I have tested with IRQ_TIMINGS related code, which seems
> not working so far.

I don't know how you tested it, can you elaborate what you meant by
"seems not working so far" ?

There are still some work to do to be more efficient. The prediction
based on the irq timings is all right if the interrupts have a simple
periodicity. But as soon as there is a pattern, the current code can't
handle it properly and does bad predictions.

I'm working on a self-learning pattern detection which is too heavy for
the kernel, and with it we should be able to detect properly the
patterns and re-ajust the period if it changes. I'm in the process of
making it suitable for kernel code (both math and perf).

One improvement which can be done right now and which can help you is
the interrupts rate on the CPU. It is possible to compute it and that
will give an accurate information for the polling decision.
Thomas Gleixner - Nov. 16, 2017, 9:53 a.m.
On Thu, 16 Nov 2017, Quan Xu wrote:
> On 2017-11-16 06:03, Thomas Gleixner wrote:
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
> struct cpuidle_driver *drv,
>                 target_state = &drv->states[index];
>         }
> 
> +#ifdef CONFIG_PARAVIRT
> +       paravirt_idle_poll();
> +
> +       if (need_resched())
> +               return -EBUSY;
> +#endif

That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.

Thanks,

	tglx
Quan Xu - Nov. 17, 2017, 11:23 a.m.
On 2017-11-16 17:53, Thomas Gleixner wrote:
> On Thu, 16 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv,
>>                  target_state = &drv->states[index];
>>          }
>>
>> +#ifdef CONFIG_PARAVIRT
>> +       paravirt_idle_poll();
>> +
>> +       if (need_resched())
>> +               return -EBUSY;
>> +#endif
> That's just plain wrong. We don't want to see any of this PARAVIRT crap in
> anything outside the architecture/hypervisor interfacing code which really
> needs it.
>
> The problem can and must be solved at the generic level in the first place
> to gather the data which can be used to make such decisions.
>
> How that information is used might be either completely generic or requires
> system specific variants. But as long as we don't have any information at
> all we cannot discuss that.
>
> Please sit down and write up which data needs to be considered to make
> decisions about probabilistic polling. Then we need to compare and contrast
> that with the data which is necessary to make power/idle state decisions.
>
> I would be very surprised if this data would not overlap by at least 90%.
>

Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,
calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it
as easy as following:

              --- a/arch/x86/kernel/kvm.c
              +++ b/arch/x86/kernel/kvm.c
              @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)
                      update_intr_gate(X86_TRAP_PF, async_page_fault);
               }

              +static __cpuidle void kvm_safe_halt(void)
              +{
          +        /* 1. POLL, if need_resched() --> return */
          +
              +        asm volatile("sti; hlt": : :"memory"); /* 2. halt */
              +
          +        /* 3. get the last idle state's residency time */
              +
          +        /* 4. update poll duration based on last idle state's 
residency time */
              +}
              +
               void __init kvm_guest_init(void)
               {
                      int i;
              @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
                      if (kvmclock_vsyscall)
                              kvm_setup_vsyscall_timeinfo();

              +       pv_irq_ops.safe_halt = kvm_safe_halt;
              +
               #ifdef CONFIG_SMP




then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.
also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud
Thomas Gleixner - Nov. 17, 2017, 11:36 a.m.
On Fri, 17 Nov 2017, Quan Xu wrote:
> On 2017-11-16 17:53, Thomas Gleixner wrote:
> > That's just plain wrong. We don't want to see any of this PARAVIRT crap in
> > anything outside the architecture/hypervisor interfacing code which really
> > needs it.
> > 
> > The problem can and must be solved at the generic level in the first place
> > to gather the data which can be used to make such decisions.
> > 
> > How that information is used might be either completely generic or requires
> > system specific variants. But as long as we don't have any information at
> > all we cannot discuss that.
> > 
> > Please sit down and write up which data needs to be considered to make
> > decisions about probabilistic polling. Then we need to compare and contrast
> > that with the data which is necessary to make power/idle state decisions.
> > 
> > I would be very surprised if this data would not overlap by at least 90%.
> > 
> 1. which data needs to considerd to make decisions about probabilistic polling
> 
> I really need to write up which data needs to considerd to make
> decisions about probabilistic polling. At last several months,
> I always focused on the data _from idle to reschedule_, then to bypass
> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
> code inevitably.
> 
> with tglx's suggestion, the data which is necessary to make power/idle
> state decisions, is the last idle state's residency time. IIUC this data
> is duration from idle to wakeup, which maybe by reschedule irq or other irq.

That's part of the picture, but not complete.

> I also test that the reschedule irq overlap by more than 90% (trace the
> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
> one minute.
> 
> as the overlap, I think I can input the last idle state's residency time
> to make decisions about probabilistic polling, as @dev->last_residency does.
> it is much easier to get data.

That's only true for your particular use case.

> 
> 2. do a HV specific idle driver (function)
> 
> so far, power management is not exposed to guest.. idle is simple for KVM
> guest,
> calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
> thanks Xen guys, who has implemented the paravirt framework. I can implement
> it
> as easy as following:
> 
>              --- a/arch/x86/kernel/kvm.c

Your email client is using a very strange formatting. 

>              +++ b/arch/x86/kernel/kvm.c
>              @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void)
>                      update_intr_gate(X86_TRAP_PF, async_page_fault);
>               }
> 
>              +static __cpuidle void kvm_safe_halt(void)
>              +{
>          +        /* 1. POLL, if need_resched() --> return */
>          +
>              +        asm volatile("sti; hlt": : :"memory"); /* 2. halt */
>              +
>          +        /* 3. get the last idle state's residency time */
>              +
>          +        /* 4. update poll duration based on last idle state's
> residency time */
>              +}
>              +
>               void __init kvm_guest_init(void)
>               {
>                      int i;
>              @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
>                      if (kvmclock_vsyscall)
>                              kvm_setup_vsyscall_timeinfo();
> 
>              +       pv_irq_ops.safe_halt = kvm_safe_halt;
>              +
>               #ifdef CONFIG_SMP
> 
> 
> then, I am no need to introduce a new pvops, and never modify
> schedule/idle/nohz code again.
> also I can narrow all of the code down in arch/x86/kernel/kvm.c.
> 
> If this is in the right direction, I will send a new patch set next week..

This is definitely better than what you proposed so far and implementing it
as a prove of concept seems to be worthwhile.

But I doubt that this is the final solution. It's not generic and not
necessarily suitable for all use case scenarios.

Thanks,

	tglx
Quan Xu - Nov. 17, 2017, 12:21 p.m.
On 2017-11-17 19:36, Thomas Gleixner wrote:
> On Fri, 17 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 17:53, Thomas Gleixner wrote:
>>> That's just plain wrong. We don't want to see any of this PARAVIRT crap in
>>> anything outside the architecture/hypervisor interfacing code which really
>>> needs it.
>>>
>>> The problem can and must be solved at the generic level in the first place
>>> to gather the data which can be used to make such decisions.
>>>
>>> How that information is used might be either completely generic or requires
>>> system specific variants. But as long as we don't have any information at
>>> all we cannot discuss that.
>>>
>>> Please sit down and write up which data needs to be considered to make
>>> decisions about probabilistic polling. Then we need to compare and contrast
>>> that with the data which is necessary to make power/idle state decisions.
>>>
>>> I would be very surprised if this data would not overlap by at least 90%.
>>>
>> 1. which data needs to considerd to make decisions about probabilistic polling
>>
>> I really need to write up which data needs to considerd to make
>> decisions about probabilistic polling. At last several months,
>> I always focused on the data _from idle to reschedule_, then to bypass
>> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
>> code inevitably.
>>
>> with tglx's suggestion, the data which is necessary to make power/idle
>> state decisions, is the last idle state's residency time. IIUC this data
>> is duration from idle to wakeup, which maybe by reschedule irq or other irq.
> That's part of the picture, but not complete.

tglx, could you share more? I am very curious about it..

>> I also test that the reschedule irq overlap by more than 90% (trace the
>> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
>> one minute.
>>
>> as the overlap, I think I can input the last idle state's residency time
>> to make decisions about probabilistic polling, as @dev->last_residency does.
>> it is much easier to get data.
> That's only true for your particular use case.
>
>> 2. do a HV specific idle driver (function)
>>
>> so far, power management is not exposed to guest.. idle is simple for KVM
>> guest,
>> calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
>> thanks Xen guys, who has implemented the paravirt framework. I can implement
>> it
>> as easy as following:
>>
>>               --- a/arch/x86/kernel/kvm.c
> Your email client is using a very strange formatting.

my bad, I insert space to highlight these code.

> This is definitely better than what you proposed so far and implementing it
> as a prove of concept seems to be worthwhile.
>
> But I doubt that this is the final solution. It's not generic and not
> necessarily suitable for all use case scenarios.
>
>
yes, I am exhausted :):)


could you tell me the gap to be generic and necessarily suitable for
all use case scenarios? as lack of irq/idle predictors?

  I really want to upstream it for all of public cloud users/providers..

as kvm host has a similar one, is it possible to upstream with following 
conditions? :
     1). add a QEMU configuration, whether enable or not, by default 
disable.
     2). add some "TODO" comments near the code.
     3). ...


anyway, thanks for your help..

Quan
  Alibaba Cloud
Quan Xu - Nov. 20, 2017, 7:05 a.m.
On 2017-11-16 17:45, Daniel Lezcano wrote:
> On 16/11/2017 10:12, Quan Xu wrote:
>>
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>>> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>>>
>>>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>>
>>>>> Implement a generic idle poll which resembles the functionality
>>>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>>>> can be overridden by the architecture code if needed.
>>>> No, we want less of those magic hooks, not more.
>>>>
>>>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>>>> for interrupts and VM-exit immediately. Also this becomes more
>>>>> expensive than bare metal. Add a generic idle poll before enter
>>>>> real idle path. When a reschedule event is pending, we can bypass
>>>>> the real idle path.
>>>> Why not do a HV specific idle driver?
>>> If I understand the problem correctly then he wants to avoid the heavy
>>> lifting in tick_nohz_idle_enter() in the first place, but there is
>>> already
>>> an interesting quirk there which makes it exit early.  See commit
>>> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this
>>> commit
>>> looks similar. But lets not proliferate that. I'd rather see that go
>>> away.
>> agreed.
>>
>> Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce
>> arch_needs_cpu")
>> in kvm guest. I won't proliferate that..
>>
>>> But the irq_timings stuff is heading into the same direction, with a more
>>> complex prediction logic which should tell you pretty good how long that
>>> idle period is going to be and in case of an interrupt heavy workload
>>> this
>>> would skip the extra work of stopping and restarting the tick and
>>> provide a
>>> very good input into a polling decision.
>>
>> interesting. I have tested with IRQ_TIMINGS related code, which seems
>> not working so far.
> I don't know how you tested it, can you elaborate what you meant by
> "seems not working so far" ?

Daniel, I tried to enable IRQ_TIMINGS* manually. used 
irq_timings_next_event()
to return estimation of the earliest interrupt. However I got a constant.

> There are still some work to do to be more efficient. The prediction
> based on the irq timings is all right if the interrupts have a simple
> periodicity. But as soon as there is a pattern, the current code can't
> handle it properly and does bad predictions.
>
> I'm working on a self-learning pattern detection which is too heavy for
> the kernel, and with it we should be able to detect properly the
> patterns and re-ajust the period if it changes. I'm in the process of
> making it suitable for kernel code (both math and perf).
>
> One improvement which can be done right now and which can help you is
> the interrupts rate on the CPU. It is possible to compute it and that
> will give an accurate information for the polling decision.
>
>
As tglx said, talk to each other / work together to make it usable for 
all use cases.
could you share how to enable it to get the interrupts rate on the CPU? 
I can try it
in cloud scenario. of course, I'd like to work with you to improve it.

Quan
Alibaba Cloud
Daniel Lezcano - Nov. 20, 2017, 6:01 p.m.
On 20/11/2017 08:05, Quan Xu wrote:

[ ... ]

>>>> But the irq_timings stuff is heading into the same direction, with a
>>>> more
>>>> complex prediction logic which should tell you pretty good how long
>>>> that
>>>> idle period is going to be and in case of an interrupt heavy workload
>>>> this
>>>> would skip the extra work of stopping and restarting the tick and
>>>> provide a
>>>> very good input into a polling decision.
>>>
>>> interesting. I have tested with IRQ_TIMINGS related code, which seems
>>> not working so far.
>> I don't know how you tested it, can you elaborate what you meant by
>> "seems not working so far" ?
> 
> Daniel, I tried to enable IRQ_TIMINGS* manually. used
> irq_timings_next_event()
> to return estimation of the earliest interrupt. However I got a constant.

The irq timings gives you an indication of the next interrupt deadline.

This information is a piece of the puzzle, you need to combine it with
the next timer expiration, and the next scheduling event. Then take the
earliest event in a timeline basis.

Using the trivial scheme above will work well with workload like videos
or mp3 but will fail as soon as the interrupts are not coming in a
regular basis and this is where the pattern recognition algorithm must act.

>> There are still some work to do to be more efficient. The prediction
>> based on the irq timings is all right if the interrupts have a simple
>> periodicity. But as soon as there is a pattern, the current code can't
>> handle it properly and does bad predictions.
>>
>> I'm working on a self-learning pattern detection which is too heavy for
>> the kernel, and with it we should be able to detect properly the
>> patterns and re-ajust the period if it changes. I'm in the process of
>> making it suitable for kernel code (both math and perf).
>>
>> One improvement which can be done right now and which can help you is
>> the interrupts rate on the CPU. It is possible to compute it and that
>> will give an accurate information for the polling decision.
>>
>>
> As tglx said, talk to each other / work together to make it usable for
> all use cases.
> could you share how to enable it to get the interrupts rate on the CPU?
> I can try it
> in cloud scenario. of course, I'd like to work with you to improve it.

Sure, I will be glad if we can collaborate. I have some draft code but
before sharing it I would like we define what is the rate and what kind
of information we expect to infer from it. From my point of view it is a
value indicating the interrupt period per CPU, a short value indicates a
high number of interrupts on the CPU.

This value must decay with the time, the question here is what decay
function we apply to the rate from the last timestamp ?

Patch

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@  int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
                 target_state = &drv->states[index];
         }

+#ifdef CONFIG_PARAVIRT
+       paravirt_idle_poll();
+
+       if (need_resched())
+               return -EBUSY;
+#endif
+
         /* Take note of the planned idle state. */
         sched_idle_set_state(target_state);