Patchwork [1/3] udc: net2280: Fix net2280_disable

login
register
mail settings
Submitter Guido Kiener
Date Feb. 4, 2019, 6:04 p.m.
Message ID <20190204180422.5095-1-guido@kiener-muenchen.de>
Download mbox | patch
Permalink /patch/717589/
State New
Headers show

Comments

Guido Kiener - Feb. 4, 2019, 6:04 p.m.
From: Guido Kiener <guido.kiener@rohde-schwarz.com>

A reset e.g. calling ep_reset_338x() can happen while endpoints
are enabled. The ep_reset_338x() sets ep->desc = NULL to mark
endpoint being invalid. A subsequent call of net2280_disable will
fail and return -EINVAL to parent function usb_ep_disable(),
which will fail, too, and do not set the member ep->enabled = false.

See:
https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/core.c#L139

This fix ignores dp->desc and allows net2280_disable() to succeed.
Subsequent calls to usb_ep_enable()/usb_ep_disable() succeeds.

Signed-off-by: Guido Kiener <guido.kiener@rohde-schwarz.com>
---
 drivers/usb/gadget/udc/net2280.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
Alan Stern - Feb. 4, 2019, 6:37 p.m.
On Mon, 4 Feb 2019, Guido Kiener wrote:

> From: Guido Kiener <guido.kiener@rohde-schwarz.com>
> 
> A reset e.g. calling ep_reset_338x() can happen while endpoints
> are enabled.

How can this happen?  That routine is called from only two places.  
One is in net2280_disable(), so after the endpoint has already been
disabled.  The other is in usb_reinit_338x() via usb_reinit() via 
stop_activity(), which disconnects the gadget driver and thereby
disables all the endpoints.

> The ep_reset_338x() sets ep->desc = NULL to mark
> endpoint being invalid. A subsequent call of net2280_disable will
> fail and return -EINVAL to parent function usb_ep_disable(),
> which will fail, too, and do not set the member ep->enabled = false.

So maybe a better fix (assuming there really is a problem) is to make
usb_ep_disable() clear ep->enabled always.  After all, the only 
reasonable way for usb_ep_disable() to fail is if the endpoint isn't 
enabled in the first place.

Alan Stern

> See:
> https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/core.c#L139
> 
> This fix ignores dp->desc and allows net2280_disable() to succeed.
> Subsequent calls to usb_ep_enable()/usb_ep_disable() succeeds.
> 
> Signed-off-by: Guido Kiener <guido.kiener@rohde-schwarz.com>
> ---
>  drivers/usb/gadget/udc/net2280.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/usb/gadget/udc/net2280.c b/drivers/usb/gadget/udc/net2280.c
> index e7dae5379e04..7154f00dea40 100644
> --- a/drivers/usb/gadget/udc/net2280.c
> +++ b/drivers/usb/gadget/udc/net2280.c
> @@ -516,8 +516,8 @@ static int net2280_disable(struct usb_ep *_ep)
>  	unsigned long		flags;
>  
>  	ep = container_of(_ep, struct net2280_ep, ep);
> -	if (!_ep || !ep->desc || _ep->name == ep0name) {
> -		pr_err("%s: Invalid ep=%p or ep->desc\n", __func__, _ep);
> +	if (!_ep || _ep->name == ep0name) {
> +		pr_err("%s: Invalid ep=%p\n", __func__, _ep);
>  		return -EINVAL;
>  	}
>  	spin_lock_irqsave(&ep->dev->lock, flags);
>
Guido Kiener - Feb. 5, 2019, 3:21 p.m.
Zitat von Alan Stern <stern@rowland.harvard.edu>:

> On Mon, 4 Feb 2019, Guido Kiener wrote:
>
>> From: Guido Kiener <guido.kiener@rohde-schwarz.com>
>>
>> A reset e.g. calling ep_reset_338x() can happen while endpoints
>> are enabled.
>
> How can this happen?  That routine is called from only two places.
> One is in net2280_disable(), so after the endpoint has already been
> disabled.  The other is in usb_reinit_338x() via usb_reinit() via
> stop_activity(), which disconnects the gadget driver and thereby
> disables all the endpoints.
>
Yes, the routine is called when I stop my application. See callstack:
Call Trace:
   net2280_disable: Invalid ep=00000000341c7996 or ep->desc
   usb_ep_disable+0x24/0xa0 [udc_core]
   disableEndpoints+0x1c/0x80 [rsusbtmc]
   tmc_func_disable+0x29/0xa0 [rsusbtmc]
   reset_config+0x3e/0xa0 [libcomposite]
   composite_disconnect+0x36/0x60 [libcomposite]
   net2280_pullup+0x9e/0xf0 [net2280]
   usb_gadget_disconnect+0x35/0xc0 [udc_core]
   usb_gadget_remove_driver+0x29/0xa0 [udc_core]
   usb_gadget_unregister_driver+0xc1/0xf0 [udc_core]
   usb_composite_unregister+0x12/0x20 [libcomposite]
   dev_release+0x3f/0x80 [rsusbtmc]
   __fput+0x102/0x230
   ____fput+0xe/0x10
   task_work_run+0x90/0xc0
   exit_to_usermode_loop+0xf2/0x100
   do_syscall_64+0xfb/0x120
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

When I restart my application then the endpoints remains disabled.
A workaround is to reload our (legacy) gadget driver, libcomposite
and net2280.

>> The ep_reset_338x() sets ep->desc = NULL to mark
>> endpoint being invalid. A subsequent call of net2280_disable will
>> fail and return -EINVAL to parent function usb_ep_disable(),
>> which will fail, too, and do not set the member ep->enabled = false.
>
> So maybe a better fix (assuming there really is a problem) is to make
> usb_ep_disable() clear ep->enabled always.  After all, the only
> reasonable way for usb_ep_disable() to fail is if the endpoint isn't
> enabled in the first place.
>
I had the same idea. However the root cause is the net2280 driver.
We have not seen the problem with other function drivers.
The usb_ep_disable() implementation looks reasonable to me. When the
ret = ep->ops->disable(ep); function fails for any other fictitious
reason (e.g. conflicts, order of endpoints..) then it might be better to
set ep->enabled = false when the function driver really succeeds.

Regards,

Guido
Alan Stern - Feb. 5, 2019, 3:52 p.m.
On Tue, 5 Feb 2019 guido@kiener-muenchen.de wrote:

> Zitat von Alan Stern <stern@rowland.harvard.edu>:
> 
> > On Mon, 4 Feb 2019, Guido Kiener wrote:
> >
> >> From: Guido Kiener <guido.kiener@rohde-schwarz.com>
> >>
> >> A reset e.g. calling ep_reset_338x() can happen while endpoints
> >> are enabled.
> >
> > How can this happen?  That routine is called from only two places.
> > One is in net2280_disable(), so after the endpoint has already been
> > disabled.  The other is in usb_reinit_338x() via usb_reinit() via
> > stop_activity(), which disconnects the gadget driver and thereby
> > disables all the endpoints.
> >
> Yes, the routine is called when I stop my application. See callstack:
> Call Trace:
>    net2280_disable: Invalid ep=00000000341c7996 or ep->desc
>    usb_ep_disable+0x24/0xa0 [udc_core]
>    disableEndpoints+0x1c/0x80 [rsusbtmc]
>    tmc_func_disable+0x29/0xa0 [rsusbtmc]
>    reset_config+0x3e/0xa0 [libcomposite]
>    composite_disconnect+0x36/0x60 [libcomposite]
>    net2280_pullup+0x9e/0xf0 [net2280]
>    usb_gadget_disconnect+0x35/0xc0 [udc_core]
>    usb_gadget_remove_driver+0x29/0xa0 [udc_core]
>    usb_gadget_unregister_driver+0xc1/0xf0 [udc_core]
>    usb_composite_unregister+0x12/0x20 [libcomposite]
>    dev_release+0x3f/0x80 [rsusbtmc]
>    __fput+0x102/0x230
>    ____fput+0xe/0x10
>    task_work_run+0x90/0xc0
>    exit_to_usermode_loop+0xf2/0x100
>    do_syscall_64+0xfb/0x120
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> When I restart my application then the endpoints remains disabled.
> A workaround is to reload our (legacy) gadget driver, libcomposite
> and net2280.
> 
> >> The ep_reset_338x() sets ep->desc = NULL to mark
> >> endpoint being invalid. A subsequent call of net2280_disable will
> >> fail and return -EINVAL to parent function usb_ep_disable(),
> >> which will fail, too, and do not set the member ep->enabled = false.
> >
> > So maybe a better fix (assuming there really is a problem) is to make
> > usb_ep_disable() clear ep->enabled always.  After all, the only
> > reasonable way for usb_ep_disable() to fail is if the endpoint isn't
> > enabled in the first place.
> >
> I had the same idea. However the root cause is the net2280 driver.
> We have not seen the problem with other function drivers.
> The usb_ep_disable() implementation looks reasonable to me. When the
> ret = ep->ops->disable(ep); function fails for any other fictitious
> reason (e.g. conflicts, order of endpoints..) then it might be better to
> set ep->enabled = false when the function driver really succeeds.

All right, I withdraw the objection.

Acked-by: Alan Stern <stern@rowland.harvard.edu>

Alan Stern
Guido Kiener - Feb. 19, 2019, 2:45 p.m.
Hi Alan,

My last proposal "udc: net2280: Fix overrun of OUT messages" is still
under investigation.

During the random stress tests I found a new rare problem:

When a request must be dequeued with net2280_dequeue() e.g. due
to a device clear action and the same request is finished by the
function scan_dma_completions():

https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269

then the following search loop does not find the request and the function
returns the error -EINVAL without restoring the state ep->stopped = stopped!

https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269

Thus the endpoint keeps blocked and does not receive any data.
When we insert ep->stopped = stopped here:

	if (&req->req != _req) {
		ep->stopped = stopped; // <<<<<----
		spin_unlock_irqrestore(&ep->dev->lock, flags);
		dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n",
								__func__);
		return -EINVAL;
	}

then the driver continues as expected, although the driver issues the
false error message "Request mismatch".

What do you think? It's labor-intensive to fix the error message here,
or shall we leave it as it is, and only insert the "ep->stopped = stopped;"

Regards,

Guido
Alan Stern - Feb. 22, 2019, 9:52 p.m.
On Tue, 19 Feb 2019 guido@kiener-muenchen.de wrote:

> Hi Alan,
> 
> My last proposal "udc: net2280: Fix overrun of OUT messages" is still
> under investigation.
> 
> During the random stress tests I found a new rare problem:
> 
> When a request must be dequeued with net2280_dequeue() e.g. due
> to a device clear action and the same request is finished by the
> function scan_dma_completions():
> 
> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269
> 
> then the following search loop does not find the request and the function
> returns the error -EINVAL without restoring the state ep->stopped = stopped!
> 
> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269
> 
> Thus the endpoint keeps blocked and does not receive any data.
> When we insert ep->stopped = stopped here:
> 
> 	if (&req->req != _req) {
> 		ep->stopped = stopped; // <<<<<----
> 		spin_unlock_irqrestore(&ep->dev->lock, flags);
> 		dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n",
> 								__func__);
> 		return -EINVAL;
> 	}
> 
> then the driver continues as expected, although the driver issues the
> false error message "Request mismatch".
> 
> What do you think? It's labor-intensive to fix the error message here,
> or shall we leave it as it is, and only insert the "ep->stopped = stopped;"

Yes, this is a bug.  I think the dev_err() should be changed to 
ep_dbg().  It's not a real error, after all -- it's just a race.

Do that and insert the "ep->stopped = stopped;" line.  Also, consider 
changing the -EINVAL return code here to -EIDRM.  This is analogous to 
what usb_unlink_urb() does on the host side.

Alan Stern
Guido Kiener - Feb. 27, 2019, 1:26 p.m.
Zitat von Alan Stern <stern@rowland.harvard.edu>:

> On Tue, 19 Feb 2019 guido@kiener-muenchen.de wrote:
>
>> Hi Alan,
>>
>> My last proposal "udc: net2280: Fix overrun of OUT messages" is still
>> under investigation.
>>
>> During the random stress tests I found a new rare problem:
>>
>> When a request must be dequeued with net2280_dequeue() e.g. due
>> to a device clear action and the same request is finished by the
>> function scan_dma_completions():
>>
>> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269
>>
>> then the following search loop does not find the request and the function
>> returns the error -EINVAL without restoring the state ep->stopped = stopped!
>>
>> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269
>>
>> Thus the endpoint keeps blocked and does not receive any data.
>> When we insert ep->stopped = stopped here:
>>
>> 	if (&req->req != _req) {
>> 		ep->stopped = stopped; // <<<<<----
>> 		spin_unlock_irqrestore(&ep->dev->lock, flags);
>> 		dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n",
>> 								__func__);
>> 		return -EINVAL;
>> 	}
>>
>> then the driver continues as expected, although the driver issues the
>> false error message "Request mismatch".
>>
>> What do you think? It's labor-intensive to fix the error message here,
>> or shall we leave it as it is, and only insert the "ep->stopped = stopped;"
>
> Yes, this is a bug.  I think the dev_err() should be changed to
> ep_dbg().  It's not a real error, after all -- it's just a race.

I agree. ep_dbg() is better here. I see the same bug in the sibling driver
net2272_dequeue():
https://elixir.bootlin.com/linux/v5.0-rc8/source/drivers/usb/gadget/udc/net2272.c#L948

The bug is quite obvious to me, however I'm not aware that we use this driver
in any of our instruments. I cannot test it. What do you recommend?
- Shall I fix it within the same commit.
- Shall I fix it with another commit.
- Don't care about it

> Do that and insert the "ep->stopped = stopped;" line.  Also, consider
> changing the -EINVAL return code here to -EIDRM.  This is analogous to
> what usb_unlink_urb() does on the host side.

EIDRM sounds better. However I see that all udc drivers calling the
<udc>_dequeue() function returns -EINVAL when a request cannot be dequeued.
I do not dare to break this tradition.

BTW we tested my last proposal:
https://patchwork.kernel.org/patch/10796161/

... I see that we can simplify this patch when we just insert the
stop_out_nacking(ep) between line 909 and 910:

https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/net2280.c#L909

and remove the stop_out_naking() call in start_queue()...

I will add this improvement to the new patch sequence.

Guido
Alan Stern - March 7, 2019, 4:11 p.m.
On Wed, 27 Feb 2019 guido@kiener-muenchen.de wrote:

> Zitat von Alan Stern <stern@rowland.harvard.edu>:
> 
> > On Tue, 19 Feb 2019 guido@kiener-muenchen.de wrote:
> >
> >> Hi Alan,
> >>
> >> My last proposal "udc: net2280: Fix overrun of OUT messages" is still
> >> under investigation.
> >>
> >> During the random stress tests I found a new rare problem:
> >>
> >> When a request must be dequeued with net2280_dequeue() e.g. due
> >> to a device clear action and the same request is finished by the
> >> function scan_dma_completions():
> >>
> >> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269
> >>
> >> then the following search loop does not find the request and the function
> >> returns the error -EINVAL without restoring the state ep->stopped = stopped!
> >>
> >> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269
> >>
> >> Thus the endpoint keeps blocked and does not receive any data.
> >> When we insert ep->stopped = stopped here:
> >>
> >> 	if (&req->req != _req) {
> >> 		ep->stopped = stopped; // <<<<<----
> >> 		spin_unlock_irqrestore(&ep->dev->lock, flags);
> >> 		dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n",
> >> 								__func__);
> >> 		return -EINVAL;
> >> 	}
> >>
> >> then the driver continues as expected, although the driver issues the
> >> false error message "Request mismatch".
> >>
> >> What do you think? It's labor-intensive to fix the error message here,
> >> or shall we leave it as it is, and only insert the "ep->stopped = stopped;"
> >
> > Yes, this is a bug.  I think the dev_err() should be changed to
> > ep_dbg().  It's not a real error, after all -- it's just a race.
> 
> I agree. ep_dbg() is better here. I see the same bug in the sibling driver
> net2272_dequeue():
> https://elixir.bootlin.com/linux/v5.0-rc8/source/drivers/usb/gadget/udc/net2272.c#L948
> 
> The bug is quite obvious to me, however I'm not aware that we use this driver
> in any of our instruments. I cannot test it. What do you recommend?
> - Shall I fix it within the same commit.
> - Shall I fix it with another commit.
> - Don't care about it

You can fix it in a separate commit, and in the commit log say that it 
was only compile-tested because you don't have the right hardware for a 
full test.

> > Do that and insert the "ep->stopped = stopped;" line.  Also, consider
> > changing the -EINVAL return code here to -EIDRM.  This is analogous to
> > what usb_unlink_urb() does on the host side.
> 
> EIDRM sounds better. However I see that all udc drivers calling the
> <udc>_dequeue() function returns -EINVAL when a request cannot be dequeued.
> I do not dare to break this tradition.

I guess it's too late to change this now.  A pity...

Alan Stern

> BTW we tested my last proposal:
> https://patchwork.kernel.org/patch/10796161/
> 
> ... I see that we can simplify this patch when we just insert the
> stop_out_nacking(ep) between line 909 and 910:
> 
> https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/net2280.c#L909
> 
> and remove the stop_out_naking() call in start_queue()...
> 
> I will add this improvement to the new patch sequence.
> 
> Guido

Patch

diff --git a/drivers/usb/gadget/udc/net2280.c b/drivers/usb/gadget/udc/net2280.c
index e7dae5379e04..7154f00dea40 100644
--- a/drivers/usb/gadget/udc/net2280.c
+++ b/drivers/usb/gadget/udc/net2280.c
@@ -516,8 +516,8 @@  static int net2280_disable(struct usb_ep *_ep)
 	unsigned long		flags;
 
 	ep = container_of(_ep, struct net2280_ep, ep);
-	if (!_ep || !ep->desc || _ep->name == ep0name) {
-		pr_err("%s: Invalid ep=%p or ep->desc\n", __func__, _ep);
+	if (!_ep || _ep->name == ep0name) {
+		pr_err("%s: Invalid ep=%p\n", __func__, _ep);
 		return -EINVAL;
 	}
 	spin_lock_irqsave(&ep->dev->lock, flags);