Patchwork [v7,01/10] usb: gadget: udc: Add timer support for usb requests

login
register
mail settings
Submitter Anurag Kumar Vulisha
Date Dec. 1, 2018, 11:13 a.m.
Message ID <1543662811-5194-2-git-send-email-anurag.kumar.vulisha@xilinx.com>
Download mbox | patch
Permalink /patch/669739/
State New
Headers show

Comments

Anurag Kumar Vulisha - Dec. 1, 2018, 11:13 a.m.
In some corner cases the gadget controller may get out of sync
with host and may get into hang state, thus creating a dealock.
For example when bulk streams are enabled for an endpoint, there
can be a condition where the gadget controller waits for the host
to issue prime transaction and the host controller waits for the
gadget to issue ERDY. This condition could create a deadlock.

To avoid such potential deadlocks, a timer is started after queuing
any request for the endpoint in usb_ep_queue(). The gadget driver
is expected to stop the timer if a valid event is found (ex: stream
event for stream capable endpoints). If no valid event is found, the
timer expires after the programmed timeout value and a timeout
callback function registered would be called. This callback function
dequeues the request and re-queues it again, doing so makes the
controller restart the transfer, thus avoiding deadlocks.

This kind of behaviour is observed in dwc3 controller and expected
to be generic issue with other controllers supporting bulk streams.

Signed-off-by: Anurag Kumar Vulisha <anurag.kumar.vulisha@xilinx.com>
---
 Changes in v7:
	1. Added usb_ep_dequeue() & usb_ep_queue() logic into the timeout
	   handler as suggested by "Felipe Balbi"
	2. Created a usb_ep_queue_timeout() & __usb_ep_queue() functions

 Changes in v6:
	1. This patch is newly added in this series to add timer into udc/core.c
---
 drivers/usb/gadget/udc/core.c | 119 +++++++++++++++++++++++++++++++++++++-----
 include/linux/usb/gadget.h    |  16 ++++++
 2 files changed, 121 insertions(+), 14 deletions(-)
Alan Stern - Dec. 2, 2018, 4:36 p.m.
On Sat, 1 Dec 2018, Anurag Kumar Vulisha wrote:

> In some corner cases the gadget controller may get out of sync
> with host and may get into hang state, thus creating a dealock.
> For example when bulk streams are enabled for an endpoint, there
> can be a condition where the gadget controller waits for the host
> to issue prime transaction and the host controller waits for the
> gadget to issue ERDY. This condition could create a deadlock.
> 
> To avoid such potential deadlocks, a timer is started after queuing
> any request for the endpoint in usb_ep_queue(). The gadget driver
> is expected to stop the timer if a valid event is found (ex: stream
> event for stream capable endpoints). If no valid event is found, the
> timer expires after the programmed timeout value and a timeout
> callback function registered would be called. This callback function
> dequeues the request and re-queues it again, doing so makes the
> controller restart the transfer, thus avoiding deadlocks.
> 
> This kind of behaviour is observed in dwc3 controller and expected
> to be generic issue with other controllers supporting bulk streams.

I find this whole approach rather dubious.

First of all, if some sort of deadlock causes a transfer to fail to 
complete, the host is expected to cancel and restart it.  Not the 
gadget.

Second, if a request timer expires and the request is cancelled, the 
gadget driver's completion handler will be called.  This is not what 
you want if the UDC core is going to resubmit the request 
automatically.

Third, if a request timer expires and the timer handler calls
usb_ep_dequeue() followed immediately by usb_ep_queue_timeout(), the
resubmit will probably fail because the dequeue won't have completed
yet.

Fourth, the patch contains a race between the timer expiring and the 
request completing.

Alan Stern
Anurag Kumar Vulisha - Dec. 3, 2018, 10:23 a.m.
Hi Alan,

>-----Original Message-----
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>Sent: Sunday, December 02, 2018 10:06 PM
>To: Anurag Kumar Vulisha <anuragku@xilinx.com>
>Cc: Felipe Balbi <balbi@kernel.org>; Greg Kroah-Hartman
><gregkh@linuxfoundation.org>; Shuah Khan <shuah@kernel.org>; Johan Hovold
><johan@kernel.org>; Jaejoong Kim <climbbb.kim@gmail.com>; Benjamin
>Herrenschmidt <benh@kernel.crashing.org>; Roger Quadros <rogerq@ti.com>; Manu
>Gautam <mgautam@codeaurora.org>; martin.petersen@oracle.com; Bart Van
>Assche <bvanassche@acm.org>; Mike Christie <mchristi@redhat.com>; Matthew
>Wilcox <willy@infradead.org>; Colin Ian King <colin.king@canonical.com>; linux-
>usb@vger.kernel.org; linux-kernel@vger.kernel.org; v.anuragkumar@gmail.com;
>Thinh Nguyen <thinhn@synopsys.com>; Tejas Joglekar
><tejas.joglekar@synopsys.com>; Ajay Yugalkishore Pandey <APANDEY@xilinx.com>
>Subject: Re: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Sat, 1 Dec 2018, Anurag Kumar Vulisha wrote:
>
>> In some corner cases the gadget controller may get out of sync
>> with host and may get into hang state, thus creating a dealock.
>> For example when bulk streams are enabled for an endpoint, there
>> can be a condition where the gadget controller waits for the host
>> to issue prime transaction and the host controller waits for the
>> gadget to issue ERDY. This condition could create a deadlock.
>>
>> To avoid such potential deadlocks, a timer is started after queuing
>> any request for the endpoint in usb_ep_queue(). The gadget driver
>> is expected to stop the timer if a valid event is found (ex: stream
>> event for stream capable endpoints). If no valid event is found, the
>> timer expires after the programmed timeout value and a timeout
>> callback function registered would be called. This callback function
>> dequeues the request and re-queues it again, doing so makes the
>> controller restart the transfer, thus avoiding deadlocks.
>>
>> This kind of behaviour is observed in dwc3 controller and expected
>> to be generic issue with other controllers supporting bulk streams.
>
>I find this whole approach rather dubious.
>
>First of all, if some sort of deadlock causes a transfer to fail to
>complete, the host is expected to cancel and restart it.  Not the
>gadget.
>

Thanks for spending your time in reviewing this patch.  The deadlock
is  a very rare case scenario and is happening because both the gadget
controller & host controllers get out of sync and are stuck waiting for the
relevant event. For example this issue is observed in stream protocol where
the gadget controller is waiting on Host controller to issue PRIME transaction
and  Host controller is waiting on gadget to issue ERDY transaction. Since
the stream protocol is gadget driven, the host may not proceed further until it
receives a valid Start Stream (ERDY) transaction from gadget. Since the gadget
controller driver is aware that the controller is stuck , makes it responsible
to recover the controller from hang condition by restarting the transfer (which
triggers the controller FSM to issue ERDY to host).

>Second, if a request timer expires and the request is cancelled, the
>gadget driver's completion handler will be called.  This is not what
>you want if the UDC core is going to resubmit the request
>automatically.
>
>Third, if a request timer expires and the timer handler calls
>usb_ep_dequeue() followed immediately by usb_ep_queue_timeout(), the
>resubmit will probably fail because the dequeue won't have completed
>yet.
>
>Fourth, the patch contains a race between the timer expiring and the
>request completing.

Thanks for correcting, I agree with you on all the above 3 cases that the
resubmission of the request should only be done from the class driver and 
the udc core should simply dequeue the request on timeout. I am not sure
why I haven't seen any issue while testing on this patch series. I will modify
the code to handle the resubmitting of requests properly.

Best Regards,
Anurag Kumar Vulisha
Alan Stern - Dec. 3, 2018, 2:51 p.m.
On Mon, 3 Dec 2018, Anurag Kumar Vulisha wrote:

> >First of all, if some sort of deadlock causes a transfer to fail to
> >complete, the host is expected to cancel and restart it.  Not the
> >gadget.
> >
> 
> Thanks for spending your time in reviewing this patch.  The deadlock
> is  a very rare case scenario and is happening because both the gadget
> controller & host controllers get out of sync and are stuck waiting for the
> relevant event. For example this issue is observed in stream protocol where
> the gadget controller is waiting on Host controller to issue PRIME transaction
> and  Host controller is waiting on gadget to issue ERDY transaction. Since
> the stream protocol is gadget driven, the host may not proceed further until it
> receives a valid Start Stream (ERDY) transaction from gadget.

That's not entirely true.  Can't the host cancel the transfer and then 
restart it?

> Since the gadget
> controller driver is aware that the controller is stuck , makes it responsible
> to recover the controller from hang condition by restarting the transfer (which
> triggers the controller FSM to issue ERDY to host).

Isn't there a cleaner way to recover than by cancelling the request and 
resubmitting it?

> >Second, if a request timer expires and the request is cancelled, the
> >gadget driver's completion handler will be called.  This is not what
> >you want if the UDC core is going to resubmit the request
> >automatically.
> >
> >Third, if a request timer expires and the timer handler calls
> >usb_ep_dequeue() followed immediately by usb_ep_queue_timeout(), the
> >resubmit will probably fail because the dequeue won't have completed
> >yet.
> >
> >Fourth, the patch contains a race between the timer expiring and the
> >request completing.
> 
> Thanks for correcting, I agree with you on all the above 3 cases that the
> resubmission of the request should only be done from the class driver and 
> the udc core should simply dequeue the request on timeout. I am not sure
> why I haven't seen any issue while testing on this patch series. I will modify
> the code to handle the resubmitting of requests properly.

How can the gadget driver know what timeout to use?  The host is
allowed to be as slow as it wants; the gadget driver doesn't have any
way to tell when the host wants to start the transfer.

Alan Stern
Anurag Kumar Vulisha - Dec. 3, 2018, 4:05 p.m.
Hi Alan,

>-----Original Message-----
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>Sent: Monday, December 03, 2018 8:21 PM
>To: Anurag Kumar Vulisha <anuragku@xilinx.com>
>Cc: Felipe Balbi <balbi@kernel.org>; Greg Kroah-Hartman
><gregkh@linuxfoundation.org>; Shuah Khan <shuah@kernel.org>; Johan Hovold
><johan@kernel.org>; Jaejoong Kim <climbbb.kim@gmail.com>; Benjamin
>Herrenschmidt <benh@kernel.crashing.org>; Roger Quadros <rogerq@ti.com>; Manu
>Gautam <mgautam@codeaurora.org>; martin.petersen@oracle.com; Bart Van
>Assche <bvanassche@acm.org>; Mike Christie <mchristi@redhat.com>; Matthew
>Wilcox <willy@infradead.org>; Colin Ian King <colin.king@canonical.com>; linux-
>usb@vger.kernel.org; linux-kernel@vger.kernel.org; v.anuragkumar@gmail.com;
>Thinh Nguyen <thinhn@synopsys.com>; Tejas Joglekar
><tejas.joglekar@synopsys.com>; Ajay Yugalkishore Pandey <APANDEY@xilinx.com>
>Subject: RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Mon, 3 Dec 2018, Anurag Kumar Vulisha wrote:
>
>> >First of all, if some sort of deadlock causes a transfer to fail to
>> >complete, the host is expected to cancel and restart it.  Not the
>> >gadget.
>> >
>>
>> Thanks for spending your time in reviewing this patch.  The deadlock
>> is  a very rare case scenario and is happening because both the gadget
>> controller & host controllers get out of sync and are stuck waiting for the
>> relevant event. For example this issue is observed in stream protocol where
>> the gadget controller is waiting on Host controller to issue PRIME transaction
>> and  Host controller is waiting on gadget to issue ERDY transaction. Since
>> the stream protocol is gadget driven, the host may not proceed further until it
>> receives a valid Start Stream (ERDY) transaction from gadget.
>
>That's not entirely true.  Can't the host cancel the transfer and then
>restart it?
>

Yes the host can cancel the transfer. This issue originated from the endpoints using bulk
streaming protocol and may not occur with normal endpoints. AFAIK bulk streaming is
gadget driven, where the gadget is allowed to select which stream id transfer the host
should work on . Since the host doesn't have control on when the transfer would be
selected by gadget, it may wait for longer timeouts before cancelling the transfer. 

>> Since the gadget
>> controller driver is aware that the controller is stuck , makes it responsible
>> to recover the controller from hang condition by restarting the transfer (which
>> triggers the controller FSM to issue ERDY to host).
>
>Isn't there a cleaner way to recover than by cancelling the request and
>resubmitting it?
>

dequeuing the request issues the stop transfer command to the controller, which
cleans all the hardware resource allocated for that endpoint. This also resets the
hardware FSMs for that endpoint . So, when re-queuing of the transfer happens
the controller starts allocating hardware resources again, thus avoiding the probability
of entering into the issue. I am not sure of other controllers, but for dwc3, issuing
the stop transfer is the only way to handle this issue. 

@Felipe:  Can you please provide your suggestion on this.  
  
>> >Second, if a request timer expires and the request is cancelled, the
>> >gadget driver's completion handler will be called.  This is not what
>> >you want if the UDC core is going to resubmit the request
>> >automatically.
>> >
>> >Third, if a request timer expires and the timer handler calls
>> >usb_ep_dequeue() followed immediately by usb_ep_queue_timeout(), the
>> >resubmit will probably fail because the dequeue won't have completed
>> >yet.
>> >
>> >Fourth, the patch contains a race between the timer expiring and the
>> >request completing.
>>
>> Thanks for correcting, I agree with you on all the above 3 cases that the
>> resubmission of the request should only be done from the class driver and
>> the udc core should simply dequeue the request on timeout. I am not sure
>> why I haven't seen any issue while testing on this patch series. I will modify
>> the code to handle the resubmitting of requests properly.
>
>How can the gadget driver know what timeout to use?  The host is
>allowed to be as slow as it wants; the gadget driver doesn't have any
>way to tell when the host wants to start the transfer.

Yes , I agree with you that the timeout may vary from usage to usage. This timeout
should be decided by the class driver which queues the request. As discussed above
this issue was observed in streaming protocol , which  is very much faster than normal
BOT modes and it works on super speed .  More over the gadget controller decides
the selection of the stream id on which the host should work , so taking all these into
consideration I kept 50ms timeout for stream transfers, so that the performance may
not get decreased.

Thanks,
Anurag Kumar Vulisha
Alan Stern - Dec. 3, 2018, 11:08 p.m.
On Mon, 3 Dec 2018, Anurag Kumar Vulisha wrote:

> >On Mon, 3 Dec 2018, Anurag Kumar Vulisha wrote:
> >
> >> >First of all, if some sort of deadlock causes a transfer to fail to
> >> >complete, the host is expected to cancel and restart it.  Not the
> >> >gadget.
> >> >
> >>
> >> Thanks for spending your time in reviewing this patch.  The deadlock
> >> is  a very rare case scenario and is happening because both the gadget
> >> controller & host controllers get out of sync and are stuck waiting for the
> >> relevant event. For example this issue is observed in stream protocol where
> >> the gadget controller is waiting on Host controller to issue PRIME transaction
> >> and  Host controller is waiting on gadget to issue ERDY transaction. Since
> >> the stream protocol is gadget driven, the host may not proceed further until it
> >> receives a valid Start Stream (ERDY) transaction from gadget.
> >
> >That's not entirely true.  Can't the host cancel the transfer and then
> >restart it?
> >
> 
> Yes the host can cancel the transfer. This issue originated from the endpoints using bulk
> streaming protocol and may not occur with normal endpoints. AFAIK bulk streaming is
> gadget driven, where the gadget is allowed to select which stream id transfer the host
> should work on . Since the host doesn't have control on when the transfer would be
> selected by gadget, it may wait for longer timeouts before cancelling the transfer. 

You're missing the point.  Although the device selects which stream ID
gets transferred, the _host_ decides whether a stream transfer should
occur in the first place.  No matter how many ERDY packets the device
controller tries to send, no transfer will occur until the host wants
to do it.

In this sense, stream transfers (like all other USB interactions except
wakeup requests) are host-driven.

> >> Since the gadget
> >> controller driver is aware that the controller is stuck , makes it responsible
> >> to recover the controller from hang condition by restarting the transfer (which
> >> triggers the controller FSM to issue ERDY to host).
> >
> >Isn't there a cleaner way to recover than by cancelling the request and
> >resubmitting it?
> >
> 
> dequeuing the request issues the stop transfer command to the controller, which
> cleans all the hardware resource allocated for that endpoint. This also resets the
> hardware FSMs for that endpoint . So, when re-queuing of the transfer happens
> the controller starts allocating hardware resources again, thus avoiding the probability
> of entering into the issue. I am not sure of other controllers, but for dwc3, issuing
> the stop transfer is the only way to handle this issue. 

Again you're missing the point.  Can't the controller driver issue the
Stop Transfer command but still consider the request to be in progress
(i.e., don't dequeue the request) so that the gadget driver's
completion callback isn't invoked and the request does not need to be
explicitly resubmitted?

> @Felipe:  Can you please provide your suggestion on this.  

> >How can the gadget driver know what timeout to use?  The host is
> >allowed to be as slow as it wants; the gadget driver doesn't have any
> >way to tell when the host wants to start the transfer.
> 
> Yes , I agree with you that the timeout may vary from usage to usage. This timeout
> should be decided by the class driver which queues the request. As discussed above
> this issue was observed in streaming protocol , which  is very much faster than normal
> BOT modes and it works on super speed .

Although USB mass storage is currently the only user of the stream 
protocol, that may not be true in the future.  You should think in more 
general terms.  A timeout which is appropriate for mass storage may not 
be appropriate for some other application.

Alan Stern

>  More over the gadget controller decides
> the selection of the stream id on which the host should work , so taking all these into
> consideration I kept 50ms timeout for stream transfers, so that the performance may
> not get decreased.
> 
> Thanks,
> Anurag Kumar Vulisha
Anurag Kumar Vulisha - Dec. 4, 2018, 4:18 p.m.
Hi Alan,

>-----Original Message-----
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>Sent: Tuesday, December 04, 2018 4:38 AM
>To: Anurag Kumar Vulisha <anuragku@xilinx.com>
>Cc: Felipe Balbi <balbi@kernel.org>; Greg Kroah-Hartman
><gregkh@linuxfoundation.org>; Shuah Khan <shuah@kernel.org>; Johan Hovold
><johan@kernel.org>; Jaejoong Kim <climbbb.kim@gmail.com>; Benjamin
>Herrenschmidt <benh@kernel.crashing.org>; Roger Quadros <rogerq@ti.com>; Manu
>Gautam <mgautam@codeaurora.org>; martin.petersen@oracle.com; Bart Van
>Assche <bvanassche@acm.org>; Mike Christie <mchristi@redhat.com>; Matthew
>Wilcox <willy@infradead.org>; Colin Ian King <colin.king@canonical.com>; linux-
>usb@vger.kernel.org; linux-kernel@vger.kernel.org; v.anuragkumar@gmail.com;
>Thinh Nguyen <thinhn@synopsys.com>; Tejas Joglekar
><tejas.joglekar@synopsys.com>; Ajay Yugalkishore Pandey <APANDEY@xilinx.com>
>Subject: RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Mon, 3 Dec 2018, Anurag Kumar Vulisha wrote:
>
>> >On Mon, 3 Dec 2018, Anurag Kumar Vulisha wrote:
>> >
>> >> >First of all, if some sort of deadlock causes a transfer to fail to
>> >> >complete, the host is expected to cancel and restart it.  Not the
>> >> >gadget.
>> >> >
>> >>
>> >> Thanks for spending your time in reviewing this patch.  The deadlock
>> >> is  a very rare case scenario and is happening because both the gadget
>> >> controller & host controllers get out of sync and are stuck waiting for the
>> >> relevant event. For example this issue is observed in stream protocol where
>> >> the gadget controller is waiting on Host controller to issue PRIME transaction
>> >> and  Host controller is waiting on gadget to issue ERDY transaction. Since
>> >> the stream protocol is gadget driven, the host may not proceed further until it
>> >> receives a valid Start Stream (ERDY) transaction from gadget.
>> >
>> >That's not entirely true.  Can't the host cancel the transfer and then
>> >restart it?
>> >
>>
>> Yes the host can cancel the transfer. This issue originated from the endpoints using
>bulk
>> streaming protocol and may not occur with normal endpoints. AFAIK bulk streaming
>is
>> gadget driven, where the gadget is allowed to select which stream id transfer the
>host
>> should work on . Since the host doesn't have control on when the transfer would be
>> selected by gadget, it may wait for longer timeouts before cancelling the transfer.
>
>You're missing the point.  Although the device selects which stream ID
>gets transferred, the _host_ decides whether a stream transfer should
>occur in the first place.  No matter how many ERDY packets the device
>controller tries to send, no transfer will occur until the host wants
>to do it.
>
>In this sense, stream transfers (like all other USB interactions except
>wakeup requests) are host-driven.
>

I agree with you that without host willing to transfer, the transfer wouldn't have
started and also agree the host always has the capability of detecting the hang. But
we are not sure on what host platform and operating system the gadget would be
tested and also not sure whether the host software is capable of handling the deadlock.
Even if the host has a timer , it would be very long and waiting for the timer to get
timedout  would reduce the performance. In this patch we are giving the user the
flexibility to opt the timeout value, so that the deadlock can be avoided without
effecting the performance. So I think adding the timer in gadget is more easier solution
than handling in host. I have seen this workaround working for both linux & windows.

>> >> Since the gadget
>> >> controller driver is aware that the controller is stuck , makes it responsible
>> >> to recover the controller from hang condition by restarting the transfer (which
>> >> triggers the controller FSM to issue ERDY to host).
>> >
>> >Isn't there a cleaner way to recover than by cancelling the request and
>> >resubmitting it?
>> >
>>
>> dequeuing the request issues the stop transfer command to the controller, which
>> cleans all the hardware resource allocated for that endpoint. This also resets the
>> hardware FSMs for that endpoint . So, when re-queuing of the transfer happens
>> the controller starts allocating hardware resources again, thus avoiding the
>probability
>> of entering into the issue. I am not sure of other controllers, but for dwc3, issuing
>> the stop transfer is the only way to handle this issue.
>
>Again you're missing the point.  Can't the controller driver issue the
>Stop Transfer command but still consider the request to be in progress
>(i.e., don't dequeue the request) so that the gadget driver's
>completion callback isn't invoked and the request does not need to be
>explicitly resubmitted?
>

Yes , what you are saying is correct. My initial patches were following the
same approach that you have suggested. We switched to the dequeue
approach because there can be different gadgets which may enter into
this issue and it would be better to follow a generic approach rather than
having every controller driver implementing their own workaround.
Please find the conservation in this link  https://patchwork.kernel.org/patch/10640145/
   
>> @Felipe:  Can you please provide your suggestion on this.
>
>> >How can the gadget driver know what timeout to use?  The host is
>> >allowed to be as slow as it wants; the gadget driver doesn't have any
>> >way to tell when the host wants to start the transfer.
>>
>> Yes , I agree with you that the timeout may vary from usage to usage. This timeout
>> should be decided by the class driver which queues the request. As discussed above
>> this issue was observed in streaming protocol , which  is very much faster than
>normal
>> BOT modes and it works on super speed .
>
>Although USB mass storage is currently the only user of the stream
>protocol, that may not be true in the future.  You should think in more
>general terms.  A timeout which is appropriate for mass storage may not
>be appropriate for some other application.
>

You are correct , this timeout may not be ideal. Since  I was not able to reproduce this issue
on non-stream capable transfers , at  present the only user of usb_ep_queue_timeout() API
is the streaming gadget.  As we are not aware of the optimal timeout value,  instead of
hardcoding the timeout value we can add module_param() for it.  So that the user can configure
timeout based on their use case. Please let me know your suggestion on this.

Thanks,
Anurag Kumar Vulisha 

>>  More over the gadget controller decides
>> the selection of the stream id on which the host should work , so taking all these
>into
>> consideration I kept 50ms timeout for stream transfers, so that the performance
>may
>> not get decreased.
>>
>> Thanks,
>> Anurag Kumar Vulisha
Alan Stern - Dec. 4, 2018, 4:46 p.m.
On Tue, 4 Dec 2018, Anurag Kumar Vulisha wrote:

> >> Yes the host can cancel the transfer. This issue originated from the endpoints using
> >bulk
> >> streaming protocol and may not occur with normal endpoints. AFAIK bulk streaming
> >is
> >> gadget driven, where the gadget is allowed to select which stream id transfer the
> >host
> >> should work on . Since the host doesn't have control on when the transfer would be
> >> selected by gadget, it may wait for longer timeouts before cancelling the transfer.
> >
> >You're missing the point.  Although the device selects which stream ID
> >gets transferred, the _host_ decides whether a stream transfer should
> >occur in the first place.  No matter how many ERDY packets the device
> >controller tries to send, no transfer will occur until the host wants
> >to do it.
> >
> >In this sense, stream transfers (like all other USB interactions except
> >wakeup requests) are host-driven.
> >
> 
> I agree with you that without host willing to transfer, the transfer wouldn't have
> started and also agree the host always has the capability of detecting the hang. But
> we are not sure on what host platform and operating system the gadget would be
> tested and also not sure whether the host software is capable of handling the deadlock.
> Even if the host has a timer , it would be very long and waiting for the timer to get
> timedout  would reduce the performance. In this patch we are giving the user the
> flexibility to opt the timeout value, so that the deadlock can be avoided without
> effecting the performance. So I think adding the timer in gadget is more easier solution
> than handling in host. I have seen this workaround working for both linux & windows.

Is there any way for the device controller driver to detect the problem 
without relying on a timeout?

In fact, is this problem a known bug in the existing device controller
hardware?  Have the controller manufacturers published a suggested
scheme for working around it?

> >> >> Since the gadget
> >> >> controller driver is aware that the controller is stuck , makes it responsible
> >> >> to recover the controller from hang condition by restarting the transfer (which
> >> >> triggers the controller FSM to issue ERDY to host).
> >> >
> >> >Isn't there a cleaner way to recover than by cancelling the request and
> >> >resubmitting it?
> >> >
> >>
> >> dequeuing the request issues the stop transfer command to the controller, which
> >> cleans all the hardware resource allocated for that endpoint. This also resets the
> >> hardware FSMs for that endpoint . So, when re-queuing of the transfer happens
> >> the controller starts allocating hardware resources again, thus avoiding the
> >probability
> >> of entering into the issue. I am not sure of other controllers, but for dwc3, issuing
> >> the stop transfer is the only way to handle this issue.
> >
> >Again you're missing the point.  Can't the controller driver issue the
> >Stop Transfer command but still consider the request to be in progress
> >(i.e., don't dequeue the request) so that the gadget driver's
> >completion callback isn't invoked and the request does not need to be
> >explicitly resubmitted?
> >
> 
> Yes , what you are saying is correct. My initial patches were following the
> same approach that you have suggested. We switched to the dequeue
> approach because there can be different gadgets which may enter into
> this issue and it would be better to follow a generic approach rather than
> having every controller driver implementing their own workaround.
> Please find the conservation in this link  https://patchwork.kernel.org/patch/10640145/

At this point, it seems that the generic approach will be messier than 
having every controller driver implement its own fix.  At least, that's 
how it appears to me.

> >> @Felipe:  Can you please provide your suggestion on this.
> >
> >> >How can the gadget driver know what timeout to use?  The host is
> >> >allowed to be as slow as it wants; the gadget driver doesn't have any
> >> >way to tell when the host wants to start the transfer.
> >>
> >> Yes , I agree with you that the timeout may vary from usage to usage. This timeout
> >> should be decided by the class driver which queues the request. As discussed above
> >> this issue was observed in streaming protocol , which  is very much faster than
> >normal
> >> BOT modes and it works on super speed .
> >
> >Although USB mass storage is currently the only user of the stream
> >protocol, that may not be true in the future.  You should think in more
> >general terms.  A timeout which is appropriate for mass storage may not
> >be appropriate for some other application.
> >
> 
> You are correct , this timeout may not be ideal. Since  I was not able to reproduce this issue
> on non-stream capable transfers , at  present the only user of usb_ep_queue_timeout() API
> is the streaming gadget.  As we are not aware of the optimal timeout value,  instead of
> hardcoding the timeout value we can add module_param() for it.  So that the user can configure
> timeout based on their use case. Please let me know your suggestion on this.

Ideally it would not be necessary to rely on a timeout at all.

Also, maintainers dislike module parameters.  It would be better not to 
add one.

Alan Stern
Anurag Kumar Vulisha - Dec. 4, 2018, 7:07 p.m.
Hi Alan,

>-----Original Message-----
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>Sent: Tuesday, December 04, 2018 10:17 PM
>To: Anurag Kumar Vulisha <anuragku@xilinx.com>
>Cc: Felipe Balbi <balbi@kernel.org>; Greg Kroah-Hartman
><gregkh@linuxfoundation.org>; Shuah Khan <shuah@kernel.org>; Johan Hovold
><johan@kernel.org>; Jaejoong Kim <climbbb.kim@gmail.com>; Benjamin
>Herrenschmidt <benh@kernel.crashing.org>; Roger Quadros <rogerq@ti.com>; Manu
>Gautam <mgautam@codeaurora.org>; martin.petersen@oracle.com; Bart Van
>Assche <bvanassche@acm.org>; Mike Christie <mchristi@redhat.com>; Matthew
>Wilcox <willy@infradead.org>; Colin Ian King <colin.king@canonical.com>; linux-
>usb@vger.kernel.org; linux-kernel@vger.kernel.org; v.anuragkumar@gmail.com;
>Thinh Nguyen <thinhn@synopsys.com>; Tejas Joglekar
><tejas.joglekar@synopsys.com>; Ajay Yugalkishore Pandey <APANDEY@xilinx.com>
>Subject: RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Tue, 4 Dec 2018, Anurag Kumar Vulisha wrote:
>
>> >> Yes the host can cancel the transfer. This issue originated from
>> >> the endpoints using
>> >bulk
>> >> streaming protocol and may not occur with normal endpoints. AFAIK
>> >> bulk streaming
>> >is
>> >> gadget driven, where the gadget is allowed to select which stream
>> >> id transfer the
>> >host
>> >> should work on . Since the host doesn't have control on when the
>> >> transfer would be selected by gadget, it may wait for longer timeouts before
>cancelling the transfer.
>> >
>> >You're missing the point.  Although the device selects which stream
>> >ID gets transferred, the _host_ decides whether a stream transfer
>> >should occur in the first place.  No matter how many ERDY packets the
>> >device controller tries to send, no transfer will occur until the
>> >host wants to do it.
>> >
>> >In this sense, stream transfers (like all other USB interactions
>> >except wakeup requests) are host-driven.
>> >
>>
>> I agree with you that without host willing to transfer, the transfer
>> wouldn't have started and also agree the host always has the
>> capability of detecting the hang. But we are not sure on what host
>> platform and operating system the gadget would be tested and also not sure
>whether the host software is capable of handling the deadlock.
>> Even if the host has a timer , it would be very long and waiting for
>> the timer to get timedout  would reduce the performance. In this patch
>> we are giving the user the flexibility to opt the timeout value, so
>> that the deadlock can be avoided without effecting the performance. So
>> I think adding the timer in gadget is more easier solution than handling in host. I
>have seen this workaround working for both linux & windows.
>
>Is there any way for the device controller driver to detect the problem without relying
>on a timeout?
>
>In fact, is this problem a known bug in the existing device controller hardware?  Have
>the controller manufacturers published a suggested scheme for working around it?
>

Yes, this problem is mentioned in dwc3 controller data book and the workaround
mentioned  is the same that we are doing , to implement timeout and if no valid
stream event is found , after timeout issue end transfer followed by start transfer.

>> >> >> Since the gadget
>> >> >> controller driver is aware that the controller is stuck , makes
>> >> >> it responsible to recover the controller from hang condition by
>> >> >> restarting the transfer (which triggers the controller FSM to issue ERDY to
>host).
>> >> >
>> >> >Isn't there a cleaner way to recover than by cancelling the
>> >> >request and resubmitting it?
>> >> >
>> >>
>> >> dequeuing the request issues the stop transfer command to the
>> >> controller, which cleans all the hardware resource allocated for
>> >> that endpoint. This also resets the hardware FSMs for that endpoint
>> >> . So, when re-queuing of the transfer happens the controller starts
>> >> allocating hardware resources again, thus avoiding the
>> >probability
>> >> of entering into the issue. I am not sure of other controllers, but
>> >> for dwc3, issuing the stop transfer is the only way to handle this issue.
>> >
>> >Again you're missing the point.  Can't the controller driver issue
>> >the Stop Transfer command but still consider the request to be in
>> >progress (i.e., don't dequeue the request) so that the gadget
>> >driver's completion callback isn't invoked and the request does not
>> >need to be explicitly resubmitted?
>> >
>>
>> Yes , what you are saying is correct. My initial patches were
>> following the same approach that you have suggested. We switched to
>> the dequeue approach because there can be different gadgets which may
>> enter into this issue and it would be better to follow a generic
>> approach rather than having every controller driver implementing their own
>workaround.
>> Please find the conservation in this link
>> https://patchwork.kernel.org/patch/10640145/
>
>At this point, it seems that the generic approach will be messier than having every
>controller driver implement its own fix.  At least, that's how it appears to me.

With the dequeue approach there is an advantage(not related to this issue) that the
class driver can have control of the timed out transfer whether to requeue it or consider
it as an error (based on implementation). This approach gives more flexibility to the class
drivers. This may be out of context of this issue but wanted to point out that we may lose
this advantage on switching to older implementation.
  
>
>> >> @Felipe:  Can you please provide your suggestion on this.
>> >
>> >> >How can the gadget driver know what timeout to use?  The host is
>> >> >allowed to be as slow as it wants; the gadget driver doesn't have
>> >> >any way to tell when the host wants to start the transfer.
>> >>
>> >> Yes , I agree with you that the timeout may vary from usage to
>> >> usage. This timeout should be decided by the class driver which
>> >> queues the request. As discussed above this issue was observed in
>> >> streaming protocol , which  is very much faster than
>> >normal
>> >> BOT modes and it works on super speed .
>> >
>> >Although USB mass storage is currently the only user of the stream
>> >protocol, that may not be true in the future.  You should think in
>> >more general terms.  A timeout which is appropriate for mass storage
>> >may not be appropriate for some other application.
>> >
>>
>> You are correct , this timeout may not be ideal. Since  I was not able
>> to reproduce this issue on non-stream capable transfers , at  present
>> the only user of usb_ep_queue_timeout() API is the streaming gadget.
>> As we are not aware of the optimal timeout value,  instead of
>> hardcoding the timeout value we can add module_param() for it.  So that the user
>can configure timeout based on their use case. Please let me know your suggestion
>on this.
>
>Ideally it would not be necessary to rely on a timeout at all.
>
>Also, maintainers dislike module parameters.  It would be better not to add one.

Okay. I would be happy if any alternative for this issue is present but unfortunately
I am not able to figure out any alternative other than timers. If not module_params()
we can add an configfs entry in stream gadget to update the timeout. Please provide
your opinion on this approach. 

Thanks,
Anurag Kumar Vulisha
Alan Stern - Dec. 4, 2018, 7:28 p.m.
On Tue, 4 Dec 2018, Anurag Kumar Vulisha wrote:

> >Is there any way for the device controller driver to detect the problem without relying
> >on a timeout?
> >
> >In fact, is this problem a known bug in the existing device controller hardware?  Have
> >the controller manufacturers published a suggested scheme for working around it?
> >
> 
> Yes, this problem is mentioned in dwc3 controller data book and the workaround
> mentioned  is the same that we are doing , to implement timeout and if no valid
> stream event is found , after timeout issue end transfer followed by start transfer.

Okay.  But this implies that the problem is confined to DWC3 and
doesn't affect other types of controllers, which would mean modifying 
the UDC core would be inappropriate.

Does the data book suggest a value for the timeout?

> >At this point, it seems that the generic approach will be messier than having every
> >controller driver implement its own fix.  At least, that's how it appears to me.

(Especially if dwc3 is the only driver affected.)

> With the dequeue approach there is an advantage(not related to this issue) that the
> class driver can have control of the timed out transfer whether to requeue it or consider
> it as an error (based on implementation). This approach gives more flexibility to the class
> drivers. This may be out of context of this issue but wanted to point out that we may lose
> this advantage on switching to older implementation.

Class drivers can cancel and requeue requests at any time.  There's no 
extra flexibility.

> >Ideally it would not be necessary to rely on a timeout at all.
> >
> >Also, maintainers dislike module parameters.  It would be better not to add one.
> 
> Okay. I would be happy if any alternative for this issue is present but unfortunately
> I am not able to figure out any alternative other than timers. If not module_params()
> we can add an configfs entry in stream gadget to update the timeout. Please provide
> your opinion on this approach. 

Since the purpose of the timeout is to detect a deadlock caused by a
hardware bug, I suggest a fixed and relatively short timeout value such
as one second.  Cancelling and requeuing a few requests at 1-second
intervals shouldn't add very much overhead.

Alan Stern
Anurag Kumar Vulisha - Dec. 5, 2018, 3:43 p.m.
Hi Alan,

>-----Original Message-----
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>Sent: Wednesday, December 05, 2018 12:59 AM
>To: Anurag Kumar Vulisha <anuragku@xilinx.com>
>Cc: Felipe Balbi <balbi@kernel.org>; Greg Kroah-Hartman
><gregkh@linuxfoundation.org>; Shuah Khan <shuah@kernel.org>; Johan Hovold
><johan@kernel.org>; Jaejoong Kim <climbbb.kim@gmail.com>; Benjamin
>Herrenschmidt <benh@kernel.crashing.org>; Roger Quadros <rogerq@ti.com>; Manu
>Gautam <mgautam@codeaurora.org>; martin.petersen@oracle.com; Bart Van
>Assche <bvanassche@acm.org>; Mike Christie <mchristi@redhat.com>; Matthew
>Wilcox <willy@infradead.org>; Colin Ian King <colin.king@canonical.com>; linux-
>usb@vger.kernel.org; linux-kernel@vger.kernel.org; v.anuragkumar@gmail.com;
>Thinh Nguyen <thinhn@synopsys.com>; Tejas Joglekar
><tejas.joglekar@synopsys.com>; Ajay Yugalkishore Pandey <APANDEY@xilinx.com>
>Subject: RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Tue, 4 Dec 2018, Anurag Kumar Vulisha wrote:
>
>> >Is there any way for the device controller driver to detect the problem without
>relying
>> >on a timeout?
>> >
>> >In fact, is this problem a known bug in the existing device controller hardware?
>Have
>> >the controller manufacturers published a suggested scheme for working around it?
>> >
>>
>> Yes, this problem is mentioned in dwc3 controller data book and the workaround
>> mentioned  is the same that we are doing , to implement timeout and if no valid
>> stream event is found , after timeout issue end transfer followed by start transfer.
>
>Okay.  But this implies that the problem is confined to DWC3 and
>doesn't affect other types of controllers, which would mean modifying
>the UDC core would be inappropriate.
>

Both host & gadget are equally responsible for this issue and it may potentially occur
with other controllers also(which are incapable of handling this situation) . Because of
this reason I would not call this issue as a dwc3 alone bug. One thing is that, with dwc3 the
issue is easily reproduced. Because of these reasons I feel that it would be better to have
a generic udc timers rather than having driver specific workaround. We had this same
discussion earlier in this mailing list https://lkml.org/lkml/2018/10/9/638

>Does the data book suggest a value for the timeout?
>

No, the databook doesn't mention about the timeout value

>> >At this point, it seems that the generic approach will be messier than having every
>> >controller driver implement its own fix.  At least, that's how it appears to me.
>
>(Especially if dwc3 is the only driver affected.)
>

As discussed above, the issue may happen with other gadgets too. As I got divide opinions
on this implementation and both the implementations looks fine to me, I am little confused
on which should be implemented.

@Felipe: Do you agree with Alan's implementation? Please let us know your suggestion
on this.

>> With the dequeue approach there is an advantage(not related to this issue) that the
>> class driver can have control of the timed out transfer whether to requeue it or
>consider
>> it as an error (based on implementation). This approach gives more flexibility to the
>class
>> drivers. This may be out of context of this issue but wanted to point out that we
>may lose
>> this advantage on switching to older implementation.
>
>Class drivers can cancel and requeue requests at any time.  There's no
>extra flexibility.
>

I agree with you, but the class drivers have to implement their own logic instead of
having a generic logic to implement the timeouts. 

>> >Ideally it would not be necessary to rely on a timeout at all.
>> >
>> >Also, maintainers dislike module parameters.  It would be better not to add one.
>>
>> Okay. I would be happy if any alternative for this issue is present but unfortunately
>> I am not able to figure out any alternative other than timers. If not
>module_params()
>> we can add an configfs entry in stream gadget to update the timeout. Please
>provide
>> your opinion on this approach.
>
>Since the purpose of the timeout is to detect a deadlock caused by a
>hardware bug, I suggest a fixed and relatively short timeout value such
>as one second.  Cancelling and requeuing a few requests at 1-second
>intervals shouldn't add very much overhead.
>

Thanks for suggesting. 1 sec timeout should be fair enough. I will change this
in next series 

Best Regards,
Anurag Kumar Vulisha
Felipe Balbi - Dec. 7, 2018, 6:05 a.m.
hi,

Anurag Kumar Vulisha <anuragku@xilinx.com> writes:
>>Does the data book suggest a value for the timeout?
>>
>
> No, the databook doesn't mention about the timeout value
>
>>> >At this point, it seems that the generic approach will be messier than having every
>>> >controller driver implement its own fix.  At least, that's how it appears to me.

Why, if the UDC implementation will, anyway, be a timer?

>>(Especially if dwc3 is the only driver affected.)
>
> As discussed above, the issue may happen with other gadgets too. As I got divide opinions
> on this implementation and both the implementations looks fine to me, I am little confused
> on which should be implemented.
>
> @Felipe: Do you agree with Alan's implementation? Please let us know your suggestion
> on this.

I still think a generic timer is a better solution since it has other uses.

>>> >Ideally it would not be necessary to rely on a timeout at all.
>>> >
>>> >Also, maintainers dislike module parameters.  It would be better not to add one.
>>>
>>> Okay. I would be happy if any alternative for this issue is present but unfortunately
>>> I am not able to figure out any alternative other than timers. If not
>>module_params()
>>> we can add an configfs entry in stream gadget to update the timeout. Please
>>provide
>>> your opinion on this approach.
>>
>>Since the purpose of the timeout is to detect a deadlock caused by a
>>hardware bug, I suggest a fixed and relatively short timeout value such
>>as one second.  Cancelling and requeuing a few requests at 1-second
>>intervals shouldn't add very much overhead.

I wouldn't call this a HW bug though. This is just how the UDC
behaves. There are N streams and host can move data in any stream at any
time. This means that host & gadget _can_ disagree on what stream to
start next.

One way to avoid this would be to never pre-start any streams and always
rely on XferNotReady, but that would mean greatly reduced throughput for
streams.
Alan Stern - Dec. 7, 2018, 5:09 p.m.
On Fri, 7 Dec 2018, Felipe Balbi wrote:

> 
> hi,
> 
> Anurag Kumar Vulisha <anuragku@xilinx.com> writes:
> >>Does the data book suggest a value for the timeout?
> >>
> >
> > No, the databook doesn't mention about the timeout value
> >
> >>> >At this point, it seems that the generic approach will be messier than having every
> >>> >controller driver implement its own fix.  At least, that's how it appears to me.
> 
> Why, if the UDC implementation will, anyway, be a timer?

It's mostly a question of what happens when the timer expires.  (After
all, starting a timer is just as easy to do in a UDC driver as it is in
the core.)  When the timer expires, what can the core do?

Don't say it can cancel the offending request and resubmit it.  That
leads to the sort of trouble we discussed earlier in this thread.  In
particular, we don't want the class driver's completion routine to be
called when the cancel occurs.  Furthermore, this leads to a race:  
Suppose the class driver decides to cancel the request at the same time
as the core does a cancel and resubmit.  Then the class driver's cancel
could get lost and the request would remain on the UDC's queue.

What you really want to do is issue the appropriate stop and restart
commands to the hardware, while leaving the request logically active
the entire time.  The UDC core can't do this, but a UDC driver can.

> >>(Especially if dwc3 is the only driver affected.)
> >
> > As discussed above, the issue may happen with other gadgets too. As I got divide opinions
> > on this implementation and both the implementations looks fine to me, I am little confused
> > on which should be implemented.
> >
> > @Felipe: Do you agree with Alan's implementation? Please let us know your suggestion
> > on this.
> 
> I still think a generic timer is a better solution since it has other uses.

Putting a struct timer into struct usb_request is okay with me, but I 
wouldn't go any farther than that.

> >>Since the purpose of the timeout is to detect a deadlock caused by a
> >>hardware bug, I suggest a fixed and relatively short timeout value such
> >>as one second.  Cancelling and requeuing a few requests at 1-second
> >>intervals shouldn't add very much overhead.
> 
> I wouldn't call this a HW bug though. This is just how the UDC
> behaves. There are N streams and host can move data in any stream at any
> time. This means that host & gadget _can_ disagree on what stream to
> start next.

But the USB 3 spec says what should happen when the host and gadget 
disagree in this way, doesn't it?  And it doesn't say that they should 
deadlock.  :-)  Or have I misread the spec?

> One way to avoid this would be to never pre-start any streams and always
> rely on XferNotReady, but that would mean greatly reduced throughput for
> streams.

It would be good if there was some way to actively detect the problem 
instead of passively waiting for a timer to expire.

Alan Stern
Anurag Kumar Vulisha - Dec. 12, 2018, 3:11 p.m.
Hi Felipe,

>-----Original Message-----
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>Sent: Friday, December 07, 2018 10:40 PM
>To: Felipe Balbi <balbi@kernel.org>
>Cc: Anurag Kumar Vulisha <anuragku@xilinx.com>; Greg Kroah-Hartman
><gregkh@linuxfoundation.org>; Shuah Khan <shuah@kernel.org>; Johan Hovold
><johan@kernel.org>; Jaejoong Kim <climbbb.kim@gmail.com>; Benjamin
>Herrenschmidt <benh@kernel.crashing.org>; Roger Quadros <rogerq@ti.com>; Manu
>Gautam <mgautam@codeaurora.org>; martin.petersen@oracle.com; Bart Van
>Assche <bvanassche@acm.org>; Mike Christie <mchristi@redhat.com>; Matthew
>Wilcox <willy@infradead.org>; Colin Ian King <colin.king@canonical.com>; linux-
>usb@vger.kernel.org; linux-kernel@vger.kernel.org; v.anuragkumar@gmail.com;
>Thinh Nguyen <thinhn@synopsys.com>; Tejas Joglekar
><tejas.joglekar@synopsys.com>; Ajay Yugalkishore Pandey <APANDEY@xilinx.com>
>Subject: RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Fri, 7 Dec 2018, Felipe Balbi wrote:
>
>>
>> hi,
>>
>> Anurag Kumar Vulisha <anuragku@xilinx.com> writes:
>> >>Does the data book suggest a value for the timeout?
>> >>
>> >
>> > No, the databook doesn't mention about the timeout value
>> >
>> >>> >At this point, it seems that the generic approach will be messier than having
>every
>> >>> >controller driver implement its own fix.  At least, that's how it appears to me.
>>
>> Why, if the UDC implementation will, anyway, be a timer?
>
>It's mostly a question of what happens when the timer expires.  (After
>all, starting a timer is just as easy to do in a UDC driver as it is in
>the core.)  When the timer expires, what can the core do?
>
>Don't say it can cancel the offending request and resubmit it.  That
>leads to the sort of trouble we discussed earlier in this thread.  In
>particular, we don't want the class driver's completion routine to be
>called when the cancel occurs.  Furthermore, this leads to a race:
>Suppose the class driver decides to cancel the request at the same time
>as the core does a cancel and resubmit.  Then the class driver's cancel
>could get lost and the request would remain on the UDC's queue.
>
>What you really want to do is issue the appropriate stop and restart
>commands to the hardware, while leaving the request logically active
>the entire time.  The UDC core can't do this, but a UDC driver can.
>

I agree with Alan's comment as it looks like there may be some corner cases
where the issue may occur with dequeue approach. Are you okay if the timeout
handler gets moved to the dwc3 driver (the timers still added into udc layer)? 
Please let us know your suggestion on this

Thanks,
Anurag Kumar Vulisha

>> >>(Especially if dwc3 is the only driver affected.)
>> >
>> > As discussed above, the issue may happen with other gadgets too. As I got divide
>opinions
>> > on this implementation and both the implementations looks fine to me, I am little
>confused
>> > on which should be implemented.
>> >
>> > @Felipe: Do you agree with Alan's implementation? Please let us know your
>suggestion
>> > on this.
>>
>> I still think a generic timer is a better solution since it has other uses.
>
>Putting a struct timer into struct usb_request is okay with me, but I
>wouldn't go any farther than that.
>
>> >>Since the purpose of the timeout is to detect a deadlock caused by a
>> >>hardware bug, I suggest a fixed and relatively short timeout value such
>> >>as one second.  Cancelling and requeuing a few requests at 1-second
>> >>intervals shouldn't add very much overhead.
>>
>> I wouldn't call this a HW bug though. This is just how the UDC
>> behaves. There are N streams and host can move data in any stream at any
>> time. This means that host & gadget _can_ disagree on what stream to
>> start next.
>
>But the USB 3 spec says what should happen when the host and gadget
>disagree in this way, doesn't it?  And it doesn't say that they should
>deadlock.  :-)  Or have I misread the spec?
>
>> One way to avoid this would be to never pre-start any streams and always
>> rely on XferNotReady, but that would mean greatly reduced throughput for
>> streams.
>
>It would be good if there was some way to actively detect the problem
>instead of passively waiting for a timer to expire.
>
>Alan Stern

Patch

diff --git a/drivers/usb/gadget/udc/core.c b/drivers/usb/gadget/udc/core.c
index 87d6b12..daeb9bf 100644
--- a/drivers/usb/gadget/udc/core.c
+++ b/drivers/usb/gadget/udc/core.c
@@ -52,6 +52,25 @@  static int udc_bind_to_driver(struct usb_udc *udc,
 /* ------------------------------------------------------------------------- */
 
 /**
+ * usb_request_timeout - callback function for endpoint stream timeout timer
+ * @arg: pointer to struct timer_list
+ *
+ * This function gets called only when bulk streams are enabled in the endpoint
+ * and only after req->req_timeout_timer has expired. This timer gets expired
+ * only when the gadget controller failed to find a valid stream event for this
+ * request. On timer expiry, this function dequeues the request and requeues it
+ * again to restart the transfer.
+ */
+static void usb_request_timeout(struct timer_list *arg)
+{
+	struct usb_request *req = from_timer(req, arg, req_timeout_timer);
+	struct usb_ep *ep = req->ep;
+
+	usb_ep_dequeue(ep, req);
+	usb_ep_queue_timeout(ep, req, GFP_ATOMIC, req->timeout_ms);
+}
+
+/**
  * usb_ep_set_maxpacket_limit - set maximum packet size limit for endpoint
  * @ep:the endpoint being configured
  * @maxpacket_limit:value of maximum packet size limit
@@ -190,6 +209,47 @@  void usb_ep_free_request(struct usb_ep *ep,
 EXPORT_SYMBOL_GPL(usb_ep_free_request);
 
 /**
+ * __usb_ep_queue - queues (submits) an I/O request to an endpoint.
+ * @ep:the endpoint associated with the request
+ * @req:the request being submitted
+ * @gfp_flags: GFP_* flags to use in case the lower level driver couldn't
+ *	pre-allocate all necessary memory with the request.
+ * @timeout: The timeout value in msecs used by the usb_request timer.
+ *
+ * This should only be called from usb_ep_queue() or usb_ep_queue_timeout().
+ * This function queues the requests to the controller driver and starts the
+ * timer if the timeout value is not zero.
+ */
+static int __usb_ep_queue(struct usb_ep *ep, struct usb_request *req,
+			  gfp_t gfp_flags, const unsigned int timeout)
+{
+	int ret = 0;
+
+	if (WARN_ON_ONCE(!ep->enabled && ep->address)) {
+		ret = -ESHUTDOWN;
+		goto out;
+	}
+
+	ret = ep->ops->queue(ep, req, gfp_flags);
+
+	if (timeout != 0) {
+		timer_setup(&req->req_timeout_timer, usb_request_timeout, 0);
+		req->req_timeout_timer.expires = jiffies +
+				msecs_to_jiffies(timeout);
+		mod_timer(&req->req_timeout_timer,
+			  req->req_timeout_timer.expires);
+		req->timeout_ms = timeout;
+	}
+
+	/* Assign the ep to req for future usage */
+	req->ep = ep;
+out:
+	trace_usb_ep_queue(ep, req, ret);
+
+	return ret;
+}
+
+/**
  * usb_ep_queue - queues (submits) an I/O request to an endpoint.
  * @ep:the endpoint associated with the request
  * @req:the request being submitted
@@ -260,23 +320,45 @@  EXPORT_SYMBOL_GPL(usb_ep_free_request);
 int usb_ep_queue(struct usb_ep *ep,
 			       struct usb_request *req, gfp_t gfp_flags)
 {
-	int ret = 0;
-
-	if (WARN_ON_ONCE(!ep->enabled && ep->address)) {
-		ret = -ESHUTDOWN;
-		goto out;
-	}
-
-	ret = ep->ops->queue(ep, req, gfp_flags);
-
-out:
-	trace_usb_ep_queue(ep, req, ret);
-
-	return ret;
+	return __usb_ep_queue(ep, req, gfp_flags, 0);
 }
 EXPORT_SYMBOL_GPL(usb_ep_queue);
 
 /**
+ * usb_ep_queue_timeout - queues (submits) an I/O request to an endpoint.
+ * @ep:the endpoint associated with the request
+ * @req:the request being submitted
+ * @gfp_flags: GFP_* flags to use in case the lower level driver couldn't
+ *	pre-allocate all necessary memory with the request.
+ * @timeout: The timeout value used by the timer present in usb_request.
+ *
+ * This functions starts the timer for the requests queued to the controller.
+ * This routine does the same as usb_ep_queue() but takes an extra timeout
+ * argument which is used for setting the timeout value for the timer. There
+ * can be some corner case where the endpoint may go out of sync with the host
+ * and enter into deadlock situation. To avoid such potential deadlocks a timer
+ * is started at the time of queuing the request. This timer should be stopped
+ * by the controller driver on valid conditions otherwise the timer gets
+ * timedout and the handler is called which handles the deadlock.
+ *
+ * For example, when streams are enabled the host and gadget can go out sync,the
+ * gadget may wait until the host issues prime transaction and the host may wait
+ * until gadget issues a ERDY. This behaviour may create a deadlock situation.
+ * To avoid such a deadlock, when request is queued to an endpoint, the timer
+ * present in usb_request is started. If a valid stream event is found the
+ * gadget driver stops the timer. If no valid stream event is found, the timer
+ * keeps running until expired and the timeout handler registered to the timer
+ * usb_request_timeout() gets called, which dequeues the request and requeues
+ * the request to avoid the deadlock condition.
+ */
+int usb_ep_queue_timeout(struct usb_ep *ep, struct usb_request *req,
+			 gfp_t gfp_flags, const unsigned int timeout)
+{
+	return __usb_ep_queue(ep, req, gfp_flags, timeout);
+}
+EXPORT_SYMBOL_GPL(usb_ep_queue_timeout);
+
+/**
  * usb_ep_dequeue - dequeues (cancels, unlinks) an I/O request from an endpoint
  * @ep:the endpoint associated with the request
  * @req:the request being canceled
@@ -291,6 +373,8 @@  EXPORT_SYMBOL_GPL(usb_ep_queue);
  * restrictions prevent drivers from supporting configuration changes,
  * even to configuration zero (a "chapter 9" requirement).
  *
+ * If a timer is started in usb_ep_queue(), it would be removed.
+ *
  * This routine may be called in interrupt context.
  */
 int usb_ep_dequeue(struct usb_ep *ep, struct usb_request *req)
@@ -300,6 +384,9 @@  int usb_ep_dequeue(struct usb_ep *ep, struct usb_request *req)
 	ret = ep->ops->dequeue(ep, req);
 	trace_usb_ep_dequeue(ep, req, ret);
 
+	if (timer_pending(&req->req_timeout_timer))
+		del_timer(&req->req_timeout_timer);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(usb_ep_dequeue);
@@ -883,7 +970,8 @@  EXPORT_SYMBOL_GPL(usb_gadget_unmap_request);
  * Context: in_interrupt()
  *
  * This is called by device controller drivers in order to return the
- * completed request back to the gadget layer.
+ * completed request back to the gadget layer. If a timer is started
+ * in usb_ep_queue(), it would be removed.
  */
 void usb_gadget_giveback_request(struct usb_ep *ep,
 		struct usb_request *req)
@@ -891,6 +979,9 @@  void usb_gadget_giveback_request(struct usb_ep *ep,
 	if (likely(req->status == 0))
 		usb_led_activity(USB_LED_EVENT_GADGET);
 
+	if (timer_pending(&req->req_timeout_timer))
+		del_timer(&req->req_timeout_timer);
+
 	trace_usb_gadget_giveback_request(ep, req, 0);
 
 	req->complete(ep, req);
diff --git a/include/linux/usb/gadget.h b/include/linux/usb/gadget.h
index e5cd84a..5b2516b 100644
--- a/include/linux/usb/gadget.h
+++ b/include/linux/usb/gadget.h
@@ -61,6 +61,13 @@  struct usb_ep;
  *	invalidated by the error may first be dequeued.
  * @context: For use by the completion callback
  * @list: For use by the gadget driver.
+ * @req_timeout_timer: Some endpoints may go out of sync with host and
+ *	enter into deadlock. For example, stream capable endpoints may enter
+ *	into deadlock where the host waits on gadget to issue ERDY and gadget
+ *	waits for host to issue prime transaction. To avoid such deadlock this
+ *	timer is used.
+ * @timeout_ms: timeout value in msecs used by the req_timeout_timer.
+ * @ep: pointer to the endpoint for which this request is queued.
  * @status: Reports completion code, zero or a negative errno.
  *	Normally, faults block the transfer queue from advancing until
  *	the completion callback returns.
@@ -111,6 +118,9 @@  struct usb_request {
 					struct usb_request *req);
 	void			*context;
 	struct list_head	list;
+	unsigned		timeout_ms;
+	struct timer_list	req_timeout_timer;
+	struct usb_ep		*ep;
 
 	int			status;
 	unsigned		actual;
@@ -243,6 +253,8 @@  int usb_ep_disable(struct usb_ep *ep);
 struct usb_request *usb_ep_alloc_request(struct usb_ep *ep, gfp_t gfp_flags);
 void usb_ep_free_request(struct usb_ep *ep, struct usb_request *req);
 int usb_ep_queue(struct usb_ep *ep, struct usb_request *req, gfp_t gfp_flags);
+int usb_ep_queue_timeout(struct usb_ep *ep, struct usb_request *req,
+			 gfp_t gfp_flags, const unsigned int timeout);
 int usb_ep_dequeue(struct usb_ep *ep, struct usb_request *req);
 int usb_ep_set_halt(struct usb_ep *ep);
 int usb_ep_clear_halt(struct usb_ep *ep);
@@ -266,6 +278,10 @@  static inline void usb_ep_free_request(struct usb_ep *ep,
 static inline int usb_ep_queue(struct usb_ep *ep, struct usb_request *req,
 		gfp_t gfp_flags)
 { return 0; }
+static inline int usb_ep_queue_timeout(struct usb_ep *ep,
+				       struct usb_request *req, gfp_t gfp_flags,
+				       const unsigned int timeout)
+{ return 0; }
 static inline int usb_ep_dequeue(struct usb_ep *ep, struct usb_request *req)
 { return 0; }
 static inline int usb_ep_set_halt(struct usb_ep *ep)