Patchwork [3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

login
register
mail settings
Submitter Dexuan Cui
Date March 4, 2019, 9:34 p.m.
Message ID <20190304213357.16652-4-decui@microsoft.com>
Download mbox | patch
Permalink /patch/741247/
State New
Headers show

Comments

Dexuan Cui - March 4, 2019, 9:34 p.m.
When we hot-remove a device, usually the host sends us a PCI_EJECT message,
and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when
we do the quick hot-add/hot-remove test, the host may not send us the
PCI_EJECT message, if the guest has not fully finished the initialization
by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
potentially unsafe to only depend on the pci_destroy_slot() in
hv_eject_device_work(), though create_root_hv_pci_bus() ->
hv_pci_assign_slots() is not called in this case. Note: in this case, the
host still sends the guest a PCI_BUS_RELATIONS message with
bus_rel->device_count == 0.

And, in the quick hot-add/hot-remove test, we can have such a race: before
pci_devices_present_work() -> new_pcichild_device() adds the new device
into hbus->children, we may have already received the PCI_EJECT message,
and hence the taklet handler hv_pci_onchannelcallback() may fail to find
the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
with bus_rel->device_count == 0 removes the device from hbus->children, and
we end up being unable to remove the slot in hv_pci_remove() ->
hv_pci_remove_slots().

The patch removes the slot in pci_devices_present_work() when the device
is removed. This can address the above race. Note 1:
pci_devices_present_work() and hv_eject_device_work() run in the
singled-threaded hbus->wq, so there is not a double-remove issue for the
slot. Note 2: we can't offload hv_pci_eject_device() from
hv_pci_onchannelcallback() to the workqueue, because we need
hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
poll the channel's ringbuffer to work around the
"hangs in hv_compose_msi_msg()" issue: see
commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org
---
 drivers/pci/controller/pci-hyperv.c | 4 ++++
 1 file changed, 4 insertions(+)
Michael Kelley - March 20, 2019, 9:44 p.m.
From: Dexuan Cui <decui@microsoft.com>  Sent: Monday, March 4, 2019 1:35 PM
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
>  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
>  					 list_entry);
>  		list_del(&hpdev->list_entry);
> +
> +		if (hpdev->pci_slot)
> +			pci_destroy_slot(hpdev->pci_slot);

The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
pci_destory_slot().  Patch 2 in this series does set it to NULL, but this code does not.
And the code in hv_eject_device_work() does not set it to NULL.

It looks like all the places that test the value of hpdev->pci_slot or call
pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But when
the code is inconsistent about setting to NULL, it always makes me wonder if there
is a reason.

Michael


> +
>  		put_pcichild(hpdev);
>  	}
> 
> --
> 2.19.1
Dexuan Cui - March 21, 2019, 12:35 a.m.
> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Wednesday, March 20, 2019 2:44 PM
> To: Dexuan Cui <decui@microsoft.com>; lorenzo.pieralisi@arm.com;
> bhelgaas@google.com; linux-pci@vger.kernel.org; KY Srinivasan
> > ...
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> work_struct *work)
> >  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
> >  					 list_entry);
> >  		list_del(&hpdev->list_entry);
> > +
> > +		if (hpdev->pci_slot)
> > +			pci_destroy_slot(hpdev->pci_slot);
> 
> The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> pci_destory_slot(). 
Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
Because:
1) the "hpdev" is removed from hbus->children and it can not be seen
elsewhere;
2) the "hpdev" struct is freed in the below put_pcichild():

     while (!list_empty(&removed)) {
                hpdev = list_first_entry(&removed, struct hv_pci_dev,
                                         list_entry);
                list_del(&hpdev->list_entry);

                if (hpdev->pci_slot)
                        pci_destroy_slot(hpdev->pci_slot);

                put_pcichild(hpdev);
        }

> Patch 2 in this series does set it to NULL, but this code does not.
In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
we must set hpdev->pci_slot to NULL, otherwise, later, due to
hv_pci_remove() -> hv_pci_bus_exit() ->
hv_pci_devices_present() with the zero "relations", we'll double-free the
"hpdev" struct in pci_devices_present_work() -- see the above.

> And the code in hv_eject_device_work() does not set it to NULL.
It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
Because in hv_eject_device_work(): 
1) the "hpdev" is removed from hbus->children and it can not be seen
elsewhere;
2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().
 
> It looks like all the places that test the value of hpdev->pci_slot or call
> pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But
> when
> the code is inconsistent about setting to NULL, it always makes me wonder if
> there
> is a reason.
> 
> Michael

Thanks,
-- Dexuan
Dexuan Cui - March 21, 2019, 12:42 a.m.
> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Dexuan Cui
> > ...
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
A minor correction: I meant we'll call pci_destroy_slot(hpdev->pci_slot)
twice, not "double-free hpdev".

Thanks,
-- Dexuan
Michael Kelley - March 26, 2019, 5:50 p.m.
From: Dexuan Cui <decui@microsoft.com>  Sent: Wednesday, March 20, 2019 5:36 PM
> 
> > From: Michael Kelley <mikelley@microsoft.com>
> > > ...
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> > work_struct *work)
> > >  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
> > >  					 list_entry);
> > >  		list_del(&hpdev->list_entry);
> > > +
> > > +		if (hpdev->pci_slot)
> > > +			pci_destroy_slot(hpdev->pci_slot);
> >
> > The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> > pci_destory_slot().
> Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
> Because:
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed in the below put_pcichild():
> 
>      while (!list_empty(&removed)) {
>                 hpdev = list_first_entry(&removed, struct hv_pci_dev,
>                                          list_entry);
>                 list_del(&hpdev->list_entry);
> 
>                 if (hpdev->pci_slot)
>                         pci_destroy_slot(hpdev->pci_slot);
> 
>                 put_pcichild(hpdev);
>         }
> 
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
> 
> > And the code in hv_eject_device_work() does not set it to NULL.
> It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
> Because in hv_eject_device_work():
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
> first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().
> 
> > It looks like all the places that test the value of hpdev->pci_slot or call
> > pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But
> > when
> > the code is inconsistent about setting to NULL, it always makes me wonder if
> > there
> > is a reason.
> >
> > Michael
> 

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>
Lorenzo Pieralisi - March 26, 2019, 7:54 p.m.
On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> When we hot-remove a device, usually the host sends us a PCI_EJECT message,
> and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when
> we do the quick hot-add/hot-remove test, the host may not send us the
> PCI_EJECT message, if the guest has not fully finished the initialization
> by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> potentially unsafe to only depend on the pci_destroy_slot() in
> hv_eject_device_work(), though create_root_hv_pci_bus() ->
> hv_pci_assign_slots() is not called in this case. Note: in this case, the
> host still sends the guest a PCI_BUS_RELATIONS message with
> bus_rel->device_count == 0.
> 
> And, in the quick hot-add/hot-remove test, we can have such a race: before
> pci_devices_present_work() -> new_pcichild_device() adds the new device
> into hbus->children, we may have already received the PCI_EJECT message,
> and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> with bus_rel->device_count == 0 removes the device from hbus->children, and
> we end up being unable to remove the slot in hv_pci_remove() ->
> hv_pci_remove_slots().
> 
> The patch removes the slot in pci_devices_present_work() when the device
> is removed. This can address the above race. Note 1:
> pci_devices_present_work() and hv_eject_device_work() run in the
> singled-threaded hbus->wq, so there is not a double-remove issue for the
> slot. Note 2: we can't offload hv_pci_eject_device() from
> hv_pci_onchannelcallback() to the workqueue, because we need
> hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> poll the channel's ringbuffer to work around the
> "hangs in hv_compose_msi_msg()" issue: see
> commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

This commit log is unreadable, sorry. Indentation, punctuation and
formatting are just a mess, try to read it, you will notice by
yourself.

I basically reformatted it completely and pushed the series to
pci/controller-fixes but that's the last time I do it since I am not an
editor, next time I won't merge it.

More importantly, these patches are marked for stable, given the series
of fixes that triggered this series please ensure it was tested
thoroughly because it is honestly complicate to understand and I do not
want to backport further fixes to stable kernels on top of this.

Please have a look and report back.

Thanks,
Lorenzo

> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/controller/pci-hyperv.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
>  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
>  					 list_entry);
>  		list_del(&hpdev->list_entry);
> +
> +		if (hpdev->pci_slot)
> +			pci_destroy_slot(hpdev->pci_slot);
> +
>  		put_pcichild(hpdev);
>  	}
>  
> -- 
> 2.19.1
>
Dexuan Cui - March 27, 2019, 12:22 a.m.
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Sent: Tuesday, March 26, 2019 12:55 PM
> On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> > When we hot-remove a device, usually the host sends us a PCI_EJECT
> message,
> > and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But
> when
> > we do the quick hot-add/hot-remove test, the host may not send us the
> > PCI_EJECT message, if the guest has not fully finished the initialization
> > by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> > potentially unsafe to only depend on the pci_destroy_slot() in
> > hv_eject_device_work(), though create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() is not called in this case. Note: in this case, the
> > host still sends the guest a PCI_BUS_RELATIONS message with
> > bus_rel->device_count == 0.
> >
> > And, in the quick hot-add/hot-remove test, we can have such a race: before
> > pci_devices_present_work() -> new_pcichild_device() adds the new device
> > into hbus->children, we may have already received the PCI_EJECT message,
> > and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> > the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> > hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> > with bus_rel->device_count == 0 removes the device from hbus->children,
> and
> > we end up being unable to remove the slot in hv_pci_remove() ->
> > hv_pci_remove_slots().
> >
> > The patch removes the slot in pci_devices_present_work() when the device
> > is removed. This can address the above race. Note 1:
> > pci_devices_present_work() and hv_eject_device_work() run in the
> > singled-threaded hbus->wq, so there is not a double-remove issue for the
> > slot. Note 2: we can't offload hv_pci_eject_device() from
> > hv_pci_onchannelcallback() to the workqueue, because we need
> > hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> > poll the channel's ringbuffer to work around the
> > "hangs in hv_compose_msi_msg()" issue: see
> > commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in
> hv_compose_msi_msg()")
> 
> This commit log is unreadable, sorry. Indentation, punctuation and
> formatting are just a mess, try to read it, you will notice by
> yourself.
> 
> I basically reformatted it completely and pushed the series to
> pci/controller-fixes but that's the last time I do it since I am not an
> editor, next time I won't merge it.

Hi Lorenzo,
Thank you for helping improve my changelogs! I did learn a lot after
carefully comparing the improved version with my original version. :-)

I'll try my best to write a good changelog for my future patches.

> More importantly, these patches are marked for stable, given the series
> of fixes that triggered this series please ensure it was tested
> thoroughly because it is honestly complicate to understand and I do not
> want to backport further fixes to stable kernels on top of this.

I did the hot-add/hot-remove test in a loop for several thousand times,
and the patchset worked as expected and didn't show any issue.

> Please have a look and report back.
> 
> Thanks,
> Lorenzo

Thanks again!

Thanks,
-- Dexuan

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index b489412e3502..82acd6155adf 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1776,6 +1776,10 @@  static void pci_devices_present_work(struct work_struct *work)
 		hpdev = list_first_entry(&removed, struct hv_pci_dev,
 					 list_entry);
 		list_del(&hpdev->list_entry);
+
+		if (hpdev->pci_slot)
+			pci_destroy_slot(hpdev->pci_slot);
+
 		put_pcichild(hpdev);
 	}