Patchwork [net-next,RFC,4/8] net: core: add recycle capabilities on skbs via page_pool API

login
register
mail settings
Submitter Jesper Dangaard Brouer
Date Dec. 6, 2018, 11:25 p.m.
Message ID <154413874729.21735.10644578158550468689.stgit@firesoul>
Download mbox | patch
Permalink /patch/674815/
State New
Headers show

Comments

Jesper Dangaard Brouer - Dec. 6, 2018, 11:25 p.m.
From: Ilias Apalodimas <ilias.apalodimas@linaro.org>

This patch is changing struct sk_buff, and is thus per-definition
controversial.

Place a new member 'mem_info' of type struct xdp_mem_info, just after
members (flags) head_frag and pfmemalloc, And not in between
headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
Copying mem_info during skb_clone() is required.  This makes sure that
pages are correctly freed or recycled during the altered
skb_free_head() invocation.

The 'mem_info' name is chosen as this is not strictly tied to XDP,
although the XDP return infrastructure is used.  As a future plan, we
could introduce a __u8 flags member to xdp_mem_info and move flags
head_frag and pfmemalloc into this area.

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    6 +++++-
 include/net/xdp.h      |    1 +
 net/core/skbuff.c      |    7 +++++++
 net/core/xdp.c         |    6 ++++++
 4 files changed, 19 insertions(+), 1 deletion(-)
David Miller - Dec. 8, 2018, 7:15 a.m.
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 07 Dec 2018 00:25:47 +0100

> @@ -744,6 +745,10 @@ struct sk_buff {
>  				head_frag:1,
>  				xmit_more:1,
>  				pfmemalloc:1;
> +	/* TODO: Future idea, extend mem_info with __u8 flags, and
> +	 * move bits head_frag and pfmemalloc there.
> +	 */
> +	struct xdp_mem_info	mem_info;

This is 4 bytes right?

I guess I can live with this.

Please do some microbenchmarks to make sure this doesn't show any
obvious regressions.

Thanks.
Ilias Apalodimas - Dec. 8, 2018, 7:54 a.m.
On Fri, Dec 07, 2018 at 11:15:14PM -0800, David Miller wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Fri, 07 Dec 2018 00:25:47 +0100
> 
> > @@ -744,6 +745,10 @@ struct sk_buff {
> >  				head_frag:1,
> >  				xmit_more:1,
> >  				pfmemalloc:1;
> > +	/* TODO: Future idea, extend mem_info with __u8 flags, and
> > +	 * move bits head_frag and pfmemalloc there.
> > +	 */
> > +	struct xdp_mem_info	mem_info;
> 
> This is 4 bytes right?

With this patchset yes

> 
> I guess I can live with this.

Great news!

> 
> Please do some microbenchmarks to make sure this doesn't show any
> obvious regressions.

Will do

Thanks
/Ilias
Florian Westphal - Dec. 8, 2018, 9:57 a.m.
Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> 
> This patch is changing struct sk_buff, and is thus per-definition
> controversial.
> 
> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> members (flags) head_frag and pfmemalloc, And not in between
> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> Copying mem_info during skb_clone() is required.  This makes sure that
> pages are correctly freed or recycled during the altered
> skb_free_head() invocation.

I read this to mean that this 'info' isn't accessed/needed until skb
is freed.  Any reason its not added at the end?

This would avoid moving other fields that are probably accessed
more frequently during processing.
Jesper Dangaard Brouer - Dec. 8, 2018, 11:36 a.m.
On Sat, 8 Dec 2018 10:57:58 +0100
Florian Westphal <fw@strlen.de> wrote:

> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> > 
> > This patch is changing struct sk_buff, and is thus per-definition
> > controversial.
> > 
> > Place a new member 'mem_info' of type struct xdp_mem_info, just after
> > members (flags) head_frag and pfmemalloc, And not in between
> > headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> > Copying mem_info during skb_clone() is required.  This makes sure that
> > pages are correctly freed or recycled during the altered
> > skb_free_head() invocation.  
> 
> I read this to mean that this 'info' isn't accessed/needed until skb
> is freed.  Any reason its not added at the end?
>
> This would avoid moving other fields that are probably accessed
> more frequently during processing.

It is placed here because it is close to skb->head_frag, which is used
also used when SKB is freed.  This is the reason, both because I think
we can move the skb->head_frag bit into mem_info, and due to cacheline
accesses (e.g. skb->cloned and destructor pointer are also read, which
is on that cache-line)

The annoying part is actually that depending on the kernel config
options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER,
whether there is a cache-line split, where mem_info gets moved into the
next cacheline.

I do like the idea of moving it to the end, iif we also move
skb->head_frag into mem_info, as skb->head (on end-cacheline) is also
accessed during free, but given we still need to read the cache-line
containing skb->{cloned,destructor}, when I don't think it will be a
win.  I think the skb->cloned value is read during lifetime of the SKB.
Eric Dumazet - Dec. 8, 2018, 12:29 p.m.
On 12/08/2018 01:57 AM, Florian Westphal wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
>>
>> This patch is changing struct sk_buff, and is thus per-definition
>> controversial.
>>
>> Place a new member 'mem_info' of type struct xdp_mem_info, just after
>> members (flags) head_frag and pfmemalloc, And not in between
>> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
>> Copying mem_info during skb_clone() is required.  This makes sure that
>> pages are correctly freed or recycled during the altered
>> skb_free_head() invocation.
> 
> I read this to mean that this 'info' isn't accessed/needed until skb
> is freed.  Any reason its not added at the end?
> 
> This would avoid moving other fields that are probably accessed
> more frequently during processing.
> 

But I do not get why the patch is needed.

Adding extra cost for each skb destruction is costly.

I though XDP was all about _not_ having skbs.

Please let's do not slow down the non XDP stack only to make XDP more appealing.
Eric Dumazet - Dec. 8, 2018, 12:34 p.m.
On 12/08/2018 04:29 AM, Eric Dumazet wrote:
> 

> But I do not get why the patch is needed.
> 
> Adding extra cost for each skb destruction is costly.
> 
> I though XDP was all about _not_ having skbs.
> 
> Please let's do not slow down the non XDP stack only to make XDP more appealing.
> 

I have a similar concern with napi_consume_skb() :

This added a cost for locally generated TCP traffic, since most TX packets are fast clones.
Jesper Dangaard Brouer - Dec. 8, 2018, 1:45 p.m.
On Sat, 8 Dec 2018 04:29:17 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 12/08/2018 01:57 AM, Florian Westphal wrote:
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:  
> >> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> >>
> >> This patch is changing struct sk_buff, and is thus per-definition
> >> controversial.
> >>
> >> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> >> members (flags) head_frag and pfmemalloc, And not in between
> >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> >> Copying mem_info during skb_clone() is required.  This makes sure that
> >> pages are correctly freed or recycled during the altered
> >> skb_free_head() invocation.  
> > 
> > I read this to mean that this 'info' isn't accessed/needed until skb
> > is freed.  Any reason its not added at the end?
> > 
> > This would avoid moving other fields that are probably accessed
> > more frequently during processing.
> >   
> 
> But I do not get why the patch is needed.
> 
> Adding extra cost for each skb destruction is costly.
> 
> I though XDP was all about _not_ having skbs.
> 
> Please let's do not slow down the non XDP stack only to make XDP more
> appealing.

In general this work is about better cooperation between XDP and
netstack.  The patch is needed because the page_pool is keeping pages
DMA mapped and need a return hook.  I doubt that the extra
compare-and-branch will affect your use-case on super-scalar CPUs.  We
(Ilias and I) are actually testing this stuff on low-end ARM64 in-order
execution CPUs, which is actually nice as performance effects of our
code changes are not hidden by speculative execution units.  I'm
enforcing (and Ilias agrees) that we do benchmark driven development.
I actually invite people to monitor our progress here[1].  So, trust
me, I am as concerned as you about any performance regression, and is
vigilantly measuring this stuff.  (This is more than you can say about
a lot of the other stuff that gets accepted on this list).


[1] https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/
Ilias Apalodimas - Dec. 8, 2018, 2:57 p.m.
Hi Eric, 
> >> This patch is changing struct sk_buff, and is thus per-definition
> >> controversial.
> >>
> >> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> >> members (flags) head_frag and pfmemalloc, And not in between
> >> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> >> Copying mem_info during skb_clone() is required.  This makes sure that
> >> pages are correctly freed or recycled during the altered
> >> skb_free_head() invocation.
> > 
> > I read this to mean that this 'info' isn't accessed/needed until skb
> > is freed.  Any reason its not added at the end?
> > 
> > This would avoid moving other fields that are probably accessed
> > more frequently during processing.
> > 
> 
> But I do not get why the patch is needed.
> 
> Adding extra cost for each skb destruction is costly.
> 
> I though XDP was all about _not_ having skbs.

You hit the only point i don't personally like in the code, xdp info in the 
skb. Considering the benefits i am convinced this is ok and here's why:

> Please let's do not slow down the non XDP stack only to make XDP more appealing.

We are not trying to do that, on the contrary. The patchset has nothing towards
speeding XDP and slowing down anything else. The patchset speeds up the 
mvneta driver on the default network stack. The only change that was needed was
to adapt the driver to using the page_pool API. The speed improvements we are
seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x.

Lots of high speed drivers are doing similar recycling tricks themselves (and
there's no common code, everyone is doing something similar though). All we are
trying to do is provide a unified API to make that easier for the rest. Another
advantage is that if the some drivers switch to the API, adding XDP
functionality on them is pretty trivial.

Moreover our tests are only performed on systems without or with SMMU disabled.
On a platform i have access to, enabling and disabling the SMMU has some
performance impact. By keeping the buffers mapped we believe this impact
will be substantially less (i'll come back with results once i have them on
this).

I do understand your point, but the potential advantages on my head
overwight that by a lot.

I got other concerns on the patchset though. Like how much memory is it 'ok' to
keep mapped keeping in mind we are using the streaming DMA API. Are we going to
affect anyone else negatively by doing so ?

Thanks
/Ilias
Andrew Lunn - Dec. 8, 2018, 5:07 p.m.
> I got other concerns on the patchset though. Like how much memory is
> it 'ok' to keep mapped keeping in mind we are using the streaming
> DMA API. Are we going to affect anyone else negatively by doing so ?

For mvneta, you can expect the target to have between 512Mbyte to
3G. You can take a look at the DT files to get a better feel for this.
All of it should be low enough that it is DMA'able.

These SoCs are often used for WiFi access points, and NAS boxes.  So i
guess the big memory usage on these boxes is the block cache for a NAS
device. So if you want to see the impact of the driver using more
memory, see if disk performance is affected.

	Andrew
Eric Dumazet - Dec. 8, 2018, 7:26 p.m.
On 12/08/2018 06:57 AM, Ilias Apalodimas wrote:
> Hi Eric, 
>>>> This patch is changing struct sk_buff, and is thus per-definition
>>>> controversial.
>>>>
>>>> Place a new member 'mem_info' of type struct xdp_mem_info, just after
>>>> members (flags) head_frag and pfmemalloc, And not in between
>>>> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
>>>> Copying mem_info during skb_clone() is required.  This makes sure that
>>>> pages are correctly freed or recycled during the altered
>>>> skb_free_head() invocation.
>>>
>>> I read this to mean that this 'info' isn't accessed/needed until skb
>>> is freed.  Any reason its not added at the end?
>>>
>>> This would avoid moving other fields that are probably accessed
>>> more frequently during processing.
>>>
>>
>> But I do not get why the patch is needed.
>>
>> Adding extra cost for each skb destruction is costly.
>>
>> I though XDP was all about _not_ having skbs.
> 
> You hit the only point i don't personally like in the code, xdp info in the 
> skb. Considering the benefits i am convinced this is ok and here's why:
> 
>> Please let's do not slow down the non XDP stack only to make XDP more appealing.
> 
> We are not trying to do that, on the contrary. The patchset has nothing towards
> speeding XDP and slowing down anything else. The patchset speeds up the 
> mvneta driver on the default network stack. The only change that was needed was
> to adapt the driver to using the page_pool API. The speed improvements we are
> seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x.
> 
> Lots of high speed drivers are doing similar recycling tricks themselves (and
> there's no common code, everyone is doing something similar though). All we are
> trying to do is provide a unified API to make that easier for the rest. Another
> advantage is that if the some drivers switch to the API, adding XDP
> functionality on them is pretty trivial.
> 
> Moreover our tests are only performed on systems without or with SMMU disabled.
> On a platform i have access to, enabling and disabling the SMMU has some
> performance impact. By keeping the buffers mapped we believe this impact
> will be substantially less (i'll come back with results once i have them on
> this).
> 
> I do understand your point, but the potential advantages on my head
> overwight that by a lot.
> 
> I got other concerns on the patchset though. Like how much memory is it 'ok' to
> keep mapped keeping in mind we are using the streaming DMA API. Are we going to
> affect anyone else negatively by doing so ?
> 


I want to make sure you guys thought about splice() stuff, and skb_try_coalesce(),
and GRO, and skb cloning, and ...

I do not see how an essential property of page fragments would be attached
to sk_buff, without adding extra code all over the places.

Page recycling in drivers is fine, but should operate on pages that the driver owns.

Messing with skb->head seems not needed for performance, since skb->head can be regular
slab allocated.
David Miller - Dec. 8, 2018, 8:10 p.m.
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Sat, 8 Dec 2018 12:36:10 +0100

> The annoying part is actually that depending on the kernel config
> options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER,
> whether there is a cache-line split, where mem_info gets moved into the
> next cacheline.

Note that Florian Westphal's work (trying to help MP-TCP) would
eliminate this variability.
Jesper Dangaard Brouer - Dec. 8, 2018, 8:11 p.m.
On Sat, 8 Dec 2018 11:26:56 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 12/08/2018 06:57 AM, Ilias Apalodimas wrote:
> > Hi Eric,   
> >>>> This patch is changing struct sk_buff, and is thus per-definition
> >>>> controversial.
> >>>>
> >>>> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> >>>> members (flags) head_frag and pfmemalloc, And not in between
> >>>> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> >>>> Copying mem_info during skb_clone() is required.  This makes sure that
> >>>> pages are correctly freed or recycled during the altered
> >>>> skb_free_head() invocation.  
> >>>
> >>> I read this to mean that this 'info' isn't accessed/needed until skb
> >>> is freed.  Any reason its not added at the end?
> >>>
> >>> This would avoid moving other fields that are probably accessed
> >>> more frequently during processing.
> >>>  
> >>
> >> But I do not get why the patch is needed.
> >>
> >> Adding extra cost for each skb destruction is costly.
> >>
> >> I though XDP was all about _not_ having skbs.  
> > 
> > You hit the only point i don't personally like in the code, xdp info in the 
> > skb. Considering the benefits i am convinced this is ok and here's why:
> >   
> >> Please let's do not slow down the non XDP stack only to make XDP more appealing.  
> > 
> > We are not trying to do that, on the contrary. The patchset has nothing towards
> > speeding XDP and slowing down anything else. The patchset speeds up the 
> > mvneta driver on the default network stack. The only change that was needed was
> > to adapt the driver to using the page_pool API. The speed improvements we are
> > seeing on specific workloads (i.e 256b < packet < 400b) are almost 3x.
> > 
> > Lots of high speed drivers are doing similar recycling tricks themselves (and
> > there's no common code, everyone is doing something similar though). All we are
> > trying to do is provide a unified API to make that easier for the rest. Another
> > advantage is that if the some drivers switch to the API, adding XDP
> > functionality on them is pretty trivial.
> > 
> > Moreover our tests are only performed on systems without or with SMMU disabled.
> > On a platform i have access to, enabling and disabling the SMMU has some
> > performance impact. By keeping the buffers mapped we believe this impact
> > will be substantially less (i'll come back with results once i have them on
> > this).
> > 
> > I do understand your point, but the potential advantages on my head
> > overwight that by a lot.
> > 
> > I got other concerns on the patchset though. Like how much memory is it 'ok' to
> > keep mapped keeping in mind we are using the streaming DMA API. Are we going to
> > affect anyone else negatively by doing so ?
> >   
>  
> I want to make sure you guys thought about splice() stuff, and
> skb_try_coalesce(), and GRO, and skb cloning, and ...

Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
code path, as it does look like we don't handle this correctly.

> I do not see how an essential property of page fragments would be
> attached to sk_buff, without adding extra code all over the places.
> 
> Page recycling in drivers is fine, but should operate on pages that
> the driver owns.
> 

I think we have addressed this.  We are only recycling pages that the
driver owns.  In case of fragments, then we release the DMA-mapping and
don't recycle the page, instead the normal code path is taken (which is
missing in case of skb_try_coalesce).
Ilias Apalodimas - Dec. 8, 2018, 8:14 p.m.
On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
> >  
> > I want to make sure you guys thought about splice() stuff, and
> > skb_try_coalesce(), and GRO, and skb cloning, and ...
> 
> Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
> code path, as it does look like we don't handle this correctly.
> 

Noted. We did check skb cloning, but not this one indeed

/Ilias
David Miller - Dec. 8, 2018, 8:21 p.m.
From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Date: Sat, 8 Dec 2018 16:57:28 +0200

> The patchset speeds up the mvneta driver on the default network
> stack. The only change that was needed was to adapt the driver to
> using the page_pool API. The speed improvements we are seeing on
> specific workloads (i.e 256b < packet < 400b) are almost 3x.
> 
> Lots of high speed drivers are doing similar recycling tricks themselves (and
> there's no common code, everyone is doing something similar though). All we are
> trying to do is provide a unified API to make that easier for the rest. Another
> advantage is that if the some drivers switch to the API, adding XDP
> functionality on them is pretty trivial.

Yeah this is a very important point moving forward.

Jesse Brandeberg brought the following up to me at LPC and I'd like to
develop it further.

Right now we tell driver authors to write a new driver as SKB based,
and once they've done all of that work we tell them to basically
shoe-horn XDP support into that somewhat different framework.

Instead, the model should be the other way around, because with a raw
meta-data free set of data buffers we can always construct an SKB or
pass it to XDP.

So drivers should be targetting some raw data buffer kind of interface
which takes care of all of this stuff.  If the buffers get wrapped
into an SKB and get pushed into the traditional networking stack, the
driver shouldn't know or care.  Likewise if it ends up being processed
with XDP, it should not need to know or care.

All of those details should be behind a common layer.  Then we can
control:

1) Buffer handling, recycling, "fast paths"

2) Statistics

3) XDP feature sets

We can consolidate behavior and semantics across all of the drivers
if we do this.  No more talk about "supporting all XDP features",
and the inconsistencies we have because of that.

The whole common statistics discussion could be resolved with this
common layer as well.

We'd be able to control and properly optimize everything.
Ilias Apalodimas - Dec. 8, 2018, 8:29 p.m.
On Sat, Dec 08, 2018 at 12:21:10PM -0800, David Miller wrote:
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Date: Sat, 8 Dec 2018 16:57:28 +0200
> 
> > The patchset speeds up the mvneta driver on the default network
> > stack. The only change that was needed was to adapt the driver to
> > using the page_pool API. The speed improvements we are seeing on
> > specific workloads (i.e 256b < packet < 400b) are almost 3x.
> > 
> > Lots of high speed drivers are doing similar recycling tricks themselves (and
> > there's no common code, everyone is doing something similar though). All we are
> > trying to do is provide a unified API to make that easier for the rest. Another
> > advantage is that if the some drivers switch to the API, adding XDP
> > functionality on them is pretty trivial.
> 
> Yeah this is a very important point moving forward.
> 
> Jesse Brandeberg brought the following up to me at LPC and I'd like to
> develop it further.
> 
> Right now we tell driver authors to write a new driver as SKB based,
> and once they've done all of that work we tell them to basically
> shoe-horn XDP support into that somewhat different framework.
> 
> Instead, the model should be the other way around, because with a raw
> meta-data free set of data buffers we can always construct an SKB or
> pass it to XDP.

Yeah exactly and it gets even worst. If the driver writer doesn't go through the
'proper' path, i.e allocate buffers and use build_skb, you end up having to
rewrite dma/memory management for the nornal stack. So it's more than 
'shoe-horning' XDP, it's re-writing and re-testing the whole thing.
The API also offers dma mapping capabilities (configurable). So you remove 
potential nasty bugs there as well.

> 
> So drivers should be targetting some raw data buffer kind of interface
> which takes care of all of this stuff.  If the buffers get wrapped
> into an SKB and get pushed into the traditional networking stack, the
> driver shouldn't know or care.  Likewise if it ends up being processed
> with XDP, it should not need to know or care.
> 
> All of those details should be behind a common layer.  Then we can
> control:
> 
> 1) Buffer handling, recycling, "fast paths"
> 
> 2) Statistics
> 
> 3) XDP feature sets
> 
> We can consolidate behavior and semantics across all of the drivers
> if we do this.  No more talk about "supporting all XDP features",
> and the inconsistencies we have because of that.
> 
> The whole common statistics discussion could be resolved with this
> common layer as well.
> 
> We'd be able to control and properly optimize everything.


/Ilias
Willy Tarreau - Dec. 8, 2018, 9:06 p.m.
On Sat, Dec 08, 2018 at 10:14:47PM +0200, Ilias Apalodimas wrote:
> On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
> > >  
> > > I want to make sure you guys thought about splice() stuff, and
> > > skb_try_coalesce(), and GRO, and skb cloning, and ...
> > 
> > Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
> > code path, as it does look like we don't handle this correctly.
> > 
> 
> Noted. We did check skb cloning, but not this one indeed

I'll try to run some tests on my clearfog later, but for now I'm still
100% busy until haproxy 1.9 gets released.

Cheers,
Willy
Eric Dumazet - Dec. 8, 2018, 10:36 p.m.
On 12/08/2018 12:14 PM, Ilias Apalodimas wrote:
> On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
>>>  
>>> I want to make sure you guys thought about splice() stuff, and
>>> skb_try_coalesce(), and GRO, and skb cloning, and ...
>>
>> Thanks for the pointers. To Ilias, we need to double check skb_try_coalesce()
>> code path, as it does look like we don't handle this correctly.
>>
> 
> Noted. We did check skb cloning, but not this one indeed

There is also TCP RX zero copy to consider ( tcp_zerocopy_receive() )
but more generally all paths that can grab pages from skbs.
Ilias Apalodimas - Dec. 10, 2018, 7:54 a.m.
Hi Willy,
> 
> I'll try to run some tests on my clearfog later, but for now I'm still
> 100% busy until haproxy 1.9 gets released.

Great. Keep in mind there's multiple hardware this drivers sits on. If your 
clearfog hardware usues the software buffer manager then you can test. We
haven't done any changes to the 'hardware buffer manager' portion of the code
though. 

Test wise, you should see similar numbers for 64b-256b packets since the driver
internally recycled buffers for packets of that size. The tests we did with
256b-512b packets showed significant increases.


Thanks!
/Ilias
Saeed Mahameed - Dec. 10, 2018, 9:51 a.m.
On Sat, Dec 8, 2018 at 12:21 PM David Miller <davem@davemloft.net> wrote:
>
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Date: Sat, 8 Dec 2018 16:57:28 +0200
>
> > The patchset speeds up the mvneta driver on the default network
> > stack. The only change that was needed was to adapt the driver to
> > using the page_pool API. The speed improvements we are seeing on
> > specific workloads (i.e 256b < packet < 400b) are almost 3x.
> >
> > Lots of high speed drivers are doing similar recycling tricks themselves (and
> > there's no common code, everyone is doing something similar though). All we are
> > trying to do is provide a unified API to make that easier for the rest. Another
> > advantage is that if the some drivers switch to the API, adding XDP
> > functionality on them is pretty trivial.
>
> Yeah this is a very important point moving forward.
>
> Jesse Brandeberg brought the following up to me at LPC and I'd like to
> develop it further.
>
> Right now we tell driver authors to write a new driver as SKB based,
> and once they've done all of that work we tell them to basically
> shoe-horn XDP support into that somewhat different framework.
>
> Instead, the model should be the other way around, because with a raw
> meta-data free set of data buffers we can always construct an SKB or
> pass it to XDP.
>

Yep, one concern is how to integrate the BTF approach to let the stack
translate hw specific
meta data from that raw buffer to populate stack generated skbs,  we
will need a middle format
that both driver and the stack can understand, userspace xdp programs
will get the format from the driver
and then compile itself with it, we can't do this in kernel!

> So drivers should be targetting some raw data buffer kind of interface
> which takes care of all of this stuff.  If the buffers get wrapped
> into an SKB and get pushed into the traditional networking stack, the
> driver shouldn't know or care.  Likewise if it ends up being processed
> with XDP, it should not need to know or care.
>

We need XDP path exclusive features, like this patch page pool, to
have an advantage over the conventional way.
Otherwise it is always faster and more appealing to build skbs in driver level.

We also need to consider legacy hardware that might not be able to
fully support such raw buffers (for example ability to preserve
headrooms per packet).

> All of those details should be behind a common layer.  Then we can
> control:
>
> 1) Buffer handling, recycling, "fast paths"
>
> 2) Statistics
>
> 3) XDP feature sets
>
> We can consolidate behavior and semantics across all of the drivers
> if we do this.  No more talk about "supporting all XDP features",
> and the inconsistencies we have because of that.
>
> The whole common statistics discussion could be resolved with this
> common layer as well.
>
> We'd be able to control and properly optimize everything.

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7dcfb5591dc3..95dac0ba6947 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,7 @@ 
 #include <linux/in6.h>
 #include <linux/if_packet.h>
 #include <net/flow.h>
+#include <net/xdp.h>
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -744,6 +745,10 @@  struct sk_buff {
 				head_frag:1,
 				xmit_more:1,
 				pfmemalloc:1;
+	/* TODO: Future idea, extend mem_info with __u8 flags, and
+	 * move bits head_frag and pfmemalloc there.
+	 */
+	struct xdp_mem_info	mem_info;
 
 	/* fields enclosed in headers_start/headers_end are copied
 	 * using a single memcpy() in __copy_skb_header()
@@ -827,7 +832,6 @@  struct sk_buff {
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32		secmark;
 #endif
-
 	union {
 		__u32		mark;
 		__u32		reserved_tailroom;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5c33b9e0efab..4a0ca7a3d5e5 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -128,6 +128,7 @@  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 void xdp_return_frame(struct xdp_frame *xdpf);
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
 void xdp_return_buff(struct xdp_buff *xdp);
+void xdp_return_skb_page(void *data, struct xdp_mem_info *mem_info);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b4ee5c8b928f..71aca186e44c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -70,6 +70,7 @@ 
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
 #include <net/xfrm.h>
+#include <net/page_pool.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -544,6 +545,11 @@  static void skb_free_head(struct sk_buff *skb)
 {
 	unsigned char *head = skb->head;
 
+	if (skb->mem_info.type == MEM_TYPE_PAGE_POOL) {
+		xdp_return_skb_page(head, &skb->mem_info);
+		return;
+	}
+
 	if (skb->head_frag)
 		skb_free_frag(head);
 	else
@@ -859,6 +865,7 @@  static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	n->nohdr = 0;
 	n->peeked = 0;
 	C(pfmemalloc);
+	C(mem_info);
 	n->destructor = NULL;
 	C(tail);
 	C(end);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index e79526314864..1703be4c2611 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -379,6 +379,12 @@  void xdp_return_buff(struct xdp_buff *xdp)
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
 
+void xdp_return_skb_page(void *data, struct xdp_mem_info *mem_info)
+{
+	__xdp_return(data, mem_info, false, 0);
+}
+EXPORT_SYMBOL(xdp_return_skb_page);
+
 int xdp_attachment_query(struct xdp_attachment_info *info,
 			 struct netdev_bpf *bpf)
 {