muen/linux.git
5 years agobpf: Add bpf_current_task_under_cgroup helper
Sargun Dhillon [Fri, 12 Aug 2016 15:56:52 +0000 (08:56 -0700)]
bpf: Add bpf_current_task_under_cgroup helper

This adds a bpf helper that's similar to the skb_in_cgroup helper to check
whether the probe is currently executing in the context of a specific
subset of the cgroupsv2 hierarchy. It does this based on membership test
for a cgroup arraymap. It is invalid to call this in an interrupt, and
it'll return an error. The helper is primarily to be used in debugging
activities for containers, where you may have multiple programs running in
a given top-level "container".

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agocgroup: Add task_under_cgroup_hierarchy cgroup inline function to headers
Sargun Dhillon [Fri, 12 Aug 2016 15:56:40 +0000 (08:56 -0700)]
cgroup: Add task_under_cgroup_hierarchy cgroup inline function to headers

This commit adds an inline function to cgroup.h to check whether a given
task is under a given cgroup hierarchy. This is to avoid having to put
ifdefs in .c files to gate access to cgroups. When cgroups are disabled
this always returns true.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'batadv-next-for-davem-20160812' of git://git.open-mesh.org/linux-merge
David S. Miller [Sat, 13 Aug 2016 03:55:41 +0000 (20:55 -0700)]
Merge tag 'batadv-next-for-davem-20160812' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature patchset includes the following changes (mostly
chronological order):

 - bump version strings, by Simon Wunderlich

 - kerneldoc clean up, by Sven Eckelmann

 - enable RTNL automatic loading and according documentation
   changes, by Sven Eckelmann (2 patches)

 - fix/improve interface removal and associated locking, by
   Sven Eckelmann (3 patches)

 - clean up unused variables, by Linus Luessing

 - implement Gateway selection code for B.A.T.M.A.N. V by
   Antonio Quartulli (4 patches)

 - rewrite TQ comparison by Markus Pargmann

 - fix Cocinelle warnings on bool vs integers (by Fenguang Wu/Intels
   kbuild test robot) and bitwise arithmetic operations (by Linus
   Luessing)

 - rewrite packet creation for forwarding for readability and to avoid
   reference count mistakes, by Linus Luessing

 - use kmem_cache for translation table, which results in more efficient
   storing of translation table entries, by Sven Eckelmann

 - rewrite/clarify reference handling for send_skb_unicast, by Sven
   Eckelmann

 - fix debug messages when updating routes, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'sfc-SFN8000-support-improvements'
David S. Miller [Sat, 13 Aug 2016 03:42:20 +0000 (20:42 -0700)]
Merge branch 'sfc-SFN8000-support-improvements'

Bert Kenward says:

====================
sfc: SFN8000 support improvements

This series improves support for the recently released SFN8000 series
of adapters. Specifically, it retrieves interrupt moderation timer
settings directly from the adapter and uses those settings. It also
uses a new event queue initialisation interface, allowing specification
of a performance objective rather than enabling individual flags.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: get timer configuration from adapter
Bert Kenward [Thu, 11 Aug 2016 12:02:36 +0000 (13:02 +0100)]
sfc: get timer configuration from adapter

On SFN8000 series adapters the MC provides a method to get the timer
quantum and the maximum timer setting. We revert to the old values if the
new call is unavailable.

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: set interrupt moderation via MCDI
Bert Kenward [Thu, 11 Aug 2016 12:02:09 +0000 (13:02 +0100)]
sfc: set interrupt moderation via MCDI

SFN8000-series NICs require a new method of setting interrupt moderation,
via MCDI. This is indicated by a workaround flag. This new MCDI command
takes an explicit time value rather than a number of ticks. It therefore
makes sense to also store the moderation values in terms of time, since
that is what the ethtool interface is interested in.

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: use new performance based event queue init
Bert Kenward [Thu, 11 Aug 2016 12:01:54 +0000 (13:01 +0100)]
sfc: use new performance based event queue init

Rather than explicitly specifying flags we can now specify a desired
performance target to the firmware, ie higher throughput or lower latency.
For now we use the default "auto" configuration.

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: retrieve second word of datapath capabilities
Bert Kenward [Thu, 11 Aug 2016 12:01:35 +0000 (13:01 +0100)]
sfc: retrieve second word of datapath capabilities

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: allow asynchronous MCDI without completion function
Bert Kenward [Thu, 11 Aug 2016 12:01:21 +0000 (13:01 +0100)]
sfc: allow asynchronous MCDI without completion function

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: update MCDI protocol headers
Bert Kenward [Thu, 11 Aug 2016 12:01:01 +0000 (13:01 +0100)]
sfc: update MCDI protocol headers

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: mediatek: enhance the locking using the lightweight ones
Sean Wang [Thu, 11 Aug 2016 09:51:00 +0000 (17:51 +0800)]
net: ethernet: mediatek: enhance the locking using the lightweight ones

Since these critical sections protected by page_lock are all entered
from the user context or bottom half context, they can be replaced
with the spin_lock() or spin_lock_bh instead of spin_lock_irqsave().

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Acked-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ena: Add a driver for Amazon Elastic Network Adapters (ENA)
Netanel Belgazal [Wed, 10 Aug 2016 11:03:22 +0000 (14:03 +0300)]
net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)

This is a driver for the ENA family of networking devices.

Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'xilinx-gmiitorgmii-converter'
David S. Miller [Fri, 12 Aug 2016 23:57:20 +0000 (16:57 -0700)]
Merge branch 'xilinx-gmiitorgmii-converter'

Kedareswara rao Appana says:

====================
net: phy: Add xilinx gmiitorgmii converter support

The Gigabit Media Independent Interface (GMII) to Reduced Gigabit Media
Independent Interface (RGMII) core provides the RGMII between RGMII-compliant
Ethernet physical media devices (PHY) and the Gigabit Ethernet controller.
This core can be used in all three modes of operation(10/100/1000 Mb/s).
The Management Data Input/Output (MDIO) interface is used to configure the
Speed of operation. This core can switch dynamically between the three
Different speed modes by configuring the conveter register through mdio write.

The conveter sits b/w the MAC and external phy like below

MACB <==> GMII2RGMII <==> RGMII_PHY

        MDIO    <========> GMII2RGMII
MCAB <=======>
                <========> RGMII

Using MAC MDIO bus we can access both the converter and the external PHY.
We need to program the line speed of the converter during run time based
On the external phy negotiated speed.

This patch series does the below
---> Add mask for Control register 10Mbps speed.
---> Add support for xilinx gmiitorgmii converter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: Add gmiitorgmii converter support
Appana Durga Kedareswara Rao [Wed, 10 Aug 2016 05:50:08 +0000 (11:20 +0530)]
net: phy: Add gmiitorgmii converter support

This patch adds support for gmiitorgmii converter.

The GMII to RGMII IP core provides the Reduced Gigabit Media
Independent Interface (RGMII) between Ethernet physical media
Devices and the Gigabit Ethernet controller. This core can
Switch dynamically between the three different speed modes of
Operation by configuring the converter register through mdio write.

MDIO interface is used to set operating speed of Ethernet MAC.

This converter sits between the MAC and the external phy
MAC <==> GMII2RGMII <==> RGMII_PHY

Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoDocumentation: DT: net: Add Xilinx gmiitorgmii converter device tree binding document...
Appana Durga Kedareswara Rao [Wed, 10 Aug 2016 05:50:07 +0000 (11:20 +0530)]
Documentation: DT: net: Add Xilinx gmiitorgmii converter device tree binding documentation

Device-tree binding documentation for xilinx gmiitorgmii converter.

Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Add mask for Control register 10Mbps speed
Appana Durga Kedareswara Rao [Wed, 10 Aug 2016 05:50:06 +0000 (11:20 +0530)]
net: Add mask for Control register 10Mbps speed

This patch adds mask for the Control register
10Mbps speed.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: renesas: sh_eth: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes [Tue, 9 Aug 2016 22:04:49 +0000 (00:04 +0200)]
net: ethernet: renesas: sh_eth: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Tested-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: renesas: sh_eth: use phydev from struct net_device
Philippe Reynes [Tue, 9 Aug 2016 22:04:48 +0000 (00:04 +0200)]
net: ethernet: renesas: sh_eth: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy_dev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Tested-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosamples/bpf: fix bpf_perf_event_output prototype
Adam Barth [Wed, 10 Aug 2016 16:45:39 +0000 (09:45 -0700)]
samples/bpf: fix bpf_perf_event_output prototype

The commit 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
started using 20 of initially reserved upper 32-bits of 'flags' argument
in bpf_perf_event_output(). Adjust corresponding prototype in samples/bpf/bpf_helpers.h

Signed-off-by: Adam Barth <arb@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: macb: Add 64 bit addressing support for GEM
Harini Katakam [Tue, 9 Aug 2016 07:45:53 +0000 (13:15 +0530)]
net: macb: Add 64 bit addressing support for GEM

This patch adds support for 64 bit addressing and BDs.
-> Enable 64 bit addressing in DMACFG register.
-> Set DMA mask when design config register shows support for 64 bit addr.
-> Add new BD words for higher address when 64 bit DMA support is present.
-> Add and update TBQPH and RBQPH for MSB of BD pointers.
-> Change extraction and updation of buffer addresses to use
64 bit address.
-> In gem_rx extract address in one place insted of two and use a
separate flag for RXUSED.

Signed-off-by: Harini Katakam <harinik@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed*: Add support for ethtool link_ksettings callbacks.
Sudarsana Reddy Kalluru [Tue, 9 Aug 2016 07:51:23 +0000 (03:51 -0400)]
qed*: Add support for ethtool link_ksettings callbacks.

This patch adds the driver implementation for ethtool link_ksettings
callbacks. qed driver now defines/uses the qed specific masks for
representing link capability values. qede driver maps these values to
to new link modes defined by the kernel implementation of link_ksettings.

Please consider applying this to 'net-next' branch.

Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'cpsw-refactor'
David S. Miller [Thu, 11 Aug 2016 00:27:41 +0000 (17:27 -0700)]
Merge branch 'cpsw-refactor'

Ivan Khoronzhuk says:

====================
net: ethernet: ti: cpsw: split driver data and per ndev data

In dual_emac mode the driver can handle 2 network devices. Each of them can use
its own private data and common data/resources. This patchset splits common driver
data/resources and private per net device data.
It leads to:
- reduce memory usage
- increase code readability
- allows add a bunch of simplification
- create prerequisites to add multi-channel support,
  when channels are shared between net devices

Doesn't have bad impact on performance.
v2: https://lkml.org/lkml/2016/8/6/108

Since v2:
- removed patch:
  net: ethernet: ti: cpsw: fix int dbg message
- replaced patch:
  "net: ethernet: ti: cpsw: remove redundant check in napi poll"
  on "net: ethernet: ti: cpsw: remove intr dbg msg from poll handlers"
- removed macro "cpsw_get_slave_ndev"
- corrected some commits

Since v1:
- added several patch improvements
- avoided variable reordering in structures
- removed static variable for common function
- split big patch on several patches:
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: move ale, cpts and drivers params under cpsw_common
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:44 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: move ale, cpts and drivers params under cpsw_common

The ale, cpts, version, rx_packet_max, bus_freq, interrupt pacing
parameters are common per net device that uses the same h/w. So,
move them to common driver structure.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: move napi struct to cpsw_common
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:43 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: move napi struct to cpsw_common

The napi structs are common for both net devices in dual_emac
mode, In order to not hold duplicate links to them, move to
cpsw_common.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: move platform data and slaves info to cpsw_common
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:42 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: move platform data and slaves info to cpsw_common

These data are common for net devs in dual_emac mode. No need to hold
it for every priv instance, so move them under cpsw_common.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet; ethernet: ti: cpsw: move irq stuff under cpsw_common
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:41 +0000 (02:22 +0300)]
net; ethernet: ti: cpsw: move irq stuff under cpsw_common

The irq data are common for net devs in dual_emac mode. So no need to
hold these data in every priv struct, move them under cpsw_common.
Also delete irq_num var, as after optimization it's not needed.
Correct number of irqs to 2, as anyway, driver is using only 2,
at least for now.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: move cpdma resources to cpsw_common
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:40 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: move cpdma resources to cpsw_common

Every net device private struct holds links to shared cpdma resources.
No need to save and every time synchronize these resources per net dev.
So, move it to common driver struct.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: move links on h/w registers to cpsw_common
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:39 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: move links on h/w registers to cpsw_common

The pointers on h/w registers are common for every cpsw_private
instance, so no need to hold them for every ndev.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: replace pdev on dev
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:38 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: replace pdev on dev

No need to hold pdev link when only dev is needed.
This allows to simplify a bunch of cpsw->pdev->dev now and farther.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: create common struct to hold shared driver data
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:37 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: create common struct to hold shared driver data

This patch simply create holder for common data and as a start moves
pdev var to it.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: don't check slave num in runtime
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:36 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: don't check slave num in runtime

No need to check const slave num in runtime for every packet,
and ndev for slaves w/o ndev is anyway NULL. So remove redundant
check and macro.

Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: remove clk var from priv
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:35 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: remove clk var from priv

There is no need to hold link to clk, it's used only once
while probe.

Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: remove priv from cpsw_get_slave_port() parameters list
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:34 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: remove priv from cpsw_get_slave_port() parameters list

There is no need in priv here.

Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: remove intr dbg msg from poll handlers
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:33 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: remove intr dbg msg from poll handlers

At poll handler no possibility to figure out which network device is
handling packets, as cpdma channels are common for both network
devices in dual_emac mode. Currently, the messages are printed only
for one device, in fact, there is two. This print msg is incorrect
and seems is not very useful, so drop it from poll handler.

Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: simplify submit routine
Ivan Khoronzhuk [Tue, 9 Aug 2016 23:22:32 +0000 (02:22 +0300)]
net: ethernet: ti: cpsw: simplify submit routine

As second net dev is created only in case of dual_emac mode, port
number can be figured out in simpler way. Also no need to pass
redundant ndev struct.

Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agorps: Inspect PPTP encapsulated by GRE to get flow hash
Gao Feng [Tue, 9 Aug 2016 04:38:24 +0000 (12:38 +0800)]
rps: Inspect PPTP encapsulated by GRE to get flow hash

The PPTP is encapsulated by GRE header with that GRE_VERSION bits
must contain one. But current GRE RPS needs the GRE_VERSION must be
zero. So RPS does not work for PPTP traffic.

In my test environment, there are four MIPS cores, and all traffic
are passed through by PPTP. As a result, only one core is 100% busy
while other three cores are very idle. After this patch, the usage
of four cores are balanced well.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Reviewed-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'qdisc-hashtable'
David S. Miller [Thu, 11 Aug 2016 00:19:07 +0000 (17:19 -0700)]
Merge branch 'qdisc-hashtable'

Jiri Kosina says:

====================
Convert qdisc linked list into a hashtable

This is a respin of the v6 of the original patch [1], split into two-patch
series as requested by davem; first patch fixes all symbol conflicts
that'd happen once netdevice.h starts to include hashtable.h, the second
one performs the actual switch to hashtable.

I've preserved Cong's Reviewed-by:, as code-wise this series is identical
to the original v6 of the patch.

[1] lkml.kernel.org/r/alpine.LNX.2.00.1608011220580.22028@cbobk.fhfr.pm
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: convert qdisc linked list to hashtable
Jiri Kosina [Wed, 10 Aug 2016 09:05:15 +0000 (11:05 +0200)]
net: sched: convert qdisc linked list to hashtable

Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.

Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: resolve symbol conflicts with generic hashtable.h
Jiri Kosina [Wed, 10 Aug 2016 09:03:35 +0000 (11:03 +0200)]
net: resolve symbol conflicts with generic hashtable.h

This is a preparatory patch for converting qdisc linked list into a
hashtable. As we'll need to include hashtable.h in netdevice.h, we first
have to make sure that this will not introduce symbol conflicts for any of
the netdevice.h users.

Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoravb: use proper names for suspend/resume functions
Niklas Söderlund [Wed, 10 Aug 2016 11:09:49 +0000 (13:09 +0200)]
ravb: use proper names for suspend/resume functions

The patch 'ravb: add sleep PM suspend/resume support' used incorrect
function names containing 'runtime' for the suspend and resume
functions.

Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ipconfig: fix use after free
Uwe Kleine-König [Wed, 10 Aug 2016 09:44:17 +0000 (11:44 +0200)]
net: ipconfig: fix use after free

ic_close_devs() calls kfree() for all devices's ic_device. Since commit
2647cffb2bc6 ("net: ipconfig: Support using "delayed" DHCP replies")
the active device's ic_device is still used however to print the
ipconfig summary which results in an oops if the memory is already
changed. So delay freeing until after the autoconfig results are
reported.

Fixes: 2647cffb2bc6 ("net: ipconfig: Support using "delayed" DHCP replies")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoravb: add sleep PM suspend/resume support
Niklas Söderlund [Wed, 3 Aug 2016 13:56:47 +0000 (15:56 +0200)]
ravb: add sleep PM suspend/resume support

The interface would not function after the system had been woken up
after have been suspended (echo mem > /sys/power/state) cycle. The
reason for this is that all device registers have been reset to its
default values. This patch adds sleep suspend and resume functions that
detached the interface at suspend and restore the registers and reattach
the interface at resume.

Only the registers that are only configured at probe time needs to be
explicitly restored by the resume handler. All other registers are
reconfigured by either reopening the device in the resume handler (if
the device was running when the system was suspended) or when the
interface is opened by a user at a later time.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: b53: constify b53_io_ops structures
Julia Lawall [Tue, 9 Aug 2016 17:09:45 +0000 (19:09 +0200)]
net: dsa: b53: constify b53_io_ops structures

The b53_io_ops structures are never modified, so declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Remove fib_local variable
David Ahern [Tue, 9 Aug 2016 13:51:06 +0000 (06:51 -0700)]
net: Remove fib_local variable

After commit 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
fib_local is set but not used. Remove it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoppp: build ifname using unit identifier for rtnl based devices
Guillaume Nault [Tue, 9 Aug 2016 13:12:26 +0000 (15:12 +0200)]
ppp: build ifname using unit identifier for rtnl based devices

Userspace programs generally need to know the name of the ppp devices
they create. Both ioctl and rtnl interfaces use the ppp<suffix> sheme
to name them. But although the suffix used by the ioctl interface can
be known by userspace (it's the PPP unit identifier returned by the
PPPIOCGUNIT ioctl), the one used by the rtnl is only known by the
kernel.

This patch brings more consistency between ioctl and rtnl based ppp
devices by generating device names using the PPP unit identifer as
suffix in both cases. This way, userspace can always infer the name of
the devices they create.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobatman-adv: Fix consistency of update route messages
Sven Eckelmann [Wed, 29 Jun 2016 21:45:57 +0000 (23:45 +0200)]
batman-adv: Fix consistency of update route messages

The debug messages of _batadv_update_route were printed before the actual
route change is done. At this point it is not really known which
curr_router will be replaced. Thus the messages could print the wrong
operation.

Printing the debug messages after the operation was done avoids this
problem.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Use bitwise instead of arithmetic operator for flags
Linus Lüssing [Mon, 11 Jul 2016 09:16:36 +0000 (11:16 +0200)]
batman-adv: Use bitwise instead of arithmetic operator for flags

This silences the following coccinelle warning:

"WARNING: sum of probable bitmasks, consider |"

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Remove orig_node reference handling from send_skb_unicast
Sven Eckelmann [Mon, 27 Jun 2016 06:15:42 +0000 (08:15 +0200)]
batman-adv: Remove orig_node reference handling from send_skb_unicast

The function batadv_send_skb_unicast is not acquiring a reference for an
orig_node nor removing it from any datastructure. It still reduces the
reference counter for an object which is still in the hands of the caller.

This is confusing and can lead in the future to problems in the reference
handling of the caller function.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: use kmem_cache for translation table
Sven Eckelmann [Sat, 25 Jun 2016 14:44:06 +0000 (16:44 +0200)]
batman-adv: use kmem_cache for translation table

The translation table (global, local) is usually the part of batman-adv
which has the most dynamical allocated objects. Most of them
(tt_local_entry, tt_global_entry, tt_orig_list_entry, tt_change_node,
tt_req_node, tt_roam_node) are equally sized. So it makes sense to have
them allocated from a kmem_cache for each type.

This approach allowed a small wireless router (TP-Link TL-841NDv8; SLUB
allocator) to store 34% more translation table entries compared to the
current implementation.

[1] https://open-mesh.org/projects/batman-adv/wiki/Kmalloc-kmem-cache-tests

Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Introduce forward packet creation helper
Linus Lüssing [Mon, 20 Jun 2016 19:39:54 +0000 (21:39 +0200)]
batman-adv: Introduce forward packet creation helper

This patch abstracts the forward packet creation into the new function
batadv_forw_packet_alloc().

The queue counting and interface reference counters are now handled
internally within batadv_forw_packet_alloc() and its
batadv_forw_packet_free() counterpart. This should reduce the risk of
having reference/queue counting bugs again and should increase
code readibility.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: fix boolreturn.cocci warnings
kbuild test robot [Wed, 6 Jul 2016 02:49:29 +0000 (10:49 +0800)]
batman-adv: fix boolreturn.cocci warnings

net/batman-adv/bridge_loop_avoidance.c:1105:9-10: WARNING: return of 0/1 in function 'batadv_bla_process_claim' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: iv_ogm, Reduce code duplication
Markus Pargmann [Sun, 3 Jul 2016 09:07:14 +0000 (11:07 +0200)]
batman-adv: iv_ogm, Reduce code duplication

The difference between tq1 and tq2 are calculated the same way in two
separate functions.

This patch moves the common code to a separate function
'batadv_iv_ogm_neigh_diff' which handles everything necessary. The other
two functions can then handle errors and use the difference directly.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
[sven@narfation.org: rebased on current version, initialize return variable
in batadv_iv_ogm_neigh_diff, add kerneldoc, convert to bool return type]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: disable sysfs knobs when GW-mode is not implemented
Antonio Quartulli [Sun, 3 Jul 2016 10:46:35 +0000 (12:46 +0200)]
batman-adv: disable sysfs knobs when GW-mode is not implemented

Now that the GW-mode code is algorithm specific, batman-adv expects the
routing algorithm to implement some APIs to make it work.

However, such APIs are not mandatory, therefore we might have algorithms
not providing them. In this case all the sysfs knobs related to GW-mode
should be deactivated to make sure that settings injected by the user
for this feature are rejected.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: B.A.T.M.A.N. V - implement GW selection logic
Antonio Quartulli [Sun, 3 Jul 2016 10:46:34 +0000 (12:46 +0200)]
batman-adv: B.A.T.M.A.N. V - implement GW selection logic

Since the GW selection logic has been made routing protocol specific
it is now possible for B.A.T.M.A.N V to have its own mechanism by
providing the API implementation.

Implement the GW specific API in the B.A.T.M.A.N. V protocol in
order to provide a working GW selection mechanism.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: make GW election code protocol specific
Antonio Quartulli [Sun, 3 Jul 2016 10:46:33 +0000 (12:46 +0200)]
batman-adv: make GW election code protocol specific

Each routing protocol may have its own specific logic about
gateway election which is potentially based on the metric being
used.

Create two GW specific API functions and move the current election
logic in the B.A.T.M.A.N. IV specific code.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: make the GW selection class algorithm specific
Antonio Quartulli [Sun, 3 Jul 2016 10:46:32 +0000 (12:46 +0200)]
batman-adv: make the GW selection class algorithm specific

The B.A.T.M.A.N. V algorithm uses a different metric compared to its
predecessor and for this reason the logic used to compute the best
Gateway is also changed. This means that the GW selection class
fed to this logic has a semantics that depends on the algorithm being
used.

Make the parsing and printing routine of the GW selection class
routing algorithm specific. Each algorithm can now parse (and print)
this value independently.

If no API is provided by any algorithm, the default is to use the
current mechanism of considering such value like an integer between
1 and 255.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Remove unused primary_if and bat_priv variables
Linus Lüssing [Tue, 14 Jun 2016 20:56:50 +0000 (22:56 +0200)]
batman-adv: Remove unused primary_if and bat_priv variables

Fixes: ef0a937f7a14 ("batman-adv: consider outgoing interface in OGM sending")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Avoid sysfs name collision for netns moves
Sven Eckelmann [Mon, 13 Jun 2016 05:41:32 +0000 (07:41 +0200)]
batman-adv: Avoid sysfs name collision for netns moves

The kobject_put is only removing the sysfs entry and corresponding entries
when its reference counter becomes zero. This tends to lead to collisions
when a device is moved between two different network namespaces because
some of the sysfs files have to be removed first and then added again to
the already moved sysfs entry.

    WARNING: CPU: 0 PID: 290 at lib/kobject.c:240 kobject_add_internal+0x5ec/0x8a0
    kobject_add_internal failed for batman_adv with -EEXIST, don't try to register things with the same name in the same directory.

But the caller of kobject_put can already remove the sysfs entry before it
does the kobject_put. This removal is done even when the reference counter
is not yet zero and thus avoids the problem.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Revert "postpone sysfs removal when unregistering"
Sven Eckelmann [Mon, 13 Jun 2016 05:41:31 +0000 (07:41 +0200)]
batman-adv: Revert "postpone sysfs removal when unregistering"

Postponing the removal of the interface breaks the expected behavior of
NETDEV_UNREGISTER and NETDEV_PRE_TYPE_CHANGE. This is especially
problematic when an interface is removed and added in quick succession.

This reverts commit 5bc44dc8458c ("batman-adv: postpone sysfs removal when
unregistering").

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Modify mesh_iface outside sysfs context
Sven Eckelmann [Mon, 13 Jun 2016 05:41:30 +0000 (07:41 +0200)]
batman-adv: Modify mesh_iface outside sysfs context

The legacy sysfs interface to modify interfaces belonging to batman-adv
is run inside a region holding s_lock. And to add a net_device, it has
to also get the rtnl_lock. This is exactly the other way around than in
other virtual net_devices and conflicts with netdevice notifier which
executes inside rtnl_lock.

The inverted lock situation is currently solved by executing the removal
of netdevices via workqueue. The workqueue isn't executed inside
rtnl_lock and thus can independently get the s_lock and the rtnl_lock.

But this workaround fails when the netdevice notifier creates events in
quick succession and the earlier triggered removal of a net_device isn't
processed in the workqueue before the adding of the new netdevice (with
same name) event is issued.

Instead the legacy sysfs interface store events have to be enqueued in
a workqueue to loose the s_lock. The worker is then free to get the
required locks and the deadlock is avoided.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Use rtnl link in device creation example
Sven Eckelmann [Fri, 10 Jun 2016 21:00:56 +0000 (23:00 +0200)]
batman-adv: Use rtnl link in device creation example

The standard kernel API to add new virtual interfaces and attach other
interfaces to it is rtnl-link. batman-adv supports it since v3.10. This
functionality should be used instead of the legacy batman-adv-only sysfs
interface.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Define module rtnl link name
Sven Eckelmann [Fri, 10 Jun 2016 21:00:55 +0000 (23:00 +0200)]
batman-adv: Define module rtnl link name

The batman-adv module can automatically be loaded when operations over the
rtnl link are triggered. This requires only the correct rtnl link name in
the module header.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Document optional batadv_algo_ops
Sven Eckelmann [Tue, 7 Jun 2016 20:44:53 +0000 (22:44 +0200)]
batman-adv: Document optional batadv_algo_ops

Some operations in batadv_algo_ops are optional and marked as such in the
kerneldoc. But some of them miss the "(optional)" in their kerneldoc. These
have to also be marked to give an implementor of an algorithm the correct
background information without looking in the code calling these function
pointers.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
5 years agobatman-adv: Start new development cycle
Simon Wunderlich [Tue, 9 Aug 2016 05:50:46 +0000 (07:50 +0200)]
batman-adv: Start new development cycle

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
5 years agoRDS: add __printf format attribute to error reporting functions
Nicolas Iooss [Fri, 5 Aug 2016 20:11:12 +0000 (22:11 +0200)]
RDS: add __printf format attribute to error reporting functions

This is helpful to detect at compile-time errors related to format
strings.

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMicrosemi VSC 8531/41 PHY Driver
Raju Lakkaraju [Fri, 5 Aug 2016 12:24:21 +0000 (17:54 +0530)]
Microsemi VSC 8531/41 PHY Driver

Hello,

I added all review comments and re-sending for review.

>From a5017f5878a92d2acec86a6a29b1498c457cb73a Mon Sep 17 00:00:00 2001
From: Nagaraju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Date: Wed, 3 Aug 2016 18:28:24 +0530
Subject: [PATCH v2] net: phy: Add drivers for Microsemi PHYs

Signed-off-by: Nagaraju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/fsl: use of_property_read_bool
Julia Lawall [Fri, 5 Aug 2016 11:24:12 +0000 (13:24 +0200)]
net/fsl: use of_property_read_bool

Use of_property_read_bool to check for the existence of a property.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e1,e2,x;
@@
- if (of_get_property(e1,e2,NULL))
- x = true;
- else
- x = false;
+ x = of_property_read_bool(e1,e2);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agohv_netvsc: Add handler for physical link speed change
Haiyang Zhang [Thu, 4 Aug 2016 17:42:15 +0000 (10:42 -0700)]
hv_netvsc: Add handler for physical link speed change

On Hyper-V host 2016 and later, VMs gets an event message of the physical
link speed when vSwitch is changed. This patch handles this message, so
the updated link speed can be reported by ethtool.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agohv_netvsc: Add query for initial physical link speed
Haiyang Zhang [Thu, 4 Aug 2016 17:42:14 +0000 (10:42 -0700)]
hv_netvsc: Add query for initial physical link speed

The physical link speed value will be reported by ethtool command.
The real speed is available from Windows 2016 host or later.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpdma: remove used_desc counter
Grygorii Strashko [Thu, 4 Aug 2016 15:20:51 +0000 (18:20 +0300)]
net: ethernet: ti: cpdma: remove used_desc counter

The struct cpdma_desc_pool->used_desc field can be safely removed from
CPDMA driver (and hot patch) because used_descs counter is used just
for pool consistency check at CPDMA deinitialization and now this
check can be re-implemnted using gen_pool_size(pool->gen_pool) !=
gen_pool_avail(pool->gen_pool).
More over, this will allow to get rid of warnings in
cpdma_desc_pool_destro()-> WARN_ON(pool->used_desc) which may happen
because the used_descs is used unprotected, since CPDMA has been
switched to use genalloc, and may get wrong values on SMP.

Hence, remove used_desc from struct cpdma_desc_pool.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/sched/sch_hfsc.c: remove unused cl_myfadj
Michal Soltys [Tue, 2 Aug 2016 22:44:55 +0000 (00:44 +0200)]
net/sched/sch_hfsc.c: remove unused cl_myfadj

The code using this variable has been commented out in the past as it
was causing issues in upperlimited link-sharing scenarios.

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug
Michal Soltys [Tue, 2 Aug 2016 22:44:54 +0000 (00:44 +0200)]
net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug

This patch simplifies how we update fsc and calculate vt from it - while
keeping the expected functionality identical with how hfsc behaves
curently. It also fixes a certain issue introduced with
a very old patch.

The idea is, that instead of correcting cl_vt before fsc curve update
(rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
cl_vt local to the current period - we can simply rely on virtual times
and curve values always being in sync - analogously to how rsc and usc
function, except that we use virtual time here.

Why hasn't it been done since the beginning this way ? The likely scenario
(basing on the code trying to correct curves whenever possible) was to
keep the virtual times as small as possible - as they have tendency to
"gallop" forward whenever their siblings and other fair sharing
subtrees are idling. On top of that, current code is subtly bugged, so
cumulative time (without any corrections) is always kept and used in
init_vf() when a new backlog period begins (using cl_cvtoff).

Is cumulative value safe ? Generally yes, though corner cases are easy
to create. For example consider:

1gbit interface
some 100kbit leaf, everything else idle

With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
logarithms rounded up). It's getting somewhat dangerous, but also
requires setup excusing this kind of values not mentioning permanently
backlogged class for a year. In near most extreme case (10gbit, 10kbit
leaf), we have "enough" to hold ~13.6 days in 64 bits.

Well, the issue remains mostly theoretical and cl_cvtoff has been
working fine for all those years. Sensible configuration are de-facto
immune to this issue, and not so sensible can solve it with a cronjob
and its period inversely proportional to the insanity of such setup =)

Now let's explain the subtle bug mentioned earlier.

The issue is related to how offsets are kept and how we calculate
virtual times and update fair service curve(s). The issue itself is
subtle, but easy to observe with long m1 segments. It was introduced in
rather old patch:

Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
in HFSC scheduler"

(available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)

Originally when a new backlog period was started, cl_vtoff of each
sibling was updated with cl_cvtmax from past period - naturally moving
all cl_vt to proper starting point. That patch adjusted it so cumulative
offset is kept in the parent, and there is no need for traversing the
list (as any subsequent child activation derives new vt from already
active sibling(s)).

But with this change, cl_vtoff (of each sibling) is no longer persistent
across the inactivity periods, as it's calculated from parent's
cl_cvtoff on a new backlog period, conflicting with the following curve
correction from the previous period:

if (cl->cl_virtual.x == vt) {
        cl->cl_virtual.x -= cl->cl_vtoff;
cl->cl_vtoff = 0;
}

This essentially tries to keep curve as if it was local to the period
and resets cl_vtoff (cumulative vt offset of the class) to 0 when
possible (read: when we have an intersection or if a new curve is below
the old one). But then it's recalculated from cl_cvtoff on next active
period.  Then rtsc_min() call preceding the above if() doesn't really
do what we expect it to do in such scenario - as it calculates the
minimum of corrected curve (from the previous backlog period) and the
new uncorrected curve (with offset derived from cl_cvtoff).

Example:

tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit

start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
start A, let it run 10s
pause A briefly to force rtsc_min()

At this point we would expect A to continue at 20mbit after a brief
moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
the effect of first correcting A (during 'start A'), and then - after
unpausing - calculating rtsc_min() from old corrected and new uncorrected
curve.

The patch fixes this bug and keepis vt and fsc in sync (virtual times
are cumulative, not local to the backlog period).

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed: Use DEFINE_SPINLOCK() for spinlock
Wei Yongjun [Tue, 2 Aug 2016 13:49:00 +0000 (13:49 +0000)]
qed: Use DEFINE_SPINLOCK() for spinlock

spinlock can be initialized automatically with DEFINE_SPINLOCK()
rather than explicitly calling spin_lock_init().

Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Acked-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/multicast: should not send source list records when have filter mode change
Hangbin Liu [Tue, 2 Aug 2016 10:02:57 +0000 (18:02 +0800)]
net/multicast: should not send source list records when have filter mode change

Based on RFC3376 5.1 and RFC3810 6.1

   If the per-interface listening change that triggers the new report is
   a filter mode change, then the next [Robustness Variable] State
   Change Reports will include a Filter Mode Change Record.  This
   applies even if any number of source list changes occur in that
   period.

   Old State         New State         State Change Record Sent
   ---------         ---------         ------------------------
   INCLUDE (A)       EXCLUDE (B)       TO_EX (B)
   EXCLUDE (A)       INCLUDE (B)       TO_IN (B)

So we should not send source-list change if there is a filter-mode change.

Here are two scenarios:
1. Group deleted and filter mode is EXCLUDE, which means we need send a
   TO_IN { }.
2. Not group deleted, but has pcm->crcount, which means we need send a
   normal filter-mode-change.

At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
we stop add records and decrease sf_crcount directly

Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.html

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: marvell: mvneta: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes [Sat, 30 Jul 2016 15:42:12 +0000 (17:42 +0200)]
net: ethernet: marvell: mvneta: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move the mvneta driver to new api {get|set}_link_ksettings.

We use the generic function phy_ethtool_get_link_ksettings,
and update old mvneta_ethtool_set_settings to the new api.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: marvell: mvneta: use phydev from struct net_device
Philippe Reynes [Sat, 30 Jul 2016 15:42:11 +0000 (17:42 +0200)]
net: ethernet: marvell: mvneta: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy_dev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: greth: use phy_ethtool_{get|set}_link_ksettings
Philippe Reynes [Sat, 30 Jul 2016 14:22:06 +0000 (16:22 +0200)]
net: ethernet: greth: use phy_ethtool_{get|set}_link_ksettings

There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: greth: use phydev from struct net_device
Philippe Reynes [Sat, 30 Jul 2016 14:22:05 +0000 (16:22 +0200)]
net: ethernet: greth: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: octeon: use phy_ethtool_{get|set}_link_ksettings
Philippe Reynes [Sat, 30 Jul 2016 14:14:44 +0000 (16:14 +0200)]
net: ethernet: octeon: use phy_ethtool_{get|set}_link_ksettings

There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.

There was a check on CAP_NET_ADMIN in cvm_oct_set_settings, but this
check is already done in dev_ethtool, so no need to repeat it before
calling the generic function.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: octeon: use phydev from struct net_device
Philippe Reynes [Sat, 30 Jul 2016 14:14:43 +0000 (16:14 +0200)]
net: ethernet: octeon: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'bna-next'
David S. Miller [Mon, 8 Aug 2016 22:41:28 +0000 (15:41 -0700)]
Merge branch 'bna-next'

Ivan Vecera says:

====================
bna: remove useless global variables

The set removes useless global bnad_list as well as bnad->entry that track
a list of driver instances but it is not used anywhere. The associated
bnad_list_mutex is removed as well but as it is also used to protect
bna_id increment it is necessary to convert bna_id to atomic_t.
====================

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
5 years agobna: remove global bnad_list_mutex
Ivan Vecera [Fri, 29 Jul 2016 17:52:58 +0000 (19:52 +0200)]
bna: remove global bnad_list_mutex

Remove global bnad_list_mutex as it is not used anymore. This makes
bnad_add_to_list() and bnad_remove_from_list() empty so remove them too.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobna: change type of bna_id to atomic_t
Ivan Vecera [Fri, 29 Jul 2016 17:52:57 +0000 (19:52 +0200)]
bna: change type of bna_id to atomic_t

Change type of bna_id to atomic_t. The bnad_list_mutex is used to prevent
a race when bna_id is incremented. After the change the mutex can be
removed in the next step.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobna: remove useless linked list
Ivan Vecera [Fri, 29 Jul 2016 17:52:56 +0000 (19:52 +0200)]
bna: remove useless linked list

Remove global variable bnad_list and bnad->list_entry that are used
as list of bna driver instances. It is not necessary and useless.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'ipconfig-improve-dhcp-timeouts'
David S. Miller [Mon, 8 Aug 2016 22:40:06 +0000 (15:40 -0700)]
Merge branch 'ipconfig-improve-dhcp-timeouts'

Uwe Kleine-König says:

====================
net: ipconfig: improve DHCP timeout handling

this series teaches the ipconfig code to handle a DHCP reply on eth0 even if a
request on eth1 was already sent out.
This is a follow fix to 2513dfb83fc7 ("ipconfig: handle case of delayed DHCP
server") that dropped a late reply.

This makes it possible at all to work with slow DHCP servers at all in some
configurations and improves boot speed in general.

The first patch is not really necessary, it only helps decoding debug messages
when there is more than one device.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ipconfig: drop inter-device timeout
Uwe Kleine-König [Fri, 29 Jul 2016 09:30:39 +0000 (11:30 +0200)]
net: ipconfig: drop inter-device timeout

Now that ipconfig learned to handle "delayed replies" in the previous
commit, there is no reason any more to delay sending a first request per
device.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ipconfig: Support using "delayed" DHCP replies
Uwe Kleine-König [Fri, 29 Jul 2016 09:30:38 +0000 (11:30 +0200)]
net: ipconfig: Support using "delayed" DHCP replies

The dhcp code only waits 1s between sending DHCP requests on different
devices and only accepts an answer for the device that sent out the last
request. Only the timeout at the end of a loop is increased iteratively
which favours only the last device. This makes it impossible to work
with a dhcp server that takes little more than 1s connected to a device
that is not the last one.

Instead of also increasing the inter-device timeout, teach the code to
handle delayed replies.

To accomplish that, make *ic_dev track the current ic_device instead of
the current net_device and adapt all users accordingly. The relevant
change then is to reset d to ic_dev on a reply to assert that the
followup request goes through the right device.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ipconfig: Add device name to debug messages
Uwe Kleine-König [Fri, 29 Jul 2016 09:30:37 +0000 (11:30 +0200)]
net: ipconfig: Add device name to debug messages

This simplifies understanding what happens when there is more than one
device.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'be2net-next'
David S. Miller [Mon, 8 Aug 2016 22:38:27 +0000 (15:38 -0700)]
Merge branch 'be2net-next'

Sathya Perla says:

====================
be2net: patch set

Patch 1 fixes the driver to workaournd a bug in the Lancer FW in the
vlan-config cmd processing. The FW in some cases clears the vlan-promisc
setting even if it cannot apply the vlan filter. The driver has no means
of knowing if the vlan-promisc setting has been cleared or not. This
patch now explicitly clears the vlan-promisc setting via the RX-Filter cmd
and then tries to program the vlan-list.

Patch 2 fixes the failure path in the vlan vid add code.
The driver currently removes a new vid from the adapter->vids[] array if
be_vid_config() returns an error, which occurs when there is an error in
HW/FW. This is wrong. After the HW/FW error is recovered from, we need the
complete vids[] array to re-program the vlan list.

Patch 3 fixes the ndo_set_rx_mode() path to avoid unnecessary multicast
list updates to the FW. Each time the ndo_set_rx_mode() routine is called,
the driver programs the multicast list in the adapter without checking
if there are any changes to the list. This leads to a flood of RX_FILTER
cmds when a number of vlan interfaces are configured over the device,
as the ndo_ gets called for each vlan interface. To avoid this, we now
use __dev_mc_sync() and __dev_uc_sync() API, but only to detect if there
is a change in the mc/uc lists. Now that we use this API, the code has to
be-designed to issue these API calls for each invocation of the ndo_ call.

Patch 4 replaces polling with sleeping in the FW completion path.
The ndo_set_rx_mode() and ndo_add/del_vxlan_port() calls may be called with
BHs disabled. The driver currently issues the required cmds to the FW in
these contexts and polls on completions from the FW, while BHs remain
disabled.  This can cause either packet loss or packet reception to be
delayed on that CPU.  This patch defers processing of the above cmds to a
separate workqueue. With this change, FW cmds are now issued only in process
context. Now that the FW cmds are issued only in process context, they can
sleep waiting for a completion instead of polling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobe2net: replace polling with sleeping in the FW completion path
Sathya Perla [Wed, 27 Jul 2016 09:26:18 +0000 (05:26 -0400)]
be2net: replace polling with sleeping in the FW completion path

The ndo_set_rx_mode() and ndo_add/del_vxlan_port() calls may be called with
BHs disabled. The driver currently issues the required cmds to the FW in
these contexts and polls on completions from the FW, while BHs remain
disabled.  This can cause either packet loss or packet reception to be
delayed on that CPU.

This patch defers processing of the above cmds to a separate workqueue.
With this change, FW cmds are now issued only in process context.
Now that the FW cmds are issued only in process context, they can sleep
waiting for a completion instead of polling. All the spin_lock_bh(mcc_lock)
calls are now replaced with mutex calls.

Also a new rx_filter_lock is now needed to protect the RX filtering fields
like vids[] between be_vlan_add/rem_vid() and __be_set_rx_mode() contexts.

Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobe2net: Avoid unnecessary firmware updates of multicast list
Sriharsha Basavapatna [Wed, 27 Jul 2016 09:26:17 +0000 (05:26 -0400)]
be2net: Avoid unnecessary firmware updates of multicast list

Eachtime the ndo_set_rx_mode() routine is called, the driver programs the
multicast list in the adapter without checking if there are any changes to
the list. This leads to a flood of RX_FILTER cmds when a number of vlan
interfaces are configured over the device, as the ndo_ gets
called for each vlan interface. To avoid this, we now use __dev_mc_sync()
and __dev_uc_sync() API, but only to detect if there is a change in the
mc/uc lists. Now that we use this API, the code has to be-designed to
issue these API calls for each invocation of the be_set_rx_mode() call.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobe2net: do not remove vids from driver table if be_vid_config() fails.
Sathya Perla [Wed, 27 Jul 2016 09:26:16 +0000 (05:26 -0400)]
be2net: do not remove vids from driver table if be_vid_config() fails.

The driver currently removes a new vid from the adapter->vids[] array if
be_vid_config() returns an error, which occurs when there is an error in
HW/FW. This is wrong. After the HW/FW error is recovered from, we need the
complete vids[] array to re-program the vlan list.

Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobe2net: clear vlan-promisc setting before programming the vlan list
Somnath Kotur [Wed, 27 Jul 2016 09:26:15 +0000 (05:26 -0400)]
be2net: clear vlan-promisc setting before programming the vlan list

The Lancer FW has a bug due to which in some cases vlan-promisc setting
is cleared eventhough the vlan-list programming did not succeed (via
VLAN_CONFIG) cmd. The driver has no way of knowing if the vlan-promisc
mode was cleared or not when this cmd fails. To work around this issue,
this patch first explicitly clears the vlan-promisc mode via RX_FILTER
cmd and then tries to program the vlan list.
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoneigh: allow admin to set NUD_STALE
Julian Anastasov [Wed, 27 Jul 2016 06:56:50 +0000 (09:56 +0300)]
neigh: allow admin to set NUD_STALE

Admin should be able to set any state. Currently, this fails
when lladdr is not changed and state is changed from
NUD_CONNECTED to NUD_STALE:

ip neigh add 192.168.8.1 lladdr 00:11:22:33:44:55 nud perm dev wlan0
ip neigh show to 192.168.8.1
192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
ip neigh change 192.168.8.1 lladdr 00:11:22:33:44:55 nud stale dev wlan0
ip neigh show to 192.168.8.1
192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT

Problem may be from 2.1.X days.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: use event->chunk when it's valid
Xin Long [Sun, 7 Aug 2016 06:15:13 +0000 (14:15 +0800)]
sctp: use event->chunk when it's valid

Commit 52253db924d1 ("sctp: also point GSO head_skb to the sk when
it's available") used event->chunk->head_skb to get the head_skb in
sctp_ulpevent_set_owner().

But at that moment, the event->chunk was NULL, as it cloned the skb
in sctp_ulpevent_make_rcvmsg(). Therefore, that patch didn't really
work.

This patch is to move the event->chunk initialization before calling
sctp_ulpevent_receive_data() so that it uses event->chunk when it's
valid.

Fixes: 52253db924d1 ("sctp: also point GSO head_skb to the sk when it's available")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: vxlan: lwt: Fix vxlan local traffic.
pravin shelar [Sat, 6 Aug 2016 00:45:37 +0000 (17:45 -0700)]
net: vxlan: lwt: Fix vxlan local traffic.

vxlan driver has bypass for local vxlan traffic, but that
depends on information about all VNIs on local system in
vxlan driver. This is not available in case of LWT.
Therefore following patch disable encap bypass for LWT
vxlan traffic.

Fixes: ee122c79d42 ("vxlan: Flow based tunneling").
Reported-by: Jakub Libosvar <jlibosva@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: vxlan: lwt: Use source ip address during route lookup.
pravin shelar [Sat, 6 Aug 2016 00:45:36 +0000 (17:45 -0700)]
net: vxlan: lwt: Use source ip address during route lookup.

LWT user can specify destination as well as source ip address
for given tunnel endpoint. But vxlan is ignoring given source
ip address. Following patch uses both ip address to route the
tunnel packet. This consistent with other LWT implementations,
like GENEVE and GRE.

Fixes: ee122c79d42 ("vxlan: Flow based tunneling").
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'bpf-csum-complete'
David S. Miller [Mon, 8 Aug 2016 20:11:44 +0000 (13:11 -0700)]
Merge branch 'bpf-csum-complete'

Daniel Borkmann says:

====================
Few BPF helper related checksum fixes

The set contains three fixes with regards to CHECKSUM_COMPLETE
and BPF helper functions. For details please see individual
patches.

Thanks!

v1 -> v2:
  - Fixed make htmldocs issue reported by kbuild bot.
  - Rest as is.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobpf: fix checksum for vlan push/pop helper
Daniel Borkmann [Thu, 4 Aug 2016 22:11:13 +0000 (00:11 +0200)]
bpf: fix checksum for vlan push/pop helper

When having skbs on ingress with CHECKSUM_COMPLETE, tc BPF programs don't
push rcsum of mac header back in and after BPF run back pull out again as
opposed to some other subsystems (ovs, for example).

For cases like q-in-q, meaning when a vlan tag for offloading is already
present and we're about to push another one, then skb_vlan_push() pushes the
inner one into the skb, increasing mac header and skb_postpush_rcsum()'ing
the 4 bytes vlan header diff. Likewise, for the reverse operation in
skb_vlan_pop() for the case where vlan header needs to be pulled out of the
skb, we're decreasing the mac header and skb_postpull_rcsum()'ing the 4 bytes
rcsum of the vlan header that was removed.

However mangling the rcsum here will lead to hw csum failure for BPF case,
since we're pulling or pushing data that was not part of the current rcsum.
Changing tc BPF programs in general to push/pull rcsum around BPF_PROG_RUN()
is also not really an option since current behaviour is ABI by now, but apart
from that would also mean to do quite a bit of useless work in the sense that
usually 12 bytes need to be rcsum pushed/pulled also when we don't need to
touch this vlan related corner case. One way to fix it would be to push the
necessary rcsum fixup down into vlan helpers that are (mostly) slow-path
anyway.

Fixes: 4e10df9a60d9 ("bpf: introduce bpf_skb_vlan_push/pop() helpers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobpf: fix checksum fixups on bpf_skb_store_bytes
Daniel Borkmann [Thu, 4 Aug 2016 22:11:12 +0000 (00:11 +0200)]
bpf: fix checksum fixups on bpf_skb_store_bytes

bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
Where we ran into an issue with bpf_skb_store_bytes() is when we did a
single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
underlying issue has been tracked down to a buffer alignment issue.

Meaning, that csum_partial() computations via skb_postpull_rcsum() and
skb_postpush_rcsum() pair invoked had a wrong result since they operated on
an odd address for the hoplimit, while other computations were done on an
even address. This mix doesn't work as-is with skb_postpull_rcsum(),
skb_postpush_rcsum() pair as it always expects at least half-word alignment
of input buffers, which is normally the case. Thus, instead of these helpers
using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
csum_block_add(), respectively. For unaligned offsets, they rotate the sum
to align it to a half-word boundary again, otherwise they work the same as
csum_sub() and csum_add().

Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
use a 0 constant for offset so that the compiler optimizes the offset & 1
test away and generates the same code as with csum_sub()/_add().

Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>