muen/linux.git
3 years agotipc: fix hanging poll() for stream sockets
Parthasarathy Bhuvaragan [Thu, 28 Dec 2017 11:03:06 +0000 (12:03 +0100)]
tipc: fix hanging poll() for stream sockets

In commit 42b531de17d2f6 ("tipc: Fix missing connection request
handling"), we replaced unconditional wakeup() with condtional
wakeup for clients with flags POLLIN | POLLRDNORM | POLLRDBAND.

This breaks the applications which do a connect followed by poll
with POLLOUT flag. These applications are not woken when the
connection is ESTABLISHED and hence sleep forever.

In this commit, we fix it by including the POLLOUT event for
sockets in TIPC_CONNECTING state.

Fixes: 42b531de17d2f6 ("tipc: Fix missing connection request handling")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Thu, 28 Dec 2017 01:35:03 +0000 (20:35 -0500)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2017-12-28

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Two small fixes for bpftool. Fix otherwise broken output if any of
   the system calls failed when listing maps in json format and instead
   of bailing out, skip maps or progs that disappeared between fetching
   next id and getting an fd for that id, both from Jakub.

2) Small fix in BPF selftests to respect LLC passed from command line
   when testing for -mcpu=probe presence, from Quentin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agosctp: Replace use of sockets_allocated with specified macro.
Tonghao Zhang [Fri, 22 Dec 2017 18:15:20 +0000 (10:15 -0800)]
sctp: Replace use of sockets_allocated with specified macro.

The patch(180d8cd942ce) replaces all uses of struct sock fields'
memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem
to accessor macros. But the sockets_allocated field of sctp sock is
not replaced at all. Then replace it now for unifying the code.

Fixes: 180d8cd942ce ("foundations of per-cgroup memory pressure controlling.")
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agobnx2x: Improve reliability in case of nested PCI errors
Guilherme G. Piccoli [Fri, 22 Dec 2017 15:01:39 +0000 (13:01 -0200)]
bnx2x: Improve reliability in case of nested PCI errors

While in recovery process of PCI error (called EEH on PowerPC arch),
another PCI transaction could be corrupted causing a situation of
nested PCI errors. Also, this scenario could be reproduced with
error injection mechanisms (for debug purposes).

We observe that in case of nested PCI errors, bnx2x might attempt to
initialize its shmem and cause a kernel crash due to bad addresses
read from MCP. Multiple different stack traces were observed depending
on the point the second PCI error happens.

This patch avoids the crashes by:

 * failing PCI recovery in case of nested errors (since multiple
 PCI errors in a row are not expected to lead to a functional
 adapter anyway), and by,

 * preventing access to adapter FW when MCP is failed (we mark it as
 failed when shmem cannot get initialized properly).

Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Shahed Shaikh <Shahed.Shaikh@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'tg3-fixes'
David S. Miller [Wed, 27 Dec 2017 16:09:14 +0000 (11:09 -0500)]
Merge branch 'tg3-fixes'

Siva Reddy Kallam says:

====================
tg3: update on copyright and couple of fixes

First patch:
Update copyright

Second patch:
Add workaround to restrict 5762 MRRS

Third patch:
Add PHY reset in change MTU path for 5720
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotg3: Enable PHY reset in MTU change path for 5720
Siva Reddy Kallam [Fri, 22 Dec 2017 10:35:29 +0000 (16:05 +0530)]
tg3: Enable PHY reset in MTU change path for 5720

A customer noticed RX path hang when MTU is changed on the fly while
running heavy traffic with NCSI enabled for 5717 and 5719. Since 5720
belongs to same ASIC family, we observed same issue and same fix
could solve this problem for 5720.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotg3: Add workaround to restrict 5762 MRRS to 2048
Siva Reddy Kallam [Fri, 22 Dec 2017 10:35:28 +0000 (16:05 +0530)]
tg3: Add workaround to restrict 5762 MRRS to 2048

One of AMD based server with 5762 hangs with jumbo frame traffic.
This AMD platform has southbridge limitation which is restricting MRRS
to 4000. As a work around, driver to restricts the MRRS to 2048 for
this particular 5762 NX1 card.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotg3: Update copyright
Siva Reddy Kallam [Fri, 22 Dec 2017 10:35:27 +0000 (16:05 +0530)]
tg3: Update copyright

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
David S. Miller [Wed, 27 Dec 2017 15:58:23 +0000 (10:58 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2017-12-22

1) Check for valid id proto in validate_tmpl(), otherwise
   we may trigger a warning in xfrm_state_fini().
   From Cong Wang.

2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute.
   From Michal Kubecek.

3) Verify the state is valid when encap_type < 0,
   otherwise we may crash on IPsec GRO .
   From Aviv Heller.

4) Fix stack-out-of-bounds read on socket policy lookup.
   We access the flowi of the wrong address family in the
   IPv4 mapped IPv6 case, fix this by catching address
   family missmatches before we do the lookup.

5) fix xfrm_do_migrate() with AEAD to copy the geniv
   field too. Otherwise the state is not fully initialized
   and migration fails. From Antony Antony.

6) Fix stack-out-of-bounds with misconfigured transport
   mode policies. Our policy template validation is not
   strict enough. It is possible to configure policies
   with transport mode template where the address family
   of the template does not match the selectors address
   family. Fix this by refusing such a configuration,
   address family can not change on transport mode.

7) Fix a policy reference leak when reusing pcpu xdst
   entry. From Florian Westphal.

8) Reinject transport-mode packets through tasklet,
   otherwise it is possible to reate a recursion
   loop. From Herbert Xu.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: fec: unmap the xmit buffer that are not transferred by DMA
Fugang Duan [Fri, 22 Dec 2017 09:12:09 +0000 (17:12 +0800)]
net: fec: unmap the xmit buffer that are not transferred by DMA

The enet IP only support 32 bit, it will use swiotlb buffer to do dma
mapping when xmit buffer DMA memory address is bigger than 4G in i.MX
platform. After stress suspend/resume test, it will print out:

log:
[12826.352864] fec 5b040000.ethernet: swiotlb buffer is full (sz: 191 bytes)
[12826.359676] DMA: Out of SW-IOMMU space for 191 bytes at device 5b040000.ethernet
[12826.367110] fec 5b040000.ethernet eth0: Tx DMA memory map failed

The issue is that the ready xmit buffers that are dma mapped but DMA still
don't copy them into fifo, once MAC restart, these DMA buffers are not unmapped.
So it should check the dma mapping buffer and unmap them.

Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path
Tommi Rantala [Fri, 22 Dec 2017 07:35:17 +0000 (09:35 +0200)]
tipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path

Calling tipc_mon_delete() before the monitor has been created will oops.
This can happen in tipc_enable_bearer() error path if tipc_disc_create()
fails.

[   48.589074] BUG: unable to handle kernel paging request at 0000000000001008
[   48.590266] IP: tipc_mon_delete+0xea/0x270 [tipc]
[   48.591223] PGD 1e60c5067 P4D 1e60c5067 PUD 1eb0cf067 PMD 0
[   48.592230] Oops: 0000 [#1] SMP KASAN
[   48.595610] CPU: 5 PID: 1199 Comm: tipc Tainted: G    B            4.15.0-rc4-pc64-dirty #5
[   48.597176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   48.598489] RIP: 0010:tipc_mon_delete+0xea/0x270 [tipc]
[   48.599347] RSP: 0018:ffff8801d827f668 EFLAGS: 00010282
[   48.600705] RAX: ffff8801ee813f00 RBX: 0000000000000204 RCX: 0000000000000000
[   48.602183] RDX: 1ffffffff1de6a75 RSI: 0000000000000297 RDI: 0000000000000297
[   48.604373] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1dd1533
[   48.605607] R10: ffffffff8eafbb05 R11: fffffbfff1dd1534 R12: 0000000000000050
[   48.607082] R13: dead000000000200 R14: ffffffff8e73f310 R15: 0000000000001020
[   48.608228] FS:  00007fc686484800(0000) GS:ffff8801f5540000(0000) knlGS:0000000000000000
[   48.610189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.611459] CR2: 0000000000001008 CR3: 00000001dda70002 CR4: 00000000003606e0
[   48.612759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.613831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.615038] Call Trace:
[   48.615635]  tipc_enable_bearer+0x415/0x5e0 [tipc]
[   48.620623]  tipc_nl_bearer_enable+0x1ab/0x200 [tipc]
[   48.625118]  genl_family_rcv_msg+0x36b/0x570
[   48.631233]  genl_rcv_msg+0x5a/0xa0
[   48.631867]  netlink_rcv_skb+0x1cc/0x220
[   48.636373]  genl_rcv+0x24/0x40
[   48.637306]  netlink_unicast+0x29c/0x350
[   48.639664]  netlink_sendmsg+0x439/0x590
[   48.642014]  SYSC_sendto+0x199/0x250
[   48.649912]  do_syscall_64+0xfd/0x2c0
[   48.650651]  entry_SYSCALL64_slow_path+0x25/0x25
[   48.651843] RIP: 0033:0x7fc6859848e3
[   48.652539] RSP: 002b:00007ffd25dff938 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   48.654003] RAX: ffffffffffffffda RBX: 00007ffd25dff990 RCX: 00007fc6859848e3
[   48.655303] RDX: 0000000000000054 RSI: 00007ffd25dff990 RDI: 0000000000000003
[   48.656512] RBP: 00007ffd25dff980 R08: 00007fc685c35fc0 R09: 000000000000000c
[   48.657697] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000d13010
[   48.658840] R13: 00007ffd25e009c0 R14: 0000000000000000 R15: 0000000000000000
[   48.662972] RIP: tipc_mon_delete+0xea/0x270 [tipc] RSP: ffff8801d827f668
[   48.664073] CR2: 0000000000001008
[   48.664576] ---[ end trace e811818d54d5ce88 ]---

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: error path leak fixes in tipc_enable_bearer()
Tommi Rantala [Fri, 22 Dec 2017 07:35:16 +0000 (09:35 +0200)]
tipc: error path leak fixes in tipc_enable_bearer()

Fix memory leak in tipc_enable_bearer() if enable_media() fails, and
cleanup with bearer_disable() if tipc_mon_create() fails.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoRDS: Check cmsg_len before dereferencing CMSG_DATA
Avinash Repaka [Fri, 22 Dec 2017 04:17:04 +0000 (20:17 -0800)]
RDS: Check cmsg_len before dereferencing CMSG_DATA

RDS currently doesn't check if the length of the control message is
large enough to hold the required data, before dereferencing the control
message data. This results in following crash:

BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013
[inline]
BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90
net/rds/send.c:1066
Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157

CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
 rds_rdma_bytes net/rds/send.c:1013 [inline]
 rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066
 sock_sendmsg_nosec net/socket.c:628 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:638
 ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018
 __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108
 SYSC_sendmmsg net/socket.c:2139 [inline]
 SyS_sendmmsg+0x35/0x60 net/socket.c:2134
 entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x43fe49
RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49
RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0
R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000

To fix this, we verify that the cmsg_len is large enough to hold the
data to be read, before proceeding further.

Reported-by: syzbot <syzkaller-bugs@googlegroups.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: Avoid preprocessor directives in tracepoint macro args
Mat Martineau [Thu, 21 Dec 2017 18:29:09 +0000 (10:29 -0800)]
tcp: Avoid preprocessor directives in tracepoint macro args

Using a preprocessor directive to check for CONFIG_IPV6 in the middle of
a DECLARE_EVENT_CLASS macro's arg list causes sparse to report a series
of errors:

./include/trace/events/tcp.h:68:1: error: directive in argument list
./include/trace/events/tcp.h:75:1: error: directive in argument list
./include/trace/events/tcp.h:144:1: error: directive in argument list
./include/trace/events/tcp.h:151:1: error: directive in argument list
./include/trace/events/tcp.h:216:1: error: directive in argument list
./include/trace/events/tcp.h:223:1: error: directive in argument list
./include/trace/events/tcp.h:274:1: error: directive in argument list
./include/trace/events/tcp.h:281:1: error: directive in argument list

Once sparse finds an error, it stops printing warnings for the file it
is checking. This masks any sparse warnings that would normally be
reported for the core TCP code.

Instead, handle the preprocessor conditionals in a couple of auxiliary
macros. This also has the benefit of reducing duplicate code.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: fix memory leak of group member when peer node is lost
Jon Maloy [Thu, 21 Dec 2017 13:36:34 +0000 (14:36 +0100)]
tipc: fix memory leak of group member when peer node is lost

When a group member receives a member WITHDRAW event, this might have
two reasons: either the peer member is leaving the group, or the link
to the member's node has been lost.

In the latter case we need to issue a DOWN event to the user right away,
and let function tipc_group_filter_msg() perform delete of the member
item. However, in this case we miss to change the state of the member
item to MBR_LEAVING, so the member item is not deleted, and we have a
memory leak.

We now separate better between the four sub-cases of a WITHRAW event
and make sure that each case is handled correctly.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: sched: fix possible null pointer deref in tcf_block_put
Jiri Pirko [Thu, 21 Dec 2017 12:13:59 +0000 (13:13 +0100)]
net: sched: fix possible null pointer deref in tcf_block_put

We need to check block for being null in both tcf_block_put and
tcf_block_put_ext.

Fixes: 343723dd51ef ("net: sched: fix clsact init error path")
Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: base group replicast ack counter on number of actual receivers
Jon Maloy [Thu, 21 Dec 2017 12:07:11 +0000 (13:07 +0100)]
tipc: base group replicast ack counter on number of actual receivers

In commit 2f487712b893 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") we introduced a mechanism that requires the first
(replicated) broadcast sent after a unicast to be acknowledged by all
receivers before permitting sending of the next (true) broadcast.

The counter for keeping track of the number of acknowledges to expect
is based on the tipc_group::member_cnt variable. But this misses that
some of the known members may not be ready for reception, and will never
acknowledge the message, either because they haven't fully joined the
group or because they are leaving the group. Such members are identified
by not fulfilling the condition tested for in the function
tipc_group_is_enabled().

We now set the counter for the actual number of acks to receive at the
moment the message is sent, by just counting the number of recipients
satisfying the tipc_group_is_enabled() test.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()
Cong Wang [Thu, 21 Dec 2017 07:26:24 +0000 (23:26 -0800)]
net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()

The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
flying RCU callback installed by a previous mini_qdisc_pair_swap(),
however we miss it on the tp_head==NULL path, which leads to that
the RCU callback still uses miniq_old->rcu after it is freed together
with qdisc in qdisc_graft(). So just add it on that path too.

Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround
Grygorii Strashko [Thu, 21 Dec 2017 00:45:10 +0000 (18:45 -0600)]
net: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround

Under some circumstances driver will perform PHY reset in
ksz9031_read_status() to fix autoneg failure case (idle error count =
0xFF). When this happens ksz9031 will not detect link status change any
more when connecting to Netgear 1G switch (link can be recovered sometimes by
restarting netdevice "ifconfig down up"). Reproduced with TI am572x board
equipped with ksz9031 PHY while connecting to Netgear 1G switch.

Fix the issue by reconfiguring autonegotiation after PHY reset in
ksz9031_read_status().

Fixes: d2fd719bcb0e ("net/phy: micrel: Add workaround for bad autoneg")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoip6_gre: fix device features for ioctl setup
Alexey Kodanev [Wed, 20 Dec 2017 16:36:03 +0000 (19:36 +0300)]
ip6_gre: fix device features for ioctl setup

When ip6gre is created using ioctl, its features, such as
scatter-gather, GSO and tx-checksumming will be turned off:

  # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
  # ethtool -k gre6 (truncated output)
    tx-checksumming: off
    scatter-gather: off
    tcp-segmentation-offload: off
    generic-segmentation-offload: off [requested on]

But when netlink is used, they will be enabled:
  # ip link add gre6 type ip6gre remote fd00::1
  # ethtool -k gre6 (truncated output)
    tx-checksumming: on
    scatter-gather: on
    tcp-segmentation-offload: on
    generic-segmentation-offload: on

This results in a loss of performance when gre6 is created via ioctl.
The issue was found with LTP/gre tests.

Fix it by moving the setup of device features to a separate function
and invoke it with ndo_init callback because both netlink and ioctl
will eventually call it via register_netdevice():

   register_netdevice()
       - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
           - ip6gre_tunnel_init_common()
                - ip6gre_tnl_init_features()

The moved code also contains two minor style fixes:
  * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
  * fixed the issue reported by checkpatch: "Unnecessary parentheses around
    'nt->encap.type == TUNNEL_ENCAP_NONE'"

Fixes: ac4eb009e477 ("ip6gre: Add support for basic offloads offloads excluding GSO")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agophylink: ensure AN is enabled
Russell King [Wed, 20 Dec 2017 23:21:34 +0000 (23:21 +0000)]
phylink: ensure AN is enabled

Ensure that we mark AN as enabled at boot time, rather than leaving
it disabled.  This is noticable if your SFP module is fiber, and
it supports faster speeds than 1G with 2.5G support in place.

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agophylink: ensure the PHY interface mode is appropriately set
Russell King [Wed, 20 Dec 2017 23:21:28 +0000 (23:21 +0000)]
phylink: ensure the PHY interface mode is appropriately set

When setting the ethtool settings, ensure that the validated PHY
interface mode is propagated to the current link settings, so that
2500BaseX can be selected.

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'bpf-bpftool-various-fixes'
Daniel Borkmann [Sat, 23 Dec 2017 00:09:53 +0000 (01:09 +0100)]
Merge branch 'bpf-bpftool-various-fixes'

Jakub Kicinski says:

====================
Two small fixes here to listing maps and programs.  The loop for showing
maps is written slightly differently to programs which was missed in JSON
output support, and output would be broken if any of the system calls
failed.  Second fix is in very unlikely case that program or map disappears
after we get its ID we should just skip over that object instead of failing.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agotools: bpftool: protect against races with disappearing objects
Jakub Kicinski [Fri, 22 Dec 2017 19:36:06 +0000 (11:36 -0800)]
tools: bpftool: protect against races with disappearing objects

On program/map show we may get an ID of an object from GETNEXT,
but the object may disappear before we call GET_FD_BY_ID.  If
that happens, ignore the object and continue.

Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agotools: bpftool: maps: close json array on error paths of show
Jakub Kicinski [Fri, 22 Dec 2017 19:36:05 +0000 (11:36 -0800)]
tools: bpftool: maps: close json array on error paths of show

We can't return from the middle of do_show(), because
json_array will not be closed.  Break out of the loop.
Note that the error handling after the loop depends on
errno, so no need to set err.

Fixes: 831a0aafe5c3 ("tools: bpftool: add JSON output for `bpftool map *` commands")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 21 Dec 2017 23:57:30 +0000 (15:57 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller"
 "What's a holiday weekend without some networking bug fixes? [1]

   1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function
      calls, from Daniel Borkmann.

   2) Fix regression from errata limiting change to marvell PHY driver,
      from Zhao Qiang.

   3) Fix u16 overflow in SCTP, from Xin Long.

   4) Fix potential memory leak during bridge newlink, from Nikolay
      Aleksandrov.

   5) Fix BPF selftest build on s390, from Hendrik Brueckner.

   6) Don't append to cfg80211 automatically generated certs file,
      always write new ones from scratch. From Thierry Reding.

   7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai.

   8) Fix hang on tg3 MTU change with certain chips, from Brian King.

   9) Add stall detection to arc emac driver and reset chip when this
      happens, from Alexander Kochetkov.

  10) Fix MTU limitng in GRE tunnel drivers, from Xin Long.

  11) Fix stmmac timestamping bug due to mis-shifting of field. From
      Fredrik Hallenberg.

  12) Fix metrics match when deleting an ipv4 route. The kernel sets
      some internal metrics bits which the user isn't going to set when
      it makes the delete request. From Phil Sutter.

  13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix
      from Yelena Krivosheev.

  14) Fix double free and memory corruption in get_net_ns_by_id, from
      Eric W. Biederman.

  15) Flush ipv4 FIB tables in the reverse order. Some tables can share
      their actual backing data, in particular this happens for the MAIN
      and LOCAL tables. We have to kill the LOCAL table first, because
      it uses MAIN's backing memory. Fix from Ido Schimmel.

  16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann
      Horn, and Alexei Starovoitov.

  17) Make changes to ipv6 autoflowlabel sysctl really propagate to
      sockets, unless the socket has set the per-socket value
      explicitly. From Shaohua Li.

  18) Fix leaks and double callback invocations of zerocopy SKBs, from
      Willem de Bruijn"

[1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
  skbuff: skb_copy_ubufs must release uarg even without user frags
  skbuff: orphan frags before zerocopy clone
  net: reevalulate autoflowlabel setting after sysctl setting
  openvswitch: Fix pop_vlan action for double tagged frames
  ipv6: Honor specified parameters in fibmatch lookup
  bpf: do not allow root to mangle valid pointers
  selftests/bpf: add tests for recent bugfixes
  bpf: fix integer overflows
  bpf: don't prune branches when a scalar is replaced with a pointer
  bpf: force strict alignment checks for stack pointers
  bpf: fix missing error return in check_stack_boundary()
  bpf: fix 32-bit ALU op verification
  bpf: fix incorrect tracking of register size truncation
  bpf: fix incorrect sign extension in check_alu_op()
  bpf/verifier: fix bounds calculation on BPF_RSH
  ipv4: Fix use-after-free when flushing FIB tables
  s390/qeth: fix error handling in checksum cmd callback
  tipc: remove joining group member from congested list
  selftests: net: Adding config fragment CONFIG_NUMA=y
  nfp: bpf: keep track of the offloaded program
  ...

3 years agoselftests/bpf: fix Makefile for passing LLC to the command line
Quentin Monnet [Thu, 21 Dec 2017 16:52:50 +0000 (08:52 -0800)]
selftests/bpf: fix Makefile for passing LLC to the command line

Makefile has a LLC variable that is initialised to "llc", but can
theoretically be overridden from the command line ("make LLC=llc-6.0").
However, this fails because for LLVM probe check, "llc" is called
directly. Use the $(LLC) variable instead to fix this.

Fixes: 22c8852624fc ("bpf: improve selftests and add tests for meta pointer")
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agoMerge branch 'net-zerocopy-fixes'
David S. Miller [Thu, 21 Dec 2017 20:00:59 +0000 (15:00 -0500)]
Merge branch 'net-zerocopy-fixes'

Saeed Mahameed says:

===================
Mellanox, mlx5 fixes 2017-12-19

The follwoing series includes some fixes for mlx5 core and etherent
driver.

Please pull and let me know if there is any problem.

This series doesn't introduce any conflict with the ongoing mlx5 for-next
submission.

For -stable:

kernels >= v4.7.y
    ("net/mlx5e: Fix possible deadlock of VXLAN lock")
    ("net/mlx5e: Add refcount to VXLAN structure")
    ("net/mlx5e: Prevent possible races in VXLAN control flow")
    ("net/mlx5e: Fix features check of IPv6 traffic")

kernels >= v4.9.y
    ("net/mlx5: Fix error flow in CREATE_QP command")
    ("net/mlx5: Fix rate limit packet pacing naming and struct")

kernels >= v4.13.y
    ("net/mlx5: FPGA, return -EINVAL if size is zero")

kernels >= v4.14.y
    ("Revert "mlx5: move affinity hints assignments to generic code")

All above patches apply and compile with no issues on corresponding -stable.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoskbuff: skb_copy_ubufs must release uarg even without user frags
Willem de Bruijn [Wed, 20 Dec 2017 22:37:50 +0000 (17:37 -0500)]
skbuff: skb_copy_ubufs must release uarg even without user frags

skb_copy_ubufs creates a private copy of frags[] to release its hold
on user frags, then calls uarg->callback to notify the owner.

Call uarg->callback even when no frags exist. This edge case can
happen when zerocopy_sg_from_iter finds enough room in skb_headlen
to copy all the data.

Fixes: 3ece782693c4 ("sock: skb_copy_ubufs support for compound pages")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoskbuff: orphan frags before zerocopy clone
Willem de Bruijn [Wed, 20 Dec 2017 22:37:49 +0000 (17:37 -0500)]
skbuff: orphan frags before zerocopy clone

Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate
calls to skb_uarg(skb)->callback for the same data.

skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb
with each segment. This is only safe for uargs that do refcounting,
which is those that pass skb_orphan_frags without dropping their
shared frags. For others, skb_orphan_frags drops the user frags and
sets the uarg to NULL, after which sock_zerocopy_clone has no effect.

Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback
calls for the same data causing the vhost_net_ubuf_ref_>refcount to
drop below zero.

Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com>
Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
Reported-by: Andreas Hartmann <andihartmann@01019freenet.de>
Reported-by: David Hill <dhill@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-block
Linus Torvalds [Thu, 21 Dec 2017 19:13:37 +0000 (11:13 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "It's been a few weeks, so here's a small collection of fixes that
  should go into the current series.

  This contains:

   - NVMe pull request from Christoph, with a few important fixes.

   - kyber hang fix from Omar.

   - A blk-throttl fix from Shaohua, fixing a case where we double
     charge a bio.

   - Two call_single_data alignment fixes from me, fixing up some
     unfortunate changes that went into 4.14 without being properly
     reviewed on the block side (since nobody was CC'ed on the
     patch...).

   - A bounce buffer fix in two parts, one from me and one from Ming.

   - Revert bdi debug error handling patch. It's causing boot issues for
     some folks, and a week down the line, we're still no closer to a
     fix. Revert this patch for now until it's figured out, then we can
     retry for 4.16"

* 'for-linus' of git://git.kernel.dk/linux-block:
  Revert "bdi: add error handle for bdi_debug_register"
  null_blk: unalign call_single_data
  block: unalign call_single_data in struct request
  block-throttle: avoid double charge
  block: fix blk_rq_append_bio
  block: don't let passthrough IO go into .make_request_fn()
  nvme: setup streams after initializing namespace head
  nvme: check hw sectors before setting chunk sectors
  nvme: call blk_integrity_unregister after queue is cleaned up
  nvme-fc: remove double put reference if admin connect fails
  nvme: set discard_alignment to zero
  kyber: fix another domain token wait queue hang

3 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Thu, 21 Dec 2017 18:44:13 +0000 (10:44 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM fixes:
   - A bug in handling of SPE state for non-vhe systems
   - A fix for a crash on system shutdown
   - Three timer fixes, introduced by the timer optimizations for v4.15

  x86 fixes:
   - fix for a WARN that was introduced in 4.15
   - fix for SMM when guest uses PCID
   - fixes for several bugs found by syzkaller

  ... and a dozen papercut fixes for the kvm_stat tool"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  tools/kvm_stat: sort '-f help' output
  kvm: x86: fix RSM when PCID is non-zero
  KVM: Fix stack-out-of-bounds read in write_mmio
  KVM: arm/arm64: Fix timer enable flow
  KVM: arm/arm64: Properly handle arch-timer IRQs after vtimer_save_state
  KVM: arm/arm64: timer: Don't set irq as forwarded if no usable GIC
  KVM: arm/arm64: Fix HYP unmapping going off limits
  arm64: kvm: Prevent restoring stale PMSCR_EL1 for vcpu
  KVM/x86: Check input paging mode when cs.l is set
  tools/kvm_stat: add line for totals
  tools/kvm_stat: stop ignoring unhandled arguments
  tools/kvm_stat: suppress usage information on command line errors
  tools/kvm_stat: handle invalid regular expressions
  tools/kvm_stat: add hint on '-f help' to man page
  tools/kvm_stat: fix child trace events accounting
  tools/kvm_stat: fix extra handling of 'help' with fields filter
  tools/kvm_stat: fix missing field update after filter change
  tools/kvm_stat: fix drilldown in events-by-guests mode
  tools/kvm_stat: fix command line option '-g'
  kvm: x86: fix WARN due to uninitialized guest FPU state
  ...

3 years agonet: reevalulate autoflowlabel setting after sysctl setting
Shaohua Li [Wed, 20 Dec 2017 20:10:21 +0000 (12:10 -0800)]
net: reevalulate autoflowlabel setting after sysctl setting

sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
supposed to not include flowlabel. This is true for normal packet, but
not for reset packet.

The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
changed, so the sock will keep the old behavior in terms of auto
flowlabel. Reset packet is suffering from this problem, because reset
packet is sent from a special control socket, which is created at boot
time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
socket will always have its ipv6_pinfo.autoflowlabel set, even after
user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
have flowlabel. Normal sock created before sysctl setting suffers from
the same issue. We can't even turn off autoflowlabel unless we kill all
socks in the hosts.

To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
autoflowlabel setting from user, otherwise we always call
ip6_default_np_autolabel() which has the new settings of sysctl.

Note, this changes behavior a little bit. Before commit 42240901f7c4
(ipv6: Implement different admin modes for automatic flow labels), the
autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
existing connection will change autoflowlabel behavior. After that
commit, autoflowlabel behavior is sticky in the whole life of the sock.
With this patch, the behavior isn't sticky again.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoopenvswitch: Fix pop_vlan action for double tagged frames
Eric Garver [Wed, 20 Dec 2017 20:09:22 +0000 (15:09 -0500)]
openvswitch: Fix pop_vlan action for double tagged frames

skb_vlan_pop() expects skb->protocol to be a valid TPID for double
tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
shift the true ethertype into position for us.

Fixes: 5108bbaddc37 ("openvswitch: add processing of L3 packets")
Signed-off-by: Eric Garver <e@erig.me>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoRevert "bdi: add error handle for bdi_debug_register"
Jens Axboe [Thu, 21 Dec 2017 17:01:30 +0000 (10:01 -0700)]
Revert "bdi: add error handle for bdi_debug_register"

This reverts commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025.

It breaks some booting for some users, and more than a week
into this, there's still no good fix. Revert this commit
for now until a solution has been found.

Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoipv6: Honor specified parameters in fibmatch lookup
Ido Schimmel [Wed, 20 Dec 2017 10:28:25 +0000 (12:28 +0200)]
ipv6: Honor specified parameters in fibmatch lookup

Currently, parameters such as oif and source address are not taken into
account during fibmatch lookup. Example (IPv4 for reference) before
patch:

$ ip -4 route show
192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
198.51.100.0/24 dev dummy1 proto kernel scope link src 198.51.100.1

$ ip -6 route show
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev dummy1 proto kernel metric 256 pref medium
fe80::/64 dev dummy0 proto kernel metric 256 pref medium
fe80::/64 dev dummy1 proto kernel metric 256 pref medium

$ ip -4 route get fibmatch 192.0.2.2 oif dummy0
192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
$ ip -4 route get fibmatch 192.0.2.2 oif dummy1
RTNETLINK answers: No route to host

$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium

After:

$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
RTNETLINK answers: Network is unreachable

The problem stems from the fact that the necessary route lookup flags
are not set based on these parameters.

Instead of duplicating the same logic for fibmatch, we can simply
resolve the original route from its copy and dump it instead.

Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib result when requested")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotools/kvm_stat: sort '-f help' output
Stefan Raspl [Thu, 21 Dec 2017 12:03:27 +0000 (13:03 +0100)]
tools/kvm_stat: sort '-f help' output

Sort the fields returned by specifying '-f help' on the command line.
While at it, simplify the code a bit, indent the output and eliminate an
extra blank line at the beginning.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agokvm: x86: fix RSM when PCID is non-zero
Paolo Bonzini [Wed, 20 Dec 2017 23:49:14 +0000 (00:49 +0100)]
kvm: x86: fix RSM when PCID is non-zero

rsm_load_state_64() and rsm_enter_protected_mode() load CR3, then
CR4 & ~PCIDE, then CR0, then CR4.

However, setting CR4.PCIDE fails if CR3[11:0] != 0.  It's probably easier
in the long run to replace rsm_enter_protected_mode() with an emulator
callback that sets all the special registers (like KVM_SET_SREGS would
do).  For now, set the PCID field of CR3 only after CR4.PCIDE is 1.

Reported-by: Laszlo Ersek <lersek@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Fixes: 660a5d517aaab9187f93854425c4c63f4a09195c
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Thu, 21 Dec 2017 04:10:29 +0000 (23:10 -0500)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2017-12-21

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix multiple security issues in the BPF verifier mostly related
   to the value and min/max bounds tracking rework in 4.14. Issues
   range from incorrect bounds calculation in some BPF_RSH cases,
   to improper sign extension and reg size handling on 32 bit
   ALU ops, missing strict alignment checks on stack pointers, and
   several others that got fixed, from Jann, Alexei and Edward.

2) Fix various build failures in BPF selftests on sparc64. More
   specifically, librt needed to be added to the libs to link
   against and few format string fixups for sizeof, from David.

3) Fix one last remaining issue from BPF selftest build that was
   still occuring on s390x from the asm/bpf_perf_event.h include
   which could not find the asm/ptrace.h copy, from Hendrik.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agobpf: do not allow root to mangle valid pointers
Alexei Starovoitov [Tue, 19 Dec 2017 04:15:20 +0000 (20:15 -0800)]
bpf: do not allow root to mangle valid pointers

Do not allow root to convert valid pointers into unknown scalars.
In particular disallow:
 ptr &= reg
 ptr <<= reg
 ptr += ptr
and explicitly allow:
 ptr -= ptr
since pkt_end - pkt == length

1.
This minimizes amount of address leaks root can do.
In the future may need to further tighten the leaks with kptr_restrict.

2.
If program has such pointer math it's likely a user mistake and
when verifier complains about it right away instead of many instructions
later on invalid memory access it's easier for users to fix their progs.

3.
when register holding a pointer cannot change to scalar it allows JITs to
optimize better. Like 32-bit archs could use single register for pointers
instead of a pair required to hold 64-bit scalars.

4.
reduces architecture dependent behavior. Since code:
r1 = r10;
r1 &= 0xff;
if (r1 ...)
will behave differently arm64 vs x64 and offloaded vs native.

A significant chunk of ptr mangling was allowed by
commit f1174f77b50c ("bpf/verifier: rework value tracking")
yet some of it was allowed even earlier.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agoMerge branch 'bpf-verifier-sec-fixes'
Daniel Borkmann [Thu, 21 Dec 2017 01:15:42 +0000 (02:15 +0100)]
Merge branch 'bpf-verifier-sec-fixes'

Alexei Starovoitov says:

====================
This patch set addresses a set of security vulnerabilities
in bpf verifier logic discovered by Jann Horn.
All of the patches are candidates for 4.14 stable.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agoselftests/bpf: add tests for recent bugfixes
Jann Horn [Tue, 19 Dec 2017 04:12:01 +0000 (20:12 -0800)]
selftests/bpf: add tests for recent bugfixes

These tests should cover the following cases:

 - MOV with both zero-extended and sign-extended immediates
 - implicit truncation of register contents via ALU32/MOV32
 - implicit 32-bit truncation of ALU32 output
 - oversized register source operand for ALU32 shift
 - right-shift of a number that could be positive or negative
 - map access where adding the operation size to the offset causes signed
   32-bit overflow
 - direct stack access at a ~4GiB offset

Also remove the F_LOAD_WITH_STRICT_ALIGNMENT flag from a bunch of tests
that should fail independent of what flags userspace passes.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: fix integer overflows
Alexei Starovoitov [Tue, 19 Dec 2017 04:12:00 +0000 (20:12 -0800)]
bpf: fix integer overflows

There were various issues related to the limited size of integers used in
the verifier:
 - `off + size` overflow in __check_map_access()
 - `off + reg->off` overflow in check_mem_access()
 - `off + reg->var_off.value` overflow or 32-bit truncation of
   `reg->var_off.value` in check_mem_access()
 - 32-bit truncation in check_stack_boundary()

Make sure that any integer math cannot overflow by not allowing
pointer math with large values.

Also reduce the scope of "scalar op scalar" tracking.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: don't prune branches when a scalar is replaced with a pointer
Jann Horn [Tue, 19 Dec 2017 04:11:59 +0000 (20:11 -0800)]
bpf: don't prune branches when a scalar is replaced with a pointer

This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: force strict alignment checks for stack pointers
Jann Horn [Tue, 19 Dec 2017 04:11:58 +0000 (20:11 -0800)]
bpf: force strict alignment checks for stack pointers

Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: fix missing error return in check_stack_boundary()
Jann Horn [Tue, 19 Dec 2017 04:11:57 +0000 (20:11 -0800)]
bpf: fix missing error return in check_stack_boundary()

Prevent indirect stack accesses at non-constant addresses, which would
permit reading and corrupting spilled pointers.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: fix 32-bit ALU op verification
Jann Horn [Tue, 19 Dec 2017 04:11:56 +0000 (20:11 -0800)]
bpf: fix 32-bit ALU op verification

32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: fix incorrect tracking of register size truncation
Jann Horn [Tue, 19 Dec 2017 04:11:55 +0000 (20:11 -0800)]
bpf: fix incorrect tracking of register size truncation

Properly handle register truncation to a smaller size.

The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16996 for this issue.

v2:
 - flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
 - add CVE number (Ben Hutchings)

Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: fix incorrect sign extension in check_alu_op()
Jann Horn [Tue, 19 Dec 2017 04:11:54 +0000 (20:11 -0800)]
bpf: fix incorrect sign extension in check_alu_op()

Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16995 for this issue.

v3:
 - add CVE number (Ben Hutchings)

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf/verifier: fix bounds calculation on BPF_RSH
Edward Cree [Tue, 19 Dec 2017 04:11:53 +0000 (20:11 -0800)]
bpf/verifier: fix bounds calculation on BPF_RSH

Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.

Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[jannh@google.com: changed description to reflect bug impact]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Thu, 21 Dec 2017 00:52:01 +0000 (16:52 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Two simple fixes: one for sparse warnings that were introduced by the
  merge window conversion to blist_flags_t and the other to fix dropped
  I/O during reset in aacraid"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: aacraid: Fix I/O drop during reset
  scsi: core: Use blist_flags_t consistently

3 years agoMerge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Linus Torvalds [Thu, 21 Dec 2017 00:47:14 +0000 (16:47 -0800)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM fix from Russell King:
 "Just one fix for a problem in the csum_partial_copy_from_user()
  implementation when software PAN is enabled"

* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8731/1: Fix csum_partial_copy_from_user() stack mismatch

3 years agoMerge tag 'acpi-4.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Wed, 20 Dec 2017 21:44:21 +0000 (13:44 -0800)]
Merge tag 'acpi-4.15-rc5' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These fix a recently introduced issue in the ACPI CPPC driver and an
  obscure error hanling bug in the APEI code.

  Specifics:

   - Fix an error handling issue in the ACPI APEI implementation of the
     >read callback in struct pstore_info (Takashi Iwai).

   - Fix a possible out-of-bounds arrar read in the ACPI CPPC driver
     (Colin Ian King)"

* tag 'acpi-4.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: APEI / ERST: Fix missing error handling in erst_reader()
  ACPI: CPPC: remove initial assignment of pcc_ss_data

3 years agoMerge tag 'pm-4.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Wed, 20 Dec 2017 21:41:40 +0000 (13:41 -0800)]
Merge tag 'pm-4.15-rc5' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a regression in the ondemand and conservative cpufreq
  governors that was introduced during the 4.13 cycle, a recent
  regression in the imx6q cpufreq driver and a regression in the PCI
  handling of hibernation from the 4.14 cycle.

  Specifics:

   - Fix an issue in the PCI handling of the "thaw" transition during
     hibernation (after creating an image), introduced by a bug fix from
     the 4.13 cycle and exposed by recent changes in the IRQ subsystem,
     that caused pci_restore_state() to be called for devices in
     low-power states in some cases which is incorrect and breaks MSI
     management on some systems (Rafael Wysocki).

   - Fix a recent regression in the imx6q cpufreq driver that broke
     speed grading on i.MX6 QuadPlus by omitting checks causing invalid
     operating performance points (OPPs) to be disabled on that SoC as
     appropriate (Lucas Stach).

   - Fix a regression introduced during the 4.14 cycle in the ondemand
     and conservative cpufreq governors that causes the sampling
     interval used by them to be shorter than the tick period in some
     cases which leads to incorrect decisions (Rafael Wysocki)"

* tag 'pm-4.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: governor: Ensure sufficiently large sampling intervals
  cpufreq: imx6q: fix speed grading regression on i.MX6 QuadPlus
  PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()

3 years agoMerge tag 'spi-fix-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brooni...
Linus Torvalds [Wed, 20 Dec 2017 21:38:00 +0000 (13:38 -0800)]
Merge tag 'spi-fix-v4.15-rc4' of git://git./linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A bunch of really small fixes here, all driver specific and mostly in
  error handling and remove paths.

  The most important fixes are for the a3700 clock configuration and a
  fix for a nasty stall which could potentially cause data corruption
  with the xilinx driver"

* tag 'spi-fix-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: atmel: fixed spin_lock usage inside atmel_spi_remove
  spi: sun4i: disable clocks in the remove function
  spi: rspi: Do not set SPCR_SPE in qspi_set_config_register()
  spi: Fix double "when"
  spi: a3700: Fix clk prescaling for coefficient over 15
  spi: xilinx: Detect stall with Unknown commands
  spi: imx: Update device tree binding documentation

3 years agoMerge tag 'mfd-fixes-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Linus Torvalds [Wed, 20 Dec 2017 21:35:10 +0000 (13:35 -0800)]
Merge tag 'mfd-fixes-4.15' of git://git./linux/kernel/git/lee/mfd

Pull MDF bugfixes from Lee Jones:

  - Fix message timing issues and report correct state when an error
    occurs in cros_ec_spi

  - Reorder enums used for Power Management in rtsx_pci

  - Use correct OF helper for obtaining child nodes in twl4030-audio and
    twl6040

* tag 'mfd-fixes-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  mfd: Fix RTS5227 (and others) powermanagement
  mfd: cros ec: spi: Fix "in progress" error signaling
  mfd: twl6040: Fix child-node lookup
  mfd: twl4030-audio: Fix sibling-node lookup
  mfd: cros ec: spi: Don't send first message too soon

3 years agoMerge tag 'sound-4.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Wed, 20 Dec 2017 21:03:20 +0000 (13:03 -0800)]
Merge tag 'sound-4.15-rc5' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "All stable fixes here:

   - a regression fix of USB-audio for the previous hardening patch

   - a potential UAF fix in rawmidi

   - HD-audio and USB-audio quirks, the missing new ID"

* tag 'sound-4.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Fix the missing ctl name suffix at parsing SU
  ALSA: hda/realtek - Fix Dell AIO LineOut issue
  ALSA: rawmidi: Avoid racy info ioctl via ctl device
  ALSA: hda - Add vendor id for Cannonlake HDMI codec
  ALSA: usb-audio: Add native DSD support for Esoteric D-05X

3 years agonull_blk: unalign call_single_data
Jens Axboe [Wed, 20 Dec 2017 20:14:42 +0000 (13:14 -0700)]
null_blk: unalign call_single_data

Commit 966a967116e6 randomly added alignment to this structure, but
it's actually detrimental to performance of null_blk. Test case:

Running on both the home and remote node shows a ~5% degradation
in performance.

While in there, move blk_status_t to the hole after the integer tag
in the nullb_cmd structure. After this patch, we shrink the size
from 192 to 152 bytes.

Fixes: 966a967116e69 ("smp: Avoid using two cache lines for struct call_single_data")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: unalign call_single_data in struct request
Jens Axboe [Wed, 20 Dec 2017 20:13:58 +0000 (13:13 -0700)]
block: unalign call_single_data in struct request

A previous change blindly added massive alignment to the
call_single_data structure in struct request. This ballooned it in size
from 296 to 320 bytes on my setup, for no valid reason at all.

Use the unaligned struct __call_single_data variant instead.

Fixes: 966a967116e69 ("smp: Avoid using two cache lines for struct call_single_data")
Cc: stable@vger.kernel.org # v4.14
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoipv4: Fix use-after-free when flushing FIB tables
Ido Schimmel [Wed, 20 Dec 2017 17:34:19 +0000 (19:34 +0200)]
ipv4: Fix use-after-free when flushing FIB tables

Since commit 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") the
local table uses the same trie allocated for the main table when custom
rules are not in use.

When a net namespace is dismantled, the main table is flushed and freed
(via an RCU callback) before the local table. In case the callback is
invoked before the local table is iterated, a use-after-free can occur.

Fix this by iterating over the FIB tables in reverse order, so that the
main table is always freed after the local table.

v3: Reworded comment according to Alex's suggestion.
v2: Add a comment to make the fix more explicit per Dave's and Alex's
feedback.

Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agos390/qeth: fix error handling in checksum cmd callback
Julian Wiedmann [Wed, 20 Dec 2017 17:07:18 +0000 (18:07 +0100)]
s390/qeth: fix error handling in checksum cmd callback

Make sure to check both return code fields before processing the
response. Otherwise we risk operating on invalid data.

Fixes: c9475369bd2b ("s390/qeth: rework RX/TX checksum offload")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: remove joining group member from congested list
Jon Maloy [Wed, 20 Dec 2017 10:03:15 +0000 (11:03 +0100)]
tipc: remove joining group member from congested list

When we receive a JOIN message from a peer member, the message may
contain an advertised window value ADV_IDLE that permits removing the
member in question from the tipc_group::congested list. However, since
the removal has been made conditional on that the advertised window is
*not* ADV_IDLE, we miss this case. This has the effect that a sender
sometimes may enter a state of permanent, false, broadcast congestion.

We fix this by unconditinally removing the member from the congested
list before calling tipc_member_update(), which might potentially sort
it into the list again.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoselftests: net: Adding config fragment CONFIG_NUMA=y
Naresh Kamboju [Wed, 20 Dec 2017 07:20:22 +0000 (12:50 +0530)]
selftests: net: Adding config fragment CONFIG_NUMA=y

kernel config fragement CONFIG_NUMA=y is need for reuseport_bpf_numa.

Signed-off-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge tag 'mlx5-fixes-2017-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Wed, 20 Dec 2017 18:41:05 +0000 (13:41 -0500)]
Merge tag 'mlx5-fixes-2017-12-19' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

===================
Mellanox, mlx5 fixes 2017-12-19

The follwoing series includes some fixes for mlx5 core and etherent
driver.

Please pull and let me know if there is any problem.

This series doesn't introduce any conflict with the ongoing mlx5 for-next
submission.

For -stable:

kernels >= v4.7.y
    ("net/mlx5e: Fix possible deadlock of VXLAN lock")
    ("net/mlx5e: Add refcount to VXLAN structure")
    ("net/mlx5e: Prevent possible races in VXLAN control flow")
    ("net/mlx5e: Fix features check of IPv6 traffic")

kernels >= v4.9.y
    ("net/mlx5: Fix error flow in CREATE_QP command")
    ("net/mlx5: Fix rate limit packet pacing naming and struct")

kernels >= v4.13.y
    ("net/mlx5: FPGA, return -EINVAL if size is zero")

kernels >= v4.14.y
    ("Revert "mlx5: move affinity hints assignments to generic code")

All above patches apply and compile with no issues on corresponding -stable.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoblock-throttle: avoid double charge
Shaohua Li [Wed, 20 Dec 2017 18:10:17 +0000 (11:10 -0700)]
block-throttle: avoid double charge

If a bio is throttled and split after throttling, the bio could be
resubmited and enters the throttling again. This will cause part of the
bio to be charged multiple times. If the cgroup has an IO limit, the
double charge will significantly harm the performance. The bio split
becomes quite common after arbitrary bio size change.

To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
If the bio is cloned/split, we copy the flag to new bio too to avoid a
double charge. However, cloned bio could be directed to a new disk,
keeping the flag be a problem. The observation is we always set new disk
for the bio in this case, so we can clear the flag in bio_set_dev().

This issue exists for a long time, arbitrary bio size change just makes
it worse, so this should go into stable at least since v4.2.

V1-> V2: Not add extra field in bio based on discussion with Tejun

Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge branch 'cls_bpf-fix-offload-state-tracking-with-block-callbacks'
David S. Miller [Wed, 20 Dec 2017 18:08:19 +0000 (13:08 -0500)]
Merge branch 'cls_bpf-fix-offload-state-tracking-with-block-callbacks'

Jakub Kicinski says:

===================
cls_bpf: fix offload state tracking with block callbacks

After introduction of block callbacks classifiers can no longer track
offload state.  cls_bpf used to do that in an attempt to move common
code from drivers to the core.  Remove that functionality and fix
drivers.

The user-visible bug this is fixing is that trying to offload a second
filter would trigger a spurious DESTROY and in turn disable the already
installed one.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonfp: bpf: keep track of the offloaded program
Jakub Kicinski [Tue, 19 Dec 2017 21:32:14 +0000 (13:32 -0800)]
nfp: bpf: keep track of the offloaded program

After TC offloads were converted to callbacks we have no choice
but keep track of the offloaded filter in the driver.

The check for nn->dp.bpf_offload_xdp was a stop gap solution
to make sure failed TC offload won't disable XDP, it's no longer
necessary.  nfp_net_bpf_offload() will return -EBUSY on
TC vs XDP conflicts.

Fixes: 3f7889c4c79b ("net: sched: cls_bpf: call block callbacks for offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agocls_bpf: fix offload assumptions after callback conversion
Jakub Kicinski [Tue, 19 Dec 2017 21:32:13 +0000 (13:32 -0800)]
cls_bpf: fix offload assumptions after callback conversion

cls_bpf used to take care of tracking what offload state a filter
is in, i.e. it would track if offload request succeeded or not.
This information would then be used to issue correct requests to
the driver, e.g. requests for statistics only on offloaded filters,
removing only filters which were offloaded, using add instead of
replace if previous filter was not added etc.

This tracking of offload state no longer functions with the new
callback infrastructure.  There could be multiple entities trying
to offload the same filter.

Throw out all the tracking and corresponding commands and simply
pass to the drivers both old and new bpf program.  Drivers will
have to deal with offload state tracking by themselves.

Fixes: 3f7889c4c79b ("net: sched: cls_bpf: call block callbacks for offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: Fix double free and memory corruption in get_net_ns_by_id()
Eric W. Biederman [Tue, 19 Dec 2017 17:27:56 +0000 (11:27 -0600)]
net: Fix double free and memory corruption in get_net_ns_by_id()

(I can trivially verify that that idr_remove in cleanup_net happens
 after the network namespace count has dropped to zero --EWB)

Function get_net_ns_by_id() does not check for net::count
after it has found a peer in netns_ids idr.

It may dereference a peer, after its count has already been
finaly decremented. This leads to double free and memory
corruption:

put_net(peer)                                   rtnl_lock()
atomic_dec_and_test(&peer->count) [count=0]     ...
__put_net(peer)                                 get_net_ns_by_id(net, id)
  spin_lock(&cleanup_list_lock)
  list_add(&net->cleanup_list, &cleanup_list)
  spin_unlock(&cleanup_list_lock)
queue_work()                                      peer = idr_find(&net->netns_ids, id)
  |                                               get_net(peer) [count=1]
  |                                               ...
  |                                               (use after final put)
  v                                               ...
  cleanup_net()                                   ...
    spin_lock(&cleanup_list_lock)                 ...
    list_replace_init(&cleanup_list, ..)          ...
    spin_unlock(&cleanup_list_lock)               ...
    ...                                           ...
    ...                                           put_net(peer)
    ...                                             atomic_dec_and_test(&peer->count) [count=0]
    ...                                               spin_lock(&cleanup_list_lock)
    ...                                               list_add(&net->cleanup_list, &cleanup_list)
    ...                                               spin_unlock(&cleanup_list_lock)
    ...                                             queue_work()
    ...                                           rtnl_unlock()
    rtnl_lock()                                   ...
    for_each_net(tmp) {                           ...
      id = __peernet2id(tmp, peer)                ...
      spin_lock_irq(&tmp->nsid_lock)              ...
      idr_remove(&tmp->netns_ids, id)             ...
      ...                                         ...
      net_drop_ns()                               ...
net_free(peer)                            ...
    }                                             ...
  |
  v
  cleanup_net()
    ...
    (Second free of peer)

Also, put_net() on the right cpu may reorder with left's cpu
list_replace_init(&cleanup_list, ..), and then cleanup_list
will be corrupted.

Since cleanup_net() is executed in worker thread, while
put_net(peer) can happen everywhere, there should be
enough time for concurrent get_net_ns_by_id() to pick
the peer up, and the race does not seem to be unlikely.
The patch fixes the problem in standard way.

(Also, there is possible problem in peernet2id_alloc(), which requires
check for net::count under nsid_lock and maybe_get_net(peer), but
in current stable kernel it's used under rtnl_lock() and it has to be
safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
be fixed too. While this is not in stable kernel yet, so I'll send
a separate message to netdev@ later).

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 0c7aecd4bde4 "netns: add rtnl cmd to add and get peer netns ids"
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'mvneta-fixes'
David S. Miller [Wed, 20 Dec 2017 17:24:12 +0000 (12:24 -0500)]
Merge branch 'mvneta-fixes'

Gregory CLEMENT says:

====================
Few mvneta fixes

here it is a small series of fixes found on the mvneta driver. They
had been already used in the vendor kernel and are now ported to
mainline.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: mvneta: eliminate wrong call to handle rx descriptor error
Yelena Krivosheev [Tue, 19 Dec 2017 16:59:47 +0000 (17:59 +0100)]
net: mvneta: eliminate wrong call to handle rx descriptor error

There are few reasons in mvneta_rx_swbm() function when received packet
is dropped. mvneta_rx_error() should be called only if error bit [16]
is set in rx descriptor.

[gregory.clement@free-electrons.com: add fixes tag]
Cc: stable@vger.kernel.org
Fixes: dc35a10f68d3 ("net: mvneta: bm: add support for hardware buffer management")
Signed-off-by: Yelena Krivosheev <yelena@marvell.com>
Tested-by: Dmitri Epshtein <dima@marvell.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: mvneta: use proper rxq_number in loop on rx queues
Yelena Krivosheev [Tue, 19 Dec 2017 16:59:46 +0000 (17:59 +0100)]
net: mvneta: use proper rxq_number in loop on rx queues

When adding the RX queue association with each CPU, a typo was made in
the mvneta_cleanup_rxqs() function. This patch fixes it.

[gregory.clement@free-electrons.com: add commit log and fixes tag]
Cc: stable@vger.kernel.org
Fixes: 2dcf75e2793c ("net: mvneta: Associate RX queues with each CPU")
Signed-off-by: Yelena Krivosheev <yelena@marvell.com>
Tested-by: Dmitri Epshtein <dima@marvell.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: mvneta: clear interface link status on port disable
Yelena Krivosheev [Tue, 19 Dec 2017 16:59:45 +0000 (17:59 +0100)]
net: mvneta: clear interface link status on port disable

When port connect to PHY in polling mode (with poll interval 1 sec),
port and phy link status must be synchronize in order don't loss link
change event.

[gregory.clement@free-electrons.com: add fixes tag]
Cc: <stable@vger.kernel.org>
Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP network unit")
Signed-off-by: Yelena Krivosheev <yelena@marvell.com>
Tested-by: Dmitri Epshtein <dima@marvell.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'acpi-cppc'
Rafael J. Wysocki [Wed, 20 Dec 2017 14:51:26 +0000 (15:51 +0100)]
Merge branch 'acpi-cppc'

* acpi-cppc:
  ACPI: CPPC: remove initial assignment of pcc_ss_data

3 years agoMerge branch 'pm-pci'
Rafael J. Wysocki [Wed, 20 Dec 2017 14:12:40 +0000 (15:12 +0100)]
Merge branch 'pm-pci'

* pm-pci:
  PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()

3 years agoDo not hash userspace addresses in fault handlers
Kees Cook [Tue, 19 Dec 2017 21:52:23 +0000 (13:52 -0800)]
Do not hash userspace addresses in fault handlers

The hashing of %p was designed to restrict kernel addresses. There is
no reason to hash the userspace values seen during a segfault report,
so switch these to %px. (Some architectures already use %lx.)

Fixes: ad67b74d2469d9b8 ("printk: hash addresses printed with %p")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobpf: Fix tools and testing build.
David Miller [Tue, 19 Dec 2017 20:22:03 +0000 (15:22 -0500)]
bpf: Fix tools and testing build.

I'm getting various build failures on sparc64.  The key is
usually that the userland tools get built 32-bit.

1) clock_gettime() is in librt, so that must be added to the link
   libraries.

2) "sizeof(x)" must be printed with "%Z" printf prefix.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agonet/mlx5: Stay in polling mode when command EQ destroy fails
Moshe Shemesh [Mon, 4 Dec 2017 13:23:51 +0000 (15:23 +0200)]
net/mlx5: Stay in polling mode when command EQ destroy fails

During unload, on mlx5_stop_eqs we move command interface from events
mode to polling mode, but if command interface EQ destroy fail we move
back to events mode.
That's wrong since even if we fail to destroy command interface EQ, we
do release its irq, so no interrupts will be received.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5: Cleanup IRQs in case of unload failure
Moshe Shemesh [Tue, 21 Nov 2017 13:15:51 +0000 (15:15 +0200)]
net/mlx5: Cleanup IRQs in case of unload failure

When mlx5_stop_eqs fails to destroy any of the eqs it returns with an error.
In such failure flow the function will return without
releasing all EQs irqs and then pci_free_irq_vectors will fail.
Fix by only warn on destroy EQ failure and continue to release other
EQs and their irqs.

It fixes the following kernel trace:
kernel: kernel BUG at drivers/pci/msi.c:352!
...
...
kernel: Call Trace:
kernel: pci_disable_msix+0xd3/0x100
kernel: pci_free_irq_vectors+0xe/0x20
kernel: mlx5_load_one.isra.17+0x9f5/0xec0 [mlx5_core]

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5: Fix steering memory leak
Maor Gottlieb [Tue, 5 Dec 2017 11:45:21 +0000 (13:45 +0200)]
net/mlx5: Fix steering memory leak

Flow steering priority and namespace are software only objects that
didn't have the proper destructors and were not freed during steering
cleanup.

Fix it by adding destructor functions for these objects.

Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in parallel")
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5e: Prevent possible races in VXLAN control flow
Gal Pressman [Mon, 4 Dec 2017 07:57:43 +0000 (09:57 +0200)]
net/mlx5e: Prevent possible races in VXLAN control flow

When calling add/remove VXLAN port, a lock must be held in order to
prevent race scenarios when more than one add/remove happens at the
same time.
Fix by holding our state_lock (mutex) as done by all other parts of the
driver.
Note that the spinlock protecting the radix-tree is still needed in
order to synchronize radix-tree access from softirq context.

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5e: Add refcount to VXLAN structure
Gal Pressman [Sun, 3 Dec 2017 11:58:50 +0000 (13:58 +0200)]
net/mlx5e: Add refcount to VXLAN structure

A refcount mechanism must be implemented in order to prevent unwanted
scenarios such as:
- Open an IPv4 VXLAN interface
- Open an IPv6 VXLAN interface (different socket)
- Remove one of the interfaces

With current implementation, the UDP port will be removed from our VXLAN
database and turn off the offloads for the other interface, which is
still active.
The reference count mechanism will only allow UDP port removals once all
consumers are gone.

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5e: Fix possible deadlock of VXLAN lock
Gal Pressman [Thu, 23 Nov 2017 11:52:28 +0000 (13:52 +0200)]
net/mlx5e: Fix possible deadlock of VXLAN lock

mlx5e_vxlan_lookup_port is called both from mlx5e_add_vxlan_port (user
context) and mlx5e_features_check (softirq), but the lock acquired does
not disable bottom half and might result in deadlock. Fix it by simply
replacing spin_lock() with spin_lock_bh().
While at it, replace all unnecessary spin_lock_irq() to spin_lock_bh().

lockdep's WARNING: inconsistent lock state
[  654.028136] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  654.028229] swapper/5/0 [HC0[0]:SC1[9]:HE1:SE0] takes:
[  654.028321]  (&(&vxlan_db->lock)->rlock){+.?.}, at: [<ffffffffa06e7f0e>] mlx5e_vxlan_lookup_port+0x1e/0x50 [mlx5_core]
[  654.028528] {SOFTIRQ-ON-W} state was registered at:
[  654.028607]   _raw_spin_lock+0x3c/0x70
[  654.028689]   mlx5e_vxlan_lookup_port+0x1e/0x50 [mlx5_core]
[  654.028794]   mlx5e_vxlan_add_port+0x2e/0x120 [mlx5_core]
[  654.028878]   process_one_work+0x1e9/0x640
[  654.028942]   worker_thread+0x4a/0x3f0
[  654.029002]   kthread+0x141/0x180
[  654.029056]   ret_from_fork+0x24/0x30
[  654.029114] irq event stamp: 579088
[  654.029174] hardirqs last  enabled at (579088): [<ffffffff818f475a>] ip6_finish_output2+0x49a/0x8c0
[  654.029309] hardirqs last disabled at (579087): [<ffffffff818f470e>] ip6_finish_output2+0x44e/0x8c0
[  654.029446] softirqs last  enabled at (579030): [<ffffffff810b3b3d>] irq_enter+0x6d/0x80
[  654.029567] softirqs last disabled at (579031): [<ffffffff810b3c05>] irq_exit+0xb5/0xc0
[  654.029684] other info that might help us debug this:
[  654.029781]  Possible unsafe locking scenario:

[  654.029868]        CPU0
[  654.029908]        ----
[  654.029947]   lock(&(&vxlan_db->lock)->rlock);
[  654.030045]   <Interrupt>
[  654.030090]     lock(&(&vxlan_db->lock)->rlock);
[  654.030162]
 *** DEADLOCK ***

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5: Fix error flow in CREATE_QP command
Moni Shoua [Mon, 4 Dec 2017 06:59:25 +0000 (08:59 +0200)]
net/mlx5: Fix error flow in CREATE_QP command

In error flow, when DESTROY_QP command should be executed, the wrong
mailbox was set with data, not the one that is written to hardware,
Fix that.

Fixes: 09a7d9eca1a6 '{net,IB}/mlx5: QP/XRCD commands via mlx5 ifc'
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5: Fix misspelling in the error message and comment
Eugenia Emantayev [Thu, 16 Nov 2017 12:57:48 +0000 (14:57 +0200)]
net/mlx5: Fix misspelling in the error message and comment

Fix misspelling in word syndrome.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5e: Fix defaulting RX ring size when not needed
Eugenia Emantayev [Tue, 14 Nov 2017 07:44:55 +0000 (09:44 +0200)]
net/mlx5e: Fix defaulting RX ring size when not needed

Fixes the bug when turning on/off CQE compression mechanism
resets the RX rings size to default value when it is not
needed.

Fixes: 2fc4bfb7250d ("net/mlx5e: Dynamic RQ type infrastructure")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5e: Fix features check of IPv6 traffic
Gal Pressman [Tue, 21 Nov 2017 15:49:36 +0000 (17:49 +0200)]
net/mlx5e: Fix features check of IPv6 traffic

The assumption that the next header field contains the transport
protocol is wrong for IPv6 packets with extension headers.
Instead, we should look the inner-most next header field in the buffer.
This will fix TSO offload for tunnels over IPv6 with extension headers.

Performance testing: 19.25x improvement, cool!
Measuring bandwidth of 16 threads TCP traffic over IPv6 GRE tap.
CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
TSO: Enabled
Before: 4,926.24  Mbps
Now   : 94,827.91 Mbps

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5e: Fix ETS BW check
Huy Nguyen [Thu, 26 Oct 2017 14:56:34 +0000 (09:56 -0500)]
net/mlx5e: Fix ETS BW check

Fix bug that allows ets bw sum to be 0% when ets tc type exists.

Fixes: 08fb1dacdd76 ('net/mlx5e: Support DCBNL IEEE ETS')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5: Fix rate limit packet pacing naming and struct
Eran Ben Elisha [Mon, 13 Nov 2017 08:11:27 +0000 (10:11 +0200)]
net/mlx5: Fix rate limit packet pacing naming and struct

In mlx5_ifc, struct size was not complete, and thus driver was sending
garbage after the last defined field. Fixed it by adding reserved field
to complete the struct size.

In addition, rename all set_rate_limit to set_pp_rate_limit to be
compliant with the Firmware <-> Driver definition.

Fixes: 7486216b3a0b ("{net,IB}/mlx5: mlx5_ifc updates")
Fixes: 1466cc5b23d1 ("net/mlx5: Rate limit tables support")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agoRevert "mlx5: move affinity hints assignments to generic code"
Saeed Mahameed [Fri, 10 Nov 2017 06:59:52 +0000 (15:59 +0900)]
Revert "mlx5: move affinity hints assignments to generic code"

Before the offending commit, mlx5 core did the IRQ affinity itself,
and it seems that the new generic code have some drawbacks and one
of them is the lack for user ability to modify irq affinity after
the initial affinity values got assigned.

The issue is still being discussed and a solution in the new generic code
is required, until then we need to revert this patch.

This fixes the following issue:
echo <new affinity> > /proc/irq/<x>/smp_affinity
fails with  -EIO

This reverts commit a435393acafbf0ecff4deb3e3cb554b34f0d0664.
Note: kept mlx5_get_vector_affinity in include/linux/mlx5/driver.h since
it is used in mlx5_ib driver.

Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jes Sorensen <jsorensen@fb.com>
Reported-by: Jes Sorensen <jsorensen@fb.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agonet/mlx5: FPGA, return -EINVAL if size is zero
Kamal Heib [Sun, 29 Oct 2017 02:03:37 +0000 (04:03 +0200)]
net/mlx5: FPGA, return -EINVAL if size is zero

Currently, if a size of zero is passed to
mlx5_fpga_mem_{read|write}_i2c()
the "err" return value will not be initialized, which triggers gcc
warnings:

[..]/mlx5/core/fpga/sdk.c:87 mlx5_fpga_mem_read_i2c() error:
uninitialized symbol 'err'.
[..]/mlx5/core/fpga/sdk.c:115 mlx5_fpga_mem_write_i2c() error:
uninitialized symbol 'err'.

fix that.

Fixes: a9956d35d199 ('net/mlx5: FPGA, Add SBU infrastructure')
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
3 years agoipv4: fib: Fix metrics match when deleting a route
Phil Sutter [Tue, 19 Dec 2017 14:17:13 +0000 (15:17 +0100)]
ipv4: fib: Fix metrics match when deleting a route

The recently added fib_metrics_match() causes a regression for routes
with both RTAX_FEATURES and RTAX_CC_ALGO if the latter has
TCP_CONG_NEEDS_ECN flag set:

| # ip link add d0 type dummy
| # ip link set d0 up
| # ip route add 172.29.29.0/24 dev d0 features ecn congctl dctcp
| # ip route del 172.29.29.0/24 dev d0 features ecn congctl dctcp
| RTNETLINK answers: No such process

During route insertion, fib_convert_metrics() detects that the given CC
algo requires ECN and hence sets DST_FEATURE_ECN_CA bit in
RTAX_FEATURES.

During route deletion though, fib_metrics_match() compares stored
RTAX_FEATURES value with that from userspace (which obviously has no
knowledge about DST_FEATURE_ECN_CA) and fails.

Fixes: 5f9ae3d9e7e4a ("ipv4: do metrics match when looking up and deleting a route")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: stmmac: Fix bad RX timestamp extraction
Fredrik Hallenberg [Mon, 18 Dec 2017 22:34:00 +0000 (23:34 +0100)]
net: stmmac: Fix bad RX timestamp extraction

As noted in dwmac4_wrback_get_rx_timestamp_status the timestamp is found
in the context descriptor following the current descriptor. However the
current code looks for the context descriptor in the current
descriptor, which will always fail.

Signed-off-by: Fredrik Hallenberg <megahallon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: stmmac: Fix TX timestamp calculation
Fredrik Hallenberg [Mon, 18 Dec 2017 22:33:59 +0000 (23:33 +0100)]
net: stmmac: Fix TX timestamp calculation

When using GMAC4 the value written in PTP_SSIR should be shifted however
the shifted value is also used in subsequent calculations which results
in a bad timestamp value.

Signed-off-by: Fredrik Hallenberg <megahallon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: fix list sorting bug in function tipc_group_update_member()
Jon Maloy [Mon, 18 Dec 2017 19:03:05 +0000 (20:03 +0100)]
tipc: fix list sorting bug in function tipc_group_update_member()

When, during a join operation, or during message transmission, a group
member needs to be added to the group's 'congested' list, we sort it
into the list in ascending order, according to its current advertised
window size. However, we miss the case when the member is already on
that list. This will have the result that the member, after the window
size has been decremented, might be at the wrong position in that list.
This again may have the effect that we during broadcast and multicast
transmissions miss the fact that a destination is not yet ready for
reception, and we end up sending anyway. From this point on, the
behavior during the remaining session is unpredictable, e.g., with
underflowing window sizes.

We now correct this bug by unconditionally removing the member from
the list before (re-)sorting it in.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoip6_tunnel: get the min mtu properly in ip6_tnl_xmit
Xin Long [Mon, 18 Dec 2017 06:26:21 +0000 (14:26 +0800)]
ip6_tunnel: get the min mtu properly in ip6_tnl_xmit

Now it's using IPV6_MIN_MTU as the min mtu in ip6_tnl_xmit, but
IPV6_MIN_MTU actually only works when the inner packet is ipv6.

With IPV6_MIN_MTU for ipv4 packets, the new pmtu for inner dst
couldn't be set less than 1280. It would cause tx_err and the
packet to be dropped when the outer dst pmtu is close to 1280.

Jianlin found it by running ipv4 traffic with the topo:

  (client) gre6 <---> eth1 (route) eth2 <---> gre6 (server)

After changing eth2 mtu to 1300, the performance became very
low, or the connection was even broken. The issue also affects
ip4ip6 and ip6ip6 tunnels.

So if the inner packet is ipv4, 576 should be considered as the
min mtu.

Note that for ip4ip6 and ip6ip6 tunnels, the inner packet can
only be ipv4 or ipv6, but for gre6 tunnel, it may also be ARP.
This patch using 576 as the min mtu for non-ipv6 packet works
for all those cases.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoip6_gre: remove the incorrect mtu limit for ipgre tap
Xin Long [Mon, 18 Dec 2017 06:25:09 +0000 (14:25 +0800)]
ip6_gre: remove the incorrect mtu limit for ipgre tap

The same fix as the patch "ip_gre: remove the incorrect mtu limit for
ipgre tap" is also needed for ip6_gre.

Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoip_gre: remove the incorrect mtu limit for ipgre tap
Xin Long [Mon, 18 Dec 2017 06:24:35 +0000 (14:24 +0800)]
ip_gre: remove the incorrect mtu limit for ipgre tap

ipgre tap driver calls ether_setup(), after commit 61e84623ace3
("net: centralize net_device min/max MTU checking"), the range
of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.

It causes the dev mtu of the ipgre tap device to not be greater
than 1500, this limit value is not correct for ipgre tap device.

Besides, it's .change_mtu already does the right check. So this
patch is just to set max_mtu as 0, and leave the check to it's
.change_mtu.

Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agovxlan: update skb dst pmtu on tx path
Xin Long [Mon, 18 Dec 2017 06:20:56 +0000 (14:20 +0800)]
vxlan: update skb dst pmtu on tx path

Unlike ip tunnels, now vxlan doesn't do any pmtu update for
upper dst pmtu, even if it doesn't match the lower dst pmtu
any more.

The problem can be reproduced when reducing the vxlan lower
dev's pmtu when running netperf. In jianlin's testing, the
performance went to 1/7 of the previous.

This patch is to update the upper dst pmtu to match the lower
dst pmtu on tx path so that packets can be sent out even when
lower dev's pmtu has been changed.

It also works for metadata dst.

Note that this patch doesn't process any pmtu icmp packet.
But even in the future, the support for pmtu icmp packets
process of udp tunnels will also needs this.

The same thing will be done for geneve in another patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: arc_emac: restart stalled EMAC
Alexander Kochetkov [Tue, 19 Dec 2017 11:03:57 +0000 (14:03 +0300)]
net: arc_emac: restart stalled EMAC

Under certain conditions EMAC stop reception of incoming packets and
continuously increment R_MISS register instead of saving data into
provided buffer. The commit implement workaround for such situation.
Then the stall detected EMAC will be restarted.

On device the stall looks like the device lost it's dynamic IP address.
ifconfig shows that interface error counter rapidly increments.
At the same time on the DHCP server we can see continues DHCP-requests
from device.

In real network stalls happen really rarely. To make them frequent the
broadcast storm[1] should be simulated. For simulation it is necessary
to make following connections:
    1. connect radxarock to 1st port of switch
    2. connect some PC to 2nd port of switch
    3. connect two other free ports together using standard ethernet cable,
       in order to make a switching loop.

After that, is necessary to make a broadcast storm. For example, running on
PC 'ping' to some IP address triggers ARP-request storm. After some
time (~10sec), EMAC on rk3188 will stall.

Observed and tested on rk3188 radxarock.

[1] https://en.wikipedia.org/wiki/Broadcast_radiation

Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>