commits
Pull x86 fix from Thomas Gleixner:
"Just the missing compat entry for the new pread/writev2"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Use compat version for preadv2 and pwritev2
Pull networking fixes from David Miller:
1) Fix mvneta/bm dependencies, from Arnd Bergmann.
2) RX completion hw bug workaround in bnxt_en, from Michael Chan.
3) Kernel pointer leak in nf_conntrack, from Linus.
4) Hoplimit route attribute limits not enforced properly, from Paolo
Abeni.
5) qlcnic driver NULL deref fix from Dan Carpenter.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
arm64: bpf: jit JMP_JSET_{X,K}
net/route: enforce hoplimit max value
nf_conntrack: avoid kernel pointer value leak in slab name
drivers: net: xgene: fix register offset
drivers: net: xgene: fix statistics counters race condition
drivers: net: xgene: fix ununiform latency across queues
drivers: net: xgene: fix sharing of irqs
drivers: net: xgene: fix IPv4 forward crash
xen-netback: fix extra_info handling in xenvif_tx_err()
net: mvneta: bm: fix dependencies again
bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)
bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)
qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()
Similar to preadv and pwritev, preadv2 and pwritev2 need compat entries
in the 32-bit syscall table.
This bug was found by strace test suite.
Fixes: 4babf2c5efb7 ("x86: wire up preadv2 and pwritev2")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20160511084817.GA29823@altlinux.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull vfs fixes from Al Viro:
"Overlayfs fixes from Miklos, assorted fixes from me.
Stable fodder of varying severity, all sat in -next for a while"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: ignore permissions on underlying lookup
vfs: add lookup_hash() helper
vfs: rename: check backing inode being equal
vfs: add vfs_select_inode() helper
get_rock_ridge_filename(): handle malformed NM entries
ecryptfs: fix handling of directory opening
atomic_open(): fix the handling of create_error
fix the copy vs. map logics in blk_rq_map_user_iov()
do_splice_to(): cap the size before passing to ->splice_read()
Original implementation commit e54bcde3d69d ("arm64: eBPF JIT compiler")
had the relevant code paths, but due to an oversight always fail jiting.
As a result, we had been falling back to BPF interpreter whenever a BPF
program has JMP_JSET_{X,K} instructions.
With this fix, we confirm that the corresponding tests in lib/test_bpf
continue to pass, and also jited.
...
[ 2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
[ 2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
[ 2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
...
[ 3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 110 PASS
[ 3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0xffffffff) return 1 jited:1 98 PASS
[ 3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 120 PASS
[ 3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0xffffffff) return 1 jited:1 89 PASS
...
Fixes: e54bcde3d69d ("arm64: eBPF JIT compiler")
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI fixes from James Bottomley:
"This is a couple of small fixes: one is a potential uninitialised
error variable in the alua code, potentially causing spurious failures
and the other is a problem caused by the conversion of SCSI to
hostwide tags which resulted in the qla1280 driver always failing in
host initialisation"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
qla1280: Don't allocate 512kb of host tags
scsi_dh_alua: uninitialized variable in alua_rtpg()
Pull cgroup fixes from Tejun Heo:
"During v4.6-rc1 cgroup namespace support was merged. There is an
issue where it's impossible to tell whether a given cgroup mount point
is bind mounted or namespaced. Serge has been working on the issue
but it took longer than expected to resolve, so the late pull request.
Given that it's a completely new feature and the patches don't touch
anything else, the risk seems acceptable. However, if this is too
late, an alternative is plugging new cgroup ns creation for v4.6 and
retrying for v4.7"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix compile warning
kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
kernfs_path_from_node_locked: don't overwrite nlen
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.
The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.
This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.
This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Hopefully the last round of fixes this release, fingers crossed :)
1) Initialize static nf_conntrack_locks_all_lock properly, from
Florian Westphal.
2) Need to cancel pending work when destroying IDLETIMER entries,
from Liping Zhang.
3) Fix TX param usage when sending TSO over iwlwifi devices, from
Emmanuel Grumbach.
4) NFACCT quota params not validated properly, from Phil Turnbull.
5) Resolve more glibc vs. kernel header conflicts, from Mikko
Tapeli.
6) Missing IRQ free in ravb_close(), from Geert Uytterhoeven.
7) Fix infoleak in x25, from Kangjie Lu.
8) Similarly in thunderx driver, from Heinrich Schuchardt.
9) tc_ife.h uapi header not exported properly, from Jamal Hadi Salim.
10) Don't reenable PHY interreupts if device is in polling mode, from
Shaohui Xie.
11) Packet scheduler actions late binding was not being handled
properly at all, from Jamal Hadi Salim.
12) Fix binding of conntrack entries to helpers in openvswitch, from
Joe Stringer"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
gre: do not keep the GRE header around in collect medata mode
openvswitch: Fix cached ct with helper.
net sched: ife action fix late binding
net sched: skbedit action fix late binding
net sched: simple action fix late binding
net sched: mirred action fix late binding
net sched: ipt action fix late binding
net sched: vlan action fix late binding
net: phylib: fix interrupts re-enablement in phy_start
tcp: refresh skb timestamp at retransmit time
net: nps_enet: bug fix - handle lost tx interrupts
net: nps_enet: Tx handler synchronization
export tc ife uapi header
net: thunderx: avoid exposing kernel stack
net: fix a kernel infoleak in x25 module
ravb: Add missing free_irq() call to ravb_close()
uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h
netfilter: nfnetlink_acct: validate NFACCT_QUOTA parameter
iwlwifi: mvm: don't override the rate with the AMSDU len
netfilter: IDLETIMER: fix race condition when destroy the target
...
The qla1280 driver sets the scsi_host_template's can_queue field to 0xfffff
which results in an allocation failure when allocating the block layer tags
for the driver's queues. This was introduced with the change for host wide
tags in commit 64d513ac31b - "scsi: use host wide tags by default".
Reduce can_queue to MAX_OUTSTANDING_COMMANDS (512) to solve the allocation
error.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: 64d513ac31b - "scsi: use host wide tags by default"
Cc: stable@vger.kernel.org # v4.4
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michael Reed <mdr@sgi.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <jejb@linux.vnet.ibm.com>
Pull workqueue fix from Tejun Heo:
"CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
DOWN_PREPARE which can trigger a WARN_ON() in workqueue.
The bug has been there for a very long time. It only triggers if CPU
down fails at a specific point and I don't think it has adverse
effects other than the warning messages. The fix is very low impact"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix rebind bound workers warning
commit 4f41fc59620f ("cgroup, kernfs: make mountinfo
show properly scoped path for cgroup namespaces")
added the following compile warning:
kernel/cgroup.c: In function ‘cgroup_show_path’:
kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable]
int len = 0, ret = 0;
^
fix it.
Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces")
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Payloads of NM entries are not supposed to contain NUL. When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries). We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254. So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected. And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily. And that's what will be passed to readdir callback as the
name length. 8Kb __copy_to_user() from a buffer allocated by
__get_free_page()
Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().
More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission. Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.
So instead use lookup_hash() which doesn't do the permission check.
Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.
Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.
This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.
Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The x86 exception table sorting was changed in commit 29934b0fb8ff
("x86/extable: use generic search and sort routines") to use the arch
independent code in lib/extable.c. However, the patch was mangled
somehow on its way into the kernel from the last version posted at [1].
The committed version kind of attempted to incorporate the changes of
commit 548acf19234d ("x86/mm: Expand the exception table logic to allow
new handling options") as in _completely_ _ignoring_ the x86 specific
'handler' member of struct exception_table_entry. This effectively
broke the sorting as entries will only partly be swapped now.
Fortunately, the x86 Kconfig selects BUILDTIME_EXTABLE_SORT, so the
exception table doesn't need to be sorted at runtime. However, in case
that ever changes, we better not break the exception table sorting just
because of that.
[ Ard Biesheuvel points out that BUILDTIME_EXTABLE_SORT applies to the
core image only, but we still rely on the sorting routines for modules
in that case - Linus ]
Fix this by providing a swap_ex_entry_fixup() macro that takes care of
the 'handler' member.
[1] https://lkml.org/lkml/2016/1/27/232
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Fixes: 29934b0fb8f ("x86/extable: use generic search and sort routines")
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For ipgre interface in collect metadata mode, it doesn't make sense for the
interface to be of ARPHRD_IPGRE type. The outer header of received packets
is not needed, as all the information from it is present in metadata_dst. We
already don't set ipgre_header_ops for collect metadata interfaces, which is
the only consumer of mac_header pointing to the outer IP header.
Just set the interface type to ARPHRD_NONE in collect metadata mode for
ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
mac_header.
Fixes: a64b04d86d14 ("gre: do not assign header_ops in collect metadata mode")
Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's possible to use "err" without initializing it. If it happens to be
a 2 which is SCSI_DH_RETRY then that could cause a bug. Bart Van Assche
pointed out that we should probably re-initialize it for every iteration
through the retry loop.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <jejb@linux.vnet.ibm.com>
Pull scheduler fix from Ingo Molnar:
"This is a revert to fix an interactivity problem.
The proper fixes for the problems that the reverted commit exposed are
now in sched/core (consisting of 3 patches), but were too risky for
v4.6 and will arrive in the v4.7 merge window"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched/fair: Fix fairness issue on migration"
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
dump_stack+0x89/0xd4
__warn+0xfd/0x120
warn_slowpath_null+0x1d/0x20
rebind_workers+0x1c0/0x1d0
workqueue_cpu_up_callback+0xf5/0x1d0
notifier_call_chain+0x64/0x90
? trace_hardirqs_on_caller+0xf2/0x220
? notify_prepare+0x80/0x80
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x35/0x50
notify_down_prepare+0x5e/0x80
? notify_prepare+0x80/0x80
cpuhp_invoke_callback+0x73/0x330
? __schedule+0x33e/0x8a0
cpuhp_down_callbacks+0x51/0xc0
cpuhp_thread_fun+0xc1/0xf0
smpboot_thread_fn+0x159/0x2a0
? smpboot_create_threads+0x80/0x80
kthread+0xef/0x110
? wait_for_completion+0xf0/0x120
? schedule_tail+0x35/0xf0
ret_from_fork+0x22/0x50
? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed
This bug can be reproduced by below config w/ nohz_full= all cpus:
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y
As Thomas pointed out:
| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP = 5
| CPU_PRI_WORKQUEUE_DOWN = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.
The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:
workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.
This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Our caller expects 0 on success, not >0.
This fixes a bug in the patch
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
where /sys does not show up in mountinfo, breaking criu.
Thanks for catching this, Andrei.
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
First of all, trying to open them r/w is idiocy; it's guaranteed to fail.
Moreover, assigning ->f_pos and assuming that everything will work is
blatantly broken - try that with e.g. tmpfs as underlying layer and watch
the fireworks. There may be a non-trivial amount of state associated with
current IO position, well beyond the numeric offset. Using the single
struct file associated with underlying inode is really not a good idea;
we ought to open one for each ecryptfs directory struct file.
Additionally, file_operations both for directories and non-directories are
full of pointless methods; non-directories should *not* have ->iterate(),
directories should not have ->flush(), ->fasync() and ->splice_read().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Overlayfs needs lookup without inode_permission() and already has the name
hash (in form of dentry->d_name on overlayfs dentry). It also doesn't
support filesystems with d_op->d_hash() so basically it only needs
the actual hashed lookup from lookup_one_len_unlocked()
So add a new helper that does unlocked lookup of a hashed name.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Iyappan Subramanian says:
====================
drivers: net: xgene: Bug fixes
This patch set addresses the following bug fixes that were found during testing.
1. IPv4 forward test crash
- drivers: net: xgene: fix IPv4 forward crash
2. Sharing of irqs
- drivers: net: xgene: fix sharing of irqs
3. Ununiform latency across queues
- drivers: net: xgene: fix ununiform latency across queues
4. Fix statistics counters race condition
- drivers: net: xgene: fix statistics counters race condition
5. Correcting register offset and field lengths
- drivers: net: xgene: fix register offset
v2: Address review comments from v1
- Defer TSO fix, and reposting all other patches from v1
v1:
- Initial version
====================
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull spi fixes from Mark Brown:
"A bunch of small driver specific fixes that have come up, none of them
remarkable in themselves. One fixes a regression introduced in the
merge window and another two are targetted at stable"
* tag 'spi-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: pxa2xx: Do not detect number of enabled chip selects on Intel SPT
spi: spi-ti-qspi: Handle truncated frames properly
spi: spi-ti-qspi: Fix FLEN and WLEN settings if bits_per_word is overridden
spi: omap2-mcspi: Undo broken fix for dma transfer of vmalloced buffer
spi: spi-fsl-dspi: Fix cs_change handling in message transfer
When using conntrack helpers from OVS, a common configuration is to
perform a lookup without specifying a helper, then go through a
firewalling policy, only to decide to attach a helper afterwards.
In this case, the initial lookup will cause a ct entry to be attached to
the skb, then the later commit with helper should attach the helper and
confirm the connection. However, the helper attachment has been missing.
If the user has enabled automatic helper attachment, then this issue
will be masked as it will be applied in init_conntrack(). It is also
masked if the action is executed from ovs_packet_cmd_execute() as that
will construct a fresh skb.
This patch fixes the issue by making an explicit call to try to assign
the helper if there is a discrepancy between the action's helper and the
current skb->nfct.
Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull perf fixes from Ingo Molnar:
"An uncharacteristically large number of bugs popped up in the last
week:
- various tooling fixes, two crashes and build problems
- two Intel PT fixes
- an KNL uncore driver fix
- an Intel PMU driver fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf stat: Fallback to user only counters when perf_event_paranoid > 1
perf evsel: Handle EACCESS + perf_event_paranoid=2 in fallback()
perf evsel: Improve EPERM error handling in open_strerror()
tools lib traceevent: Do not reassign parg after collapse_tree()
perf probe: Check if dwarf_getlocations() is available
perf dwarf: Guard !x86_64 definitions under #ifdef else clause
perf tools: Use readdir() instead of deprecated readdir_r()
perf thread_map: Use readdir() instead of deprecated readdir_r()
perf script: Use readdir() instead of deprecated readdir_r()
perf tools: Use readdir() instead of deprecated readdir_r()
perf/core: Disable the event on a truncated AUX record
perf/x86/intel/pt: Generate PMI in the STOP region as well
perf/x86: Fix undefined shift on 32-bit kernels
perf/x86/msr: Fix SMI overflow
perf/x86/intel/uncore: Fix CHA registers configuration procedure for Knights Landing platform
perf diff: Fix duplicated output column
Mike reported that this recent commit:
3a47d5124a95 ("sched/fair: Fix fairness issue on migration")
... broke interactivity and the signal starvation test.
We have a proper fix series in the works but ran out of time for
v4.6, so revert the commit.
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* if we have a hashed negative dentry and either CREAT|EXCL on
r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
with failing may_o_create(), we should fail with EROFS or the
error may_o_create() has returned, but not ENOENT. Which is what
the current code ends up returning.
* if we have CREAT|TRUNC hitting a regular file on a read-only
filesystem, we can't fail with EROFS here. At the very least,
not until we'd done follow_managed() - we might have a writable
file (or a device, for that matter) bound on top of that one.
Moreover, the code downstream will see that O_TRUNC and attempt
to grab the write access (*after* following possible mount), so
if we really should fail with EROFS, it will happen. No need
to do that inside atomic_open().
The real logics is much simpler than what the current code is
trying to do - if we decided to go for simple lookup, ended
up with a negative dentry *and* had create_error set, fail with
create_error. No matter whether we'd got that negative dentry
from lookup_real() or had found it in dcache.
Cc: stable@vger.kernel.org # v3.6+
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
should do nothing and return success.
This condition is checked in vfs_rename(). However it won't detect hard
links on overlayfs where these are given separate inodes on the overlayfs
layer.
Overlayfs itself detects this condition and returns success without doing
anything, but then vfs_rename() will proceed as if this was a successful
rename (detach_mounts(), d_move()).
The correct thing to do is to detect this condition before even calling
into overlayfs. This patch does this by calling vfs_select_inode() to get
the underlying inodes.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
Patch 562abd39 "xen-netback: support multiple extra info fragments
passed from frontend" contained a mistake which can result in an in-
correct number of responses being generated when handling errors
encountered when processing packets containing extra info fragments.
This patch fixes the problem.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reported-by: Jan Beulich <JBeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes SG_RX_DV_GATE_REG_0_ADDR register offset
and ring state field lengths.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull KVM fixes from Paolo Bonzini:
"Two small x86 patches, improving "make kvmconfig" and fixing an
objtool warning for CONFIG_PROFILE_ALL_BRANCHES"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvmconfig: add more virtio drivers
x86/kvm: Add stack frame dependency to fastop() inline asm
Jamal Hadi Salim says:
====================
Some actions were broken in allowing for late binding of actions.
Late binding workflow is as follows:
a) create an action and provide all necessary parameters for it
Optionally provide an index or let the kernel give you one.
Example:
sudo tc actions add action police rate 1kbit burst 90k drop index 1
b) later on bind to the pre-created action from a filter definition
by merely specifying the index.
Example:
sudo tc filter add dev lo parent ffff: protocol ip prio 8 \
u32 match ip src 127.0.0.8/32 flowid 1:8 action police index 1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull device mapper fix from Mike Snitzer:
"Fix for earlier 4.6-rc4 stable@ commit that introduced improper use of
write lock in cmd_read_lock() -- due to cut-n-paste gone awry (and
sparse didn't catch it)"
* tag 'dm-4.6-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache metadata: fix cmd_read_lock() acquiring write lock
Pull ARM SoC fixes from Arnd Bergmann:
"Three more bug fixes for ARM SoCs this week:
- The Atmel sama5d2 was registering the wrong NFC device type
- On Atmel sam9x5, the power management controller had an incorrect
register area size
- On ARM64 Allwinner machine was not secting the generic irqchip
code, causing build errors in some configurations"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: dts: at91: sam9x5: Fix the memory range assigned to the PMC
arm64/sunxi: 4.6-rc1: Add dependency on generic irq chip
ARM: dts: at91: sama5d2: use "atmel,sama5d3-nfc" compatible for nfc
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
- Fallback to usermode-only counters when perf_event_paranoid > 1, which
is the case now (Arnaldo Carvalho de Melo)
- Do not reassign parg after collapse_tree() in libtraceevent, which
may cause tool crashes (Steven Rostedt)
- Fix the build on Fedora Rawhide, where readdir_r() is deprecated and
also wrt -Werror=unused-const-variable= + x86_32_regoffset_table on
!x86_64 (Arnaldo Carvalho de Melo)
- Fix the build on Ubuntu 12.04.5, where dwarf_getlocations() isn't
available, i.e. libdw-dev < 0.157 (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull PCI fixes from Bjorn Helgaas:
"Since v4.5, we've WARNed during resume if a PCI device, including a
Thunderbolt device, was added while we were suspended. A change we
merged for v4.6-rc1 turned that warning into a system hang. These
enumeration patches from Lukas Wunner fix this issue:
- Fix BUG on device attach failure
- Do not treat EPROBE_DEFER as device attach failure"
* tag 'pci-v4.6-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Do not treat EPROBE_DEFER as device attach failure
PCI: Fix BUG on device attach failure
Pull IPMI updates from Corey Minyard:
"Just some minor fixes, nothing big"
* tag 'for-linus-4.6' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi: do not probe ACPI devices if si_tryacpi is unset
ipmi_si: Avoid a wrong long timeout on transaction done
ipmi_si: Fix module parameter doc names
ipmi_ssif: Fix logic around alert handling
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to. We use them
as such at the end. But in the loop copying the actual entries, we
overwrite @nlen. Use a temporary variable for that instead.
Without this, the return length, when the buffer is large enough, is
wrong. (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)
Interestingly, no callers of this function are affected by this as of
yet. However the upcoming cgroup_show_path() will be.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
I tried to fix this before, but my previous fix was incomplete
and we can still get the same link error in randconfig builds
because of the way that Kconfig treats the
default y if MVNETA=y && MVNETA_BM_ENABLE
line that does not actually trigger when MVNETA_BM_ENABLE=m,
unlike I intended.
Changing the line to use MVNETA_BM_ENABLE!=n however has
the desired effect and hopefully makes all configurations
work as expected.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 019ded3aa7c9 ("net: mvneta: bm: clarify dependencies")
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the race condition on updating the statistics
counters by moving the counters to the ring structure.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make defconfig kvmconfig" is supposed to end up with usable kernel for
KVM guest. In practice, it won't work for e.g. Hetzner VPS (KVM-based)
unless you add these options.
Signed-off-by: Andrey Utkin <andrey_utkin@fastmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are use cases when chip select should be triggered between transfers
in single SPI message. Current implementation does this only on last
transfer in message ignoring cs_change value provided in current transfer.
Signed-off-by: Andrey Vostrikov <andrey.vostrikov@cogentembedded.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
This reverts commit 3525e0aac91c4de5d20b1f22a6c6e2b39db3cc96.
The DMA transfer for RX buffer was not handled correctly in this change.
The actual transfer length for DMA RX can be less than xfer->len in the
specific condition and the last words will be filled after the DMA
completion, but the commit doesn't consider it and the dmaengine is
started with rx_sg mapped by spi core.
The solution for this at least requires more lines than this commit
has inserted. So revert it for now.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Certain Intel Sunrisepoint PCH variants report zero chip selects in SPI
capabilities register even they have one per port. Detection in
pxa2xx_spi_probe() sets master->num_chipselect to 0 leading to -EINVAL
from spi_register_master() where chip select count is validated.
Fix this by not using SPI capabilities register on Sunrisepoint. They don't
have more than one chip select so use the default value 1 instead of
detection.
Fixes: 8b136baa5892 ("spi: pxa2xx: Detect number of enabled Intel LPSS SPI chip select signals")
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
We clamp frame_len_words to a maximum of 4096, but do not actually
limit the number of words written or read through the DATA registers
or the length added to spi_message::actual_length. This results in
silent data corruption for commands longer than this maximum.
Recalculate the length of each transfer, taking frame_len_words into
account. Use this length in qspi_{read,write}_msg(), and to increment
spi_message::actual_length.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
If phy was suspended and is starting, current driver always enable
phy's interrupts, if phy works in polling, phy can raise unexpected
interrupt which will not be handled, the interrupt will block system
enter suspend again. So interrupts should only be re-enabled if phy
works in interrupt.
Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add an ife action and give it an instance id of 1
sudo tc actions add action ife encode \
type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1
//create a filter which binds to ife action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull char/misc fixes from Greg KH:
"Here are some small char/misc driver fixes for 4.6-rc4. Full details
are in the shortlog, nothing major here.
These have all been in linux-next for a while with no reported issues"
* tag 'char-misc-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
lkdtm: do not leak free page on kmalloc failure
lkdtm: fix memory leak of base
lkdtm: fix memory leak of val
extcon: palmas: Drop stray IRQF_EARLY_RESUME flag
Commit 9567366fefdd ("dm cache metadata: fix READ_LOCK macros and
cleanup WRITE_LOCK macros") uses down_write() instead of down_read() in
cmd_read_lock(), yet up_read() is used to release the lock in
READ_UNLOCK(). Fix it.
Fixes: 9567366fefdd ("dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros")
Cc: stable@vger.kernel.org
Signed-off-by: Ahmed Samy <f.fallen45@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull networking fixes from David Miller:
1) Fix mvneta/bm dependencies, from Arnd Bergmann.
2) RX completion hw bug workaround in bnxt_en, from Michael Chan.
3) Kernel pointer leak in nf_conntrack, from Linus.
4) Hoplimit route attribute limits not enforced properly, from Paolo
Abeni.
5) qlcnic driver NULL deref fix from Dan Carpenter.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
arm64: bpf: jit JMP_JSET_{X,K}
net/route: enforce hoplimit max value
nf_conntrack: avoid kernel pointer value leak in slab name
drivers: net: xgene: fix register offset
drivers: net: xgene: fix statistics counters race condition
drivers: net: xgene: fix ununiform latency across queues
drivers: net: xgene: fix sharing of irqs
drivers: net: xgene: fix IPv4 forward crash
xen-netback: fix extra_info handling in xenvif_tx_err()
net: mvneta: bm: fix dependencies again
bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)
bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)
qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()
Similar to preadv and pwritev, preadv2 and pwritev2 need compat entries
in the 32-bit syscall table.
This bug was found by strace test suite.
Fixes: 4babf2c5efb7 ("x86: wire up preadv2 and pwritev2")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20160511084817.GA29823@altlinux.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull vfs fixes from Al Viro:
"Overlayfs fixes from Miklos, assorted fixes from me.
Stable fodder of varying severity, all sat in -next for a while"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: ignore permissions on underlying lookup
vfs: add lookup_hash() helper
vfs: rename: check backing inode being equal
vfs: add vfs_select_inode() helper
get_rock_ridge_filename(): handle malformed NM entries
ecryptfs: fix handling of directory opening
atomic_open(): fix the handling of create_error
fix the copy vs. map logics in blk_rq_map_user_iov()
do_splice_to(): cap the size before passing to ->splice_read()
Original implementation commit e54bcde3d69d ("arm64: eBPF JIT compiler")
had the relevant code paths, but due to an oversight always fail jiting.
As a result, we had been falling back to BPF interpreter whenever a BPF
program has JMP_JSET_{X,K} instructions.
With this fix, we confirm that the corresponding tests in lib/test_bpf
continue to pass, and also jited.
...
[ 2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
[ 2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
[ 2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
...
[ 3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 110 PASS
[ 3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0xffffffff) return 1 jited:1 98 PASS
[ 3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 120 PASS
[ 3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0xffffffff) return 1 jited:1 89 PASS
...
Fixes: e54bcde3d69d ("arm64: eBPF JIT compiler")
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI fixes from James Bottomley:
"This is a couple of small fixes: one is a potential uninitialised
error variable in the alua code, potentially causing spurious failures
and the other is a problem caused by the conversion of SCSI to
hostwide tags which resulted in the qla1280 driver always failing in
host initialisation"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
qla1280: Don't allocate 512kb of host tags
scsi_dh_alua: uninitialized variable in alua_rtpg()
Pull cgroup fixes from Tejun Heo:
"During v4.6-rc1 cgroup namespace support was merged. There is an
issue where it's impossible to tell whether a given cgroup mount point
is bind mounted or namespaced. Serge has been working on the issue
but it took longer than expected to resolve, so the late pull request.
Given that it's a completely new feature and the patches don't touch
anything else, the risk seems acceptable. However, if this is too
late, an alternative is plugging new cgroup ns creation for v4.6 and
retrying for v4.7"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix compile warning
kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
kernfs_path_from_node_locked: don't overwrite nlen
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.
The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.
This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.
This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Hopefully the last round of fixes this release, fingers crossed :)
1) Initialize static nf_conntrack_locks_all_lock properly, from
Florian Westphal.
2) Need to cancel pending work when destroying IDLETIMER entries,
from Liping Zhang.
3) Fix TX param usage when sending TSO over iwlwifi devices, from
Emmanuel Grumbach.
4) NFACCT quota params not validated properly, from Phil Turnbull.
5) Resolve more glibc vs. kernel header conflicts, from Mikko
Tapeli.
6) Missing IRQ free in ravb_close(), from Geert Uytterhoeven.
7) Fix infoleak in x25, from Kangjie Lu.
8) Similarly in thunderx driver, from Heinrich Schuchardt.
9) tc_ife.h uapi header not exported properly, from Jamal Hadi Salim.
10) Don't reenable PHY interreupts if device is in polling mode, from
Shaohui Xie.
11) Packet scheduler actions late binding was not being handled
properly at all, from Jamal Hadi Salim.
12) Fix binding of conntrack entries to helpers in openvswitch, from
Joe Stringer"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
gre: do not keep the GRE header around in collect medata mode
openvswitch: Fix cached ct with helper.
net sched: ife action fix late binding
net sched: skbedit action fix late binding
net sched: simple action fix late binding
net sched: mirred action fix late binding
net sched: ipt action fix late binding
net sched: vlan action fix late binding
net: phylib: fix interrupts re-enablement in phy_start
tcp: refresh skb timestamp at retransmit time
net: nps_enet: bug fix - handle lost tx interrupts
net: nps_enet: Tx handler synchronization
export tc ife uapi header
net: thunderx: avoid exposing kernel stack
net: fix a kernel infoleak in x25 module
ravb: Add missing free_irq() call to ravb_close()
uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h
netfilter: nfnetlink_acct: validate NFACCT_QUOTA parameter
iwlwifi: mvm: don't override the rate with the AMSDU len
netfilter: IDLETIMER: fix race condition when destroy the target
...
The qla1280 driver sets the scsi_host_template's can_queue field to 0xfffff
which results in an allocation failure when allocating the block layer tags
for the driver's queues. This was introduced with the change for host wide
tags in commit 64d513ac31b - "scsi: use host wide tags by default".
Reduce can_queue to MAX_OUTSTANDING_COMMANDS (512) to solve the allocation
error.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: 64d513ac31b - "scsi: use host wide tags by default"
Cc: stable@vger.kernel.org # v4.4
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michael Reed <mdr@sgi.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <jejb@linux.vnet.ibm.com>
Pull workqueue fix from Tejun Heo:
"CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
DOWN_PREPARE which can trigger a WARN_ON() in workqueue.
The bug has been there for a very long time. It only triggers if CPU
down fails at a specific point and I don't think it has adverse
effects other than the warning messages. The fix is very low impact"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix rebind bound workers warning
commit 4f41fc59620f ("cgroup, kernfs: make mountinfo
show properly scoped path for cgroup namespaces")
added the following compile warning:
kernel/cgroup.c: In function ‘cgroup_show_path’:
kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable]
int len = 0, ret = 0;
^
fix it.
Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces")
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Payloads of NM entries are not supposed to contain NUL. When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries). We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254. So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected. And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily. And that's what will be passed to readdir callback as the
name length. 8Kb __copy_to_user() from a buffer allocated by
__get_free_page()
Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().
More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission. Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.
So instead use lookup_hash() which doesn't do the permission check.
Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.
Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.
This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.
Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The x86 exception table sorting was changed in commit 29934b0fb8ff
("x86/extable: use generic search and sort routines") to use the arch
independent code in lib/extable.c. However, the patch was mangled
somehow on its way into the kernel from the last version posted at [1].
The committed version kind of attempted to incorporate the changes of
commit 548acf19234d ("x86/mm: Expand the exception table logic to allow
new handling options") as in _completely_ _ignoring_ the x86 specific
'handler' member of struct exception_table_entry. This effectively
broke the sorting as entries will only partly be swapped now.
Fortunately, the x86 Kconfig selects BUILDTIME_EXTABLE_SORT, so the
exception table doesn't need to be sorted at runtime. However, in case
that ever changes, we better not break the exception table sorting just
because of that.
[ Ard Biesheuvel points out that BUILDTIME_EXTABLE_SORT applies to the
core image only, but we still rely on the sorting routines for modules
in that case - Linus ]
Fix this by providing a swap_ex_entry_fixup() macro that takes care of
the 'handler' member.
[1] https://lkml.org/lkml/2016/1/27/232
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Fixes: 29934b0fb8f ("x86/extable: use generic search and sort routines")
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For ipgre interface in collect metadata mode, it doesn't make sense for the
interface to be of ARPHRD_IPGRE type. The outer header of received packets
is not needed, as all the information from it is present in metadata_dst. We
already don't set ipgre_header_ops for collect metadata interfaces, which is
the only consumer of mac_header pointing to the outer IP header.
Just set the interface type to ARPHRD_NONE in collect metadata mode for
ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
mac_header.
Fixes: a64b04d86d14 ("gre: do not assign header_ops in collect metadata mode")
Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's possible to use "err" without initializing it. If it happens to be
a 2 which is SCSI_DH_RETRY then that could cause a bug. Bart Van Assche
pointed out that we should probably re-initialize it for every iteration
through the retry loop.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <jejb@linux.vnet.ibm.com>
Pull scheduler fix from Ingo Molnar:
"This is a revert to fix an interactivity problem.
The proper fixes for the problems that the reverted commit exposed are
now in sched/core (consisting of 3 patches), but were too risky for
v4.6 and will arrive in the v4.7 merge window"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched/fair: Fix fairness issue on migration"
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
dump_stack+0x89/0xd4
__warn+0xfd/0x120
warn_slowpath_null+0x1d/0x20
rebind_workers+0x1c0/0x1d0
workqueue_cpu_up_callback+0xf5/0x1d0
notifier_call_chain+0x64/0x90
? trace_hardirqs_on_caller+0xf2/0x220
? notify_prepare+0x80/0x80
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x35/0x50
notify_down_prepare+0x5e/0x80
? notify_prepare+0x80/0x80
cpuhp_invoke_callback+0x73/0x330
? __schedule+0x33e/0x8a0
cpuhp_down_callbacks+0x51/0xc0
cpuhp_thread_fun+0xc1/0xf0
smpboot_thread_fn+0x159/0x2a0
? smpboot_create_threads+0x80/0x80
kthread+0xef/0x110
? wait_for_completion+0xf0/0x120
? schedule_tail+0x35/0xf0
ret_from_fork+0x22/0x50
? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed
This bug can be reproduced by below config w/ nohz_full= all cpus:
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y
As Thomas pointed out:
| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP = 5
| CPU_PRI_WORKQUEUE_DOWN = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.
The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:
workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.
This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Our caller expects 0 on success, not >0.
This fixes a bug in the patch
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
where /sys does not show up in mountinfo, breaking criu.
Thanks for catching this, Andrei.
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
First of all, trying to open them r/w is idiocy; it's guaranteed to fail.
Moreover, assigning ->f_pos and assuming that everything will work is
blatantly broken - try that with e.g. tmpfs as underlying layer and watch
the fireworks. There may be a non-trivial amount of state associated with
current IO position, well beyond the numeric offset. Using the single
struct file associated with underlying inode is really not a good idea;
we ought to open one for each ecryptfs directory struct file.
Additionally, file_operations both for directories and non-directories are
full of pointless methods; non-directories should *not* have ->iterate(),
directories should not have ->flush(), ->fasync() and ->splice_read().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Overlayfs needs lookup without inode_permission() and already has the name
hash (in form of dentry->d_name on overlayfs dentry). It also doesn't
support filesystems with d_op->d_hash() so basically it only needs
the actual hashed lookup from lookup_one_len_unlocked()
So add a new helper that does unlocked lookup of a hashed name.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Iyappan Subramanian says:
====================
drivers: net: xgene: Bug fixes
This patch set addresses the following bug fixes that were found during testing.
1. IPv4 forward test crash
- drivers: net: xgene: fix IPv4 forward crash
2. Sharing of irqs
- drivers: net: xgene: fix sharing of irqs
3. Ununiform latency across queues
- drivers: net: xgene: fix ununiform latency across queues
4. Fix statistics counters race condition
- drivers: net: xgene: fix statistics counters race condition
5. Correcting register offset and field lengths
- drivers: net: xgene: fix register offset
v2: Address review comments from v1
- Defer TSO fix, and reposting all other patches from v1
v1:
- Initial version
====================
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull spi fixes from Mark Brown:
"A bunch of small driver specific fixes that have come up, none of them
remarkable in themselves. One fixes a regression introduced in the
merge window and another two are targetted at stable"
* tag 'spi-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: pxa2xx: Do not detect number of enabled chip selects on Intel SPT
spi: spi-ti-qspi: Handle truncated frames properly
spi: spi-ti-qspi: Fix FLEN and WLEN settings if bits_per_word is overridden
spi: omap2-mcspi: Undo broken fix for dma transfer of vmalloced buffer
spi: spi-fsl-dspi: Fix cs_change handling in message transfer
When using conntrack helpers from OVS, a common configuration is to
perform a lookup without specifying a helper, then go through a
firewalling policy, only to decide to attach a helper afterwards.
In this case, the initial lookup will cause a ct entry to be attached to
the skb, then the later commit with helper should attach the helper and
confirm the connection. However, the helper attachment has been missing.
If the user has enabled automatic helper attachment, then this issue
will be masked as it will be applied in init_conntrack(). It is also
masked if the action is executed from ovs_packet_cmd_execute() as that
will construct a fresh skb.
This patch fixes the issue by making an explicit call to try to assign
the helper if there is a discrepancy between the action's helper and the
current skb->nfct.
Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull perf fixes from Ingo Molnar:
"An uncharacteristically large number of bugs popped up in the last
week:
- various tooling fixes, two crashes and build problems
- two Intel PT fixes
- an KNL uncore driver fix
- an Intel PMU driver fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf stat: Fallback to user only counters when perf_event_paranoid > 1
perf evsel: Handle EACCESS + perf_event_paranoid=2 in fallback()
perf evsel: Improve EPERM error handling in open_strerror()
tools lib traceevent: Do not reassign parg after collapse_tree()
perf probe: Check if dwarf_getlocations() is available
perf dwarf: Guard !x86_64 definitions under #ifdef else clause
perf tools: Use readdir() instead of deprecated readdir_r()
perf thread_map: Use readdir() instead of deprecated readdir_r()
perf script: Use readdir() instead of deprecated readdir_r()
perf tools: Use readdir() instead of deprecated readdir_r()
perf/core: Disable the event on a truncated AUX record
perf/x86/intel/pt: Generate PMI in the STOP region as well
perf/x86: Fix undefined shift on 32-bit kernels
perf/x86/msr: Fix SMI overflow
perf/x86/intel/uncore: Fix CHA registers configuration procedure for Knights Landing platform
perf diff: Fix duplicated output column
Mike reported that this recent commit:
3a47d5124a95 ("sched/fair: Fix fairness issue on migration")
... broke interactivity and the signal starvation test.
We have a proper fix series in the works but ran out of time for
v4.6, so revert the commit.
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* if we have a hashed negative dentry and either CREAT|EXCL on
r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
with failing may_o_create(), we should fail with EROFS or the
error may_o_create() has returned, but not ENOENT. Which is what
the current code ends up returning.
* if we have CREAT|TRUNC hitting a regular file on a read-only
filesystem, we can't fail with EROFS here. At the very least,
not until we'd done follow_managed() - we might have a writable
file (or a device, for that matter) bound on top of that one.
Moreover, the code downstream will see that O_TRUNC and attempt
to grab the write access (*after* following possible mount), so
if we really should fail with EROFS, it will happen. No need
to do that inside atomic_open().
The real logics is much simpler than what the current code is
trying to do - if we decided to go for simple lookup, ended
up with a negative dentry *and* had create_error set, fail with
create_error. No matter whether we'd got that negative dentry
from lookup_real() or had found it in dcache.
Cc: stable@vger.kernel.org # v3.6+
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
should do nothing and return success.
This condition is checked in vfs_rename(). However it won't detect hard
links on overlayfs where these are given separate inodes on the overlayfs
layer.
Overlayfs itself detects this condition and returns success without doing
anything, but then vfs_rename() will proceed as if this was a successful
rename (detach_mounts(), d_move()).
The correct thing to do is to detect this condition before even calling
into overlayfs. This patch does this by calling vfs_select_inode() to get
the underlying inodes.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
Patch 562abd39 "xen-netback: support multiple extra info fragments
passed from frontend" contained a mistake which can result in an in-
correct number of responses being generated when handling errors
encountered when processing packets containing extra info fragments.
This patch fixes the problem.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reported-by: Jan Beulich <JBeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull KVM fixes from Paolo Bonzini:
"Two small x86 patches, improving "make kvmconfig" and fixing an
objtool warning for CONFIG_PROFILE_ALL_BRANCHES"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvmconfig: add more virtio drivers
x86/kvm: Add stack frame dependency to fastop() inline asm
Jamal Hadi Salim says:
====================
Some actions were broken in allowing for late binding of actions.
Late binding workflow is as follows:
a) create an action and provide all necessary parameters for it
Optionally provide an index or let the kernel give you one.
Example:
sudo tc actions add action police rate 1kbit burst 90k drop index 1
b) later on bind to the pre-created action from a filter definition
by merely specifying the index.
Example:
sudo tc filter add dev lo parent ffff: protocol ip prio 8 \
u32 match ip src 127.0.0.8/32 flowid 1:8 action police index 1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull device mapper fix from Mike Snitzer:
"Fix for earlier 4.6-rc4 stable@ commit that introduced improper use of
write lock in cmd_read_lock() -- due to cut-n-paste gone awry (and
sparse didn't catch it)"
* tag 'dm-4.6-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache metadata: fix cmd_read_lock() acquiring write lock
Pull ARM SoC fixes from Arnd Bergmann:
"Three more bug fixes for ARM SoCs this week:
- The Atmel sama5d2 was registering the wrong NFC device type
- On Atmel sam9x5, the power management controller had an incorrect
register area size
- On ARM64 Allwinner machine was not secting the generic irqchip
code, causing build errors in some configurations"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: dts: at91: sam9x5: Fix the memory range assigned to the PMC
arm64/sunxi: 4.6-rc1: Add dependency on generic irq chip
ARM: dts: at91: sama5d2: use "atmel,sama5d3-nfc" compatible for nfc
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
- Fallback to usermode-only counters when perf_event_paranoid > 1, which
is the case now (Arnaldo Carvalho de Melo)
- Do not reassign parg after collapse_tree() in libtraceevent, which
may cause tool crashes (Steven Rostedt)
- Fix the build on Fedora Rawhide, where readdir_r() is deprecated and
also wrt -Werror=unused-const-variable= + x86_32_regoffset_table on
!x86_64 (Arnaldo Carvalho de Melo)
- Fix the build on Ubuntu 12.04.5, where dwarf_getlocations() isn't
available, i.e. libdw-dev < 0.157 (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull PCI fixes from Bjorn Helgaas:
"Since v4.5, we've WARNed during resume if a PCI device, including a
Thunderbolt device, was added while we were suspended. A change we
merged for v4.6-rc1 turned that warning into a system hang. These
enumeration patches from Lukas Wunner fix this issue:
- Fix BUG on device attach failure
- Do not treat EPROBE_DEFER as device attach failure"
* tag 'pci-v4.6-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Do not treat EPROBE_DEFER as device attach failure
PCI: Fix BUG on device attach failure
Pull IPMI updates from Corey Minyard:
"Just some minor fixes, nothing big"
* tag 'for-linus-4.6' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi: do not probe ACPI devices if si_tryacpi is unset
ipmi_si: Avoid a wrong long timeout on transaction done
ipmi_si: Fix module parameter doc names
ipmi_ssif: Fix logic around alert handling
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to. We use them
as such at the end. But in the loop copying the actual entries, we
overwrite @nlen. Use a temporary variable for that instead.
Without this, the return length, when the buffer is large enough, is
wrong. (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)
Interestingly, no callers of this function are affected by this as of
yet. However the upcoming cgroup_show_path() will be.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
I tried to fix this before, but my previous fix was incomplete
and we can still get the same link error in randconfig builds
because of the way that Kconfig treats the
default y if MVNETA=y && MVNETA_BM_ENABLE
line that does not actually trigger when MVNETA_BM_ENABLE=m,
unlike I intended.
Changing the line to use MVNETA_BM_ENABLE!=n however has
the desired effect and hopefully makes all configurations
work as expected.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 019ded3aa7c9 ("net: mvneta: bm: clarify dependencies")
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are use cases when chip select should be triggered between transfers
in single SPI message. Current implementation does this only on last
transfer in message ignoring cs_change value provided in current transfer.
Signed-off-by: Andrey Vostrikov <andrey.vostrikov@cogentembedded.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
This reverts commit 3525e0aac91c4de5d20b1f22a6c6e2b39db3cc96.
The DMA transfer for RX buffer was not handled correctly in this change.
The actual transfer length for DMA RX can be less than xfer->len in the
specific condition and the last words will be filled after the DMA
completion, but the commit doesn't consider it and the dmaengine is
started with rx_sg mapped by spi core.
The solution for this at least requires more lines than this commit
has inserted. So revert it for now.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Certain Intel Sunrisepoint PCH variants report zero chip selects in SPI
capabilities register even they have one per port. Detection in
pxa2xx_spi_probe() sets master->num_chipselect to 0 leading to -EINVAL
from spi_register_master() where chip select count is validated.
Fix this by not using SPI capabilities register on Sunrisepoint. They don't
have more than one chip select so use the default value 1 instead of
detection.
Fixes: 8b136baa5892 ("spi: pxa2xx: Detect number of enabled Intel LPSS SPI chip select signals")
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
We clamp frame_len_words to a maximum of 4096, but do not actually
limit the number of words written or read through the DATA registers
or the length added to spi_message::actual_length. This results in
silent data corruption for commands longer than this maximum.
Recalculate the length of each transfer, taking frame_len_words into
account. Use this length in qspi_{read,write}_msg(), and to increment
spi_message::actual_length.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
If phy was suspended and is starting, current driver always enable
phy's interrupts, if phy works in polling, phy can raise unexpected
interrupt which will not be handled, the interrupt will block system
enter suspend again. So interrupts should only be re-enabled if phy
works in interrupt.
Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add an ife action and give it an instance id of 1
sudo tc actions add action ife encode \
type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1
//create a filter which binds to ife action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull char/misc fixes from Greg KH:
"Here are some small char/misc driver fixes for 4.6-rc4. Full details
are in the shortlog, nothing major here.
These have all been in linux-next for a while with no reported issues"
* tag 'char-misc-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
lkdtm: do not leak free page on kmalloc failure
lkdtm: fix memory leak of base
lkdtm: fix memory leak of val
extcon: palmas: Drop stray IRQF_EARLY_RESUME flag
Commit 9567366fefdd ("dm cache metadata: fix READ_LOCK macros and
cleanup WRITE_LOCK macros") uses down_write() instead of down_read() in
cmd_read_lock(), yet up_read() is used to release the lock in
READ_UNLOCK(). Fix it.
Fixes: 9567366fefdd ("dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros")
Cc: stable@vger.kernel.org
Signed-off-by: Ahmed Samy <f.fallen45@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>