commits

A ~10% regression has been reported for UnixBench's execl throughput
test by Aaron Lu and Ye Xiaolong:

https://lkml.org/lkml/2018/10/30/765

That test is pretty simple, it does a "recursive" execve() syscall on the
same binary. Starting from the syscall, this sequence is possible:

do_execve()
do_execveat_common()
__do_execve_file()
sched_exec()
select_task_rq_fair() <==| Task already enqueued
find_idlest_cpu()
find_idlest_group()
capacity_spare_wake() <==| Functions not called from
cpu_util_wake() | the wakeup path

which means we can end up calling cpu_util_wake() not only from the
"wakeup path", as its name would suggest. Indeed, the task doing an
execve() syscall is already enqueued on the CPU we want to get the
cpu_util_wake() for.

The estimated utilization for a CPU computed in cpu_util_wake() was
written under the assumption that function can be called only from the
wakeup path. If instead the task is already enqueued, we end up with a
utilization which does not remove the current task's contribution from
the estimated utilization of the CPU.
This will wrongly assume a reduced spare capacity on the current CPU and
increase the chances to migrate the task on execve.

The regression is tracked down to:

commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")

because in that patch we turn on by default the UTIL_EST sched feature.
However, the real issue is introduced by:

commit f9be3e5961c5 ("sched/fair: Use util_est in LB and WU paths")

Let's fix this by ensuring to always discount the task estimated
utilization from the CPU's estimated utilization when the task is also
the current one. The same benchmark of the bug report, executed on a
dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
reports these "Execl Throughput" figures (higher the better):

mainline : 48136.5 lps
mainline+fix : 55376.5 lps

which correspond to a 15% speedup.

Moreover, since {cpu_util,capacity_spare}_wake() are not really only
used from the wakeup path, let's remove this ambiguity by using a better
matching name: {cpu_util,capacity_spare}_without().

Since we are at that, let's also improve the existing documentation.

Reported-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)
Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
Signed-off-by: Ingo Molnar <mingo@kernel.org>

7y ago

Michal Hocko

c63ae43b

mm, page_alloc: check for max order in hot path

Konstantin has noticed that kvmalloc might trigger the following
warning:

WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
[...]
Call Trace:
fragmentation_index+0x76/0x90
compaction_suitable+0x4f/0xf0
shrink_node+0x295/0x310
node_reclaim+0x205/0x250
get_page_from_freelist+0x649/0xad0
__alloc_pages_nodemask+0x12a/0x2a0
kmalloc_large_node+0x47/0x90
__kmalloc_node+0x22b/0x2e0
kvmalloc_node+0x3e/0x70
xt_alloc_table_info+0x3a/0x80 [x_tables]
do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
nf_setsockopt+0x44/0x60
SyS_setsockopt+0x6f/0xc0
do_syscall_64+0x67/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following report

UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
__zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
zone_watermark_fast mm/page_alloc.c:3216 [inline]
get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
__alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
alloc_pages include/linux/gfp.h:509 [inline]
__get_free_pages+0x12/0x60 mm/page_alloc.c:4414
dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
block_ioctl+0x105/0x150 fs/block_dev.c:1883
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
__do_sys_ioctl fs/ioctl.c:709 [inline]
__se_sys_ioctl fs/ioctl.c:707 [inline]
__x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7y ago

Masayoshi Mizuma

af31b04b

tools/testing/nvdimm: Fix the array size for dimm devices.

7y ago

Linus Torvalds

743a4863

Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7y ago

Kan Liang

4d47d640

perf/x86/intel/uncore: Support CoffeeLake 8th CBOX

7y ago

Yi Wang

e1ff516a

sched/fair: Fix a comment in task_numa_fault()

7y ago

Uwe Kleine-König

6f4d29df

scripts/spdxcheck.py: make python3 compliant

7y ago

Linus Torvalds

65102238

Linux 4.20-rc1 v4.20-rc1

7y ago

Linus Torvalds

cfaa9f02

Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm

7y ago

Ard Biesheuvel

63eb322d

efi: Permit calling efi_mem_reserve_persistent() from atomic context

7y ago

Kan Liang

c10a8de0

perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs

7y ago

Valentin Schneider

40fa3780

sched/core: Take the hotplug lock in sched_init_smp()

When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
Juno or HiKey960), we get the following report:

[ 0.748225] Call trace:
[ 0.750685] lockdep_assert_cpus_held+0x30/0x40
[ 0.755236] static_key_enable_cpuslocked+0x20/0xc8
[ 0.760137] build_sched_domains+0x1034/0x1108
[ 0.764601] sched_init_domains+0x68/0x90
[ 0.768628] sched_init_smp+0x30/0x80
[ 0.772309] kernel_init_freeable+0x278/0x51c
[ 0.776685] kernel_init+0x10/0x108
[ 0.780190] ret_from_fork+0x10/0x18

The static_key in question is 'sched_asym_cpucapacity' introduced by
commit:

df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")

In this particular case, we enable it because smp_prepare_cpus() will
end up fetching the capacity-dmips-mhz entry from the devicetree,
so we already have some asymmetry detected when entering sched_init_smp().

This didn't get detected in tip/sched/core because we were missing:

commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

Calls to build_sched_domains() post sched_init_smp() will hold the
hotplug lock, it just so happens that this very first call is a
special case. As stated by a comment in sched_init_smp(), "There's no
userspace yet to cause hotplug operations" so this is a harmless
warning.

However, to both respect the semantics of underlying
callees and make lockdep happy, take the hotplug lock in
sched_init_smp(). This also satisfies the comment atop
sched_init_domains() that says "Callers must hold the hotplug lock".

Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: morten.rasmussen@arm.com
Cc: quentin.perret@arm.com
Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

7y ago

Yufen Yu

1a413646

tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset

7y ago

Linus Torvalds

42bd06e9

Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs

7y ago

Linus Torvalds

1ce80e0f

Merge tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

7y ago

Russell King

383fb3ee

ARM: spectre-v2: per-CPU vtables to work around big.Little systems

7y ago

Ard Biesheuvel

eff89628

efi/arm: Defer persistent reservations until after paging_init()

7y ago

Linus Torvalds

ccda4af0

Linux 4.20-rc2 v4.20-rc2

7y ago

Peter Zijlstra

993f0b05

sched/topology: Fix off by one bug

7y ago

Arnd Bergmann

1c23b410

lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn

7y ago

Linus Torvalds

4710e789

Merge tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

7y ago

Ding Xiang

84db119f

ubifs: Remove unneeded semicolon

7y ago

Linus Torvalds

e6a2562f

Merge tag 'gfs2-4.20.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

7y ago

Amir Goldstein

b469e7e4

fanotify: fix handling of events on child sub-directory

7y ago

Russell King

e209950f

ARM: add PROC_VTABLE and PROC_TABLE macros

7y ago

Ard Biesheuvel

72a58a63

efi/arm/libstub: Pack FDT after populating it

7y ago

Linus Torvalds

7a3765ed

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"One last pull request before heading to Vancouver for LPC, here we have:

1) Don't forget to free VSI contexts during ice driver unload, from
Victor Raj.

2) Don't forget napi delete calls during device remove in ice driver,
from Dave Ertman.

3) Don't request VLAN tag insertion of ibmvnic device when SKB
doesn't have VLAN tags at all.

4) IPV4 frag handling code has to accomodate the situation where two
threads try to insert the same fragment into the hash table at the
same time. From Eric Dumazet.

5) Relatedly, don't flow separate on protocol ports for fragmented
frames, also from Eric Dumazet.

6) Memory leaks in qed driver, from Denis Bolotin.

7) Correct valid MTU range in smsc95xx driver, from Stefan Wahren.

8) Validate cls_flower nested policies properly, from Jakub Kicinski.

9) Clearing of stats counters in mc88e6xxx driver doesn't retain
important bits in the G1_STATS_OP register causing the chip to
hang. Fix from Andrew Lunn"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
act_mirred: clear skb->tstamp on redirect
net: dsa: mv88e6xxx: Fix clearing of stats counters
tipc: fix link re-establish failure
net: sched: cls_flower: validate nested enc_opts_policy to avoid warning
net: mvneta: correct typo
flow_dissector: do not dissect l4 ports for fragments
net: qualcomm: rmnet: Fix incorrect assignment of real_dev
net: aquantia: allow rx checksum offload configuration
net: aquantia: invalid checksumm offload implementation
net: aquantia: fixed enable unicast on 32 macvlan
net: aquantia: fix potential IOMMU fault after driver unbind
net: aquantia: synchronized flow control between mac/phy
net: smsc95xx: Fix MTU range
net: stmmac: Fix RX packet size > 8191
qed: Fix potential memory corruption
qed: Fix SPQ entries not returned to pool in error flows
qed: Fix blocking/unlimited SPQ entries leak
qed: Fix memory/entry leak in qed_init_sp_request()
inet: frags: better deal with smp races
net: hns3: bugfix for not checking return value
...

7y ago

Muchun Song

a68d7508

sched/rt: Update comment in pick_next_task_rt()

7y ago

Janne Huttunen

13c9aaf7

mm/vmstat.c: fix NUMA statistics updates

7y ago

Linus Torvalds

35e74524

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7y ago

Colin Ian King

d3787af2

NFS: fix spelling mistake, EACCESS -> EACCES

7y ago

Sascha Hauer

e453fa60

Documentation: ubifs: Add authentication whitepaper

7y ago

Linus Torvalds

32e2524a

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

7y ago

Andreas Gruenbacher

c26b5aa8

gfs2: Fix iomap buffer head reference counting bug

7y ago

Linus Torvalds

b00d2092

Merge tag 'compiler-attributes-for-linus-v4.20-rc2' of https://github.com/ojeda/linux

7y ago

Russell King

945aceb1

ARM: clean up per-processor check_bugs method call

7y ago

Ard Biesheuvel

33412b86

efi/arm: Revert deferred unmap of early memmap mapping

7y ago

Linus Torvalds

e12e00e3

Merge tag 'kbuild-fixes-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

7y ago

Eric Dumazet

7236ead1

act_mirred: clear skb->tstamp on redirect

7y ago

Linus Torvalds

b59dfdae

i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array

7y ago

Mike Rapoport

78179556

mm/gup.c: fix follow_page_mask() kerneldoc comment

7y ago

Linus Torvalds

04578e84

Merge tag 'ntb-4.20' of git://github.com/jonmason/ntb

7y ago

Thomas Gleixner

bff9a107

Merge branch 'clockevents/4.20-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent

7y ago

Paul Burton

c3be6577

SUNRPC: Use atomic(64)_t for seq_send(64)

The seq_send & seq_send64 fields in struct krb5_ctx are used as
atomically incrementing counters. This is implemented using cmpxchg() &
cmpxchg64() to implement what amount to custom versions of
atomic_fetch_inc() & atomic64_fetch_inc().

Besides the duplication, using cmpxchg64() has another major drawback in
that some 32 bit architectures don't provide it. As such commit
571ed1fd2390 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
resulted in build failures for some architectures.

Change seq_send to be an atomic_t and seq_send64 to be an atomic64_t,
then use atomic(64)_* functions to manipulate the values. The atomic64_t
type & associated functions are provided even on architectures which
lack real 64 bit atomic memory access via CONFIG_GENERIC_ATOMIC64 which
uses spinlocks to serialize access. This fixes the build failures for
architectures lacking cmpxchg64().

A potential alternative that was raised would be to provide cmpxchg64()
on the 32 bit architectures that currently lack it, using spinlocks.
However this would provide a version of cmpxchg64() with semantics a
little different to the implementations on architectures with real 64
bit atomics - the spinlock-based implementation would only work if all
access to the memory used with cmpxchg64() is *always* performed using
cmpxchg64(). That is not currently a requirement for users of
cmpxchg64(), and making it one seems questionable. As such avoiding
cmpxchg64() outside of architecture-specific code seems best,
particularly in cases where atomic64_t seems like a better fit anyway.

The CONFIG_GENERIC_ATOMIC64 implementation of atomic64_* functions will
use spinlocks & so faces the same issue, but with the key difference
that the memory backing an atomic64_t ought to always be accessed via
the atomic64_* functions anyway making the issue moot.

Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: 571ed1fd2390 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>