commits

tl;dr:

Several SGX users reported seeing the following message on NUMA systems:

sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.

This turned out to be the memblock code mistakenly throwing away SGX
memory.

=== Full Changelog ===

The 'max_pfn' variable represents the highest known RAM address. It can
be used, for instance, to quickly determine for which physical addresses
there is mem_map[] space allocated. The numa_meminfo code makes an
effort to throw out ("trim") all memory blocks which are above 'max_pfn'.

SGX memory is not considered RAM (it is marked as "Reserved" in the
e820) and is not taken into account by max_pfn. Despite this, SGX memory
areas have NUMA affinity and are enumerated in the ACPI SRAT table. The
existing SGX code uses the numa_meminfo mechanism to look up the NUMA
affinity for its memory areas.

In cases where SGX memory was above max_pfn (usually just the one EPC
section in the last highest NUMA node), the numa_memblock is truncated
at 'max_pfn', which is below the SGX memory. When the SGX code tries to
look up the affinity of this memory, it fails and produces an error message:

sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.

and assigns the memory to NUMA node 0.

Instead of silently truncating the memory block at 'max_pfn' and
dropping the SGX memory, add the truncated portion to
'numa_reserved_meminfo'. This allows the SGX code to later determine
the NUMA affinity of its 'Reserved' area.

Before, numa_meminfo looked like this (from 'crash'):

blk = { start = 0x0, end = 0x2080000000, nid = 0x0 }
{ start = 0x2080000000, end = 0x4000000000, nid = 0x1 }

numa_reserved_meminfo is empty.

With this, numa_meminfo looks like this:

blk = { start = 0x0, end = 0x2080000000, nid = 0x0 }
{ start = 0x2080000000, end = 0x4000000000, nid = 0x1 }

and numa_reserved_meminfo has an entry for node 1's SGX memory:

blk = { start = 0x4000000000, end = 0x4080000000, nid = 0x1 }

[ daveh: completely rewrote/reworked changelog ]

Fixes: 5d30f92e7631 ("x86/NUMA: Provide a range-to-target_node lookup facility")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210617194657.0A99CB22@viggo.jf.intel.com

4y ago

Marc Zyngier

382e6e17

irqchip/gic-v3: Workaround inconsistent PMR setting on NMI entry

4y ago

Linus Torvalds

e4e45343

Merge tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

4y ago

Linus Torvalds

913ec3c2

Merge tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

4y ago

Athira Rajeev

60b7ed54

powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set

4y ago

Kai Huang

4692bc77

x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed

4y ago

Linus Torvalds

d07f6ca9

Linux 5.13-rc2 v5.13-rc2

4y ago

Linus Torvalds

960f0716

Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

4y ago

Arnaldo Carvalho de Melo

36524112

tools headers cpufeatures: Sync with the kernel sources

4y ago

Linus Torvalds

d9403d30

Merge tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

4y ago

Arnaldo Carvalho de Melo

1792a59e

tools headers UAPI: Sync linux/in.h copy with the kernel sources

4y ago

Michael Ellerman

478036c4

powerpc: Fix initrd corruption with relative jump labels

Commit b0b3b2c78ec0 ("powerpc: Switch to relative jump labels") switched
us to using relative jump labels. That involves changing the code,
target and key members in struct jump_entry to be relative to the
address of the jump_entry, rather than absolute addresses.

We have two static inlines that create a struct jump_entry,
arch_static_branch() and arch_static_branch_jump(), as well as an asm
macro ARCH_STATIC_BRANCH, which is used by the pseries-only hypervisor
tracing code.

Unfortunately we missed updating the key to be a relative reference in
ARCH_STATIC_BRANCH.

That causes a pseries kernel to have a handful of jump_entry structs
with bad key values. Instead of being a relative reference they instead
hold the full address of the key.

However the code doesn't expect that, it still adds the key value to the
address of the jump_entry (see jump_entry_key()) expecting to get a
pointer to a key somewhere in kernel data.

The table of jump_entry structs sits in rodata, which comes after the
kernel text. In a typical build this will be somewhere around 15MB. The
address of the key will be somewhere in data, typically around 20MB.
Adding the two values together gets us a pointer somewhere around 45MB.

We then call static_key_set_entries() with that bad pointer and modify
some members of the struct static_key we think we are pointing at.

A pseries kernel is typically ~30MB in size, so writing to ~45MB won't
corrupt the kernel itself. However if we're booting with an initrd,
depending on the size and exact location of the initrd, we can corrupt
the initrd. Depending on how exactly we corrupt the initrd it can either
cause the system to not boot, or just corrupt one of the files in the
initrd.

The fix is simply to make the key value relative to the jump_entry
struct in the ARCH_STATIC_BRANCH macro.

Fixes: b0b3b2c78ec0 ("powerpc: Switch to relative jump labels")
Reported-by: Anastasia Kovaleva <a.kovaleva@yadro.com>
Reported-by: Roman Bolshakov <r.bolshakov@yadro.com>
Reported-by: Greg Kurz <groug@kaod.org>
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Daniel Axtens <dja@axtens.net>
Tested-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210614131440.312360-1-mpe@ellerman.id.au

4y ago

Thomas Gleixner

efa16550

x86/fpu: Reset state for all signal restore failures

4y ago

Linus Torvalds

28183dbf

Merge tag 'driver-core-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

4y ago

Linus Torvalds

331a6edb

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

4y ago

Trond Myklebust

c3aba897

NFSv4: Fix second deadlock in nfs4_evict_inode()

4y ago

Leo Yan

197eecb6

perf session: Correct buffer copying when peeking events

4y ago

Linus Torvalds

e14c779a

Merge tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

4y ago

David Abdurachmanov

7ede12b0

riscv: dts: fu740: fix cache-controller interrupts

4y ago

Arnaldo Carvalho de Melo

17d27fc3

tools headers UAPI: Sync asm-generic/unistd.h with the kernel original

4y ago

Michael Ellerman

e41d6c3f

powerpc/signal64: Copy siginfo before changing regs->nip

4y ago

Thomas Gleixner

510b80a6

x86/pkru: Write hardware init value to PKRU when xstate is init

4y ago

Linus Torvalds

6942d81a

Merge tag 'staging-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

4y ago

Tudor Ambarus

bb4031b8

clk: Skip clk provider registration when np is NULL

4y ago

Linus Torvalds

8ecfa36c

Merge tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

4y ago

Ming Lei

1e0d4e62

scsi: core: Only put parent device if host state differs from SHOST_CREATED

4y ago

Trond Myklebust

dfe1fe75

NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()

4y ago

Eric W. Biederman

06af8679

coredump: Limit what can interrupt coredumps

4y ago

Linus Torvalds

9ed13a17

Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
bluetooth, netfilter and can.

Current release - regressions:

- mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
to fix modifying offloaded qdiscs

- lantiq: net: fix duplicated skb in rx descriptor ring

- rtnetlink: fix regression in bridge VLAN configuration, empty info
is not an error, bot-generated "fix" was not needed

- libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem
creation

Current release - new code bugs:

- ethtool: fix NULL pointer dereference during module EEPROM dump via
the new netlink API

- mlx5e: don't update netdev RQs with PTP-RQ, the special purpose
queue should not be visible to the stack

- mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs

- mlx5e: verify dev is present in get devlink port ndo, avoid a panic

Previous releases - regressions:

- neighbour: allow NUD_NOARP entries to be force GCed

- further fixes for fallout from reorg of WiFi locking (staging:
rtl8723bs, mac80211, cfg80211)

- skbuff: fix incorrect msg_zerocopy copy notifications

- mac80211: fix NULL ptr deref for injected rate info

- Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs

Previous releases - always broken:

- bpf: more speculative execution fixes

- netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local

- udp: fix race between close() and udp_abort() resulting in a panic

- fix out of bounds when parsing TCP options before packets are
validated (in netfilter: synproxy, tc: sch_cake and mptcp)

- mptcp: improve operation under memory pressure, add missing
wake-ups

- mptcp: fix double-lock/soft lookup in subflow_error_report()

- bridge: fix races (null pointer deref and UAF) in vlan tunnel
egress

- ena: fix DMA mapping function issues in XDP

- rds: fix memory leak in rds_recvmsg

Misc:

- vrf: allow larger MTUs

- icmp: don't send out ICMP messages with a source address of 0.0.0.0

- cdc_ncm: switch to eth%d interface naming"

* tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits)
net: ethernet: fix potential use-after-free in ec_bhf_remove
selftests/net: Add icmp.sh for testing ICMP dummy address responses
icmp: don't send out ICMP messages with a source address of 0.0.0.0
net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
net: ll_temac: Fix TX BD buffer overwrite
net: ll_temac: Add memory-barriers for TX BD access
net: ll_temac: Make sure to free skb when it is completely used
MAINTAINERS: add Guvenc as SMC maintainer
bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
bnxt_en: Fix TQM fastpath ring backing store computation
bnxt_en: Rediscover PHY capabilities after firmware reset
cxgb4: fix wrong shift.
mac80211: handle various extensible elements correctly
mac80211: reset profile_periodicity/ema_ap
cfg80211: avoid double free of PMSR request
cfg80211: make certificate generation more robust
mac80211: minstrel_ht: fix sample time check
net: qed: Fix memcpy() overflow of qed_dcbx_params()
net: cdc_eem: fix tx fixup skb leak
net: hamradio: fix memory leak in mkiss_close
...

4y ago

Harald Freudenberger

e73a99f3

s390/ap: Fix hanging ioctl caused by wrong msg counter

4y ago

Jisheng Zhang

3a02764c

riscv: Ensure BPF_JIT_REGION_START aligned with PMD size

Andreas reported commit fc8504765ec5 ("riscv: bpf: Avoid breaking W^X")
breaks booting with one kind of defconfig, I reproduced a kernel panic
with the defconfig:

[ 0.138553] Unable to handle kernel paging request at virtual address ffffffff81201220
[ 0.139159] Oops [#1]
[ 0.139303] Modules linked in:
[ 0.139601] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc5-default+ #1
[ 0.139934] Hardware name: riscv-virtio,qemu (DT)
[ 0.140193] epc : __memset+0xc4/0xfc
[ 0.140416] ra : skb_flow_dissector_init+0x1e/0x82
[ 0.140609] epc : ffffffff8029806c ra : ffffffff8033be78 sp : ffffffe001647da0
[ 0.140878] gp : ffffffff81134b08 tp : ffffffe001654380 t0 : ffffffff81201158
[ 0.141156] t1 : 0000000000000002 t2 : 0000000000000154 s0 : ffffffe001647dd0
[ 0.141424] s1 : ffffffff80a43250 a0 : ffffffff81201220 a1 : 0000000000000000
[ 0.141654] a2 : 000000000000003c a3 : ffffffff81201258 a4 : 0000000000000064
[ 0.141893] a5 : ffffffff8029806c a6 : 0000000000000040 a7 : ffffffffffffffff
[ 0.142126] s2 : ffffffff81201220 s3 : 0000000000000009 s4 : ffffffff81135088
[ 0.142353] s5 : ffffffff81135038 s6 : ffffffff8080ce80 s7 : ffffffff80800438
[ 0.142584] s8 : ffffffff80bc6578 s9 : 0000000000000008 s10: ffffffff806000ac
[ 0.142810] s11: 0000000000000000 t3 : fffffffffffffffc t4 : 0000000000000000
[ 0.143042] t5 : 0000000000000155 t6 : 00000000000003ff
[ 0.143220] status: 0000000000000120 badaddr: ffffffff81201220 cause: 000000000000000f
[ 0.143560] [<ffffffff8029806c>] __memset+0xc4/0xfc
[ 0.143859] [<ffffffff8061e984>] init_default_flow_dissectors+0x22/0x60
[ 0.144092] [<ffffffff800010fc>] do_one_initcall+0x3e/0x168
[ 0.144278] [<ffffffff80600df0>] kernel_init_freeable+0x1c8/0x224
[ 0.144479] [<ffffffff804868a8>] kernel_init+0x12/0x110
[ 0.144658] [<ffffffff800022de>] ret_from_exception+0x0/0xc
[ 0.145124] ---[ end trace f1e9643daa46d591 ]---

After some investigation, I think I found the root cause: commit
2bfc6cd81bd ("move kernel mapping outside of linear mapping") moves
BPF JIT region after the kernel:

| #define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)

The &_end is unlikely aligned with PMD size, so the front bpf jit
region sits with part of kernel .data section in one PMD size mapping.
But kernel is mapped in PMD SIZE, when bpf_jit_binary_lock_ro() is
called to make the first bpf jit prog ROX, we will make part of kernel
.data section RO too, so when we write to, for example memset the
.data section, MMU will trigger a store page fault.

To fix the issue, we need to ensure the BPF JIT region is PMD size
aligned. This patch acchieve this goal by restoring the BPF JIT region
to original position, I.E the 128MB before kernel .text section. The
modification to kasan_init.c is inspired by Alexandre.

Fixes: fc8504765ec5 ("riscv: bpf: Avoid breaking W^X")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

4y ago

Arnaldo Carvalho de Melo

ef83f9ef

perf beauty: Update copy of linux/socket.h with the kernel sources

4y ago

Christophe Leroy

8e11d62e

powerpc/mem: Add back missing header to fix 'no previous prototype' error

4y ago

Thomas Gleixner

12f7764a

x86/process: Check PF_KTHREAD and not current->mm for kernel threads

4y ago

Linus Torvalds

4a668429

Merge tag 'usb-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

4y ago

Greg Kroah-Hartman

ba9c25d9

Merge tag 'iio-fixes-5.13a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus

4y ago

Saravana Kannan

28ec344b

usb: typec: tcpm: Don't block probing of consumers of "connector" nodes

4y ago

Feng Tang

2e302543

mm: relocate 'write_protect_seq' in struct mm_struct

0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork").

Further debug shows the regression is due to that commit changes the
offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
cache alignment changes.

From the perf data, the contention for 'mmap_lock' is very severe and
takes around 95% cpu cycles, and it is a rw_semaphore

struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to
have a very optimal cache alignment layout, as Linus explained:

"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.

Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."

After the commit, the rw_semaphore is at offset 128, which means the
'count' and 'owner' fields are now in the same cacheline, and causes
more cache bouncing.

Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
affect its offset:

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

The layout above is on 64 bits system with 0day's default kernel config
(similar to RHEL-8.3's config), in which all these 3 options are 'y'.
And the layout can vary with different kernel configs.

Relayouting a structure is usually a double-edged sword, as sometimes it
can helps one case, but hurt other cases. For this case, one solution
is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
(when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
hole in 'mm_struct' will not change other fields' alignment, while
restoring the regression.

Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>