commits

Currently mm_iommu_do_alloc() is called in 2 cases:
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY ioctl() for normal memory:
this locks &mem_list_mutex and then locks mm::mmap_sem
several times when adjusting locked_vm or pinning pages;
- vfio_pci_nvgpu_regops::mmap() for GPU memory:
this is called with mm::mmap_sem held already and it locks
&mem_list_mutex.

So one can craft a userspace program to do special ioctl and mmap in
2 threads concurrently and cause a deadlock which lockdep warns about
(below).

We did not hit this yet because QEMU constructs the machine in a single
thread.

This moves the overlap check next to where the new entry is added and
reduces the amount of time spent with &mem_list_mutex held.

This moves locked_vm adjustment from under &mem_list_mutex.

This relies on mm_iommu_adjust_locked_vm() doing nothing when entries==0.

This is one of the lockdep warnings:

======================================================
WARNING: possible circular locking dependency detected
5.1.0-rc2-le_nv2_aikATfstn1-p1 #363 Not tainted
------------------------------------------------------
qemu-system-ppc/8038 is trying to acquire lock:
000000002ec6c453 (mem_list_mutex){+.+.}, at: mm_iommu_do_alloc+0x70/0x490

but task is already holding lock:
00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++}:
lock_acquire+0xf8/0x260
down_write+0x44/0xa0
mm_iommu_adjust_locked_vm.part.1+0x4c/0x190
mm_iommu_do_alloc+0x310/0x490
tce_iommu_ioctl.part.9+0xb84/0x1150 [vfio_iommu_spapr_tce]
vfio_fops_unl_ioctl+0x94/0x430 [vfio]
do_vfs_ioctl+0xe4/0x930
ksys_ioctl+0xc4/0x110
sys_ioctl+0x28/0x80
system_call+0x5c/0x70

-> #0 (mem_list_mutex){+.+.}:
__lock_acquire+0x1484/0x1900
lock_acquire+0xf8/0x260
__mutex_lock+0x88/0xa70
mm_iommu_do_alloc+0x70/0x490
vfio_pci_nvgpu_mmap+0xc0/0x130 [vfio_pci]
vfio_pci_mmap+0x198/0x2a0 [vfio_pci]
vfio_device_fops_mmap+0x44/0x70 [vfio]
mmap_region+0x5d4/0x770
do_mmap+0x42c/0x650
vm_mmap_pgoff+0x124/0x160
ksys_mmap_pgoff+0xdc/0x2f0
sys_mmap+0x40/0x80
system_call+0x5c/0x70

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(mem_list_mutex);
lock(&mm->mmap_sem);
lock(mem_list_mutex);

*** DEADLOCK ***

1 lock held by qemu-system-ppc/8038:
#0: 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

Fixes: c10c21efa4bc ("powerpc/vfio/iommu/kvm: Do not pin device memory", 2018-12-19)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

6y ago

Russell King

50362162

ARM: fix function graph tracer and unwinder dependencies

6y ago

Linus Torvalds

72a6e35d

Merge tag 'dmaengine-fix-5.1-rc7' of git://git.infradead.org/users/vkoul/slave-dma

6y ago

Lijun Ou

2557fabd

RDMA/hns: Bugfix for mapping user db

6y ago

Stefan Bühler

fb775faa

io_uring: fix poll full SQ detection

6y ago

Michael Ellerman

8adddf34

powerpc/mm/radix: Make Radix require HUGETLB_PAGE

Joel reported weird crashes using skiroot_defconfig, in his case we
jumped into an NX page:

kernel tried to execute exec-protected page (c000000002bff4f0) - exploit attempt? (uid: 0)
BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xc000000002bff4f0

Looking at the disassembly, we had simply branched to that address:

c000000000c001bc 49fff335 bl c000000002bff4f0

But that didn't match the original kernel image:

c000000000c001bc 4bfff335 bl c000000000bff4f0 <kobject_get+0x8>

When STRICT_KERNEL_RWX is enabled, and we're using the radix MMU, we
call radix__change_memory_range() late in boot to change page
protections. We do that both to mark rodata read only and also to mark
init text no-execute. That involves walking the kernel page tables,
and clearing _PAGE_WRITE or _PAGE_EXEC respectively.

With radix we may use hugepages for the linear mapping, so the code in
radix__change_memory_range() uses eg. pmd_huge() to test if it has
found a huge mapping, and if so it stops the page table walk and
changes the PMD permissions.

However if the kernel is built without HUGETLBFS support, pmd_huge()
is just a #define that always returns 0. That causes the code in
radix__change_memory_range() to incorrectly interpret the PMD value as
a pointer to a PTE page rather than as a PTE at the PMD level.

We can see this using `dv` in xmon which also uses pmd_huge():

0:mon> dv c000000000000000
pgd @ 0xc000000001740000
pgdp @ 0xc000000001740000 = 0x80000000ffffb009
pudp @ 0xc0000000ffffb000 = 0x80000000ffffa009
pmdp @ 0xc0000000ffffa000 = 0xc00000000000018f <- this is a PTE
ptep @ 0xc000000000000100 = 0xa64bb17da64ab07d <- kernel text

The end result is we treat the value at 0xc000000000000100 as a PTE
and clear _PAGE_WRITE or _PAGE_EXEC, potentially corrupting the code
at that address.

In Joel's specific case we cleared the sign bit in the offset of the
branch, causing a backward branch to turn into a forward branch which
caused us to branch into a non-executable page. However the exact
nature of the crash depends on kernel version, compiler version, and
other factors.

We need to fix radix__change_memory_range() to not use accessors that
depend on HUGETLBFS, but we also have radix memory hotplug code that
uses pmd_huge() etc that will also need fixing. So for now just
disallow the broken combination of Radix with HUGETLBFS disabled.

The only defconfig we have that is affected is skiroot_defconfig, so
turn on HUGETLBFS there so that it still gets Radix.

Fixes: 566ca99af026 ("powerpc/mm/radix: Add dummy radix_enabled()")
Cc: stable@vger.kernel.org # v4.7+
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

6y ago

Linus Torvalds

9e98c678

Linux 5.1-rc1 v5.1-rc1

6y ago

Linus Torvalds

25cce03b

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

6y ago

Shun-Chih Yu

5bb5c3a3

dmaengine: mediatek-cqdma: fix wrong register usage in mtk_cqdma_start

6y ago

Jason Gunthorpe

67f269b3

RDMA/ucontext: Fix regression with disassociate

6y ago

Stefan Bühler

0d7bae69

io_uring: fix race condition when sq threads goes sleeping

6y ago

Michael Ellerman

cf7cf697

powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs

6y ago

Linus Torvalds

28d747f2

Merge tag 'kbuild-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

6y ago

Linus Torvalds

037904a2

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6y ago

Lucas Stach

3a349763

Input: synaptics-rmi4 - write config register values to the right offset

6y ago

Achim Dahlhoff

6e7da747

dmaengine: sh: rcar-dmac: Fix glitch in dmaengine_tx_status

6y ago

Jason Gunthorpe

d5e560d3

RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages

6y ago

Stefan Bühler

e523a29c

io_uring: fix race condition reading SQ entries

6y ago

Nicholas Piggin

7100e870

powerpc/64s/radix: Fix radix segment exception handling

6y ago

Linus Torvalds

80b98e92

Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6y ago

Masahiro Yamada

c71bb9f8

kconfig: remove stale lxdialog/.gitignore

6y ago

Linus Torvalds

15d4e26b

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6y ago

Qian Cai

0d02113b

x86/mm: Fix a crash with kmemleak_scan()

6y ago

Pan Bian

bce1a784

Input: synaptics-rmi4 - fix possible double free

6y ago

Dirk Behme

907bd68a

dmaengine: sh: rcar-dmac: With cyclic DMA residue 0 is valid

6y ago

Jason Gunthorpe

c660133c

RDMA/mlx5: Do not allow the user to write to the clock page

6y ago

Jens Axboe

35fa71a0

io_uring: fail io_uring_register(2) on a dying io_uring instance

6y ago

Christophe Leroy

dd9a994f

powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64

6y ago

Linus Torvalds

69ebf9a1

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6y ago

Rasmus Villemoes

2e905c7a

x86/asm: Remove unused __constant_c_x_memset() macro and inlines

7y ago

Masahiro Yamada

037fc336

kbuild: force all architectures except um to include mandatory-y

6y ago

Linus Torvalds

50849916

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6y ago

Xie XiuQi

a860fa7b

sched/numa: Fix a possible divide-by-zero

6y ago

Borislav Petkov

36f0c423

x86/boot: Disable RSDP parsing temporarily

6y ago

Jacky Bai

f06eba72

Input: snvs_pwrkey - make it depend on ARCH_MXC

6y ago

Stefan Wahren

f1473847

dmaengine: bcm2835: Avoid GFP_KERNEL in device_prep_slave_sg

6y ago

Guy Levi

7249c8ea

IB/mlx5: Fix scatter to CQE in DCT QP creation

6y ago

Linus Torvalds

085b7755

Linux 5.1-rc6 v5.1-rc6

6y ago

Christophe Leroy

fd427103

powerpc/32: Fix early boot failure with RTAS built-in

6y ago

Linus Torvalds

c5b5138c

Merge tag 'for-linus-5.1b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

6y ago

kbuild test robot

c634dc6b

perf/x86/intel: Make dev_attr_allow_tsx_force_abort static

6y ago

Rasmus Villemoes

88ca66d8

x86/asm: Remove dead __GNUC__ conditionals

7y ago

Masahiro Yamada

7cbbbb8b

kbuild: warn redundant generic-y

6y ago

Linus Torvalds

baf76f0c

slip: make slhc_free() silently accept an error pointer

6y ago

Harry Pan

82c99f7a

perf/x86/intel: Update KBL Package C-state events to also include PC8/PC9/PC10 counters

6y ago

Linus Torvalds

cd8dead0

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Just the usual assortment of small'ish fixes:

1) Conntrack timeout is sometimes not initialized properly, from
Alexander Potapenko.

2) Add a reasonable range limit to tcp_min_rtt_wlen to avoid
undefined behavior. From ZhangXiaoxu.

3) des1 field of descriptor in stmmac driver is initialized with the
wrong variable. From Yue Haibing.

4) Increase mlxsw pci sw reset timeout a little bit more, from Ido
Schimmel.

5) Match IOT2000 stmmac devices more accurately, from Su Bao Cheng.

6) Fallback refcount fix in TLS code, from Jakub Kicinski.

7) Fix max MTU check when using XDP in mlx5, from Maxim Mikityanskiy.

8) Fix recursive locking in team driver, from Hangbin Liu.

9) Fix tls_set_device_offload_Rx() deadlock, from Jakub Kicinski.

10) Don't use napi_alloc_frag() outside of softiq context of socionext
driver, from Ilias Apalodimas.

11) MAC address increment overflow in ncsi, from Tao Ren.

12) Fix a regression in 8K/1M pool switching of RDS, from Zhu Yanjun.

13) ipv4_link_failure has to validate the headers that are actually
there because RAW sockets can pass in arbitrary garbage, from Eric
Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
ipv4: add sanity checks in ipv4_link_failure()
net/rose: fix unbound loop in rose_loopback_timer()
rxrpc: fix race condition in rxrpc_input_packet()
net: rds: exchange of 8K and 1M pool
net: vrf: Fix operation not supported when set vrf mac
net/ncsi: handle overflow when incrementing mac address
net: socionext: replace napi_alloc_frag with the netdev variant on init
net: atheros: fix spelling mistake "underun" -> "underrun"
spi: ST ST95HF NFC: declare missing of table
spi: Micrel eth switch: declare missing of table
net: stmmac: move stmmac_check_ether_addr() to driver probe
netfilter: fix nf_l4proto_log_invalid to log invalid packets
netfilter: never get/set skb->tstamp
netfilter: ebtables: CONFIG_COMPAT: drop a bogus WARN_ON
Documentation: decnet: remove reference to CONFIG_DECNET_ROUTE_FWMARK
dt-bindings: add an explanation for internal phy-mode
net/tls: don't leak IV and record seq when offload fails
net/tls: avoid potential deadlock in tls_set_device_offload_rx()
selftests/net: correct the return value for run_afpackettests
team: fix possible recursive locking when add slaves
...