commits

The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that runs when it (sgx_nr_free_pages) goes below a watermark
to ensure that there are always some free pages available to, for
example, support efficient page faults.

With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is read without any
protection and updated with inconsistent protection by any one
of the spin locks associated with the individual NUMA nodes.
For example:

CPU_A CPU_B
----- -----
spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
... ...
sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--;

spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);

Since sgx_nr_free_pages may be protected by different spin locks
while being modified from different CPUs, the following scenario
is possible:

CPU_A CPU_B
----- -----
{sgx_nr_free_pages = 100}
spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
sgx_nr_free_pages--; sgx_nr_free_pages--;
/* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */
/* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */
/* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */
spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);

In the above scenario, sgx_nr_free_pages is decremented from two CPUs
but instead of sgx_nr_free_pages ending with a value that is two less
than it started with, it was only decremented by one while the number
of free pages were actually reduced by two. The consequence of
sgx_nr_free_pages not being protected is that its value may not
accurately reflect the actual number of free pages on the system,
impacting the availability of free pages in support of many flows.

The problematic scenario is when the reclaimer does not run because it
believes there to be sufficient free pages while any attempt to allocate
a page fails because there are no free pages available. In the SGX driver
the reclaimer's watermark is only 32 pages so after encountering the
above example scenario 32 times a user space hang is possible when there
are no more free pages because of repeated page faults caused by no
free pages made available.

The following flow was encountered:
asm_exc_page_fault
...
sgx_vma_fault()
sgx_encl_load_page()
sgx_encl_eldu() // Encrypted page needs to be loaded from backing
// storage into newly allocated SGX memory page
sgx_alloc_epc_page() // Allocate a page of SGX memory
__sgx_alloc_epc_page() // Fails, no free SGX memory
...
if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
wake_up(&ksgxd_waitq);
return -EBUSY; // Return -EBUSY giving reclaimer time to run
return -EBUSY;
return -EBUSY;
return VM_FAULT_NOPAGE;

The reclaimer is triggered in above flow with the following code:

static bool sgx_should_reclaim(unsigned long watermark)
{
return sgx_nr_free_pages < watermark &&
!list_empty(&sgx_active_page_list);
}

In the problematic scenario there were no free pages available yet the
value of sgx_nr_free_pages was above the watermark. The allocation of
SGX memory thus always failed because of a lack of free pages while no
free pages were made available because the reclaimer is never started
because of sgx_nr_free_pages' incorrect value. The consequence was that
user space kept encountering VM_FAULT_NOPAGE that caused the same
address to be accessed repeatedly with the same result.

Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.

Cc: stable@vger.kernel.org
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com

4y ago

Linus Torvalds

75603b14

Merge tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

4y ago

Song Liu

f3fd84a3

x86/perf: Fix snapshot_branch_stack warning in VM

4y ago

Borislav Petkov

8d48bf82

x86/boot: Pull up cmdline preparation and early param parsing

4y ago

Geert Uytterhoeven

61eb495c

pstore/blk: Use "%lu" to format unsigned long

4y ago

Cédric Le Goater

8e80a73f

powerpc/xive: Change IRQ domain to a tree domain

4y ago

Alexander Antonov

bdc0feee

perf/x86/intel/uncore: Fix IIO event constraints for Snowridge

4y ago

Linus Torvalds

fa55b7dc

Linux 5.16-rc1 v5.16-rc1

4y ago

Linus Torvalds

923dcc5e

Merge branch 'akpm' (patches from Andrew)

4y ago

Christophe Leroy

1e35eba4

powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX

4y ago

Alexander Antonov

3866ae31

perf/x86/intel/uncore: Fix IIO event constraints for Skylake Server

4y ago

Gustavo A. R. Silva

dee2b702

kconfig: Add support for -Wimplicit-fallthrough

4y ago

Linus Torvalds

61564e7b

Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block

4y ago

David Hildenbrand

c1e63117

proc/vmcore: fix clearing user buffer by properly using clear_user()

To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":

systemd[1]: Starting Kdump Vmcore Save Service...
kdump[420]: Kdump is using the default log level(3).
kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[465]: saving vmcore-dmesg.txt complete
kdump[467]: saving vmcore
BUG: unable to handle page fault for address: 00007f2374e01000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
Oops: 0003 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
Call Trace:
read_vmcore+0x236/0x2c0
proc_reg_read+0x55/0xa0
vfs_read+0x95/0x190
ksys_read+0x4f/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().

To fix, properly use clear_user() when we're dealing with a user buffer.

Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4y ago

Christophe Leroy

5499802b

powerpc/signal32: Fix sigset_t copy

4y ago

Alexander Antonov

e324234e

perf/x86/intel/uncore: Fix filter_tid mask for CHA events on Skylake Server

4y ago

Linus Torvalds

ce49bfc8

Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

4y ago

Linus Torvalds

b100274c

Merge tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

4y ago

Ming Lei

2b504bd4

blk-mq: don't insert FUA request with data into scheduler queue

4y ago

Ard Biesheuvel

825c43f5

kmap_local: don't assume kmap PTEs are linear arrays in memory

4y ago

Christophe Leroy

5b548609

powerpc/book3e: Fix TLBCAM preset at boot

4y ago

Linus Torvalds

c3b68c27

Merge tag 'for-5.16/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

4y ago

Darrick J. Wong

4a6b35b3

xfs: sync xfs_btree_split macros with userspace libxfs

4y ago

Linus Torvalds

6b38e2fb

Merge tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

4y ago

Bjorn Andersson

62209e80

pinctrl: qcom: sm8350: Correct UFS and SDC offsets

4y ago

Yu Kuai

15c30104

blk-cgroup: fix missing put device in error path from blkg_conf_pref()

4y ago

SeongJae Park

d78f3853

mm/damon/dbgfs: fix missed use of damon_dbgfs_lock

4y ago

Alexey Kardashevskiy

ad397602

powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window

4y ago

Linus Torvalds

24318ae8

Merge tag 'sh-for-5.16' of git://git.libc.org/linux-sh

4y ago

Sven Schnelle

3ec18fc7

parisc/entry: fix trace test in syscall exit path

4y ago

Eric Sandeen

29f11fce

xfs: #ifdef out perag code for userspace

4y ago

Linus Torvalds

b38bfc74

Merge tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

4y ago

Heiko Carstens

890e3dc8

ftrace/samples: add s390 support for ftrace direct multi sample

4y ago

Arnd Bergmann

293083f8

pinctrl: tegra194: remove duplicate initializer again

An earlier bugfix removed a duplicate field initializer in
a macro, but it seems that this came back with the following
update:

drivers/pinctrl/tegra/pinctrl-tegra194.c:1341:28: error: initialized field overwritten [-Werror=override-init]
1341 | .drv_reg = ((r)), \
| ^
drivers/pinctrl/tegra/pinctrl-tegra194.c:1392:41: note: in expansion of macro 'DRV_PINGROUP_ENTRY_Y'
1392 | #define drive_touch_clk_pcc4 DRV_PINGROUP_ENTRY_Y(0x2004, 12, 5, 20, 5, -1, -1, -1, -1, 1)
| ^~~~~~~~~~~~~~~~~~~~
drivers/pinctrl/tegra/pinctrl-tegra194.c:1631:17: note: in expansion of macro 'drive_touch_clk_pcc4'
1631 | drive_##pg_name, \
| ^~~~~~
drivers/pinctrl/tegra/pinctrl-tegra194.c:1636:9: note: in expansion of macro 'PINGROUP'
1636 | PINGROUP(touch_clk_pcc4, GP, TOUCH, RSVD2, RSVD3, 0x2000, 1, Y, -1, -1, 6, 8, -1, 10, 11, 12, N, -1, -1, N, "vddio_ao"),
| ^~~~~~~~
drivers/pinctrl/tegra/pinctrl-tegra194.c:1341:28: note: (near initialization for 'tegra194_groups[0].drv_reg')
1341 | .drv_reg = ((r)), \
| ^
drivers/pinctrl/tegra/pinctrl-tegra194.c:1392:41: note: in expansion of macro 'DRV_PINGROUP_ENTRY_Y'
1392 | #define drive_touch_clk_pcc4 DRV_PINGROUP_ENTRY_Y(0x2004, 12, 5, 20, 5, -1, -1, -1, -1, 1)
| ^~~~~~~~~~~~~~~~~~~~
drivers/pinctrl/tegra/pinctrl-tegra194.c:1631:17: note: in expansion of macro 'drive_touch_clk_pcc4'
1631 | drive_##pg_name, \
| ^~~~~~
drivers/pinctrl/tegra/pinctrl-tegra194.c:1636:9: note: in expansion of macro 'PINGROUP'
1636 | PINGROUP(touch_clk_pcc4, GP, TOUCH, RSVD2, RSVD3, 0x2000, 1, Y, -1, -1, 6, 8, -1, 10, 11, 12, N, -1, -1, N, "vddio_ao"),
| ^~~~~~~~

Remove it again.

Fixes: 613c0826081b ("pinctrl: tegra: Add pinmux support for Tegra194")
Fixes: 92cadf68e50a ("pinctrl: tegra: pinctrl-tegra194: Do not initialise field twice")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211104133645.1186968-1-arnd@kernel.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

4y ago

Ming Lei

245a489e

block: avoid to quiesce queue in elevator_init_mq

4y ago

SeongJae Park

db7a347b

mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation

4y ago

Alexey Kardashevskiy

fb4ee2b3

powerpc/pseries/ddw: simplify enable_ddw()

4y ago

Linus Torvalds

6ea45c57

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

4y ago

Geert Uytterhoeven

8518e694

sh: pgtable-3level: Fix cast to pointer from integer of different size

4y ago

John David Anglin

38860b2c

parisc: Flush kernel data mapping in set_pte_at() when installing pte for user page

4y ago

Yang Guang

5b068aad

xfs: use swap() to make dabtree code cleaner

4y ago

Linus Torvalds

a90af8f1

Merge tag 'libata-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

4y ago

Paulo Alcantara

8ae87bbe

cifs: introduce cifs_ses_mark_for_reconnect() helper

4y ago

Heiko Carstens

503e4510

ftrace/samples: add missing Kconfig option for ftrace direct multi sample

4y ago

Jonathan Corbet

a3143f78

Remove unused header <linux/sdb.h>

4y ago

Kees Cook

d1faacbf

Revert "mark pstore-blk as broken"

4y ago

Kees Cook

cab71f74

kasan: test: silence intentional read overflow warnings

As done in commit d73dad4eb5ad ("kasan: test: bypass __alloc_size
checks") for __write_overflow warnings, also silence some more cases
that trip the __read_overflow warnings seen in 5.16-rc1[1]:

In file included from include/linux/string.h:253,
from include/linux/bitmap.h:10,
from include/linux/cpumask.h:12,
from include/linux/mm_types_task.h:14,
from include/linux/mm_types.h:5,
from include/linux/page-flags.h:13,
from arch/arm64/include/asm/mte.h:14,
from arch/arm64/include/asm/pgtable.h:12,
from include/linux/pgtable.h:6,
from include/linux/kasan.h:29,
from lib/test_kasan.c:10:
In function 'memcmp',
inlined from 'kasan_memcmp' at lib/test_kasan.c:897:2:
include/linux/fortify-string.h:263:25: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
263 | __read_overflow();
| ^~~~~~~~~~~~~~~~~
In function 'memchr',
inlined from 'kasan_memchr' at lib/test_kasan.c:872:2:
include/linux/fortify-string.h:277:17: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
277 | __read_overflow();
| ^~~~~~~~~~~~~~~~~

[1] http://kisskb.ellerman.id.au/kisskb/buildresult/14660585/log/

Link: https://lkml.kernel.org/r/20211116004111.3171781-1-keescook@chromium.org
Fixes: d73dad4eb5ad ("kasan: test: bypass __alloc_size checks")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>