commits

So technically there is nothing wrong with adding a pinned page to the
swap cache, but the pinning obviously means that the page can't actually
be free'd right now anyway, so it's a bit pointless.

However, the real problem is not with it being a bit pointless: the real
issue is that after we've added it to the swap cache, we'll try to unmap
the page. That will succeed, because the code in mm/rmap.c doesn't know
or care about pinned pages.

Even the unmapping isn't fatal per se, since the page will stay around
in memory due to the pinning, and we do hold the connection to it using
the swap cache. But when we then touch it next and take a page fault,
the logic in do_swap_page() will map it back into the process as a
possibly read-only page, and we'll then break the page association on
the next COW fault.

Honestly, this issue could have been fixed in any of those other places:
(a) we could refuse to unmap a pinned page (which makes conceptual
sense), or (b) we could make sure to re-map a pinned page writably in
do_swap_page(), or (c) we could just make do_wp_page() not COW the
pinned page (which was what we historically did before that "mm:
do_wp_page() simplification" commit).

But while all of them are equally valid models for breaking this chain,
not putting pinned pages into the swap cache in the first place is the
simplest one by far.

It's also the safest one: the reason why do_wp_page() was changed in the
first place was that getting the "can I re-use this page" wrong is so
fraught with errors. If you do it wrong, you end up with an incorrectly
shared page.

As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or
do_swap_page() would be a serious bug since it is only a (very good)
heuristic. Re-using the page requires a hard black-and-white rule with
no room for ambiguity.

In contrast, saying "this page is very likely dma pinned, so let's not
add it to the swap cache and try to unmap it" is an obviously safe thing
to do, and if the heuristic might very rarely be a false positive, no
harm is done.

Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Reported-and-tested-by: Martin Raiber <martin@urbackup.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5y ago

Al Viro

d36a1dd9

dump_common_audit_data(): fix racy accesses to ->d_name

5y ago

Ariel Marcovitch

2225a8dd

powerpc: Fix alignment bug within the init sections

5y ago

Namhyung Kim

a1bf2305

perf stat: Take cgroups into account for shadow stats

As of now it doesn't consider cgroups when collecting shadow stats and
metrics so counter values from different cgroups will be saved in a same
slot. This resulted in incorrect numbers when those cgroups have
different workloads.

For example, let's look at the scenario below: cgroups A and C runs same
workload which burns a cpu while cgroup B runs a light workload.

$ perf stat -a -e cycles,instructions --for-each-cgroup A,B,C sleep 1

Performance counter stats for 'system wide':

3,958,116,522 cycles A
6,722,650,929 instructions A # 2.53 insn per cycle
1,132,741 cycles B
571,743 instructions B # 0.00 insn per cycle
4,007,799,935 cycles C
6,793,181,523 instructions C # 2.56 insn per cycle

1.001050869 seconds time elapsed

When I run 'perf stat' with single workload, it usually shows IPC around
1.7. We can verify it (6,722,650,929.0 / 3,958,116,522 = 1.698) for cgroup A.

But in this case, since cgroups are ignored, cycles are averaged so it
used the lower value for IPC calculation and resulted in around 2.5.

avg cycle: (3958116522 + 1132741 + 4007799935) / 3 = 2655683066
IPC (A) : 6722650929 / 2655683066 = 2.531
IPC (B) : 571743 / 2655683066 = 0.0002
IPC (C) : 6793181523 / 2655683066 = 2.557

We can simply compare cgroup pointers in the evsel and it'll be NULL
when cgroups are not specified. With this patch, I can see correct
numbers like below:

$ perf stat -a -e cycles,instructions --for-each-cgroup A,B,C sleep 1

Performance counter stats for 'system wide':

4,171,051,687 cycles A
7,219,793,922 instructions A # 1.73 insn per cycle
1,051,189 cycles B
583,102 instructions B # 0.55 insn per cycle
4,171,124,710 cycles C
7,192,944,580 instructions C # 1.72 insn per cycle

1.007909814 seconds time elapsed

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20210115071139.257042-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

5y ago

Linus Torvalds

0da0a8a0

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

5y ago

Christoph Hellwig

a959a978

iov_iter: fix the uaccess area in copy_compat_iovec_from_user

5y ago

Nathan Chancellor

3ce47d95

powerpc: Handle .text.{hot,unlikely}.* in linker script

5y ago

Namhyung Kim

3ff1e718

perf stat: Introduce struct runtime_stat_data

5y ago

Linus Torvalds

54c6247d

Merge tag 'block-5.11-2021-01-16' of git://git.kernel.dk/linux-block

5y ago

Lukas Bulwahn

be255335

scsi: sd: Remove obsolete variable in sd_remove()

5y ago

Al Viro

a0a6df9a

umount(2): move the flag validity checks first

5y ago

Christophe Leroy

98bf2d3f

powerpc/32s: Fix RTAS machine check with VMAP stack

5y ago

Ian Rogers

66dd86b2

libperf tests: Fail when failing to get a tracepoint id

5y ago

Linus Torvalds

11c0239a

Merge tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block

5y ago

Jens Axboe

b4f66425

Merge tag 'nvme-5.11-2021-01-14' of git://git.infradead.org/nvme into block-5.11

5y ago

Ewan D. Milne

e5cc9002

scsi: sd: Suppress spurious errors when WRITE SAME is being disabled

5y ago

Linus Torvalds

5c8fe583

Linux 5.11-rc1 v5.11-rc1

5y ago

Linus Torvalds

e71ba945

Linux 5.11-rc2 v5.11-rc2

5y ago

Ian Rogers

bba2ea17

libperf tests: If a test fails return non-zero

5y ago

Linus Torvalds

acda701b

Merge tag 'riscv-for-linus-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:
"There are a few more fixes than a normal rc4, largely due to the
bubble introduced by the holiday break:

- return -ENOSYS for syscall number -1, which previously returned an
uninitialized value.

- ensure of_clk_init() has been called in time_init(), without which
clock drivers may not be initialized.

- fix sifive,uart0 driver to properly display the baud rate. A fix to
initialize MPIE that allows interrupts to be processed during
system calls.

- avoid erronously begin tracing IRQs when interrupts are disabled,
which at least triggers suprious lockdep failures.

- workaround for a warning related to calling smp_processor_id()
while preemptible. The warning itself is suprious on currently
availiable systems.

- properly include the generic time VDSO calls. A fix to our kasan
address mapping. A fix to the HiFive Unleashed device tree, which
allows the Ethernet PHY to be properly initialized by Linux (as
opposed to relying on the bootloader).

- defconfig update to include SiFive's GPIO driver, which is present
on the HiFive Unleashed and necessary to initialize the PHY.

- avoid allocating memory while initializing reserved memory.

- avoid allocating the last 4K of memory, as pointers there alias
with syscall errors.

There are also two cleanups that should have no functional effect but
do fix build warnings:

- drop a duplicated definition of PAGE_KERNEL_EXEC.

- properly declare the asm register SP shim.

- cleanup the rv32 memory size Kconfig entry, to reflect the actual
size of memory availiable"

* tag 'riscv-for-linus-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: Fix maximum allowed phsyical memory for RV32
RISC-V: Set current memblock limit
RISC-V: Do not allocate memblock while iterating reserved memblocks
riscv: stacktrace: Move register keyword to beginning of declaration
riscv: defconfig: enable gpio support for HiFive Unleashed
dts: phy: add GPIO number and active state used for phy reset
dts: phy: fix missing mdio device and probe failure of vsc8541-01 device
riscv: Fix KASAN memory mapping.
riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
riscv: cacheinfo: Fix using smp_processor_id() in preemptible
riscv: Trace irq on only interrupt is enabled
riscv: Drop a duplicated PAGE_KERNEL_EXEC
riscv: Enable interrupts during syscalls with M-Mode
riscv: Fix sifive serial driver
riscv: Fix kernel time_init()
riscv: return -ENOSYS for syscall -1

5y ago

Jens Axboe

a8d13dbc

io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()

5y ago

Coly Li

5342fd42

bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET

5y ago

Sagi Grimberg

5ab25a32

nvme: don't intialize hwmon for discovery controllers

5y ago

Dinghao Liu

3b01d7ea

scsi: scsi_debug: Fix memleak in scsi_debug_init()

5y ago

Linus Torvalds

14e3e989

proc mountinfo: make splice available again

5y ago

Linus Torvalds

3516bd72

Merge tag 's390-5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

5y ago

Ian Rogers

be82fddc

libperf tests: Avoid uninitialized variable warning

5y ago

Linus Torvalds

9348b73c

mm: don't play games with pinned pages in clear_page_refs

5y ago

Atish Patra

e5577937

RISC-V: Fix maximum allowed phsyical memory for RV32

5y ago

Marcelo Diop-Gonzalez

f010505b

io_uring: flush timeouts that should already have expired

5y ago

Coly Li

b16671e8

bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket

When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
was introduced into the incompat feature set. It used bucket_size_hi
(which was added at the tail of struct cache_sb_disk) to extend current
16bit bucket size to 32bit with existing bucket_size in struct
cache_sb_disk.

This is not a good idea, there are two obvious problems,
- Bucket size is always value power of 2, if store log2(bucket size) in
existing bucket_size of struct cache_sb_disk, it is unnecessary to add
bucket_size_hi.
- Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
struct cache_sb_disk, bucket_size_hi was added after d[] which makes
csum_set calculate an unexpected super block checksum.

To fix the above problems, this patch introduces a new incompat feature
bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
means bucket_size in struct cache_sb_disk stores the order of power-of-2
bucket size value. When user specifies a bucket size larger than 32768
sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
incompat feature set, and bucket_size stores log2(bucket size) more
than store the real bucket size value.

The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
recognized by kernel driver for legacy compatible purpose. The previous
bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
and not used in bcache-tools anymore.

For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
bcache-tools and kernel driver still recognize the feature string and
display it as "obso_large_bucket".

With this change, the unnecessary extra space extend of bcache on-disk
super block can be avoided, and csum_set() may generate expected check
sum as well.

Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5y ago

Sagi Grimberg

ca1ff67d

nvme-tcp: fix possible data corruption with bio merges

5y ago

Colin Ian King

39718fe7

scsi: mpt3sas: Fix spelling mistake in Kconfig "compatiblity" -> "compatibility"

5y ago

Linus Torvalds

52cd5f9c

Merge tag 'ntb-5.11' of git://github.com/jonmason/ntb

5y ago

Linus Torvalds

d9296a7b

Merge tag 'pm-5.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

5y ago

Heiko Carstens

129975e7

s390/Kconfig: sort config S390 select list once again

5y ago

Namhyung Kim

a042a82d

perf test: Fix shadow stat test for non-bash shells

5y ago

Linus Torvalds

29a951df

mm: fix clear_refs_write locking

5y ago

Atish Patra

abb8e86b

RISC-V: Set current memblock limit

5y ago

Pavel Begunkov

06585c49

io_uring: do sqo disable on install_fd error

5y ago

Coly Li

1dfc0686

bcache: check unsupported feature sets for bcache register

5y ago

Sagi Grimberg

ada83177

nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT

5y ago

Nilesh Javali

d50c7986

scsi: qedi: Correct max length of CHAP secret

5y ago

Linus Torvalds

33c148a4

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

5y ago

Dave Jiang

75b6f648

ntb: intel: add Intel NTB LTR vendor support for gen4 NTB

5y ago

Linus Torvalds

eda809ae

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).

The big core two fixes are for power management ("block: Do not accept
any requests while suspended" and "block: Fix a race in the runtime
power management code") which finally sorts out the resume problems
we've occasionally been having.

To make the resume fix, there are seven necessary precursors which
effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
in block is automatically a power management exempt one.

All of the non-PM preempt cases are removed except for the one in the
SCSI Parallel Interface (spi) domain validation which is a genuine
case where we have to run requests at high priority to validate the
bus so this becomes an autopm get/put protected request"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
scsi: cxgb4i: Fix TLS dependency
scsi: ufs: Un-inline ufshcd_vops_device_reset function
scsi: ufs: Re-enable WriteBooster after device reset
scsi: ufs-mediatek: Use correct path to fix compile error
scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
scsi: block: Do not accept any requests while suspended
scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: block: Introduce BLK_MQ_REQ_PM
scsi: block: Fix a race in the runtime power management code
scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
scsi: ufs-pci: Fix restore from S4 for Intel controllers
scsi: ufs-mediatek: Keep VCC always-on for specific devices
scsi: ufs: Allow regulators being always-on
scsi: ufs: Clear UAC for RPMB after ufshcd resets
...