Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
code
Clone this repository
https://tangled.org/tjh.dev/kernel
git@gordian.tjh.dev:tjh.dev/kernel
For self-hosted knots, clone URLs may differ based on your setup.
Yes, staying withing 80 columns is certainly still _preferred_. But
it's not the hard limit that the checkpatch warnings imply, and other
concerns can most certainly dominate.
Increase the default limit to 100 characters. Not because 100
characters is some hard limit either, but that's certainly a "what are
you doing" kind of value and less likely to be about the occasional
slightly longer lines.
Miscellanea:
- to avoid unnecessary whitespace changes in files, checkpatch will no
longer emit a warning about line length when scanning files unless
--strict is also used
- Add a bit to coding-style about alignment to open parenthesis
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 fixes from Thomas Gleixner:
"A pile of x86 fixes:
- Prevent a memory leak in ioperm which was caused by the stupid
assumption that the exit cleanup is always called for current,
which is not the case when fork fails after taking a reference on
the ioperm bitmap.
- Fix an arithmething overflow in the DMA code on 32bit systems
- Fill gaps in the xstate copy with defaults instead of leaving them
uninitialized
- Revert: "Make __X32_SYSCALL_BIT be unsigned long" as it turned out
that existing user space fails to build"
* tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioperm: Prevent a memory leak when fork fails
x86/dma: Fix max PFN arithmetic overflow on 32 bit systems
copy_xstate_to_kernel(): don't leave parts of destination uninitialized
x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix preventing a crash in NUMA balancing.
The current->mm check is not reliable as the mm might be temporary due
to use_mm() in a kthread. Check for PF_KTHREAD explictly"
* tag 'sched-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Don't NUMA balance for kthreads
Pick up FPU register dump fixes from Al Viro.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
"Another week, another set of bug fixes:
1) Fix pskb_pull length in __xfrm_transport_prep(), from Xin Long.
2) Fix double xfrm_state put in esp{4,6}_gro_receive(), also from Xin
Long.
3) Re-arm discovery timer properly in mac80211 mesh code, from Linus
Lüssing.
4) Prevent buffer overflows in nf_conntrack_pptp debug code, from
Pablo Neira Ayuso.
5) Fix race in ktls code between tls_sw_recvmsg() and
tls_decrypt_done(), from Vinay Kumar Yadav.
6) Fix crashes on TCP fallback in MPTCP code, from Paolo Abeni.
7) More validation is necessary of untrusted GSO packets coming from
virtualization devices, from Willem de Bruijn.
8) Fix endianness of bnxt_en firmware message length accesses, from
Edwin Peer.
9) Fix infinite loop in sch_fq_pie, from Davide Caratti.
10) Fix lockdep splat in DSA by setting lockless TX in netdev features
for slave ports, from Vladimir Oltean.
11) Fix suspend/resume crashes in mlx5, from Mark Bloch.
12) Fix use after free in bpf fmod_ret, from Alexei Starovoitov.
13) ARP retransmit timer guard uses wrong offset, from Hongbin Liu.
14) Fix leak in inetdev_init(), from Yang Yingliang.
15) Don't try to use inet hash and unhash in l2tp code, results in
crashes. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
l2tp: add sk_family checks to l2tp_validate_socket
l2tp: do not use inet_hash()/inet_unhash()
net: qrtr: Allocate workqueue before kernel_bind
mptcp: remove msk from the token container at destruction time.
mptcp: fix race between MP_JOIN and close
mptcp: fix unblocking connect()
net/sched: act_ct: add nat mangle action only for NAT-conntrack
devinet: fix memleak in inetdev_init()
virtio_vsock: Fix race condition in virtio_transport_recv_pkt
drivers/net/ibmvnic: Update VNIC protocol version reporting
NFC: st21nfca: add missed kfree_skb() in an error path
neigh: fix ARP retransmit timer guard
bpf, selftests: Add a verifier test for assigning 32bit reg states to 64bit ones
bpf, selftests: Verifier bounds tests need to be updated
bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones
bpf: Fix use-after-free in fmod_ret check
net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()
net/mlx5e: Fix MLX5_TC_CT dependencies
net/mlx5e: Properly set default values when disabling adaptive moderation
net/mlx5e: Fix arch depending casting issue in FEC
...
Stefano reported a crash with using SQPOLL with io_uring:
BUG: kernel NULL pointer dereference, address: 00000000000003b0
CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
RIP: 0010:task_numa_work+0x4f/0x2c0
Call Trace:
task_work_run+0x68/0xa0
io_sq_thread+0x252/0x3d0
kthread+0xf9/0x130
ret_from_fork+0x35/0x40
which is task_numa_work() oopsing on current->mm being NULL.
The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.
Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
In the copy_process() routine called by _do_fork(), failure to allocate
a PID (or further along in the function) will trigger an invocation to
exit_thread(). This is done to clean up from an earlier call to
copy_thread_tls(). Naturally, the child task is passed into exit_thread(),
however during the process, io_bitmap_exit() nullifies the parent's
io_bitmap rather than the child's.
As copy_thread_tls() has been called ahead of the failure, the reference
count on the calling thread's io_bitmap is incremented as we would expect.
However, io_bitmap_exit() doesn't accept any arguments, and thus assumes
it should trash the current thread's io_bitmap reference rather than the
child's. This is pretty sneaky in practice, because in all instances but
this one, exit_thread() is called with respect to the current task and
everything works out.
A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to
get a bitmap allocated, and force a clone3() syscall to fail by passing
in a zeroed clone_args structure. The kernel handles the erroneous struct
and the buggy code path is followed, and even though the parent's reference
to the io_bitmap is trashed, the child still holds a reference and thus
the structure will never be freed.
Fix this by tweaking io_bitmap_exit() and its subroutines to accept a
task_struct argument which to operate on.
Fixes: ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped")
Signed-off-by: Jay Lang <jaytlang@mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable#@vger.kernel.org
Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu
copy the corresponding pieces of init_fpstate into the gaps instead.
Cc: stable@kernel.org
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull powerpc fixes from Michael Ellerman:
- a fix for the recent change to how we restore non-volatile GPRs,
which broke our emulation of reading from the DSCR (Data Stream
Control Register).
- a fix for the recent rewrite of interrupt/syscall exit in C, we need
to exclude KCOV from that code, otherwise it can lead to
unrecoverable faults.
Thanks to Daniel Axtens.
* tag 'powerpc-5.7-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Disable sanitisers for C syscall/interrupt entry/exit code
powerpc/64s: Fix restore of NV GPRs after facility unavailable exception
syzbot was able to trigger a crash after using an ISDN socket
and fool l2tp.
Fix this by making sure the UDP socket is of the proper family.
BUG: KASAN: slab-out-of-bounds in setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78
Write of size 1 at addr ffff88808ed0c590 by task syz-executor.5/3018
CPU: 0 PID: 3018 Comm: syz-executor.5 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:382
__kasan_report.cold+0x20/0x38 mm/kasan/report.c:511
kasan_report+0x33/0x50 mm/kasan/common.c:625
setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78
l2tp_tunnel_register+0xb15/0xdd0 net/l2tp/l2tp_core.c:1523
l2tp_nl_cmd_tunnel_create+0x4b2/0xa60 net/l2tp/l2tp_netlink.c:249
genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:718 [inline]
genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735
netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
genl_rcv+0x24/0x40 net/netlink/genetlink.c:746
netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6e6/0x810 net/socket.c:2352
___sys_sendmsg+0x100/0x170 net/socket.c:2406
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2439
do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45ca29
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007effe76edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004fe1c0 RCX: 000000000045ca29
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000094e R14: 00000000004d5d00 R15: 00007effe76ee6d4
Allocated by task 3018:
save_stack+0x1b/0x40 mm/kasan/common.c:49
set_track mm/kasan/common.c:57 [inline]
__kasan_kmalloc mm/kasan/common.c:495 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:468
__do_kmalloc mm/slab.c:3656 [inline]
__kmalloc+0x161/0x7a0 mm/slab.c:3665
kmalloc include/linux/slab.h:560 [inline]
sk_prot_alloc+0x223/0x2f0 net/core/sock.c:1612
sk_alloc+0x36/0x1100 net/core/sock.c:1666
data_sock_create drivers/isdn/mISDN/socket.c:600 [inline]
mISDN_sock_create+0x272/0x400 drivers/isdn/mISDN/socket.c:796
__sock_create+0x3cb/0x730 net/socket.c:1428
sock_create net/socket.c:1479 [inline]
__sys_socket+0xef/0x200 net/socket.c:1521
__do_sys_socket net/socket.c:1530 [inline]
__se_sys_socket net/socket.c:1528 [inline]
__x64_sys_socket+0x6f/0xb0 net/socket.c:1528
do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Freed by task 2484:
save_stack+0x1b/0x40 mm/kasan/common.c:49
set_track mm/kasan/common.c:57 [inline]
kasan_set_free_info mm/kasan/common.c:317 [inline]
__kasan_slab_free+0xf7/0x140 mm/kasan/common.c:456
__cache_free mm/slab.c:3426 [inline]
kfree+0x109/0x2b0 mm/slab.c:3757
kvfree+0x42/0x50 mm/util.c:603
__free_fdtable+0x2d/0x70 fs/file.c:31
put_files_struct fs/file.c:420 [inline]
put_files_struct+0x248/0x2e0 fs/file.c:413
exit_files+0x7e/0xa0 fs/file.c:445
do_exit+0xb04/0x2dd0 kernel/exit.c:791
do_group_exit+0x125/0x340 kernel/exit.c:894
get_signal+0x47b/0x24e0 kernel/signal.c:2739
do_signal+0x81/0x2240 arch/x86/kernel/signal.c:784
exit_to_usermode_loop+0x26c/0x360 arch/x86/entry/common.c:161
prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
syscall_return_slowpath arch/x86/entry/common.c:279 [inline]
do_syscall_64+0x6b1/0x7d0 arch/x86/entry/common.c:305
entry_SYSCALL_64_after_hwframe+0x49/0xb3
The buggy address belongs to the object at ffff88808ed0c000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1424 bytes inside of
2048-byte region [ffff88808ed0c000, ffff88808ed0c800)
The buggy address belongs to the page:
page:ffffea00023b4300 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0xfffe0000000200(slab)
raw: 00fffe0000000200 ffffea0002838208 ffffea00015ba288 ffff8880aa000e00
raw: 0000000000000000 ffff88808ed0c000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88808ed0c480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88808ed0c500: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88808ed0c580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88808ed0c600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88808ed0c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Fixes: 6b9f34239b00 ("l2tp: fix races in tunnel creation")
Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Guillaume Nault <gnault@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
are quite close and follow the same sequence for enqueuing an entity in the
cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
enqueue_task_fair(). This fixes a problem already faced with the latter and
add an optimization in the last for_each_sched_entity loop.
Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
Reported-by Tao Zhou <zohooouoto@zoho.com.cn>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
The intermediate result of the old term (4UL * 1024 * 1024 * 1024) is
4 294 967 296 or 0x100000000 which is no problem on 64 bit systems.
The patch does not change the later overall result of 0x100000 for
MAX_DMA32_PFN (after it has been shifted by PAGE_SHIFT). The new
calculation yields the same result, but does not require 64 bit
arithmetic.
On 32 bit systems the old calculation suffers from an arithmetic
overflow in that intermediate term in braces: 4UL aka unsigned long int
is 4 byte wide and an arithmetic overflow happens (the 0x100000000 does
not fit in 4 bytes), the in braces result is truncated to zero, the
following right shift does not alter that, so MAX_DMA32_PFN evaluates to
0 on 32 bit systems.
That wrong value is a problem in a comparision against MAX_DMA32_PFN in
the init code for swiotlb in pci_swiotlb_detect_4gb() to decide if
swiotlb should be active. That comparison yields the opposite result,
when compiling on 32 bit systems.
This was not possible before
1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too")
when that MAX_DMA32_PFN was first made visible to x86_32 (and which
landed in v3.0).
In practice this wasn't a problem, unless CONFIG_SWIOTLB is active on
x86-32.
However if one has set CONFIG_IOMMU_INTEL, since
c5a5dc4cbbf4 ("iommu/vt-d: Don't switch off swiotlb if bounce page is used")
there's a dependency on CONFIG_SWIOTLB, which was not necessarily
active before. That landed in v5.4, where we noticed it in the fli4l
Linux distribution. We have CONFIG_IOMMU_INTEL active on both 32 and 64
bit kernel configs there (I could not find out why, so let's just say
historical reasons).
The effect is at boot time 64 MiB (default size) were allocated for
bounce buffers now, which is a noticeable amount of memory on small
systems like pcengines ALIX 2D3 with 256 MiB memory, which are still
frequently used as home routers.
We noticed this effect when migrating from kernel v4.19 (LTS) to v5.4
(LTS) in fli4l and got that kernel messages for example:
Linux version 5.4.22 (buildroot@buildroot) (gcc version 7.3.0 (Buildroot 2018.02.8)) #1 SMP Mon Nov 26 23:40:00 CET 2018
…
Memory: 183484K/261756K available (4594K kernel code, 393K rwdata, 1660K rodata, 536K init, 456K bss , 78272K reserved, 0K cma-reserved, 0K highmem)
…
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
software IO TLB: mapped [mem 0x0bb78000-0x0fb78000] (64MB)
The initial analysis and the suggested fix was done by user 'sourcejedi'
at stackoverflow and explicitly marked as GPLv2 for inclusion in the
Linux kernel:
https://unix.stackexchange.com/a/520525/50007
The new calculation, which does not suffer from that overflow, is the
same as for arch/mips now as suggested by Robin Murphy.
The fix was tested by fli4l users on round about two dozen different
systems, including both 32 and 64 bit archs, bare metal and virtualized
machines.
[ bp: Massage commit message. ]
Fixes: 1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too")
Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Alexander Dahl <post@lespocky.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://unix.stackexchange.com/q/520065/50007
Link: https://web.nettworks.org/bugs/browse/FFL-2560
Link: https://lkml.kernel.org/r/20200526175749.20742-1-post@lespocky.de
cpy and set really should be size_t; we won't get an overflow on that,
since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *),
so nr that would've managed to overflow size_t on that multiplication
won't get anywhere near copy_fdtable() - we'll fail with EMFILE
before that.
Cc: stable@kernel.org # v2.6.25+
Fixes: 9cfe015aa424 (get rid of NR_OPEN and introduce a sysctl_nr_open)
Reported-by: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>