Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
code
Clone this repository
https://tangled.org/tjh.dev/kernel
git@gordian.tjh.dev:tjh.dev/kernel
For self-hosted knots, clone URLs may differ based on your setup.
Merge vm fixes from Andrew Morton:
"5 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/sparse: fix kernel crash with pfn_section_valid check
mm: fork: fix kernel_stack memcg stats for various stack implementations
hugetlb_cgroup: fix illegal access to memory
drivers/base/memory.c: indicate all memory blocks as removable
mm/swapfile.c: move inode_lock out of claim_swapfile
Pull timer fix from Thomas Gleixner:
"A single fix for the Hyper-V clocksource driver to make sched clock
actually return nanoseconds and not the virtual clock value which
increments at 10e7 HZ (100ns)"
* tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
Fix the crash like this:
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
LR [c000000000088354] vmemmap_free+0x144/0x320
Call Trace:
section_deactivate+0x220/0x240
__remove_pages+0x118/0x170
arch_remove_memory+0x3c/0x150
memunmap_pages+0x1cc/0x2f0
devm_action_release+0x30/0x50
release_nodes+0x2f8/0x3e0
device_release_driver_internal+0x168/0x270
unbind_store+0x130/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x68/0x80
kernfs_fop_write+0x100/0x290
__vfs_write+0x3c/0x70
vfs_write+0xcc/0x240
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The crash is due to NULL dereference at
test_bit(idx, ms->usage->subsection_map);
due to ms->usage = NULL in pfn_section_valid()
With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem(). This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
config pfn_to_page does
__section_mem_map_addr(__sec) + __pfn;
where
static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}
Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
used to check the pfn validity (pfn_valid()). Since section_deactivate
release mem_section->usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.
static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}
where
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
return test_bit(idx, ms->usage->subsection_map);
}
Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
freed. For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections. Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.
[aneesh.kumar@linux.ibm.com: add comment]
Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull irq fix from Thomas Gleixner:
"A single bugfix to prevent reference leaks in irq affinity notifiers"
* tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix reference leaks on irq affinity notifiers
The sched clock read functions return the HV clock (100ns granularity)
without converting it to nanoseconds.
Add the missing conversion.
Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V specific sched clock function")
Signed-off-by: Yubo Xie <yuboxie@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200327021159.31429-1-Tianyu.Lan@microsoft.com
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().
In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg . In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.
It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).
In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state(). It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .
Note: This is a special version of the patch created for stable
backports. It contains code from the following two patches:
- mm: memcg/slab: introduce mem_cgroup_from_obj()
- mm: fork: fix kernel_stack memcg stats for various stack implementations
[guro@fb.com: introduce mem_cgroup_from_obj()]
Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Fix memory leak in vti6, from Torsten Hilbrich.
2) Fix double free in xfrm_policy_timer, from YueHaibing.
3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from
Johannes Berg.
4) Wrong allocation failure check in qlcnic driver, from Xu Wang.
5) Get ks8851-ml IO operations right, for real this time, from Marek
Vasut.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
r8169: fix PHY driver check on platforms w/o module softdeps
net: ks8851-ml: Fix IO operations, again
mlxsw: spectrum_mr: Fix list iteration in error path
qlcnic: Fix bad kzalloc null test
mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX
mac80211: mark station unauthorized before key removal
mac80211: Check port authorization in the ieee80211_tx_dequeue() case
cfg80211: Do not warn on same channel at the end of CSA
mac80211: drop data frames without key on encrypted links
ieee80211: fix HE SPR size calculation
nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type
xfrm: policy: Fix doulbe free in xfrm_policy_timer
bpf: Explicitly memset some bpf info structures declared on the stack
bpf: Explicitly memset the bpf_attr structure
bpf: Sanitize the bpf_struct_ops tcp-cc name
vti6: Fix memory leak of skb if input policy check fails
esp: remove the skb from the chain when it's enqueued in cryptd_wq
ipv6: xfrm6_tunnel.c: Use built-in RCU list checking
xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire
xfrm: fix uctx len check in verify_sec_ctx_len
...
The handling of notify->work did not properly maintain notify->kref in two
cases:
1) where the work was already scheduled, another irq_set_affinity_locked()
would get the ref and (no-op-ly) schedule the work. Thus when
irq_affinity_notify() ran, it would drop the original ref but not the
additional one.
2) when cancelling the (old) work in irq_set_affinity_notifier(), if there
was outstanding work a ref had been got for it but was never put.
Fix both by checking the return values of the work handling functions
(schedule_work() for (1) and cancel_work_sync() for (2)) and put the
extra ref if the return value indicates preexisting work.
Fixes: cd7eab44e994 ("genirq: Add IRQ affinity notifiers")
Fixes: 59c39840f5ab ("genirq: Prevent use-after-free and work list corruption")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ben Hutchings <ben@decadent.org.uk>
Link: https://lkml.kernel.org/r/24f5983f-2ab5-e83a-44ee-a45b5f9300f5@solarflare.com
Pull arm64 fixes from Will Deacon:
- Fix panic() when it occurs during secondary CPU startup
- Fix "kpti=off" when KASLR is enabled
- Fix howler in compat syscall table for vDSO clock_getres() fallback
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: compat: Fix syscall number of compat_clock_getres
arm64: kpti: Fix "kpti=off" when KASLR is enabled
arm64: smp: fix crash_smp_send_stop() behaviour
arm64: smp: fix smp_send_stop() behaviour
This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb
controller for cgroups v2").
Essentially that commit does a hugetlb_cgroup_from_counter assuming that
page_counter_try_charge has initialized counter.
But if that has failed then it seems will not initialize counter, so
hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
causing kasan to complain.
The solution is to simply use 'h_cg', instead of
hugetlb_cgroup_from_counter(counter), since that is a reference to the
hugetlb_cgroup anyway. After this change kasan ceases to complain.
Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2")
Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Giuseppe Scrivano <gscrivan@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull i2c fixes from Wolfram Sang:
"Three more driver bugfixes, and two doc improvements fixing build
warnings while we are here"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pca-platform: Use platform_irq_get_optional
i2c: st: fix missing struct parameter description
i2c: nvidia-gpu: Handle timeout correctly in gpu_i2c_check_status()
i2c: fix a doc warning
i2c: hix5hd2: add missed clk_disable_unprepare in remove
Daniel Borkmann says:
====================
pull-request: bpf 2020-03-27
The following pull-request contains BPF updates for your *net* tree.
We've added 3 non-merge commits during the last 4 day(s) which contain
a total of 4 files changed, 25 insertions(+), 20 deletions(-).
The main changes are:
1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid
having to rely on compiler to do so. Issues have been noticed on
some compilers with padding and other oddities where the request was
then unexpectedly rejected, from Greg Kroah-Hartman.
2) Sanitize the bpf_struct_ops TCP congestion control name in order to
avoid problematic characters such as whitespaces, from Martin KaFai Lau.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull char/misc driver fixes from Greg KH:
"Here are some small different driver fixes for 5.6-rc7:
- binderfs fix, yet again
- slimbus new device id added
- hwtracing bugfixes for reported issues and a new device id
All of these have been in linux-next with no reported issues"
* tag 'char-misc-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
intel_th: pci: Add Elkhart Lake CPU support
intel_th: Fix user-visible error codes
intel_th: msu: Fix the unexpected state warning
stm class: sys-t: Fix the use of time_after()
slimbus: ngd: add v2.1.0 compatible
binderfs: use refcount for binder control devices too