Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mseal sysmap: kernel config and header change

Patch series "mseal system mappings", v9.

As discussed during mseal() upstream process [1], mseal() protects the
VMAs of a given virtual memory range against modifications, such as the
read/write (RW) and no-execute (NX) bits. For complete descriptions of
memory sealing, please see mseal.rst [2].

The mseal() is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped.

The system mappings are readonly only, memory sealing can protect them
from ever changing to writable or unmmap/remapped as different attributes.

System mappings such as vdso, vvar, vvar_vclock, vectors (arm
compat-mode), sigpage (arm compat-mode), are created by the kernel during
program initialization, and could be sealed after creation.

Unlike the aforementioned mappings, the uprobe mapping is not established
during program startup. However, its lifetime is the same as the
process's lifetime [3]. It could be sealed from creation.

The vsyscall on x86-64 uses a special address (0xffffffffff600000), which
is outside the mm managed range. This means mprotect, munmap, and mremap
won't work on the vsyscall. Since sealing doesn't enhance the vsyscall's
security, it is skipped in this patch. If we ever seal the vsyscall, it
is probably only for decorative purpose, i.e. showing the 'sl' flag in
the /proc/pid/smaps. For this patch, it is ignored.

It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the system mappings during restore operations. UML(User Mode Linux)
and gVisor, rr are also known to change the vdso/vvar mappings.
Consequently, this feature cannot be universally enabled across all
systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.

To support mseal of system mappings, architectures must define
CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special
mappings calls to pass mseal flag. Additionally, architectures must
confirm they do not unmap/remap system mappings during the process
lifetime. The existence of this flag for an architecture implies that it
does not require the remapping of thest system mappings during process
lifetime, so sealing these mappings is safe from a kernel perspective.

This version covers x86-64 and arm64 archiecture as minimum viable feature.

While no specific CPU hardware features are required for enable this
feature on an archiecture, memory sealing requires a 64-bit kernel. Other
architectures can choose whether or not to adopt this feature. Currently,
I'm not aware of any instances in the kernel code that actively
munmap/mremap a system mapping without a request from userspace. The PPC
does call munmap when _install_special_mapping fails for vdso; however,
it's uncertain if this will ever fail for PPC - this needs to be
investigated by PPC in the future [4]. The UML kernel can add this
support when KUnit tests require it [5].

In this version, we've improved the handling of system mapping sealing
from previous versions, instead of modifying the _install_special_mapping
function itself, which would affect all architectures, we now call
_install_special_mapping with a sealing flag only within the specific
architecture that requires it. This targeted approach offers two key
advantages: 1) It limits the code change's impact to the necessary
architectures, and 2) It aligns with the software architecture by keeping
the core memory management within the mm layer, while delegating the
decision of sealing system mappings to the individual architecture, which
is particularly relevant since 32-bit architectures never require sealing.

Prior to this patch series, we explored sealing special mappings from
userspace using glibc's dynamic linker. This approach revealed several
issues:

- The PT_LOAD header may report an incorrect length for vdso, (smaller
than its actual size). The dynamic linker, which relies on PT_LOAD
information to determine mapping size, would then split and partially
seal the vdso mapping. Since each architecture has its own vdso/vvar
code, fixing this in the kernel would require going through each
archiecture. Our initial goal was to enable sealing readonly mappings,
e.g. .text, across all architectures, sealing vdso from kernel since
creation appears to be simpler than sealing vdso at glibc.

- The [vvar] mapping header only contains address information, not
length information. Similar issues might exist for other special
mappings.

- Mappings like uprobe are not covered by the dynamic linker, and there
is no effective solution for them.

This feature's security enhancements will benefit ChromeOS, Android, and
other high security systems.

Testing:
This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
- Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
- Passing various automation tests (e.g. pre-checkin) on ChromeOS and
Android to ensure the sealing doesn't affect the functionality of
Chromebook and Android phone.

I also tested the feature on Ubuntu on x86-64:
- With config disabled, vdso/vvar is not sealed,
- with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
normal operations such as browsing the web, open/edit doc are OK.

Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
Link: Documentation/userspace-api/mseal.rst [2]
Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]
Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQSt+HA@mail.gmail.com/ [4]
Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5]


This patch (of 7):

Provide infrastructure to mseal system mappings. Establish two kernel
configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP macro for future
patches.

Link: https://lkml.kernel.org/r/20250305021711.3867874-1-jeffxu@google.com
Link: https://lkml.kernel.org/r/20250305021711.3867874-2-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Berg <benjamin@sipsolutions.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Elliot Hughes <enh@google.com>
Cc: Florian Faineli <f.fainelli@gmail.com>
Cc: Greg Ungerer <gerg@kernel.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <mike.rapoport@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Jeff Xu and committed by
Andrew Morton
5796d396 02d9e1a2

+53
+10
include/linux/mm.h
··· 4236 4236 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status); 4237 4237 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status); 4238 4238 4239 + 4240 + /* 4241 + * mseal of userspace process's system mappings. 4242 + */ 4243 + #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS 4244 + #define VM_SEALED_SYSMAP VM_SEALED 4245 + #else 4246 + #define VM_SEALED_SYSMAP VM_NONE 4247 + #endif 4248 + 4239 4249 #endif /* _LINUX_MM_H */
+22
init/Kconfig
··· 1888 1888 config ARCH_HAS_MEMBARRIER_SYNC_CORE 1889 1889 bool 1890 1890 1891 + config ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS 1892 + bool 1893 + help 1894 + Control MSEAL_SYSTEM_MAPPINGS access based on architecture. 1895 + 1896 + A 64-bit kernel is required for the memory sealing feature. 1897 + No specific hardware features from the CPU are needed. 1898 + 1899 + To enable this feature, the architecture needs to update their 1900 + special mappings calls to include the sealing flag and confirm 1901 + that it doesn't unmap/remap system mappings during the life 1902 + time of the process. The existence of this flag for an architecture 1903 + implies that it does not require the remapping of the system 1904 + mappings during process lifetime, so sealing these mappings is safe 1905 + from a kernel perspective. 1906 + 1907 + After the architecture enables this, a distribution can set 1908 + CONFIG_MSEAL_SYSTEM_MAPPING to manage access to the feature. 1909 + 1910 + For complete descriptions of memory sealing, please see 1911 + Documentation/userspace-api/mseal.rst 1912 + 1891 1913 config HAVE_PERF_EVENTS 1892 1914 bool 1893 1915 help
+21
security/Kconfig
··· 51 51 52 52 endchoice 53 53 54 + config MSEAL_SYSTEM_MAPPINGS 55 + bool "mseal system mappings" 56 + depends on 64BIT 57 + depends on ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS 58 + depends on !CHECKPOINT_RESTORE 59 + help 60 + Apply mseal on system mappings. 61 + The system mappings includes vdso, vvar, vvar_vclock, 62 + vectors (arm compat-mode), sigpage (arm compat-mode), uprobes. 63 + 64 + A 64-bit kernel is required for the memory sealing feature. 65 + No specific hardware features from the CPU are needed. 66 + 67 + WARNING: This feature breaks programs which rely on relocating 68 + or unmapping system mappings. Known broken software at the time 69 + of writing includes CHECKPOINT_RESTORE, UML, gVisor, rr. Therefore 70 + this config can't be enabled universally. 71 + 72 + For complete descriptions of memory sealing, please see 73 + Documentation/userspace-api/mseal.rst 74 + 54 75 config SECURITY 55 76 bool "Enable different security models" 56 77 depends on SYSFS