Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

shm: add sealing API

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
- one side cannot overwrite data while the other reads it
- one side cannot shrink the buffer while the other accesses it
- one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:

1) A graphics client wants to share its rendering-buffer with a
graphics-server. The memory-region is allocated by the client for
read/write access and a second FD is passed to the server. While
scanning out from the memory region, the server has no guarantee that
the client doesn't shrink the buffer at any time, requiring rather
cumbersome SIGBUS handling.
2) A process wants to perform an RPC on another process. To avoid huge
bandwidth consumption, zero-copy is preferred. After a message is
assembled in-memory and a FD is passed to the remote side, both sides
want to be sure that neither modifies this shared copy, anymore. The
source may have put sensible data into the message without a separate
copy and the target may want to parse the message inline, to avoid a
local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever. Unlike locks,
seals can only be set, never removed. Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
- SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
in size. This affects ftruncate() and open(O_TRUNC).
- GROW: If SEAL_GROW is set, the file in question cannot be increased
in size. This affects ftruncate(), fallocate() and write().
- WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
are possible. This affects fallocate(PUNCH_HOLE), mmap() and
write().
- SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
This basically prevents the F_ADD_SEAL operation on a file and
can be set to prevent others from adding further seals that you
don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

1) The graphics server can verify that a passed file-descriptor has
SEAL_SHRINK set. This allows safe scanout, while the client is
allowed to increase buffer size for window-resizing on-the-fly.
Concurrent writes are explicitly allowed.
2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
process can modify the data while the other side parses it.
Furthermore, it guarantees that even with writable FDs passed to the
peer, it cannot increase the size to hit memory-limits of the source
process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
F_GET_SEALS: Return a bitset describing the seals on the file. This
can be called on any FD if the underlying file supports
sealing.
F_ADD_SEALS: Change the seals of a given file. This requires WRITE
access to the file and F_SEAL_SEAL may not already be set.
Furthermore, the underlying file must support sealing and
there may not be any existing shared mapping of that file.
Otherwise, EBADF/EPERM is returned.
The given seals are _added_ to the existing set of seals
on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

David Herrmann and committed by
Linus Torvalds
40e041a2 4bb5f5d9

+180
+5
fs/fcntl.c
··· 21 21 #include <linux/rcupdate.h> 22 22 #include <linux/pid_namespace.h> 23 23 #include <linux/user_namespace.h> 24 + #include <linux/shmem_fs.h> 24 25 25 26 #include <asm/poll.h> 26 27 #include <asm/siginfo.h> ··· 336 335 case F_SETPIPE_SZ: 337 336 case F_GETPIPE_SZ: 338 337 err = pipe_fcntl(filp, cmd, arg); 338 + break; 339 + case F_ADD_SEALS: 340 + case F_GET_SEALS: 341 + err = shmem_fcntl(filp, cmd, arg); 339 342 break; 340 343 default: 341 344 break;
+17
include/linux/shmem_fs.h
··· 1 1 #ifndef __SHMEM_FS_H 2 2 #define __SHMEM_FS_H 3 3 4 + #include <linux/file.h> 4 5 #include <linux/swap.h> 5 6 #include <linux/mempolicy.h> 6 7 #include <linux/pagemap.h> ··· 12 11 13 12 struct shmem_inode_info { 14 13 spinlock_t lock; 14 + unsigned int seals; /* shmem seals */ 15 15 unsigned long flags; 16 16 unsigned long alloced; /* data pages alloced to file */ 17 17 union { ··· 66 64 return shmem_read_mapping_page_gfp(mapping, index, 67 65 mapping_gfp_mask(mapping)); 68 66 } 67 + 68 + #ifdef CONFIG_TMPFS 69 + 70 + extern int shmem_add_seals(struct file *file, unsigned int seals); 71 + extern int shmem_get_seals(struct file *file); 72 + extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg); 73 + 74 + #else 75 + 76 + static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a) 77 + { 78 + return -EINVAL; 79 + } 80 + 81 + #endif 69 82 70 83 #endif
+15
include/uapi/linux/fcntl.h
··· 28 28 #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8) 29 29 30 30 /* 31 + * Set/Get seals 32 + */ 33 + #define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9) 34 + #define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10) 35 + 36 + /* 37 + * Types of seals 38 + */ 39 + #define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */ 40 + #define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */ 41 + #define F_SEAL_GROW 0x0004 /* prevent file from growing */ 42 + #define F_SEAL_WRITE 0x0008 /* prevent writes */ 43 + /* (1U << 31) is reserved for signed error codes */ 44 + 45 + /* 31 46 * Types of directory notifications that may be requested. 32 47 */ 33 48 #define DN_ACCESS 0x00000001 /* File accessed */
+143
mm/shmem.c
··· 66 66 #include <linux/highmem.h> 67 67 #include <linux/seq_file.h> 68 68 #include <linux/magic.h> 69 + #include <linux/fcntl.h> 69 70 70 71 #include <asm/uaccess.h> 71 72 #include <asm/pgtable.h> ··· 548 547 static int shmem_setattr(struct dentry *dentry, struct iattr *attr) 549 548 { 550 549 struct inode *inode = dentry->d_inode; 550 + struct shmem_inode_info *info = SHMEM_I(inode); 551 551 int error; 552 552 553 553 error = inode_change_ok(inode, attr); ··· 558 556 if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { 559 557 loff_t oldsize = inode->i_size; 560 558 loff_t newsize = attr->ia_size; 559 + 560 + /* protected by i_mutex */ 561 + if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) || 562 + (newsize > oldsize && (info->seals & F_SEAL_GROW))) 563 + return -EPERM; 561 564 562 565 if (newsize != oldsize) { 563 566 error = shmem_reacct_size(SHMEM_I(inode)->flags, ··· 1419 1412 info = SHMEM_I(inode); 1420 1413 memset(info, 0, (char *)inode - (char *)info); 1421 1414 spin_lock_init(&info->lock); 1415 + info->seals = F_SEAL_SEAL; 1422 1416 info->flags = flags & VM_NORESERVE; 1423 1417 INIT_LIST_HEAD(&info->swaplist); 1424 1418 simple_xattrs_init(&info->xattrs); ··· 1478 1470 struct page **pagep, void **fsdata) 1479 1471 { 1480 1472 struct inode *inode = mapping->host; 1473 + struct shmem_inode_info *info = SHMEM_I(inode); 1481 1474 pgoff_t index = pos >> PAGE_CACHE_SHIFT; 1475 + 1476 + /* i_mutex is held by caller */ 1477 + if (unlikely(info->seals)) { 1478 + if (info->seals & F_SEAL_WRITE) 1479 + return -EPERM; 1480 + if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) 1481 + return -EPERM; 1482 + } 1483 + 1482 1484 return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); 1483 1485 } 1484 1486 ··· 1826 1808 return offset; 1827 1809 } 1828 1810 1811 + static int shmem_wait_for_pins(struct address_space *mapping) 1812 + { 1813 + return 0; 1814 + } 1815 + 1816 + #define F_ALL_SEALS (F_SEAL_SEAL | \ 1817 + F_SEAL_SHRINK | \ 1818 + F_SEAL_GROW | \ 1819 + F_SEAL_WRITE) 1820 + 1821 + int shmem_add_seals(struct file *file, unsigned int seals) 1822 + { 1823 + struct inode *inode = file_inode(file); 1824 + struct shmem_inode_info *info = SHMEM_I(inode); 1825 + int error; 1826 + 1827 + /* 1828 + * SEALING 1829 + * Sealing allows multiple parties to share a shmem-file but restrict 1830 + * access to a specific subset of file operations. Seals can only be 1831 + * added, but never removed. This way, mutually untrusted parties can 1832 + * share common memory regions with a well-defined policy. A malicious 1833 + * peer can thus never perform unwanted operations on a shared object. 1834 + * 1835 + * Seals are only supported on special shmem-files and always affect 1836 + * the whole underlying inode. Once a seal is set, it may prevent some 1837 + * kinds of access to the file. Currently, the following seals are 1838 + * defined: 1839 + * SEAL_SEAL: Prevent further seals from being set on this file 1840 + * SEAL_SHRINK: Prevent the file from shrinking 1841 + * SEAL_GROW: Prevent the file from growing 1842 + * SEAL_WRITE: Prevent write access to the file 1843 + * 1844 + * As we don't require any trust relationship between two parties, we 1845 + * must prevent seals from being removed. Therefore, sealing a file 1846 + * only adds a given set of seals to the file, it never touches 1847 + * existing seals. Furthermore, the "setting seals"-operation can be 1848 + * sealed itself, which basically prevents any further seal from being 1849 + * added. 1850 + * 1851 + * Semantics of sealing are only defined on volatile files. Only 1852 + * anonymous shmem files support sealing. More importantly, seals are 1853 + * never written to disk. Therefore, there's no plan to support it on 1854 + * other file types. 1855 + */ 1856 + 1857 + if (file->f_op != &shmem_file_operations) 1858 + return -EINVAL; 1859 + if (!(file->f_mode & FMODE_WRITE)) 1860 + return -EPERM; 1861 + if (seals & ~(unsigned int)F_ALL_SEALS) 1862 + return -EINVAL; 1863 + 1864 + mutex_lock(&inode->i_mutex); 1865 + 1866 + if (info->seals & F_SEAL_SEAL) { 1867 + error = -EPERM; 1868 + goto unlock; 1869 + } 1870 + 1871 + if ((seals & F_SEAL_WRITE) && !(info->seals & F_SEAL_WRITE)) { 1872 + error = mapping_deny_writable(file->f_mapping); 1873 + if (error) 1874 + goto unlock; 1875 + 1876 + error = shmem_wait_for_pins(file->f_mapping); 1877 + if (error) { 1878 + mapping_allow_writable(file->f_mapping); 1879 + goto unlock; 1880 + } 1881 + } 1882 + 1883 + info->seals |= seals; 1884 + error = 0; 1885 + 1886 + unlock: 1887 + mutex_unlock(&inode->i_mutex); 1888 + return error; 1889 + } 1890 + EXPORT_SYMBOL_GPL(shmem_add_seals); 1891 + 1892 + int shmem_get_seals(struct file *file) 1893 + { 1894 + if (file->f_op != &shmem_file_operations) 1895 + return -EINVAL; 1896 + 1897 + return SHMEM_I(file_inode(file))->seals; 1898 + } 1899 + EXPORT_SYMBOL_GPL(shmem_get_seals); 1900 + 1901 + long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg) 1902 + { 1903 + long error; 1904 + 1905 + switch (cmd) { 1906 + case F_ADD_SEALS: 1907 + /* disallow upper 32bit */ 1908 + if (arg > UINT_MAX) 1909 + return -EINVAL; 1910 + 1911 + error = shmem_add_seals(file, arg); 1912 + break; 1913 + case F_GET_SEALS: 1914 + error = shmem_get_seals(file); 1915 + break; 1916 + default: 1917 + error = -EINVAL; 1918 + break; 1919 + } 1920 + 1921 + return error; 1922 + } 1923 + 1829 1924 static long shmem_fallocate(struct file *file, int mode, loff_t offset, 1830 1925 loff_t len) 1831 1926 { 1832 1927 struct inode *inode = file_inode(file); 1833 1928 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 1929 + struct shmem_inode_info *info = SHMEM_I(inode); 1834 1930 struct shmem_falloc shmem_falloc; 1835 1931 pgoff_t start, index, end; 1836 1932 int error; ··· 1959 1827 loff_t unmap_start = round_up(offset, PAGE_SIZE); 1960 1828 loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; 1961 1829 DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); 1830 + 1831 + /* protected by i_mutex */ 1832 + if (info->seals & F_SEAL_WRITE) { 1833 + error = -EPERM; 1834 + goto out; 1835 + } 1962 1836 1963 1837 shmem_falloc.waitq = &shmem_falloc_waitq; 1964 1838 shmem_falloc.start = unmap_start >> PAGE_SHIFT; ··· 1991 1853 error = inode_newsize_ok(inode, offset + len); 1992 1854 if (error) 1993 1855 goto out; 1856 + 1857 + if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) { 1858 + error = -EPERM; 1859 + goto out; 1860 + } 1994 1861 1995 1862 start = offset >> PAGE_CACHE_SHIFT; 1996 1863 end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;