Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Don't use rcu_users to refcount in task kfuncs

A series of prior patches added some kfuncs that allow struct
task_struct * objects to be used as kptrs. These kfuncs leveraged the
'refcount_t rcu_users' field of the task for performing refcounting.
This field was used instead of 'refcount_t usage', as we wanted to
leverage the safety provided by RCU for ensuring a task's lifetime.

A struct task_struct is refcounted by two different refcount_t fields:

1. p->usage: The "true" refcount field which task lifetime. The
task is freed as soon as this refcount drops to 0.

2. p->rcu_users: An "RCU users" refcount field which is statically
initialized to 2, and is co-located in a union with
a struct rcu_head field (p->rcu). p->rcu_users
essentially encapsulates a single p->usage
refcount, and when p->rcu_users goes to 0, an RCU
callback is scheduled on the struct rcu_head which
decrements the p->usage refcount.

Our logic was that by using p->rcu_users, we would be able to use RCU to
safely issue refcount_inc_not_zero() a task's rcu_users field to
determine if a task could still be acquired, or was exiting.
Unfortunately, this does not work due to p->rcu_users and p->rcu sharing
a union. When p->rcu_users goes to 0, an RCU callback is scheduled to
drop a single p->usage refcount, and because the fields share a union,
the refcount immediately becomes nonzero again after the callback is
scheduled.

If we were to split the fields out of the union, this wouldn't be a
problem. Doing so should also be rather non-controversial, as there are
a number of places in struct task_struct that have padding which we
could use to avoid growing the structure by splitting up the fields.

For now, so as to fix the kfuncs to be correct, this patch instead
updates bpf_task_acquire() and bpf_task_release() to use the p->usage
field for refcounting via the get_task_struct() and put_task_struct()
functions. Because we can no longer rely on RCU, the change also guts
the bpf_task_acquire_not_zero() and bpf_task_kptr_get() functions
pending a resolution on the above problem.

In addition, the task fixes the kfunc and rcu_read_lock selftests to
expect this new behavior.

Fixes: 90660309b0c7 ("bpf: Add kfuncs for storing struct task_struct * as a kptr")
Fixes: fca1aa75518c ("bpf: Handle MEM_RCU type properly")
Reported-by: Matus Jokay <matus.jokay@stuba.sk>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221206210538.597606-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

David Vernet and committed by
Alexei Starovoitov
156ed20d 235d2ef2

+60 -30
+48 -28
kernel/bpf/helpers.c
··· 1833 1833 */ 1834 1834 struct task_struct *bpf_task_acquire(struct task_struct *p) 1835 1835 { 1836 - refcount_inc(&p->rcu_users); 1837 - return p; 1836 + return get_task_struct(p); 1838 1837 } 1839 1838 1840 1839 /** ··· 1844 1845 */ 1845 1846 struct task_struct *bpf_task_acquire_not_zero(struct task_struct *p) 1846 1847 { 1847 - if (!refcount_inc_not_zero(&p->rcu_users)) 1848 - return NULL; 1849 - return p; 1848 + /* For the time being this function returns NULL, as it's not currently 1849 + * possible to safely acquire a reference to a task with RCU protection 1850 + * using get_task_struct() and put_task_struct(). This is due to the 1851 + * slightly odd mechanics of p->rcu_users, and how task RCU protection 1852 + * works. 1853 + * 1854 + * A struct task_struct is refcounted by two different refcount_t 1855 + * fields: 1856 + * 1857 + * 1. p->usage: The "true" refcount field which tracks a task's 1858 + * lifetime. The task is freed as soon as this 1859 + * refcount drops to 0. 1860 + * 1861 + * 2. p->rcu_users: An "RCU users" refcount field which is statically 1862 + * initialized to 2, and is co-located in a union with 1863 + * a struct rcu_head field (p->rcu). p->rcu_users 1864 + * essentially encapsulates a single p->usage 1865 + * refcount, and when p->rcu_users goes to 0, an RCU 1866 + * callback is scheduled on the struct rcu_head which 1867 + * decrements the p->usage refcount. 1868 + * 1869 + * There are two important implications to this task refcounting logic 1870 + * described above. The first is that 1871 + * refcount_inc_not_zero(&p->rcu_users) cannot be used anywhere, as 1872 + * after the refcount goes to 0, the RCU callback being scheduled will 1873 + * cause the memory backing the refcount to again be nonzero due to the 1874 + * fields sharing a union. The other is that we can't rely on RCU to 1875 + * guarantee that a task is valid in a BPF program. This is because a 1876 + * task could have already transitioned to being in the TASK_DEAD 1877 + * state, had its rcu_users refcount go to 0, and its rcu callback 1878 + * invoked in which it drops its single p->usage reference. At this 1879 + * point the task will be freed as soon as the last p->usage reference 1880 + * goes to 0, without waiting for another RCU gp to elapse. The only 1881 + * way that a BPF program can guarantee that a task is valid is in this 1882 + * scenario is to hold a p->usage refcount itself. 1883 + * 1884 + * Until we're able to resolve this issue, either by pulling 1885 + * p->rcu_users and p->rcu out of the union, or by getting rid of 1886 + * p->usage and just using p->rcu_users for refcounting, we'll just 1887 + * return NULL here. 1888 + */ 1889 + return NULL; 1850 1890 } 1851 1891 1852 1892 /** ··· 1896 1858 */ 1897 1859 struct task_struct *bpf_task_kptr_get(struct task_struct **pp) 1898 1860 { 1899 - struct task_struct *p; 1900 - 1901 - rcu_read_lock(); 1902 - p = READ_ONCE(*pp); 1903 - 1904 - /* Another context could remove the task from the map and release it at 1905 - * any time, including after we've done the lookup above. This is safe 1906 - * because we're in an RCU read region, so the task is guaranteed to 1907 - * remain valid until at least the rcu_read_unlock() below. 1861 + /* We must return NULL here until we have clarity on how to properly 1862 + * leverage RCU for ensuring a task's lifetime. See the comment above 1863 + * in bpf_task_acquire_not_zero() for more details. 1908 1864 */ 1909 - if (p && !refcount_inc_not_zero(&p->rcu_users)) 1910 - /* If the task had been removed from the map and freed as 1911 - * described above, refcount_inc_not_zero() will return false. 1912 - * The task will be freed at some point after the current RCU 1913 - * gp has ended, so just return NULL to the user. 1914 - */ 1915 - p = NULL; 1916 - rcu_read_unlock(); 1917 - 1918 - return p; 1865 + return NULL; 1919 1866 } 1920 1867 1921 1868 /** 1922 1869 * bpf_task_release - Release the reference acquired on a struct task_struct *. 1923 - * If this kfunc is invoked in an RCU read region, the task_struct is 1924 - * guaranteed to not be freed until the current grace period has ended, even if 1925 - * its refcount drops to 0. 1926 1870 * @p: The task on which a reference is being released. 1927 1871 */ 1928 1872 void bpf_task_release(struct task_struct *p) ··· 1912 1892 if (!p) 1913 1893 return; 1914 1894 1915 - put_task_struct_rcu_user(p); 1895 + put_task_struct(p); 1916 1896 } 1917 1897 1918 1898 #ifdef CONFIG_CGROUPS
+5
tools/testing/selftests/bpf/progs/rcu_read_lock.c
··· 161 161 /* acquire a reference which can be used outside rcu read lock region */ 162 162 gparent = bpf_task_acquire_not_zero(gparent); 163 163 if (!gparent) 164 + /* Until we resolve the issues with using task->rcu_users, we 165 + * expect bpf_task_acquire_not_zero() to return a NULL task. 166 + * See the comment at the definition of 167 + * bpf_task_acquire_not_zero() for more details. 168 + */ 164 169 goto out; 165 170 166 171 (void)bpf_task_storage_get(&map_a, gparent, 0, 0);
+7 -2
tools/testing/selftests/bpf/progs/task_kfunc_success.c
··· 123 123 } 124 124 125 125 kptr = bpf_task_kptr_get(&v->task); 126 - if (!kptr) { 126 + if (kptr) { 127 + /* Until we resolve the issues with using task->rcu_users, we 128 + * expect bpf_task_kptr_get() to return a NULL task. See the 129 + * comment at the definition of bpf_task_acquire_not_zero() for 130 + * more details. 131 + */ 132 + bpf_task_release(kptr); 127 133 err = 3; 128 134 return 0; 129 135 } 130 136 131 - bpf_task_release(kptr); 132 137 133 138 return 0; 134 139 }