Merge tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull CPU hotplug fix from Thomas Gleixner:
"A fix for the CPU hotplug and cpusets interaction:

cpusets delegate the hotplug work to a workqueue to prevent a lock
order inversion vs. the CPU hotplug lock. The work is not flushed
before the hotplug operation returns which creates user visible
inconsistent state. Prevent this by flushing the work after dropping
CPU hotplug lock and before releasing the outer mutex which serializes
the CPU hotplug related sysfs interface operations"

* tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Cure the cpusets trainwreck

Linus Torvalds 4 years ago 62180152 371fb854

+49

1 changed file

expand all

kernel

cpu.c

+49

kernel/cpu.c

··· 32 32 #include <linux/relay.h> 33 33 #include <linux/slab.h> 34 34 #include <linux/percpu-rwsem.h> 35 + #include <linux/cpuset.h> 35 36 36 37 #include <trace/events/power.h> 37 38 #define CREATE_TRACE_POINTS ··· 874 873 kthread_unpark(this_cpu_read(cpuhp_state.thread)); 875 874 } 876 875 876 + /* 877 + * 878 + * Serialize hotplug trainwrecks outside of the cpu_hotplug_lock 879 + * protected region. 880 + * 881 + * The operation is still serialized against concurrent CPU hotplug via 882 + * cpu_add_remove_lock, i.e. CPU map protection. But it is _not_ 883 + * serialized against other hotplug related activity like adding or 884 + * removing of state callbacks and state instances, which invoke either the 885 + * startup or the teardown callback of the affected state. 886 + * 887 + * This is required for subsystems which are unfixable vs. CPU hotplug and 888 + * evade lock inversion problems by scheduling work which has to be 889 + * completed _before_ cpu_up()/_cpu_down() returns. 890 + * 891 + * Don't even think about adding anything to this for any new code or even 892 + * drivers. It's only purpose is to keep existing lock order trainwrecks 893 + * working. 894 + * 895 + * For cpu_down() there might be valid reasons to finish cleanups which are 896 + * not required to be done under cpu_hotplug_lock, but that's a different 897 + * story and would be not invoked via this. 898 + */ 899 + static void cpu_up_down_serialize_trainwrecks(bool tasks_frozen) 900 + { 901 + /* 902 + * cpusets delegate hotplug operations to a worker to "solve" the 903 + * lock order problems. Wait for the worker, but only if tasks are 904 + * _not_ frozen (suspend, hibernate) as that would wait forever. 905 + * 906 + * The wait is required because otherwise the hotplug operation 907 + * returns with inconsistent state, which could even be observed in 908 + * user space when a new CPU is brought up. The CPU plug uevent 909 + * would be delivered and user space reacting on it would fail to 910 + * move tasks to the newly plugged CPU up to the point where the 911 + * work has finished because up to that point the newly plugged CPU 912 + * is not assignable in cpusets/cgroups. On unplug that's not 913 + * necessarily a visible issue, but it is still inconsistent state, 914 + * which is the real problem which needs to be "fixed". This can't 915 + * prevent the transient state between scheduling the work and 916 + * returning from waiting for it. 917 + */ 918 + if (!tasks_frozen) 919 + cpuset_wait_for_hotplug(); 920 + } 921 + 877 922 #ifdef CONFIG_HOTPLUG_CPU 878 923 #ifndef arch_clear_mm_cpumask_cpu 879 924 #define arch_clear_mm_cpumask_cpu(cpu, mm) cpumask_clear_cpu(cpu, mm_cpumask(mm)) ··· 1155 1108 */ 1156 1109 lockup_detector_cleanup(); 1157 1110 arch_smt_update(); 1111 + cpu_up_down_serialize_trainwrecks(tasks_frozen); 1158 1112 return ret; 1159 1113 } 1160 1114 ··· 1350 1302 out: 1351 1303 cpus_write_unlock(); 1352 1304 arch_smt_update(); 1305 + cpu_up_down_serialize_trainwrecks(tasks_frozen); 1353 1306 return ret; 1354 1307 } 1355 1308