Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/sched: Document race condition in drm_sched_fini()

In drm_sched_fini() all entities are marked as stopped - without taking
the appropriate lock, because that would deadlock. That means that
drm_sched_fini() and drm_sched_entity_push_job() can race against each
other.

This should most likely be fixed by establishing the rule that all
entities associated with a scheduler must be torn down first. Then,
however, the locking should be removed from drm_sched_fini() alltogether
with an appropriate comment.

Reported-by: James Flowers <bold.zone2373@fastmail.com>
Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250813085654.102504-2-phasta@kernel.org

+16
+16
drivers/gpu/drm/scheduler/sched_main.c
··· 1424 1424 * Prevents reinsertion and marks job_queue as idle, 1425 1425 * it will be removed from the rq in drm_sched_entity_fini() 1426 1426 * eventually 1427 + * 1428 + * FIXME: 1429 + * This lacks the proper spin_lock(&s_entity->lock) and 1430 + * is, therefore, a race condition. Most notably, it 1431 + * can race with drm_sched_entity_push_job(). The lock 1432 + * cannot be taken here, however, because this would 1433 + * lead to lock inversion -> deadlock. 1434 + * 1435 + * The best solution probably is to enforce the life 1436 + * time rule of all entities having to be torn down 1437 + * before their scheduler. Then, however, locking could 1438 + * be dropped alltogether from this function. 1439 + * 1440 + * For now, this remains a potential race in all 1441 + * drivers that keep entities alive for longer than 1442 + * the scheduler. 1427 1443 */ 1428 1444 s_entity->stopped = true; 1429 1445 spin_unlock(&rq->lock);