Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/v3d: Add job to pending list if the reset was skipped

When a CL/CSD job times out, we check if the GPU has made any progress
since the last timeout. If so, instead of resetting the hardware, we skip
the reset and let the timer get rearmed. This gives long-running jobs a
chance to complete.

However, when `timedout_job()` is called, the job in question is removed
from the pending list, which means it won't be automatically freed through
`free_job()`. Consequently, when we skip the reset and keep the job
running, the job won't be freed when it finally completes.

This situation leads to a memory leak, as exposed in [1] and [2].

Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when
GPU is still active"), this patch ensures the job is put back on the
pending list when extending the timeout.

Cc: stable@vger.kernel.org # 6.0
Reported-by: Daivik Bhatia <dtgs1208@gmail.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1]
Closes: https://github.com/raspberrypi/linux/issues/6817 [2]
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://lore.kernel.org/r/20250430210643.57924-1-mcanal@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>

+21 -7
+21 -7
drivers/gpu/drm/v3d/v3d_sched.c
··· 744 744 return DRM_GPU_SCHED_STAT_NOMINAL; 745 745 } 746 746 747 - /* If the current address or return address have changed, then the GPU 748 - * has probably made progress and we should delay the reset. This 749 - * could fail if the GPU got in an infinite loop in the CL, but that 750 - * is pretty unlikely outside of an i-g-t testcase. 751 - */ 747 + static void 748 + v3d_sched_skip_reset(struct drm_sched_job *sched_job) 749 + { 750 + struct drm_gpu_scheduler *sched = sched_job->sched; 751 + 752 + spin_lock(&sched->job_list_lock); 753 + list_add(&sched_job->list, &sched->pending_list); 754 + spin_unlock(&sched->job_list_lock); 755 + } 756 + 752 757 static enum drm_gpu_sched_stat 753 758 v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, 754 759 u32 *timedout_ctca, u32 *timedout_ctra) ··· 763 758 u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q)); 764 759 u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q)); 765 760 761 + /* If the current address or return address have changed, then the GPU 762 + * has probably made progress and we should delay the reset. This 763 + * could fail if the GPU got in an infinite loop in the CL, but that 764 + * is pretty unlikely outside of an i-g-t testcase. 765 + */ 766 766 if (*timedout_ctca != ctca || *timedout_ctra != ctra) { 767 767 *timedout_ctca = ctca; 768 768 *timedout_ctra = ctra; 769 + 770 + v3d_sched_skip_reset(sched_job); 769 771 return DRM_GPU_SCHED_STAT_NOMINAL; 770 772 } 771 773 ··· 812 800 struct v3d_dev *v3d = job->base.v3d; 813 801 u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d->ver)); 814 802 815 - /* If we've made progress, skip reset and let the timer get 816 - * rearmed. 803 + /* If we've made progress, skip reset, add the job to the pending 804 + * list, and let the timer get rearmed. 817 805 */ 818 806 if (job->timedout_batches != batches) { 819 807 job->timedout_batches = batches; 808 + 809 + v3d_sched_skip_reset(sched_job); 820 810 return DRM_GPU_SCHED_STAT_NOMINAL; 821 811 } 822 812