Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/i915/pmu: Fix synchronization of PMU callback with reset

Since the PMU callback runs in irq context, it synchronizes with gt
reset using the reset count. We could run into a case where the PMU
callback could read the reset count before it is updated. This has a
potential of corrupting the busyness stats.

In addition to the reset count, check if the reset bit is set before
capturing busyness.

In addition save the previous stats only if you intend to update them.

v2:
- The 2 reset counts captured in the PMU callback can end up being the
same if they were captured right after the count is incremented in the
reset flow. This can lead to a bad busyness state. Ensure that reset
is not in progress when the initial reset count is captured.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108211057.68783-1-umesh.nerlige.ramappa@intel.com

authored by

Umesh Nerlige Ramappa and committed by
John Harrison
2a67b18e 95d35838

+11 -6
+11 -6
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
··· 1183 1183 u64 total, gt_stamp_saved; 1184 1184 unsigned long flags; 1185 1185 u32 reset_count; 1186 + bool in_reset; 1186 1187 1187 1188 spin_lock_irqsave(&guc->timestamp.lock, flags); 1188 1189 1189 1190 /* 1190 - * If a reset happened, we risk reading partially updated 1191 - * engine busyness from GuC, so we just use the driver stored 1192 - * copy of busyness. Synchronize with gt reset using reset_count. 1191 + * If a reset happened, we risk reading partially updated engine 1192 + * busyness from GuC, so we just use the driver stored copy of busyness. 1193 + * Synchronize with gt reset using reset_count and the 1194 + * I915_RESET_BACKOFF flag. Note that reset flow updates the reset_count 1195 + * after I915_RESET_BACKOFF flag, so ensure that the reset_count is 1196 + * usable by checking the flag afterwards. 1193 1197 */ 1194 1198 reset_count = i915_reset_count(gpu_error); 1199 + in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags); 1195 1200 1196 1201 *now = ktime_get(); 1197 1202 ··· 1206 1201 * start_gt_clk is derived from GuC state. To get a consistent 1207 1202 * view of activity, we query the GuC state only if gt is awake. 1208 1203 */ 1209 - stats_saved = *stats; 1210 - gt_stamp_saved = guc->timestamp.gt_stamp; 1211 - if (intel_gt_pm_get_if_awake(gt)) { 1204 + if (intel_gt_pm_get_if_awake(gt) && !in_reset) { 1205 + stats_saved = *stats; 1206 + gt_stamp_saved = guc->timestamp.gt_stamp; 1212 1207 guc_update_engine_gt_clks(engine); 1213 1208 guc_update_pm_timestamp(guc, engine, now); 1214 1209 intel_gt_pm_put_async(gt);