xfs: flush inodegc workqueue tasks before cancel

The xfs_inodegc_stop() helper performs a high level flush of pending
work on the percpu queues and then runs a cancel_work_sync() on each
of the percpu work tasks to ensure all work has completed before
returning. While cancel_work_sync() waits for wq tasks to complete,
it does not guarantee work tasks have started. This means that the
_stop() helper can queue and instantly cancel a wq task without
having completed the associated work. This can be observed by
tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
test:

xfs_destroy_inode: ... ino 0x83 ...
xfs_inode_set_need_inactive: ... ino 0x83 ...
xfs_inodegc_stop: ...
...
xfs_inodegc_start: ...
xfs_inodegc_worker: ...
xfs_inode_inactivating: ... ino 0x83 ...

The first few lines show that the inode is removed and need inactive
state set, but the inactivation work has not completed before the
inodegc mechanism stops. The inactivation doesn't actually occur
until the fs is unfrozen and the gc mechanism starts back up. Note
that this test requires fsfreeze to reproduce because xfs_freeze
indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

When this occurs, the workqueue try_to_grab_pending() logic first
tries to steal the pending bit, which does not succeed because the
bit has been set by queue_work_on(). Subsequently, it checks for
association of a pool workqueue from the work item under the pool
lock. This association is set at the point a work item is queued and
cleared when dequeued for processing. If the association exists, the
work item is removed from the queue and cancel_work_sync() returns
true. If the pwq association is cleared, the remove attempt assumes
the task is busy and retries (eventually returning false to the
caller after waiting for the work task to complete).

To avoid this race, we can flush each work item explicitly before
cancel. However, since the _queue_all() already schedules each
underlying work item, the workqueue level helpers are sufficient to
achieve the same ordering effect. E.g., the inodegc enabled flag
prevents scheduling any further work in the _stop() case. Use the
drain_workqueue() helper in this particular case to make the intent
a bit more self explanatory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

authored by Brian Foster and committed by Darrick J. Wong 6191cf3a a8e422af

+4 -18
+4 -18
fs/xfs/xfs_icache.c
··· 1852 1852 } 1853 1853 1854 1854 /* 1855 - * Force all currently queued inode inactivation work to run immediately, and 1856 - * wait for the work to finish. Two pass - queue all the work first pass, wait 1857 - * for it in a second pass. 1855 + * Force all currently queued inode inactivation work to run immediately and 1856 + * wait for the work to finish. 1858 1857 */ 1859 1858 void 1860 1859 xfs_inodegc_flush( 1861 1860 struct xfs_mount *mp) 1862 1861 { 1863 - struct xfs_inodegc *gc; 1864 - int cpu; 1865 - 1866 1862 if (!xfs_is_inodegc_enabled(mp)) 1867 1863 return; 1868 1864 1869 1865 trace_xfs_inodegc_flush(mp, __return_address); 1870 1866 1871 1867 xfs_inodegc_queue_all(mp); 1872 - 1873 - for_each_online_cpu(cpu) { 1874 - gc = per_cpu_ptr(mp->m_inodegc, cpu); 1875 - flush_work(&gc->work); 1876 - } 1868 + flush_workqueue(mp->m_inodegc_wq); 1877 1869 } 1878 1870 1879 1871 /* ··· 1876 1884 xfs_inodegc_stop( 1877 1885 struct xfs_mount *mp) 1878 1886 { 1879 - struct xfs_inodegc *gc; 1880 - int cpu; 1881 - 1882 1887 if (!xfs_clear_inodegc_enabled(mp)) 1883 1888 return; 1884 1889 1885 1890 xfs_inodegc_queue_all(mp); 1891 + drain_workqueue(mp->m_inodegc_wq); 1886 1892 1887 - for_each_online_cpu(cpu) { 1888 - gc = per_cpu_ptr(mp->m_inodegc, cpu); 1889 - cancel_work_sync(&gc->work); 1890 - } 1891 1893 trace_xfs_inodegc_stop(mp, __return_address); 1892 1894 } 1893 1895