xfs: limit iclog tail updates

From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
S1 C1
+-----------------------+-----------------------+
S2 EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1: Cache flush, log tail = X
At S1: Metadata stable, write start record and checkpoint
At C1: Write commit record, set NEED_FUA
Single iclog checkpoint, so no need for NEED_FLUSH
Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2: Cache flush, log tail = X
At S2: Metadata stable, write start record and checkpoint
After S2: Log tail moves to X+1
At EOIC: End of iclog, more journal data to write
Releases iclog
Not a commit iclog, so no need for NEED_FLUSH
Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

authored by Dave Chinner and committed by Darrick J. Wong 9d110014 b2ae3a9e

Changed files
+37 -13
fs
xfs
+37 -13
fs/xfs/xfs_log.c
··· 78 78 STATIC void 79 79 xlog_verify_tail_lsn( 80 80 struct xlog *log, 81 - struct xlog_in_core *iclog, 82 - xfs_lsn_t tail_lsn); 81 + struct xlog_in_core *iclog); 83 82 #else 84 83 #define xlog_verify_dest_ptr(a,b) 85 84 #define xlog_verify_grant_tail(a) 86 85 #define xlog_verify_iclog(a,b,c) 87 - #define xlog_verify_tail_lsn(a,b,c) 86 + #define xlog_verify_tail_lsn(a,b) 88 87 #endif 89 88 90 89 STATIC int ··· 488 489 489 490 /* 490 491 * Flush iclog to disk if this is the last reference to the given iclog and the 491 - * it is in the WANT_SYNC state. If the caller passes in a non-zero 492 - * @old_tail_lsn and the current log tail does not match, there may be metadata 493 - * on disk that must be persisted before this iclog is written. To satisfy that 494 - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this 495 - * iclog with the new log tail value. 492 + * it is in the WANT_SYNC state. 493 + * 494 + * If the caller passes in a non-zero @old_tail_lsn and the current log tail 495 + * does not match, there may be metadata on disk that must be persisted before 496 + * this iclog is written. To satisfy that requirement, set the 497 + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new 498 + * log tail value. 499 + * 500 + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the 501 + * log tail is updated correctly. NEED_FUA indicates that the iclog will be 502 + * written to stable storage, and implies that a commit record is contained 503 + * within the iclog. We need to ensure that the log tail does not move beyond 504 + * the tail that the first commit record in the iclog ordered against, otherwise 505 + * correct recovery of that checkpoint becomes dependent on future operations 506 + * performed on this iclog. 507 + * 508 + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the 509 + * current tail into iclog. Once the iclog tail is set, future operations must 510 + * not modify it, otherwise they potentially violate ordering constraints for 511 + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in 512 + * the iclog will get zeroed on activation of the iclog after sync, so we 513 + * always capture the tail lsn on the iclog on the first NEED_FUA release 514 + * regardless of the number of active reference counts on this iclog. 496 515 */ 516 + 497 517 int 498 518 xlog_state_release_iclog( 499 519 struct xlog *log, ··· 537 519 538 520 if (old_tail_lsn && tail_lsn != old_tail_lsn) 539 521 iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; 522 + 523 + if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) && 524 + !iclog->ic_header.h_tail_lsn) 525 + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); 540 526 } 541 527 542 528 if (!atomic_dec_and_test(&iclog->ic_refcnt)) ··· 552 530 } 553 531 554 532 iclog->ic_state = XLOG_STATE_SYNCING; 555 - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); 556 - xlog_verify_tail_lsn(log, iclog, tail_lsn); 533 + if (!iclog->ic_header.h_tail_lsn) 534 + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); 535 + xlog_verify_tail_lsn(log, iclog); 557 536 trace_xlog_iclog_syncing(iclog, _RET_IP_); 558 537 559 538 spin_unlock(&log->l_icloglock); ··· 2602 2579 memset(iclog->ic_header.h_cycle_data, 0, 2603 2580 sizeof(iclog->ic_header.h_cycle_data)); 2604 2581 iclog->ic_header.h_lsn = 0; 2582 + iclog->ic_header.h_tail_lsn = 0; 2605 2583 } 2606 2584 2607 2585 /* ··· 3638 3614 STATIC void 3639 3615 xlog_verify_tail_lsn( 3640 3616 struct xlog *log, 3641 - struct xlog_in_core *iclog, 3642 - xfs_lsn_t tail_lsn) 3617 + struct xlog_in_core *iclog) 3643 3618 { 3644 - int blocks; 3619 + xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn); 3620 + int blocks; 3645 3621 3646 3622 if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) { 3647 3623 blocks =