Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path

On umount path, jbd2_journal_destroy() writes latest transaction ID
(->j_tail_sequence) to be used at next mount.

The bug is that ->j_tail_sequence is not holding latest transaction ID
in some cases. So, at next mount, there is chance to conflict with
remaining (not overwritten yet) transactions.

mount (id=10)
write transaction (id=11)
write transaction (id=12)
umount (id=10) <= the bug doesn't write latest ID

mount (id=10)
write transaction (id=11)
crash

mount
[recovery process]
transaction (id=11)
transaction (id=12) <= valid transaction ID, but old commit
must not replay

Like above, this bug become the cause of recovery failure, or FS
corruption.

So why ->j_tail_sequence doesn't point latest ID?

Because if checkpoint transactions was reclaimed by memory pressure
(i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated.
(And another case is, __jbd2_journal_clean_checkpoint_list() is called
with empty transaction.)

So in above cases, ->j_tail_sequence is not pointing latest
transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
done too.

So, to fix this problem with minimum changes, this patch updates
->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes,
some optimizations would be possible to avoid unnecessary REQ_FLUSH
for example though.)

BTW,

journal->j_tail_sequence =
++journal->j_transaction_sequence;

Increment of ->j_transaction_sequence seems to be unnecessary, but
ext3 does this.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

authored by

OGAWA Hirofumi and committed by
Theodore Ts'o
c0a2ad9b 2d90c160

+12 -5
+12 -5
fs/jbd2/journal.c
··· 1430 1430 /** 1431 1431 * jbd2_mark_journal_empty() - Mark on disk journal as empty. 1432 1432 * @journal: The journal to update. 1433 + * @write_op: With which operation should we write the journal sb 1433 1434 * 1434 1435 * Update a journal's dynamic superblock fields to show that journal is empty. 1435 1436 * Write updated superblock to disk waiting for IO to complete. 1436 1437 */ 1437 - static void jbd2_mark_journal_empty(journal_t *journal) 1438 + static void jbd2_mark_journal_empty(journal_t *journal, int write_op) 1438 1439 { 1439 1440 journal_superblock_t *sb = journal->j_superblock; 1440 1441 ··· 1453 1452 sb->s_start = cpu_to_be32(0); 1454 1453 read_unlock(&journal->j_state_lock); 1455 1454 1456 - jbd2_write_superblock(journal, WRITE_FUA); 1455 + jbd2_write_superblock(journal, write_op); 1457 1456 1458 1457 /* Log is no longer empty */ 1459 1458 write_lock(&journal->j_state_lock); ··· 1739 1738 if (journal->j_sb_buffer) { 1740 1739 if (!is_journal_aborted(journal)) { 1741 1740 mutex_lock(&journal->j_checkpoint_mutex); 1742 - jbd2_mark_journal_empty(journal); 1741 + 1742 + write_lock(&journal->j_state_lock); 1743 + journal->j_tail_sequence = 1744 + ++journal->j_transaction_sequence; 1745 + write_unlock(&journal->j_state_lock); 1746 + 1747 + jbd2_mark_journal_empty(journal, WRITE_FLUSH_FUA); 1743 1748 mutex_unlock(&journal->j_checkpoint_mutex); 1744 1749 } else 1745 1750 err = -EIO; ··· 2004 1997 * the magic code for a fully-recovered superblock. Any future 2005 1998 * commits of data to the journal will restore the current 2006 1999 * s_start value. */ 2007 - jbd2_mark_journal_empty(journal); 2000 + jbd2_mark_journal_empty(journal, WRITE_FUA); 2008 2001 mutex_unlock(&journal->j_checkpoint_mutex); 2009 2002 write_lock(&journal->j_state_lock); 2010 2003 J_ASSERT(!journal->j_running_transaction); ··· 2050 2043 if (write) { 2051 2044 /* Lock to make assertions happy... */ 2052 2045 mutex_lock(&journal->j_checkpoint_mutex); 2053 - jbd2_mark_journal_empty(journal); 2046 + jbd2_mark_journal_empty(journal, WRITE_FUA); 2054 2047 mutex_unlock(&journal->j_checkpoint_mutex); 2055 2048 } 2056 2049