Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

jbd: Issue cache flush after checkpointing

When we reach cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock, effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash.

A similar problem can happen if we replay the journal and wipe it before
flushing disk's caches.

Thus we must unconditionally issue a cache flush before we update journal
superblock in these cases. The fix is slightly complicated by the fact that we
have to get log tail before we issue cache flush but we can store it in the
journal superblock only after the cache flush. Otherwise we risk races where
new tail is written before appropriate cache flush is finished.

I managed to reproduce the corruption using somewhat tweaked Chris Mason's
barrier-test scheduler. Also this should fix occasional reports of 'Bit already
freed' filesystem errors which are totally unreproducible but inspection of
several fs images I've gathered over time points to a problem like this.

CC: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>

Jan Kara 353b67d8 e4e11180

+26 -5
+22 -5
fs/jbd/checkpoint.c
··· 453 453 * 454 454 * Return <0 on error, 0 on success, 1 if there was nothing to clean up. 455 455 * 456 - * Called with the journal lock held. 457 - * 458 456 * This is the only part of the journaling code which really needs to be 459 457 * aware of transaction aborts. Checkpointing involves writing to the 460 458 * main filesystem area rather than to the journal, so it can proceed ··· 470 472 if (is_journal_aborted(journal)) 471 473 return 1; 472 474 473 - /* OK, work out the oldest transaction remaining in the log, and 475 + /* 476 + * OK, work out the oldest transaction remaining in the log, and 474 477 * the log block it starts at. 475 478 * 476 479 * If the log is now empty, we need to work out which is the 477 480 * next transaction ID we will write, and where it will 478 - * start. */ 479 - 481 + * start. 482 + */ 480 483 spin_lock(&journal->j_state_lock); 481 484 spin_lock(&journal->j_list_lock); 482 485 transaction = journal->j_checkpoint_transactions; ··· 503 504 spin_unlock(&journal->j_state_lock); 504 505 return 1; 505 506 } 507 + spin_unlock(&journal->j_state_lock); 506 508 509 + /* 510 + * We need to make sure that any blocks that were recently written out 511 + * --- perhaps by log_do_checkpoint() --- are flushed out before we 512 + * drop the transactions from the journal. It's unlikely this will be 513 + * necessary, especially with an appropriately sized journal, but we 514 + * need this to guarantee correctness. Fortunately 515 + * cleanup_journal_tail() doesn't get called all that often. 516 + */ 517 + if (journal->j_flags & JFS_BARRIER) 518 + blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); 519 + 520 + spin_lock(&journal->j_state_lock); 521 + if (!tid_gt(first_tid, journal->j_tail_sequence)) { 522 + spin_unlock(&journal->j_state_lock); 523 + /* Someone else cleaned up journal so return 0 */ 524 + return 0; 525 + } 507 526 /* OK, update the superblock to recover the freed space. 508 527 * Physical blocks come first: have we wrapped beyond the end of 509 528 * the log? */
+4
fs/jbd/recovery.c
··· 20 20 #include <linux/fs.h> 21 21 #include <linux/jbd.h> 22 22 #include <linux/errno.h> 23 + #include <linux/blkdev.h> 23 24 #endif 24 25 25 26 /* ··· 264 263 err2 = sync_blockdev(journal->j_fs_dev); 265 264 if (!err) 266 265 err = err2; 266 + /* Flush disk caches to get replayed data on the permanent storage */ 267 + if (journal->j_flags & JFS_BARRIER) 268 + blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); 267 269 268 270 return err; 269 271 }