Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: vmscan: move dirty pages out of the way until they're flushed

We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans. Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance. Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect and
activate refaulting pages right away, distilling used-once pages on the
inactive list much more effectively. This is by design, and it makes
sense for clean cache. But for the most part our workload's cache
faults are refaults and its use-once cache is from streaming writes. We
end up with most of the inactive list dirty, and we don't go after the
active cache as long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off. Even if the refaults happen, reads are
faster than writes. Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that are dirty or under writeback
when they reach the end of the inactive LRU. The pages are marked for
immediate reclaim, meaning they'll get moved back to the inactive LRU
tail as soon as they're written back and become reclaimable. But in the
meantime, by reducing the inactive list to only immediately reclaimable
pages, we allow the scanner to deactivate and refill the inactive list
with clean cache from the active list tail to guarantee forward
progress.

[hannes@cmpxchg.org: update comment]
Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Johannes Weiner and committed by
Linus Torvalds
c55e8d03 4eda4823

+24 -7
+7
include/linux/mm_inline.h
··· 50 50 list_add(&page->lru, &lruvec->lists[lru]); 51 51 } 52 52 53 + static __always_inline void add_page_to_lru_list_tail(struct page *page, 54 + struct lruvec *lruvec, enum lru_list lru) 55 + { 56 + update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page)); 57 + list_add_tail(&page->lru, &lruvec->lists[lru]); 58 + } 59 + 53 60 static __always_inline void del_page_from_lru_list(struct page *page, 54 61 struct lruvec *lruvec, enum lru_list lru) 55 62 {
+5 -4
mm/swap.c
··· 209 209 { 210 210 int *pgmoved = arg; 211 211 212 - if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { 213 - enum lru_list lru = page_lru_base_type(page); 214 - list_move_tail(&page->lru, &lruvec->lists[lru]); 212 + if (PageLRU(page) && !PageUnevictable(page)) { 213 + del_page_from_lru_list(page, lruvec, page_lru(page)); 214 + ClearPageActive(page); 215 + add_page_to_lru_list_tail(page, lruvec, page_lru(page)); 215 216 (*pgmoved)++; 216 217 } 217 218 } ··· 236 235 */ 237 236 void rotate_reclaimable_page(struct page *page) 238 237 { 239 - if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && 238 + if (!PageLocked(page) && !PageDirty(page) && 240 239 !PageUnevictable(page) && PageLRU(page)) { 241 240 struct pagevec *pvec; 242 241 unsigned long flags;
+12 -3
mm/vmscan.c
··· 1056 1056 * throttling so we could easily OOM just because too many 1057 1057 * pages are in writeback and there is nothing else to 1058 1058 * reclaim. Wait for the writeback to complete. 1059 + * 1060 + * In cases 1) and 2) we activate the pages to get them out of 1061 + * the way while we continue scanning for clean pages on the 1062 + * inactive list and refilling from the active list. The 1063 + * observation here is that waiting for disk writes is more 1064 + * expensive than potentially causing reloads down the line. 1065 + * Since they're marked for immediate reclaim, they won't put 1066 + * memory pressure on the cache working set any longer than it 1067 + * takes to write them to disk. 1059 1068 */ 1060 1069 if (PageWriteback(page)) { 1061 1070 /* Case 1 above */ ··· 1072 1063 PageReclaim(page) && 1073 1064 test_bit(PGDAT_WRITEBACK, &pgdat->flags)) { 1074 1065 nr_immediate++; 1075 - goto keep_locked; 1066 + goto activate_locked; 1076 1067 1077 1068 /* Case 2 above */ 1078 1069 } else if (sane_reclaim(sc) || ··· 1090 1081 */ 1091 1082 SetPageReclaim(page); 1092 1083 nr_writeback++; 1093 - goto keep_locked; 1084 + goto activate_locked; 1094 1085 1095 1086 /* Case 3 above */ 1096 1087 } else { ··· 1183 1174 inc_node_page_state(page, NR_VMSCAN_IMMEDIATE); 1184 1175 SetPageReclaim(page); 1185 1176 1186 - goto keep_locked; 1177 + goto activate_locked; 1187 1178 } 1188 1179 1189 1180 if (references == PAGEREF_RECLAIM_CLEAN)