memcg: use new logic for page stat accounting

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag. The reason is that memcg has a
feature to move a page from a group to another group and we have race
between "move" and "page stat accounting",

Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
does "page stat accounting".

When CPU-A goes 1st,

CPU-A CPU-B
update "struct page" info.
move_lock_mem_cgroup(memcg)
see pc->flags
copy page stat to new group
overwrite pc->mem_cgroup.
move_unlock_mem_cgroup(memcg)
move_lock_mem_cgroup(mem)
set pc->flags
update page stat accounting
move_unlock_mem_cgroup(mem)

stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
(CPU-A) doesn't see changes in "struct page" information.

But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'. And, there is a potential problem.

For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
kind of page stat accounting.)

CPU-A CPU-B
TestSet PG_dirty
(delay) TestClear PG_dirty
if (TestClear(PCG_dirty))
memcg->nr_dirty--
if (TestSet(PCG_dirty))
memcg->nr_dirty++

Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
Thelen <gthelen@google.com>. Now, only FILE_MAPPED is supported but
fortunately, it's serialized by page table lock and this is not real bug,
_now_,

If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information. But we'll have a problem in "move
account"

Assume we use only PG_dirty.

CPU-A CPU-B
TestSet PG_dirty
(delay) move_lock_mem_cgroup()
if (PageDirty(page))
new_memcg->nr_dirty++
pc->mem_cgroup = new_memcg;
move_unlock_mem_cgroup()
move_lock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty++

accounting information may be double-counted. This was original reason to
have PCG_xxx flags but it seems PCG_xxx has another problem.

I think we need a bigger lock as

move_lock_mem_cgroup(page)
TestSetPageDirty(page)
update page stats (without any checks)
move_unlock_mem_cgroup(page)

This fixes both of problems and we don't have to duplicate page flag into
page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
are possibility of "account move" under the system. So, in most path,
status update will go without atomic locks.

This patch introduces mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it. as

mem_cgroup_begin_update_page_stat()
modify page information
mem_cgroup_update_page_stat()
=> never check any 'struct page' info, just update counters.
mem_cgroup_end_update_page_stat().

This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not. A
following patch adds an easy optimization and reduces the cost.

[akpm@linux-foundation.org: s/lock/locked/]
[hughd@google.com: fix deadlock by avoiding stat lock when anon]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

KAMEZAWA Hiroyuki and committed by

Linus Torvalds 14 years ago 89c06bd5 312734c0

+101 -24

3 changed files

expand all

include

linux

memcontrol.h

memcontrol.c

rmap.c

+35

include/linux/memcontrol.h

··· 141 141 return false; 142 142 } 143 143 144 + void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, 145 + unsigned long *flags); 146 + 147 + static inline void mem_cgroup_begin_update_page_stat(struct page *page, 148 + bool *locked, unsigned long *flags) 149 + { 150 + if (mem_cgroup_disabled()) 151 + return; 152 + rcu_read_lock(); 153 + *locked = false; 154 + return __mem_cgroup_begin_update_page_stat(page, locked, flags); 155 + } 156 + 157 + void __mem_cgroup_end_update_page_stat(struct page *page, 158 + unsigned long *flags); 159 + static inline void mem_cgroup_end_update_page_stat(struct page *page, 160 + bool *locked, unsigned long *flags) 161 + { 162 + if (mem_cgroup_disabled()) 163 + return; 164 + if (*locked) 165 + __mem_cgroup_end_update_page_stat(page, flags); 166 + rcu_read_unlock(); 167 + } 168 + 144 169 void mem_cgroup_update_page_stat(struct page *page, 145 170 enum mem_cgroup_page_stat_item idx, 146 171 int val); ··· 363 338 364 339 static inline void 365 340 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) 341 + { 342 + } 343 + 344 + static inline void mem_cgroup_begin_update_page_stat(struct page *page, 345 + bool *locked, unsigned long *flags) 346 + { 347 + } 348 + 349 + static inline void mem_cgroup_end_update_page_stat(struct page *page, 350 + bool *locked, unsigned long *flags) 366 351 { 367 352 } 368 353

+42 -20

mm/memcontrol.c

··· 1910 1910 * If there is, we take a lock. 1911 1911 */ 1912 1912 1913 + void __mem_cgroup_begin_update_page_stat(struct page *page, 1914 + bool *locked, unsigned long *flags) 1915 + { 1916 + struct mem_cgroup *memcg; 1917 + struct page_cgroup *pc; 1918 + 1919 + pc = lookup_page_cgroup(page); 1920 + again: 1921 + memcg = pc->mem_cgroup; 1922 + if (unlikely(!memcg || !PageCgroupUsed(pc))) 1923 + return; 1924 + /* 1925 + * If this memory cgroup is not under account moving, we don't 1926 + * need to take move_lock_page_cgroup(). Because we already hold 1927 + * rcu_read_lock(), any calls to move_account will be delayed until 1928 + * rcu_read_unlock() if mem_cgroup_stealed() == true. 1929 + */ 1930 + if (!mem_cgroup_stealed(memcg)) 1931 + return; 1932 + 1933 + move_lock_mem_cgroup(memcg, flags); 1934 + if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { 1935 + move_unlock_mem_cgroup(memcg, flags); 1936 + goto again; 1937 + } 1938 + *locked = true; 1939 + } 1940 + 1941 + void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags) 1942 + { 1943 + struct page_cgroup *pc = lookup_page_cgroup(page); 1944 + 1945 + /* 1946 + * It's guaranteed that pc->mem_cgroup never changes while 1947 + * lock is held because a routine modifies pc->mem_cgroup 1948 + * should take move_lock_page_cgroup(). 1949 + */ 1950 + move_unlock_mem_cgroup(pc->mem_cgroup, flags); 1951 + } 1952 + 1913 1953 void mem_cgroup_update_page_stat(struct page *page, 1914 1954 enum mem_cgroup_page_stat_item idx, int val) 1915 1955 { 1916 1956 struct mem_cgroup *memcg; 1917 1957 struct page_cgroup *pc = lookup_page_cgroup(page); 1918 - bool need_unlock = false; 1919 1958 unsigned long uninitialized_var(flags); 1920 1959 1921 1960 if (mem_cgroup_disabled()) 1922 1961 return; 1923 - again: 1924 - rcu_read_lock(); 1962 + 1925 1963 memcg = pc->mem_cgroup; 1926 1964 if (unlikely(!memcg || !PageCgroupUsed(pc))) 1927 - goto out; 1928 - /* pc->mem_cgroup is unstable ? */ 1929 - if (unlikely(mem_cgroup_stealed(memcg))) { 1930 - /* take a lock against to access pc->mem_cgroup */ 1931 - move_lock_mem_cgroup(memcg, &flags); 1932 - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { 1933 - move_unlock_mem_cgroup(memcg, &flags); 1934 - rcu_read_unlock(); 1935 - goto again; 1936 - } 1937 - need_unlock = true; 1938 - } 1965 + return; 1939 1966 1940 1967 switch (idx) { 1941 1968 case MEMCG_NR_FILE_MAPPED: ··· 1977 1950 } 1978 1951 1979 1952 this_cpu_add(memcg->stat->count[idx], val); 1980 - 1981 - out: 1982 - if (unlikely(need_unlock)) 1983 - move_unlock_mem_cgroup(memcg, &flags); 1984 - rcu_read_unlock(); 1985 1953 } 1986 1954 1987 1955 /*

+24 -4

mm/rmap.c

··· 1148 1148 */ 1149 1149 void page_add_file_rmap(struct page *page) 1150 1150 { 1151 + bool locked; 1152 + unsigned long flags; 1153 + 1154 + mem_cgroup_begin_update_page_stat(page, &locked, &flags); 1151 1155 if (atomic_inc_and_test(&page->_mapcount)) { 1152 1156 __inc_zone_page_state(page, NR_FILE_MAPPED); 1153 1157 mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED); 1154 1158 } 1159 + mem_cgroup_end_update_page_stat(page, &locked, &flags); 1155 1160 } 1156 1161 1157 1162 /** ··· 1167 1162 */ 1168 1163 void page_remove_rmap(struct page *page) 1169 1164 { 1165 + bool anon = PageAnon(page); 1166 + bool locked; 1167 + unsigned long flags; 1168 + 1169 + /* 1170 + * The anon case has no mem_cgroup page_stat to update; but may 1171 + * uncharge_page() below, where the lock ordering can deadlock if 1172 + * we hold the lock against page_stat move: so avoid it on anon. 1173 + */ 1174 + if (!anon) 1175 + mem_cgroup_begin_update_page_stat(page, &locked, &flags); 1176 + 1170 1177 /* page still mapped by someone else? */ 1171 1178 if (!atomic_add_negative(-1, &page->_mapcount)) 1172 - return; 1179 + goto out; 1173 1180 1174 1181 /* 1175 1182 * Now that the last pte has gone, s390 must transfer dirty ··· 1190 1173 * not if it's in swapcache - there might be another pte slot 1191 1174 * containing the swap entry, but page not yet written to swap. 1192 1175 */ 1193 - if ((!PageAnon(page) || PageSwapCache(page)) && 1176 + if ((!anon || PageSwapCache(page)) && 1194 1177 page_test_and_clear_dirty(page_to_pfn(page), 1)) 1195 1178 set_page_dirty(page); 1196 1179 /* ··· 1198 1181 * and not charged by memcg for now. 1199 1182 */ 1200 1183 if (unlikely(PageHuge(page))) 1201 - return; 1202 - if (PageAnon(page)) { 1184 + goto out; 1185 + if (anon) { 1203 1186 mem_cgroup_uncharge_page(page); 1204 1187 if (!PageTransHuge(page)) 1205 1188 __dec_zone_page_state(page, NR_ANON_PAGES); ··· 1219 1202 * Leaving it set also helps swapoff to reinstate ptes 1220 1203 * faster for those pages still in swapcache. 1221 1204 */ 1205 + out: 1206 + if (!anon) 1207 + mem_cgroup_end_update_page_stat(page, &locked, &flags); 1222 1208 } 1223 1209 1224 1210 /*