memcg: memory hotplug fix for notifier callback

Fixes for memcg/memory hotplug.

While memory hotplug allocate/free memmap, page_cgroup doesn't free
page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
(Because freeing bootmem requires special care.)

Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
by memory hotplug, page_cgroup->page == page is no longer true.

But current MEM_ONLINE handler doesn't check it and update
page_cgroup->page if it's not necessary to allocate page_cgroup. (This
was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.)

And I noticed that MEM_ONLINE can be called against "part of section".
So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing
used page_cgroup) Don't rollback at CANCEL.

One more, current memory hotplug notifier is stopped by slub because it
sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never
be called. (low priority than slub now.)

I think this slub's behavior is not intentional(BUG). and fixes it.

Another way to be considered about page_cgroup allocation:
- free page_cgroup at OFFLINE even if it's from bootmem
and remove specieal handler. But it requires more changes.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041

Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Tested-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by KAMEZAWA Hiroyuki and committed by Linus Torvalds dc19f9db b29acbdc

+33 -16
+29 -14
mm/page_cgroup.c
··· 107 108 section = __pfn_to_section(pfn); 109 110 - if (section->page_cgroup) 111 - return 0; 112 - 113 - nid = page_to_nid(pfn_to_page(pfn)); 114 - 115 - table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION; 116 - if (slab_is_available()) { 117 - base = kmalloc_node(table_size, GFP_KERNEL, nid); 118 - if (!base) 119 - base = vmalloc_node(table_size, nid); 120 - } else { 121 - base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), table_size, 122 PAGE_SIZE, __pa(MAX_DMA_ADDRESS)); 123 } 124 125 if (!base) { ··· 218 ret = online_page_cgroup(mn->start_pfn, 219 mn->nr_pages, mn->status_change_nid); 220 break; 221 - case MEM_CANCEL_ONLINE: 222 case MEM_OFFLINE: 223 offline_page_cgroup(mn->start_pfn, 224 mn->nr_pages, mn->status_change_nid); 225 break; 226 case MEM_GOING_OFFLINE: 227 break; 228 case MEM_ONLINE: 229 case MEM_CANCEL_OFFLINE: 230 break; 231 } 232 - ret = notifier_from_errno(ret); 233 return ret; 234 } 235
··· 107 108 section = __pfn_to_section(pfn); 109 110 + if (!section->page_cgroup) { 111 + nid = page_to_nid(pfn_to_page(pfn)); 112 + table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION; 113 + if (slab_is_available()) { 114 + base = kmalloc_node(table_size, GFP_KERNEL, nid); 115 + if (!base) 116 + base = vmalloc_node(table_size, nid); 117 + } else { 118 + base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), 119 + table_size, 120 PAGE_SIZE, __pa(MAX_DMA_ADDRESS)); 121 + } 122 + } else { 123 + /* 124 + * We don't have to allocate page_cgroup again, but 125 + * address of memmap may be changed. So, we have to initialize 126 + * again. 127 + */ 128 + base = section->page_cgroup + pfn; 129 + table_size = 0; 130 + /* check address of memmap is changed or not. */ 131 + if (base->page == pfn_to_page(pfn)) 132 + return 0; 133 } 134 135 if (!base) { ··· 208 ret = online_page_cgroup(mn->start_pfn, 209 mn->nr_pages, mn->status_change_nid); 210 break; 211 case MEM_OFFLINE: 212 offline_page_cgroup(mn->start_pfn, 213 mn->nr_pages, mn->status_change_nid); 214 break; 215 + case MEM_CANCEL_ONLINE: 216 case MEM_GOING_OFFLINE: 217 break; 218 case MEM_ONLINE: 219 case MEM_CANCEL_OFFLINE: 220 break; 221 } 222 + 223 + if (ret) 224 + ret = notifier_from_errno(ret); 225 + else 226 + ret = NOTIFY_OK; 227 + 228 return ret; 229 } 230
+4 -2
mm/slub.c
··· 2931 case MEM_CANCEL_OFFLINE: 2932 break; 2933 } 2934 - 2935 - ret = notifier_from_errno(ret); 2936 return ret; 2937 } 2938
··· 2931 case MEM_CANCEL_OFFLINE: 2932 break; 2933 } 2934 + if (ret) 2935 + ret = notifier_from_errno(ret); 2936 + else 2937 + ret = NOTIFY_OK; 2938 return ret; 2939 } 2940