Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: fix aio performance regression for database caused by THP

I am working with a tool that simulates oracle database I/O workload.
This tool (orion to be specific -
<http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then
does aio into these pages from flash disks using various common block
sizes used by database. I am looking at performance with two of the most
common block sizes - 1M and 64K. aio performance with these two block
sizes plunged after Transparent HugePages was introduced in the kernel.
Here are performance numbers:

pre-THP 2.6.39 3.11-rc5
1M read 8384 MB/s 5629 MB/s 6501 MB/s
64K read 7867 MB/s 4576 MB/s 4251 MB/s

I have narrowed the performance impact down to the overheads introduced by
THP in __get_page_tail() and put_compound_page() routines. perf top shows
>40% of cycles being spent in these two routines. Every time direct I/O
to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
the pages and calls put_page() when I/O completes to put the reference
away. THP introduced significant amount of locking overhead to get_page()
and put_page() when dealing with compound pages because hugepages can be
split underneath get_page() and put_page(). It added this overhead
irrespective of whether it is dealing with hugetlbfs pages or transparent
hugepages. This resulted in 20%-45% drop in aio performance when using
hugetlbfs pages.

Since hugetlbfs pages can not be split, there is no reason to go through
all the locking overhead for these pages from what I can see. I added
code to __get_page_tail() and put_compound_page() to bypass all the
locking code when working with hugetlbfs pages. This improved performance
significantly. Performance numbers with this patch:

pre-THP 3.11-rc5 3.11-rc5 + Patch
1M read 8384 MB/s 6501 MB/s 8371 MB/s
64K read 7867 MB/s 4251 MB/s 6510 MB/s

Performance with 64K read is still lower than what it was before THP, but
still a 53% improvement. It does mean there is more work to be done but I
will take a 53% improvement for now.

Please take a look at the following patch and let me know if it looks
reasonable.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Khalid Aziz and committed by
Linus Torvalds
7cb2ef56 3a7200af

+52 -25
+52 -25
mm/swap.c
··· 31 31 #include <linux/memcontrol.h> 32 32 #include <linux/gfp.h> 33 33 #include <linux/uio.h> 34 + #include <linux/hugetlb.h> 34 35 35 36 #include "internal.h" 36 37 ··· 82 81 83 82 static void put_compound_page(struct page *page) 84 83 { 84 + /* 85 + * hugetlbfs pages cannot be split from under us. If this is a 86 + * hugetlbfs page, check refcount on head page and release the page if 87 + * the refcount becomes zero. 88 + */ 89 + if (PageHuge(page)) { 90 + page = compound_head(page); 91 + if (put_page_testzero(page)) 92 + __put_compound_page(page); 93 + 94 + return; 95 + } 96 + 85 97 if (unlikely(PageTail(page))) { 86 98 /* __split_huge_page_refcount can run under us */ 87 99 struct page *page_head = compound_trans_head(page); ··· 198 184 * proper PT lock that already serializes against 199 185 * split_huge_page(). 200 186 */ 201 - unsigned long flags; 202 187 bool got = false; 203 - struct page *page_head = compound_trans_head(page); 188 + struct page *page_head; 204 189 205 - if (likely(page != page_head && get_page_unless_zero(page_head))) { 190 + /* 191 + * If this is a hugetlbfs page it cannot be split under us. Simply 192 + * increment refcount for the head page. 193 + */ 194 + if (PageHuge(page)) { 195 + page_head = compound_head(page); 196 + atomic_inc(&page_head->_count); 197 + got = true; 198 + } else { 199 + unsigned long flags; 206 200 207 - /* Ref to put_compound_page() comment. */ 208 - if (PageSlab(page_head)) { 201 + page_head = compound_trans_head(page); 202 + if (likely(page != page_head && 203 + get_page_unless_zero(page_head))) { 204 + 205 + /* Ref to put_compound_page() comment. */ 206 + if (PageSlab(page_head)) { 207 + if (likely(PageTail(page))) { 208 + __get_page_tail_foll(page, false); 209 + return true; 210 + } else { 211 + put_page(page_head); 212 + return false; 213 + } 214 + } 215 + 216 + /* 217 + * page_head wasn't a dangling pointer but it 218 + * may not be a head page anymore by the time 219 + * we obtain the lock. That is ok as long as it 220 + * can't be freed from under us. 221 + */ 222 + flags = compound_lock_irqsave(page_head); 223 + /* here __split_huge_page_refcount won't run anymore */ 209 224 if (likely(PageTail(page))) { 210 225 __get_page_tail_foll(page, false); 211 - return true; 212 - } else { 213 - put_page(page_head); 214 - return false; 226 + got = true; 215 227 } 228 + compound_unlock_irqrestore(page_head, flags); 229 + if (unlikely(!got)) 230 + put_page(page_head); 216 231 } 217 - 218 - /* 219 - * page_head wasn't a dangling pointer but it 220 - * may not be a head page anymore by the time 221 - * we obtain the lock. That is ok as long as it 222 - * can't be freed from under us. 223 - */ 224 - flags = compound_lock_irqsave(page_head); 225 - /* here __split_huge_page_refcount won't run anymore */ 226 - if (likely(PageTail(page))) { 227 - __get_page_tail_foll(page, false); 228 - got = true; 229 - } 230 - compound_unlock_irqrestore(page_head, flags); 231 - if (unlikely(!got)) 232 - put_page(page_head); 233 232 } 234 233 return got; 235 234 }