Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched, cpuset: rework sched domains and CPU hotplug handling (v4)

This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.

This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.

Here are some more details:

rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().

In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).

This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.

I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.

The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.

It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

authored by

Max Krasnyansky and committed by
Ingo Molnar
cf417141 b635acec

+185 -133
+185 -133
kernel/cpuset.c
··· 14 14 * 2003-10-22 Updates by Stephen Hemminger. 15 15 * 2004 May-July Rework by Paul Jackson. 16 16 * 2006 Rework by Paul Menage to use generic cgroups 17 + * 2008 Rework of the scheduler domains and CPU hotplug handling 18 + * by Max Krasnyansky 17 19 * 18 20 * This file is subject to the terms and conditions of the GNU General Public 19 21 * License. See the file COPYING in the main directory of the Linux ··· 238 236 239 237 static DEFINE_MUTEX(callback_mutex); 240 238 241 - /* This is ugly, but preserves the userspace API for existing cpuset 239 + /* 240 + * This is ugly, but preserves the userspace API for existing cpuset 242 241 * users. If someone tries to mount the "cpuset" filesystem, we 243 - * silently switch it to mount "cgroup" instead */ 242 + * silently switch it to mount "cgroup" instead 243 + */ 244 244 static int cpuset_get_sb(struct file_system_type *fs_type, 245 245 int flags, const char *unused_dev_name, 246 246 void *data, struct vfsmount *mnt) ··· 477 473 } 478 474 479 475 /* 480 - * Helper routine for rebuild_sched_domains(). 476 + * Helper routine for generate_sched_domains(). 481 477 * Do cpusets a, b have overlapping cpus_allowed masks? 482 478 */ 483 - 484 479 static int cpusets_overlap(struct cpuset *a, struct cpuset *b) 485 480 { 486 481 return cpus_intersects(a->cpus_allowed, b->cpus_allowed); ··· 521 518 } 522 519 523 520 /* 524 - * rebuild_sched_domains() 521 + * generate_sched_domains() 525 522 * 526 - * This routine will be called to rebuild the scheduler's dynamic 527 - * sched domains: 528 - * - if the flag 'sched_load_balance' of any cpuset with non-empty 529 - * 'cpus' changes, 530 - * - or if the 'cpus' allowed changes in any cpuset which has that 531 - * flag enabled, 532 - * - or if the 'sched_relax_domain_level' of any cpuset which has 533 - * that flag enabled and with non-empty 'cpus' changes, 534 - * - or if any cpuset with non-empty 'cpus' is removed, 535 - * - or if a cpu gets offlined. 536 - * 537 - * This routine builds a partial partition of the systems CPUs 538 - * (the set of non-overlappping cpumask_t's in the array 'part' 539 - * below), and passes that partial partition to the kernel/sched.c 540 - * partition_sched_domains() routine, which will rebuild the 541 - * schedulers load balancing domains (sched domains) as specified 542 - * by that partial partition. A 'partial partition' is a set of 543 - * non-overlapping subsets whose union is a subset of that set. 523 + * This function builds a partial partition of the systems CPUs 524 + * A 'partial partition' is a set of non-overlapping subsets whose 525 + * union is a subset of that set. 526 + * The output of this function needs to be passed to kernel/sched.c 527 + * partition_sched_domains() routine, which will rebuild the scheduler's 528 + * load balancing domains (sched domains) as specified by that partial 529 + * partition. 544 530 * 545 531 * See "What is sched_load_balance" in Documentation/cpusets.txt 546 532 * for a background explanation of this. ··· 539 547 * domains when operating in the severe memory shortage situations 540 548 * that could cause allocation failures below. 541 549 * 542 - * Call with cgroup_mutex held. May take callback_mutex during 543 - * call due to the kfifo_alloc() and kmalloc() calls. May nest 544 - * a call to the get_online_cpus()/put_online_cpus() pair. 545 - * Must not be called holding callback_mutex, because we must not 546 - * call get_online_cpus() while holding callback_mutex. Elsewhere 547 - * the kernel nests callback_mutex inside get_online_cpus() calls. 548 - * So the reverse nesting would risk an ABBA deadlock. 550 + * Must be called with cgroup_lock held. 549 551 * 550 552 * The three key local variables below are: 551 553 * q - a linked-list queue of cpuset pointers, used to implement a ··· 574 588 * element of the partition (one sched domain) to be passed to 575 589 * partition_sched_domains(). 576 590 */ 577 - 578 - void rebuild_sched_domains(void) 591 + static int generate_sched_domains(cpumask_t **domains, 592 + struct sched_domain_attr **attributes) 579 593 { 580 - LIST_HEAD(q); /* queue of cpusets to be scanned*/ 594 + LIST_HEAD(q); /* queue of cpusets to be scanned */ 581 595 struct cpuset *cp; /* scans q */ 582 596 struct cpuset **csa; /* array of all cpuset ptrs */ 583 597 int csn; /* how many cpuset ptrs in csa so far */ ··· 587 601 int ndoms; /* number of sched domains in result */ 588 602 int nslot; /* next empty doms[] cpumask_t slot */ 589 603 590 - csa = NULL; 604 + ndoms = 0; 591 605 doms = NULL; 592 606 dattr = NULL; 607 + csa = NULL; 593 608 594 609 /* Special case for the 99% of systems with one, full, sched domain */ 595 610 if (is_sched_load_balance(&top_cpuset)) { 596 - ndoms = 1; 597 611 doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL); 598 612 if (!doms) 599 - goto rebuild; 613 + goto done; 614 + 600 615 dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL); 601 616 if (dattr) { 602 617 *dattr = SD_ATTR_INIT; 603 618 update_domain_attr_tree(dattr, &top_cpuset); 604 619 } 605 620 *doms = top_cpuset.cpus_allowed; 606 - goto rebuild; 621 + 622 + ndoms = 1; 623 + goto done; 607 624 } 608 625 609 626 csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL); ··· 669 680 } 670 681 } 671 682 672 - /* Convert <csn, csa> to <ndoms, doms> */ 683 + /* 684 + * Now we know how many domains to create. 685 + * Convert <csn, csa> to <ndoms, doms> and populate cpu masks. 686 + */ 673 687 doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL); 674 - if (!doms) 675 - goto rebuild; 688 + if (!doms) { 689 + ndoms = 0; 690 + goto done; 691 + } 692 + 693 + /* 694 + * The rest of the code, including the scheduler, can deal with 695 + * dattr==NULL case. No need to abort if alloc fails. 696 + */ 676 697 dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL); 677 698 678 699 for (nslot = 0, i = 0; i < csn; i++) { 679 700 struct cpuset *a = csa[i]; 701 + cpumask_t *dp; 680 702 int apn = a->pn; 681 703 682 - if (apn >= 0) { 683 - cpumask_t *dp = doms + nslot; 684 - 685 - if (nslot == ndoms) { 686 - static int warnings = 10; 687 - if (warnings) { 688 - printk(KERN_WARNING 689 - "rebuild_sched_domains confused:" 690 - " nslot %d, ndoms %d, csn %d, i %d," 691 - " apn %d\n", 692 - nslot, ndoms, csn, i, apn); 693 - warnings--; 694 - } 695 - continue; 696 - } 697 - 698 - cpus_clear(*dp); 699 - if (dattr) 700 - *(dattr + nslot) = SD_ATTR_INIT; 701 - for (j = i; j < csn; j++) { 702 - struct cpuset *b = csa[j]; 703 - 704 - if (apn == b->pn) { 705 - cpus_or(*dp, *dp, b->cpus_allowed); 706 - b->pn = -1; 707 - if (dattr) 708 - update_domain_attr_tree(dattr 709 - + nslot, b); 710 - } 711 - } 712 - nslot++; 704 + if (apn < 0) { 705 + /* Skip completed partitions */ 706 + continue; 713 707 } 708 + 709 + dp = doms + nslot; 710 + 711 + if (nslot == ndoms) { 712 + static int warnings = 10; 713 + if (warnings) { 714 + printk(KERN_WARNING 715 + "rebuild_sched_domains confused:" 716 + " nslot %d, ndoms %d, csn %d, i %d," 717 + " apn %d\n", 718 + nslot, ndoms, csn, i, apn); 719 + warnings--; 720 + } 721 + continue; 722 + } 723 + 724 + cpus_clear(*dp); 725 + if (dattr) 726 + *(dattr + nslot) = SD_ATTR_INIT; 727 + for (j = i; j < csn; j++) { 728 + struct cpuset *b = csa[j]; 729 + 730 + if (apn == b->pn) { 731 + cpus_or(*dp, *dp, b->cpus_allowed); 732 + if (dattr) 733 + update_domain_attr_tree(dattr + nslot, b); 734 + 735 + /* Done with this partition */ 736 + b->pn = -1; 737 + } 738 + } 739 + nslot++; 714 740 } 715 741 BUG_ON(nslot != ndoms); 716 742 717 - rebuild: 718 - /* Have scheduler rebuild sched domains */ 719 - get_online_cpus(); 720 - partition_sched_domains(ndoms, doms, dattr); 721 - put_online_cpus(); 722 - 723 743 done: 724 744 kfree(csa); 725 - /* Don't kfree(doms) -- partition_sched_domains() does that. */ 726 - /* Don't kfree(dattr) -- partition_sched_domains() does that. */ 745 + 746 + *domains = doms; 747 + *attributes = dattr; 748 + return ndoms; 749 + } 750 + 751 + /* 752 + * Rebuild scheduler domains. 753 + * 754 + * Call with neither cgroup_mutex held nor within get_online_cpus(). 755 + * Takes both cgroup_mutex and get_online_cpus(). 756 + * 757 + * Cannot be directly called from cpuset code handling changes 758 + * to the cpuset pseudo-filesystem, because it cannot be called 759 + * from code that already holds cgroup_mutex. 760 + */ 761 + static void do_rebuild_sched_domains(struct work_struct *unused) 762 + { 763 + struct sched_domain_attr *attr; 764 + cpumask_t *doms; 765 + int ndoms; 766 + 767 + get_online_cpus(); 768 + 769 + /* Generate domain masks and attrs */ 770 + cgroup_lock(); 771 + ndoms = generate_sched_domains(&doms, &attr); 772 + cgroup_unlock(); 773 + 774 + /* Have scheduler rebuild the domains */ 775 + partition_sched_domains(ndoms, doms, attr); 776 + 777 + put_online_cpus(); 778 + } 779 + 780 + static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains); 781 + 782 + /* 783 + * Rebuild scheduler domains, asynchronously via workqueue. 784 + * 785 + * If the flag 'sched_load_balance' of any cpuset with non-empty 786 + * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset 787 + * which has that flag enabled, or if any cpuset with a non-empty 788 + * 'cpus' is removed, then call this routine to rebuild the 789 + * scheduler's dynamic sched domains. 790 + * 791 + * The rebuild_sched_domains() and partition_sched_domains() 792 + * routines must nest cgroup_lock() inside get_online_cpus(), 793 + * but such cpuset changes as these must nest that locking the 794 + * other way, holding cgroup_lock() for much of the code. 795 + * 796 + * So in order to avoid an ABBA deadlock, the cpuset code handling 797 + * these user changes delegates the actual sched domain rebuilding 798 + * to a separate workqueue thread, which ends up processing the 799 + * above do_rebuild_sched_domains() function. 800 + */ 801 + static void async_rebuild_sched_domains(void) 802 + { 803 + schedule_work(&rebuild_sched_domains_work); 804 + } 805 + 806 + /* 807 + * Accomplishes the same scheduler domain rebuild as the above 808 + * async_rebuild_sched_domains(), however it directly calls the 809 + * rebuild routine synchronously rather than calling it via an 810 + * asynchronous work thread. 811 + * 812 + * This can only be called from code that is not holding 813 + * cgroup_mutex (not nested in a cgroup_lock() call.) 814 + */ 815 + void rebuild_sched_domains(void) 816 + { 817 + do_rebuild_sched_domains(NULL); 727 818 } 728 819 729 820 /** ··· 932 863 return retval; 933 864 934 865 if (is_load_balanced) 935 - rebuild_sched_domains(); 866 + async_rebuild_sched_domains(); 936 867 return 0; 937 868 } 938 869 ··· 1159 1090 if (val != cs->relax_domain_level) { 1160 1091 cs->relax_domain_level = val; 1161 1092 if (!cpus_empty(cs->cpus_allowed) && is_sched_load_balance(cs)) 1162 - rebuild_sched_domains(); 1093 + async_rebuild_sched_domains(); 1163 1094 } 1164 1095 1165 1096 return 0; ··· 1200 1131 mutex_unlock(&callback_mutex); 1201 1132 1202 1133 if (cpus_nonempty && balance_flag_changed) 1203 - rebuild_sched_domains(); 1134 + async_rebuild_sched_domains(); 1204 1135 1205 1136 return 0; 1206 1137 } ··· 1561 1492 default: 1562 1493 BUG(); 1563 1494 } 1495 + 1496 + /* Unreachable but makes gcc happy */ 1497 + return 0; 1564 1498 } 1565 1499 1566 1500 static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft) ··· 1576 1504 default: 1577 1505 BUG(); 1578 1506 } 1507 + 1508 + /* Unrechable but makes gcc happy */ 1509 + return 0; 1579 1510 } 1580 1511 1581 1512 ··· 1767 1692 } 1768 1693 1769 1694 /* 1770 - * Locking note on the strange update_flag() call below: 1771 - * 1772 1695 * If the cpuset being removed has its flag 'sched_load_balance' 1773 1696 * enabled, then simulate turning sched_load_balance off, which 1774 - * will call rebuild_sched_domains(). The get_online_cpus() 1775 - * call in rebuild_sched_domains() must not be made while holding 1776 - * callback_mutex. Elsewhere the kernel nests callback_mutex inside 1777 - * get_online_cpus() calls. So the reverse nesting would risk an 1778 - * ABBA deadlock. 1697 + * will call async_rebuild_sched_domains(). 1779 1698 */ 1780 1699 1781 1700 static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont) ··· 1788 1719 struct cgroup_subsys cpuset_subsys = { 1789 1720 .name = "cpuset", 1790 1721 .create = cpuset_create, 1791 - .destroy = cpuset_destroy, 1722 + .destroy = cpuset_destroy, 1792 1723 .can_attach = cpuset_can_attach, 1793 1724 .attach = cpuset_attach, 1794 1725 .populate = cpuset_populate, ··· 1880 1811 } 1881 1812 1882 1813 /* 1883 - * If common_cpu_mem_hotplug_unplug(), below, unplugs any CPUs 1814 + * If CPU and/or memory hotplug handlers, below, unplug any CPUs 1884 1815 * or memory nodes, we need to walk over the cpuset hierarchy, 1885 1816 * removing that CPU or node from all cpusets. If this removes the 1886 1817 * last CPU or node from a cpuset, then move the tasks in the empty ··· 1972 1903 } 1973 1904 1974 1905 /* 1975 - * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track 1976 - * cpu_online_map and node_states[N_HIGH_MEMORY]. Force the top cpuset to 1977 - * track what's online after any CPU or memory node hotplug or unplug event. 1978 - * 1979 - * Since there are two callers of this routine, one for CPU hotplug 1980 - * events and one for memory node hotplug events, we could have coded 1981 - * two separate routines here. We code it as a single common routine 1982 - * in order to minimize text size. 1983 - */ 1984 - 1985 - static void common_cpu_mem_hotplug_unplug(int rebuild_sd) 1986 - { 1987 - cgroup_lock(); 1988 - 1989 - top_cpuset.cpus_allowed = cpu_online_map; 1990 - top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; 1991 - scan_for_empty_cpusets(&top_cpuset); 1992 - 1993 - /* 1994 - * Scheduler destroys domains on hotplug events. 1995 - * Rebuild them based on the current settings. 1996 - */ 1997 - if (rebuild_sd) 1998 - rebuild_sched_domains(); 1999 - 2000 - cgroup_unlock(); 2001 - } 2002 - 2003 - /* 2004 1906 * The top_cpuset tracks what CPUs and Memory Nodes are online, 2005 1907 * period. This is necessary in order to make cpusets transparent 2006 1908 * (of no affect) on systems that are actively using CPU hotplug ··· 1979 1939 * 1980 1940 * This routine ensures that top_cpuset.cpus_allowed tracks 1981 1941 * cpu_online_map on each CPU hotplug (cpuhp) event. 1942 + * 1943 + * Called within get_online_cpus(). Needs to call cgroup_lock() 1944 + * before calling generate_sched_domains(). 1982 1945 */ 1983 - 1984 - static int cpuset_handle_cpuhp(struct notifier_block *unused_nb, 1946 + static int cpuset_track_online_cpus(struct notifier_block *unused_nb, 1985 1947 unsigned long phase, void *unused_cpu) 1986 1948 { 1949 + struct sched_domain_attr *attr; 1950 + cpumask_t *doms; 1951 + int ndoms; 1952 + 1987 1953 switch (phase) { 1988 - case CPU_UP_CANCELED: 1989 - case CPU_UP_CANCELED_FROZEN: 1990 - case CPU_DOWN_FAILED: 1991 - case CPU_DOWN_FAILED_FROZEN: 1992 1954 case CPU_ONLINE: 1993 1955 case CPU_ONLINE_FROZEN: 1994 1956 case CPU_DEAD: 1995 1957 case CPU_DEAD_FROZEN: 1996 - common_cpu_mem_hotplug_unplug(1); 1997 1958 break; 1959 + 1998 1960 default: 1999 1961 return NOTIFY_DONE; 2000 1962 } 1963 + 1964 + cgroup_lock(); 1965 + top_cpuset.cpus_allowed = cpu_online_map; 1966 + scan_for_empty_cpusets(&top_cpuset); 1967 + ndoms = generate_sched_domains(&doms, &attr); 1968 + cgroup_unlock(); 1969 + 1970 + /* Have scheduler rebuild the domains */ 1971 + partition_sched_domains(ndoms, doms, attr); 2001 1972 2002 1973 return NOTIFY_OK; 2003 1974 } ··· 2016 1965 #ifdef CONFIG_MEMORY_HOTPLUG 2017 1966 /* 2018 1967 * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY]. 2019 - * Call this routine anytime after you change 2020 - * node_states[N_HIGH_MEMORY]. 2021 - * See also the previous routine cpuset_handle_cpuhp(). 1968 + * Call this routine anytime after node_states[N_HIGH_MEMORY] changes. 1969 + * See also the previous routine cpuset_track_online_cpus(). 2022 1970 */ 2023 - 2024 1971 void cpuset_track_online_nodes(void) 2025 1972 { 2026 - common_cpu_mem_hotplug_unplug(0); 1973 + cgroup_lock(); 1974 + top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; 1975 + scan_for_empty_cpusets(&top_cpuset); 1976 + cgroup_unlock(); 2027 1977 } 2028 1978 #endif 2029 1979 ··· 2039 1987 top_cpuset.cpus_allowed = cpu_online_map; 2040 1988 top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; 2041 1989 2042 - hotcpu_notifier(cpuset_handle_cpuhp, 0); 1990 + hotcpu_notifier(cpuset_track_online_cpus, 0); 2043 1991 } 2044 1992 2045 1993 /**