Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

vmstat: User per cpu atomics to avoid interrupt disable / enable

Currently the operations to increment vm counters must disable interrupts
in order to not mess up their housekeeping of counters.

So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
count on preremption being disabled we still have some minor issues.
The fetching of the counter thresholds is racy.
A threshold from another cpu may be applied if we happen to be
rescheduled on another cpu. However, the following vmstat operation
will then bring the counter again under the threshold limit.

The operations for __xxx_zone_state are not changed since the caller
has taken care of the synchronization needs (and therefore the cycle
count is even less than the optimized version for the irq disable case
provided here).

The optimization using this_cpu_cmpxchg will only be used if the arch
supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)

The use of this_cpu_cmpxchg reduces the cycle count for the counter
operations by %80 (inc_zone_page_state goes from 170 cycles to 32).

Signed-off-by: Christoph Lameter <cl@linux.com>

authored by

Christoph Lameter and committed by
Tejun Heo
7c839120 20b87691

+87 -14
+87 -14
mm/vmstat.c
··· 185 185 EXPORT_SYMBOL(__mod_zone_page_state); 186 186 187 187 /* 188 - * For an unknown interrupt state 189 - */ 190 - void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, 191 - int delta) 192 - { 193 - unsigned long flags; 194 - 195 - local_irq_save(flags); 196 - __mod_zone_page_state(zone, item, delta); 197 - local_irq_restore(flags); 198 - } 199 - EXPORT_SYMBOL(mod_zone_page_state); 200 - 201 - /* 202 188 * Optimized increment and decrement functions. 203 189 * 204 190 * These are only for a single page and therefore can take a struct page * ··· 251 265 } 252 266 EXPORT_SYMBOL(__dec_zone_page_state); 253 267 268 + #ifdef CONFIG_CMPXCHG_LOCAL 269 + /* 270 + * If we have cmpxchg_local support then we do not need to incur the overhead 271 + * that comes with local_irq_save/restore if we use this_cpu_cmpxchg. 272 + * 273 + * mod_state() modifies the zone counter state through atomic per cpu 274 + * operations. 275 + * 276 + * Overstep mode specifies how overstep should handled: 277 + * 0 No overstepping 278 + * 1 Overstepping half of threshold 279 + * -1 Overstepping minus half of threshold 280 + */ 281 + static inline void mod_state(struct zone *zone, 282 + enum zone_stat_item item, int delta, int overstep_mode) 283 + { 284 + struct per_cpu_pageset __percpu *pcp = zone->pageset; 285 + s8 __percpu *p = pcp->vm_stat_diff + item; 286 + long o, n, t, z; 287 + 288 + do { 289 + z = 0; /* overflow to zone counters */ 290 + 291 + /* 292 + * The fetching of the stat_threshold is racy. We may apply 293 + * a counter threshold to the wrong the cpu if we get 294 + * rescheduled while executing here. However, the following 295 + * will apply the threshold again and therefore bring the 296 + * counter under the threshold. 297 + */ 298 + t = this_cpu_read(pcp->stat_threshold); 299 + 300 + o = this_cpu_read(*p); 301 + n = delta + o; 302 + 303 + if (n > t || n < -t) { 304 + int os = overstep_mode * (t >> 1) ; 305 + 306 + /* Overflow must be added to zone counters */ 307 + z = n + os; 308 + n = -os; 309 + } 310 + } while (this_cpu_cmpxchg(*p, o, n) != o); 311 + 312 + if (z) 313 + zone_page_state_add(z, zone, item); 314 + } 315 + 316 + void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, 317 + int delta) 318 + { 319 + mod_state(zone, item, delta, 0); 320 + } 321 + EXPORT_SYMBOL(mod_zone_page_state); 322 + 323 + void inc_zone_state(struct zone *zone, enum zone_stat_item item) 324 + { 325 + mod_state(zone, item, 1, 1); 326 + } 327 + 328 + void inc_zone_page_state(struct page *page, enum zone_stat_item item) 329 + { 330 + mod_state(page_zone(page), item, 1, 1); 331 + } 332 + EXPORT_SYMBOL(inc_zone_page_state); 333 + 334 + void dec_zone_page_state(struct page *page, enum zone_stat_item item) 335 + { 336 + mod_state(page_zone(page), item, -1, -1); 337 + } 338 + EXPORT_SYMBOL(dec_zone_page_state); 339 + #else 340 + /* 341 + * Use interrupt disable to serialize counter updates 342 + */ 343 + void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, 344 + int delta) 345 + { 346 + unsigned long flags; 347 + 348 + local_irq_save(flags); 349 + __mod_zone_page_state(zone, item, delta); 350 + local_irq_restore(flags); 351 + } 352 + EXPORT_SYMBOL(mod_zone_page_state); 353 + 254 354 void inc_zone_state(struct zone *zone, enum zone_stat_item item) 255 355 { 256 356 unsigned long flags; ··· 367 295 local_irq_restore(flags); 368 296 } 369 297 EXPORT_SYMBOL(dec_zone_page_state); 298 + #endif 370 299 371 300 /* 372 301 * Update the zone counters for one cpu.