commits

Pull networking fixes from David Miller:

1) The value choosen for the new SO_MAX_PACING_RATE socket option on
parisc was very poorly choosen, let's fix it while we still can.
From Eric Dumazet.

2) Our generic reciprocal divide was found to handle some edge cases
incorrectly, part of this is encoded into the BPF as deep as the JIT
engines themselves. Just use a real divide throughout for now.
From Eric Dumazet.

3) Because the initial lookup is lockless, the TCP metrics engine can
end up creating two entries for the same lookup key. Fix this by
doing a second lookup under the lock before we actually create the
new entry. From Christoph Paasch.

4) Fix scatter-gather list init in usbnet driver, from Bjørn Mork.

5) Fix unintended 32-bit truncation in cxgb4 driver's bit shifting.
From Dan Carpenter.

6) Netlink socket dumping uses the wrong socket state for timewait
sockets. Fix from Neal Cardwell.

7) Fix netlink memory leak in ieee802154_add_iface(), from Christian
Engelmayer.

8) Multicast forwarding in ipv4 can overflow the per-rule reference
counts, causing all multicast traffic to cease. Fix from Hannes
Frederic Sowa.

9) via-rhine needs to stop all TX queues when it resets the device,
from Richard Weinberger.

10) Fix RDS per-cpu accesses broken by the this_cpu_* conversions. From
Gerald Schaefer.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions
parisc: fix SO_MAX_PACING_RATE typo
ipv6: simplify detection of first operational link-local address on interface
tcp: metrics: Avoid duplicate entries with the same destination-IP
net: rds: fix per-cpu helper usage
e1000e: Fix compilation warning when !CONFIG_PM_SLEEP
bpf: do not use reciprocal divide
be2net: add dma_mapping_error() check for dma_map_page()
bnx2x: Don't release PCI bars on shutdown
net,via-rhine: Fix tx_timeout handling
batman-adv: fix batman-adv header overhead calculation
qlge: Fix vlan netdev features.
net: avoid reference counter overflows on fib_rules in multicast forwarding
dm9601: add USB IDs for new dm96xx variants
MAINTAINERS: add virtio-dev ML for virtio
ieee802154: Fix memory leak in ieee802154_add_iface()
net: usbnet: fix SG initialisation
inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets
cxgb4: silence shift wrapping static checker warning

12y ago

Robert Richter

bee09ed9

perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h

12y ago

Linus Torvalds

7e22e911

Linux 3.13-rc8 v3.13-rc8

12y ago

Linus Torvalds

48ba620a

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

12y ago

Heiko Carstens

3af57f78

s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions

12y ago

Peter Zijlstra

c026b359

x86, mm, perf: Allow recursive faults from interrupts

12y ago

Steven Rostedt

3dc91d43

SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()

While running stress tests on adding and deleting ftrace instances I hit
this bug:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: selinux_inode_permission+0x85/0x160
PGD 63681067 PUD 7ddbe067 PMD 0
Oops: 0000 [#1] PREEMPT
CPU: 0 PID: 5634 Comm: ftrace-test-mki Not tainted 3.13.0-rc4-test-00033-gd2a6dde-dirty #20
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
task: ffff880078375800 ti: ffff88007ddb0000 task.ti: ffff88007ddb0000
RIP: 0010:[<ffffffff812d8bc5>] [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
RSP: 0018:ffff88007ddb1c48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000800000 RCX: ffff88006dd43840
RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffff88006ee46000
RBP: ffff88007ddb1c88 R08: 0000000000000000 R09: ffff88007ddb1c54
R10: 6e6576652f6f6f66 R11: 0000000000000003 R12: 0000000000000000
R13: 0000000000000081 R14: ffff88006ee46000 R15: 0000000000000000
FS: 00007f217b5b6700(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 0000000000000020 CR3: 000000006a0fe000 CR4: 00000000000007f0
Call Trace:
security_inode_permission+0x1c/0x30
__inode_permission+0x41/0xa0
inode_permission+0x18/0x50
link_path_walk+0x66/0x920
path_openat+0xa6/0x6c0
do_filp_open+0x43/0xa0
do_sys_open+0x146/0x240
SyS_open+0x1e/0x20
system_call_fastpath+0x16/0x1b
Code: 84 a1 00 00 00 81 e3 00 20 00 00 89 d8 83 c8 02 40 f6 c6 04 0f 45 d8 40 f6 c6 08 74 71 80 cf 02 49 8b 46 38 4c 8d 4d cc 45 31 c0 <0f> b7 50 20 8b 70 1c 48 8b 41 70 89 d9 8b 78 04 e8 36 cf ff ff
RIP selinux_inode_permission+0x85/0x160
CR2: 0000000000000020

Investigating, I found that the inode->i_security was NULL, and the
dereference of it caused the oops.

in selinux_inode_permission():

isec = inode->i_security;

rc = avc_has_perm_noaudit(sid, isec->sid, isec->sclass, perms, 0, &avd);

Note, the crash came from stressing the deletion and reading of debugfs
files. I was not able to recreate this via normal files. But I'm not
sure they are safe. It may just be that the race window is much harder
to hit.

What seems to have happened (and what I have traced), is the file is
being opened at the same time the file or directory is being deleted.
As the dentry and inode locks are not held during the path walk, nor is
the inodes ref counts being incremented, there is nothing saving these
structures from being discarded except for an rcu_read_lock().

The rcu_read_lock() protects against freeing of the inode, but it does
not protect freeing of the inode_security_struct. Now if the freeing of
the i_security happens with a call_rcu(), and the i_security field of
the inode is not changed (it gets freed as the inode gets freed) then
there will be no issue here. (Linus Torvalds suggested not setting the
field to NULL such that we do not need to check if it is NULL in the
permission check).

Note, this is a hack, but it fixes the problem at hand. A real fix is
to restructure the destroy_inode() to call all the destructor handlers
from the RCU callback. But that is a major job to do, and requires a
lot of work. For now, we just band-aid this bug with this fix (it
works), and work on a more maintainable solution in the future.

Link: http://lkml.kernel.org/r/20140109101932.0508dec7@gandalf.local.home
Link: http://lkml.kernel.org/r/20140109182756.17abaaa8@gandalf.local.home

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

12y ago

Linus Torvalds

8f211b6c

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

12y ago

Eric W. Biederman

41301ae7

vfs: Fix a regression in mounting proc

12y ago

Eric Dumazet

75b99dbd

parisc: fix SO_MAX_PACING_RATE typo

12y ago

Linus Torvalds

85ce70fd

Merge branches 'sched-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

12y ago

Hugh Dickins

eecc1e42

thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only

12y ago

Linus Torvalds

8b6d79f5

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

12y ago

Andrew Jones

0dce7cd6

kvm: x86: fix apic_base enable check

12y ago

Eric W. Biederman

1f7f4dde

fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)

12y ago

Hannes Frederic Sowa

11ffff75

ipv6: simplify detection of first operational link-local address on interface

12y ago

Linus Torvalds

9b6c4ea9

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

12y ago

Rik van Riel

9722c2da

sched: Calculate effective load even if local weight is 0

12y ago

Ingo Molnar

e59da0ae

Merge branch 'clockevents/3.13-fixes' of git://git.linaro.org/people/daniel.lezcano/linux into timers/urgent

12y ago

Ming Lei

518d00b7

block: null_blk: fix queue leak inside removing device

12y ago

Hugh Dickins

d1969a84

percpu_counter: unbreak __percpu_counter_add()

12y ago

Catalin Marinas

4ce00dfc

Revert "arm64: Fix memory shareability attribute for ioremap_wc/cache"

12y ago

Eric W. Biederman

f48cfddc

vfs: In d_path don't call d_dname on a mount point

12y ago

Christoph Paasch

77f99ad1

tcp: metrics: Avoid duplicate entries with the same destination-IP

12y ago

Linus Torvalds

93a11c8f

Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

12y ago

John Stultz

7a06c41c

sched_clock: Disable seqlock lockdep usage in sched_clock()

12y ago

Linus Torvalds

228fdc08

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Famouse last words: "final pull request" :-)

I'm sending this because Jason Wang's fixes are pretty important

1) Add missing per-cpu stats initialization to ip6_vti. Otherwise
lockdep spits out a call trace. From Li RongQing.

2) Fix NULL oops in wireless hwsim, from Javier Lopez

3) TIPC deferred packet queue unlink must NULL out skb->next to avoid
crashes. From Erik Hugne

4) Fix access to uninitialized buffer in nf_nat netfilter code, from
Daniel Borkmann

5) Fix lifetime of ipv6 loopback and SIT tunnel addresses, otherwise
they basically timeout immediately. From Hannes Frederic Sowa

6) Fix DMA unmapping of TSO packets in bnx2x driver, from Michal
Schmidt

7) Do not allow L2 forwarding offload via macvtap device, the way
things are now it will not end up being forwaded at all. From
Jason Wang

8) Fix transmit queue selection via ndo_dfwd_start_xmit(), fixing
things like applying NETIF_F_LLTX to the wrong device (!!) and
eliding the proper transmit watchdog handling

9) qlcnic driver was not updating tx statistics at all, from Manish
Chopra"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
qlcnic: Fix ethtool statistics length calculation
qlcnic: Fix bug in TX statistics
net: core: explicitly select a txq before doing l2 forwarding
macvlan: forbid L2 fowarding offload for macvtap
bnx2x: fix DMA unmapping of TSO split BDs
ipv6: add link-local, sit and loopback address with INFINITY_LIFE_TIME
bnx2x: prevent WARN during driver unload
tipc: correctly unlink packets from deferred packet queue
ipv6: pcpu_tstats.syncp should be initialised in ip6_vti.c
netfilter: only warn once on wrong seqadj usage
netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
NFC: Fix target mode p2p link establishment
iwlwifi: add new devices for 7265 series
mac80211: move "bufferable MMPDU" check to fix AP mode scan
mac80211_hwsim: Fix NULL pointer dereference

12y ago

Linus Torvalds

a6da83f9

Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc

12y ago

Soren Brinkmann

c1dcc927

clocksource: cadence_ttc: Fix mutex taken inside interrupt context

When the kernel is compiled with:
CONFIG_HIGH_RES_TIMERS=no
CONFIG_HZ_PERIODIC=yes
CONFIG_DEBUG_ATOMIC_SLEEP=yes

The following WARN appears:

WARNING: CPU: 1 PID: 0 at linux/kernel/mutex.c:856 mutex_trylock+0x70/0x1fc()
DEBUG_LOCKS_WARN_ON(in_interrupt())
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-xilinx-dirty #93
[<c0014a78>] (unwind_backtrace+0x0/0x11c) from [<c0011b6c>] (show_stack+0x10/0x14)
[<c0011b6c>] (show_stack+0x10/0x14) from [<c039120c>] (dump_stack+0x7c/0xc0)
[<c039120c>] (dump_stack+0x7c/0xc0) from [<c001fda4>] (warn_slowpath_common+0x60/0x84)
[<c001fda4>] (warn_slowpath_common+0x60/0x84) from [<c001fe48>] (warn_slowpath_fmt+0x2c/0x3c)
[<c001fe48>] (warn_slowpath_fmt+0x2c/0x3c) from [<c0392658>] (mutex_trylock+0x70/0x1fc)
[<c0392658>] (mutex_trylock+0x70/0x1fc) from [<c02dfc08>] (clk_prepare_lock+0xc/0xe4)
[<c02dfc08>] (clk_prepare_lock+0xc/0xe4) from [<c02e099c>] (clk_get_rate+0xc/0x44)
[<c02e099c>] (clk_get_rate+0xc/0x44) from [<c02d0394>] (ttc_set_mode+0x34/0x78)
[<c02d0394>] (ttc_set_mode+0x34/0x78) from [<c005f794>] (clockevents_set_mode+0x28/0x5c)
[<c005f794>] (clockevents_set_mode+0x28/0x5c) from [<c00607fc>] (tick_broadcast_on_off+0x190/0x1c0)
[<c00607fc>] (tick_broadcast_on_off+0x190/0x1c0) from [<c005f168>] (clockevents_notify+0x58/0x1ac)
[<c005f168>] (clockevents_notify+0x58/0x1ac) from [<c02b99dc>] (cpuidle_setup_broadcast_timer+0x20/0x24)
[<c02b99dc>] (cpuidle_setup_broadcast_timer+0x20/0x24) from [<c006cd04>] (generic_smp_call_function_single_interrupt+0)
[<c006cd04>] (generic_smp_call_function_single_interrupt+0xe0/0x130) from [<c00138c8>] (handle_IPI+0x88/0x118)
[<c00138c8>] (handle_IPI+0x88/0x118) from [<c0008504>] (gic_handle_irq+0x58/0x60)
[<c0008504>] (gic_handle_irq+0x58/0x60) from [<c0012644>] (__irq_svc+0x44/0x78)
Exception stack(0xef099fa0 to 0xef099fe8)
9fa0: 00000001 ef092100 00000000 ef092100 ef098000 00000015 c0399f2c c0579d74
9fc0: 0000406a 413fc090 00000000 00000000 00000000 ef099fe8 c00666ec c000f46c
9fe0: 20000113 ffffffff
[<c0012644>] (__irq_svc+0x44/0x78) from [<c000f46c>] (arch_cpu_idle+0x34/0x3c)
[<c000f46c>] (arch_cpu_idle+0x34/0x3c) from [<c0053980>] (cpu_startup_entry+0xa8/0x10c)
[<c0053980>] (cpu_startup_entry+0xa8/0x10c) from [<000085a4>] (0x85a4)

We are in an interrupt context (IPI) and we are calling clk_get_rate in the
set_mode function which in turn ends up by getting a mutex... Even if that
does not hang, it is a potential kernel deadlock.

It is not allowed to call clk_get_rate() from interrupt context. To
avoid such calls the timer input frequency is stored in the driver's
data struct which makes it accessible to the driver in any context.

[dlezcano] completed the changelog with the WARN trace and added a more
detailed description. Tested on zync zc702.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

12y ago

Will Deacon

cdc27c27

arm64: ptrace: avoid using HW_BREAKPOINT_EMPTY for disabled events

12y ago

Linus Torvalds

0e4b0743

Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

12y ago

Gerald Schaefer

c196403b

net: rds: fix per-cpu helper usage

12y ago

Linus Torvalds

9826dbb1

Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm

12y ago

Jean Delvare

3f9aec76

hwmon: (coretemp) Fix truncated name of alarm attributes

12y ago

John Stultz

0c3351d4

seqlock: Use raw_ prefix instead of _no_lockdep

12y ago

Linus Torvalds

e2bc4470

Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs

12y ago

Shahed Shaikh

d6e9c89a

qlcnic: Fix ethtool statistics length calculation

12y ago

Linus Torvalds

061f49ec

Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

12y ago

Benjamin Herrenschmidt

10348f59

powerpc: Check return value of instance-to-package OF call

12y ago

Linus Torvalds

b0031f22

Merge tag 's2mps11-build' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

12y ago

Linus Torvalds

319e2e3f

Linux 3.13-rc4 v3.13-rc4

12y ago

Linus Torvalds

3af4977e

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

12y ago

Stephen Warren

a31ab44e

ARM: bcm2835: add missing #xxx-cells to I2C nodes

12y ago

David S. Miller

8c12ec74

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

12y ago

Linus Torvalds

70b23ce3

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

12y ago

Taras Kondratiuk

b25f3e1c

ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling

12y ago

Linus Torvalds

324c66ff

Merge branch 'leds-fixes-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds

12y ago

Chuansheng Liu

1f4a63bf

xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()

12y ago

Manish Chopra

1ac6762a

qlcnic: Fix bug in TX statistics

12y ago

Linus Torvalds

26bef131

x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround

12y ago

Benjamin Herrenschmidt

f991db1c

Merge remote-tracking branch 'agust/merge' into merge

12y ago

Linus Torvalds

941ef73d

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

12y ago

Krzysztof Kozlowski

1b1ccee1

mfd: s2mps11: Fix build after regmap field rename in sec-core.c

12y ago

Matias Bjorling

57053d8c

null_blk: mem garbage on NUMA systems during init

12y ago

Linus Torvalds

b9548514

Merge tag 'ntb-3.13' of git://github.com/jonmason/ntb

12y ago

Linux 3.13 v3.13

d8ec26d7

Linus Torvalds

12y

drm/nouveau/mxm: fix null deref on load

72de1823

Ilia Mirkin

12y

Merge tag 'acpi-3.13-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

4d935402

Linus Torvalds

12y

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

16ec54ad

Linus Torvalds

12y

Revert "ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs"

2b844ba7

Rafael J. Wysocki

12y

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

7d0d46da

Linus Torvalds

12y

perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h

bee09ed9

Robert Richter

12y

Linux 3.13-rc8 v3.13-rc8

7e22e911

Linus Torvalds

12y

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

48ba620a

Linus Torvalds

12y

s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions

3af57f78

Heiko Carstens

12y

x86, mm, perf: Allow recursive faults from interrupts

c026b359

Peter Zijlstra

12y

SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()

3dc91d43

Steven Rostedt

12y

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

8f211b6c

Linus Torvalds

12y

vfs: Fix a regression in mounting proc

41301ae7

Eric W. Biederman

12y

parisc: fix SO_MAX_PACING_RATE typo

75b99dbd

Eric Dumazet

12y

Merge branches 'sched-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

85ce70fd

Linus Torvalds

12y

thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only

We see General Protection Fault on RSI in copy_page_rep: that RSI is
what you get from a NULL struct page pointer.

RIP: 0010:[<ffffffff81154955>] [<ffffffff81154955>] copy_page_rep+0x5/0x10
RSP: 0000:ffff880136e15c00 EFLAGS: 00010286
RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200
RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000
RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c
R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001
R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0
Call Trace:
copy_user_huge_page+0x93/0xab
do_huge_pmd_wp_page+0x710/0x815
handle_mm_fault+0x15d8/0x1d70
__do_page_fault+0x14d/0x840
do_page_fault+0x2f/0x90
page_fault+0x22/0x30

do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but
since shrink_huge_zero_page() can free the huge_zero_page, and we have
no hold of our own on it here (except where the fourth test holds
page_table_lock and has checked pmd_same), it's possible for it to
answer yes the first time, but no to the second or third test. Change
all those last three to tests for NULL page.

(Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
to the source page being split, and a tail page freed, while copy
is in progress; and not a problem without DEBUG_PAGEALLOC, since
the pmd_same check will prevent a miscopy from being made visible.)

Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v3.10 v3.11 v3.12
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eecc1e42

Hugh Dickins

12y

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

8b6d79f5

Linus Torvalds

12y

kvm: x86: fix apic_base enable check

0dce7cd6

Andrew Jones

12y

fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)

Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> Hi Oleg,
>
> commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
> "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
> breaks lxc-attach in 3.12. That code forks a child which does
> setns() and then does a clone(CLONE_PARENT). That way the
> grandchild can be in the right namespaces (which the child was
> not) and be a child of the original task, which is the monitor.
>
> lxc-attach in 3.11 was working fine with no side effects that I
> could see. Is there a real danger in allowing CLONE_PARENT
> when current->nsproxy->pidns_for_children is not our pidns,
> or was this done out of an "over-abundance of caution"? Can we
> safely revert that new extra check?

The two fundamental things I know we can not allow are:
- A shared signal queue aka CLONE_THREAD. Because we compute the pid
and uid of the signal when we place it in the queue.

- Changing the pid and by extention pid_namespace of an existing
process.

From a parents perspective there is nothing special about the pid
namespace, to deny CLONE_PARENT, because the parent simply won't know or
care.

From the childs perspective all that is special really are shared signal
queues.

User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
in different pid namespaces is almost certainly going to break because
it is complicated. But shared signal handlers can look at per thread
information to know which pid namespace a process is in, so I don't know
of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
at the kernel level. It would be absolutely stupid to implement but
that is a different thing.

So hmm.

Because it can do no harm, and because it is a regression let's remove
the CLONE_PARENT check and send it stable.

Cc: stable@vger.kernel.org
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>