commits

This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd.

commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:

bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.

All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).

Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.

[1]: https://github.com/systemd/systemd/pull/9483

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7y ago

Mans Rullgard

9bc30ab8

auxdisplay: charlcd: fix x/y command parsing

7y ago

Linus Torvalds

40e020c1

Linux 4.20-rc6 v4.20-rc6

7y ago

Varun Prakash

ed076c55

scsi: target: iscsi: cxgbit: fix csk leak

7y ago

David Howells

4584ae96

afs: Fix missing net error handling

7y ago

Christoph Hellwig

0cd60eb1

dma-mapping: fix flags in dma_alloc_wc

7y ago

Linus Torvalds

d48f782e

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"A decent batch of fixes here. I'd say about half are for problems that
have existed for a while, and half are for new regressions added in
the 4.20 merge window.

1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach.

2) Revert bogus emac driver change, from Benjamin Herrenschmidt.

3) Handle BPF exported data structure with pointers when building
32-bit userland, from Daniel Borkmann.

4) Memory leak fix in act_police, from Davide Caratti.

5) Check RX checksum offload in RX descriptors properly in aquantia
driver, from Dmitry Bogdanov.

6) SKB unlink fix in various spots, from Edward Cree.

7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from
Eric Dumazet.

8) Fix FID leak in mlxsw driver, from Ido Schimmel.

9) IOTLB locking fix in vhost, from Jean-Philippe Brucker.

10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory
limits otherwise namespace exit can hang. From Jiri Wiesner.

11) Address block parsing length fixes in x25 from Martin Schiller.

12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan.

13) For tun interfaces, only iface delete works with rtnl ops, enforce
this by disallowing add. From Nicolas Dichtel.

14) Use after free in liquidio, from Pan Bian.

15) Fix SKB use after passing to netif_receive_skb(), from Prashant
Bhole.

16) Static key accounting and other fixes in XPS from Sabrina Dubroca.

17) Partially initialized flow key passed to ip6_route_output(), from
Shmulik Ladkani.

18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas
Falcon.

19) Several small TCP fixes (off-by-one on window probe abort, NULL
deref in tail loss probe, SNMP mis-estimations) from Yuchung
Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
net/sched: cls_flower: Reject duplicated rules also under skip_sw
bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
bnxt_en: Keep track of reserved IRQs.
bnxt_en: Fix CNP CoS queue regression.
net/mlx4_core: Correctly set PFC param if global pause is turned off.
Revert "net/ibm/emac: wrong bit is used for STA control"
neighbour: Avoid writing before skb->head in neigh_hh_output()
ipv6: Check available headroom in ip6_xmit() even without options
tcp: lack of available data can also cause TSO defer
ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
mlxsw: spectrum_router: Relax GRE decap matching check
mlxsw: spectrum_switchdev: Avoid leaking FID's reference count
mlxsw: spectrum_nve: Remove easily triggerable warnings
ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
sctp: frag_point sanity check
tcp: fix NULL ref in tail loss probe
tcp: Do not underestimate rwnd_limited
net: use skb_list_del_init() to remove from RX sublists
...

7y ago

Martin K. Petersen

60a89a3c

scsi: t10-pi: Return correct ref tag when queue has no integrity profile

7y ago

David Howells

ae3b7361

afs: Fix validation/callback interaction

7y ago

Linus Torvalds

23203e3f

Merge branch 'akpm' (patches from Andrew)

7y ago

Linus Torvalds

8586ca8a

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7y ago

Or Gerlitz

35cc3cef

net/sched: cls_flower: Reject duplicated rules also under skip_sw

7y ago

Dan Carpenter

9ae4f842

scsi: bnx2fc: Fix NULL dereference in error handling

7y ago

Al Viro

78e1f386

iov_iter: teach csum_and_copy_to_iter() to handle pipe-backed ones

7y ago

Linus Torvalds

6cafab50

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

7y ago

Oscar Salvador

17e2e7d7

mm, page_alloc: fix has_unmovable_pages for HugePages

7y ago

Linus Torvalds

ebbd3000

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7y ago

Nick Desaulniers

ac3e233d

x86/vdso: Drop implicit common-page-size linker flag

7y ago

David S. Miller

d4b60e94

Merge branch 'bnxt_en-Bug-fixes'

7y ago

Himanshu Madhani

c64a87f9

Revert "scsi: qla2xxx: Fix NVMe Target discovery"

7y ago

Pan Bian

2084ac6c

exportfs: do not read dentry after free

7y ago

Linus Torvalds

87935eee

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

7y ago

Yangtao Li

d430aff8

serial/sunsu: fix refcount leak

7y ago

Rik van Riel

5eed6f1d

fork,memcg: fix crash in free_thread_stack on memcg charge fail

Commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") will
result in fork failing if allocating a kernel stack for a task in
dup_task_struct exceeds the kernel memory allowance for that cgroup.

Unfortunately, it also results in a crash.

This is due to the code jumping to free_stack and calling
free_thread_stack when the memcg kernel stack charge fails, but without
tsk->stack pointing at the freshly allocated stack.

This in turn results in the vfree_atomic in free_thread_stack oopsing
with a backtrace like this:

#5 [ffffc900244efc88] die at ffffffff8101f0ab
#6 [ffffc900244efcb8] do_general_protection at ffffffff8101cb86
#7 [ffffc900244efce0] general_protection at ffffffff818ff082
[exception RIP: llist_add_batch+7]
RIP: ffffffff8150d487 RSP: ffffc900244efd98 RFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88085ef55980 RCX: 0000000000000000
RDX: ffff88085ef55980 RSI: 343834343531203a RDI: 343834343531203a
RBP: ffffc900244efd98 R8: 0000000000000001 R9: ffff8808578c3600
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88029f6c21c0
R13: 0000000000000286 R14: ffff880147759b00 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffffc900244efda0] vfree_atomic at ffffffff811df2c7
#9 [ffffc900244efdb8] copy_process at ffffffff81086e37
#10 [ffffc900244efe98] _do_fork at ffffffff810884e0
#11 [ffffc900244eff10] sys_vfork at ffffffff810887ff
#12 [ffffc900244eff20] do_syscall_64 at ffffffff81002a43
RIP: 000000000049b948 RSP: 00007ffcdb307830 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000896030 RCX: 000000000049b948
RDX: 0000000000000000 RSI: 00007ffcdb307790 RDI: 00000000005d7421
RBP: 000000000067370f R8: 00007ffcdb3077b0 R9: 000000000001ed00
R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000040
R13: 000000000000000f R14: 0000000000000000 R15: 000000000088d018
ORIG_RAX: 000000000000003a CS: 0033 SS: 002b

The simplest fix is to assign tsk->stack right where it is allocated.

Link: http://lkml.kernel.org/r/20181214231726.7ee4843c@imladris.surriel.com
Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7y ago

Linus Torvalds

4b04e73a

Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7y ago

Andrea Righi

a50480cb

kprobes/x86: Blacklist non-attachable interrupt functions

7y ago

Masahiro Yamada

25896d07

x86/build: Fix compiler support check for CONFIG_RETPOLINE

7y ago

Tarick Bedeir

bd5122cd

net/mlx4_core: Correctly set PFC param if global pause is turned off.

7y ago

Michael Chan

e30fbc33

bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.

7y ago

Dexuan Cui

c9675904

scsi: storvsc: Fix a race in sub-channel creation that can cause panic

7y ago

YueHaibing

909e22e0

exportfs: fix 'passing zero to ERR_PTR()' warning

7y ago

Linus Torvalds

5092adb2

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

7y ago

Daniele Palmas

d667044f

qmi_wwan: Fix qmap header retrieval in qmimux_rx_fixup

7y ago

Corentin Labbe

afaffac3

sparc: Set "ARCH: sunxx" information on the same line

7y ago

Peter Xu

2e83ee1d

mm: thp: fix flags for pmd migration when split

7y ago

Linus Torvalds

0844895a

Merge tag 'char-misc-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

7y ago

YiFei Zhu

79c2206d

x86/earlyprintk/efi: Fix infinite loop on some screen widths

7y ago

Masami Hiramatsu

43a1b0cb

kprobes/x86: Fix instruction patching corruption when copying more than one RIP-relative instruction

After copy_optimized_instructions() copies several instructions
to the working buffer it tries to fix up the real RIP address, but it
adjusts the RIP-relative instruction with an incorrect RIP address
for the 2nd and subsequent instructions due to a bug in the logic.

This will break the kernel pretty badly (with likely outcomes such as
a kernel freeze, a crash, or worse) because probed instructions can refer
to the wrong data.

For example putting kprobes on cpumask_next() typically hits this bug.

cpumask_next() is normally like below if CONFIG_CPUMASK_OFFSTACK=y
(in this case nr_cpumask_bits is an alias of nr_cpu_ids):

<cpumask_next>:
48 89 f0 mov %rsi,%rax
8b 35 7b fb e2 00 mov 0xe2fb7b(%rip),%esi # ffffffff82db9e64 <nr_cpu_ids>
55 push %rbp
...

If we put a kprobe on it and it gets jump-optimized, it gets
patched by the kprobes code like this:

<cpumask_next>:
e9 95 7d 07 1e jmpq 0xffffffffa000207a
7b fb jnp 0xffffffff81f8a2e2 <cpumask_next+2>
e2 00 loop 0xffffffff81f8a2e9 <cpumask_next+9>
55 push %rbp

This shows that the first two MOV instructions were copied to a
trampoline buffer at 0xffffffffa000207a.

Here is the disassembled result of the trampoline, skipping
the optprobe template instructions:

# Dump of assembly code from 0xffffffffa000207a to 0xffffffffa00020ea:

54 push %rsp
...
48 83 c4 08 add $0x8,%rsp
9d popfq
48 89 f0 mov %rsi,%rax
8b 35 82 7d db e2 mov -0x1d24827e(%rip),%esi # 0xffffffff82db9e67 <nr_cpu_ids+3>

This dump shows that the second MOV accesses *(nr_cpu_ids+3) instead of
the original *nr_cpu_ids. This leads to a kernel freeze because
cpumask_next() always returns 0 and for_each_cpu() never ends.

Fix this by adding 'len' correctly to the real RIP address while
copying.

[ mingo: Improved the changelog. ]

Reported-by: Michael Rodin <michael@rodin.online>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # v4.15+
Fixes: 63fef14fc98a ("kprobes/x86: Make insn buffer always ROX and use text_poke()")
Link: http://lkml.kernel.org/r/153504457253.22602.1314289671019919596.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>

7y ago

Juergen Gross

182ddd16

x86/boot: Clear RSDP address in boot_params for broken loaders

7y ago

Benjamin Herrenschmidt

5b3279e2

Revert "net/ibm/emac: wrong bit is used for STA control"

7y ago

Michael Chan

c0b8cda0

bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.

7y ago

Cathy Avery

02f425f8

scsi: vmw_pscsi: Rearrange code to avoid multiple calls to free_irq during unload

7y ago

Jens Axboe

53fffe29

aio: fix failure to put the file pointer

7y ago

Linus Torvalds

e572fa0e

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7y ago

Vitaly Kuznetsov

3cf85f9f

KVM: x86: nSVM: fix switch to guest mmu

7y ago

Jörgen Storvist

7c3db410

qmi_wwan: Add support for Fibocom NL678 series

7y ago

ndesaulniers@google.com

0ff70f62

sparc: vdso: Drop implicit common-page-size linker flag

7y ago

Mikhail Zaslonko

2830bf6f

mm, memory_hotplug: initialize struct pages for the full memory section

If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized. This may lead
to VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered
by memory_hotplug sysfs handlers:

Here are the the panic examples:
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y

kernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
( test_pages_in_a_zone+0xde/0x160)
show_valid_zones+0x5c/0x190
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oops

kernel parameter mem=3075M
--------------------------
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
( is_mem_section_removable+0xb4/0x190)
show_mem_removable+0x9a/0xd8
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oops

Fix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.

Michal said:

: This has alwways been problem AFAIU. It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8d9 kernels mostly and
: therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
: allocation in vmemmap")

Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7y ago

Linus Torvalds

47dcb080

Merge tag 'staging-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

7y ago

Linux 4.20 v4.20

8fe28cb5

Linus Torvalds

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

3c730b10

Linus Torvalds

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

9105b8aa

Linus Torvalds

proc/sysctl: don't return ENOMEM on lookup when a table is unregistering

ea5751cc

Ivan Delalande

Merge tag 'compiler-attributes-for-linus-v4.20' of https://github.com/ojeda/linux

1104bd96

Linus Torvalds

scsi: sd: use mempool for discard special page

61cce6f6

Jens Axboe

aio: fix spectre gadget in lookup_ioctx

0afa9964

Jeff Moyer

Merge tag 'auxdisplay-for-linus-v4.20' of https://github.com/ojeda/linux

38c0ecf6

Linus Torvalds

include/linux/compiler_types.h: don't pollute userspace with macro definitions

71391bdd

Xiaozhou Liu

scsi: target: iscsi: cxgbit: add missing spin_lock_init()

9e6371d3

Varun Prakash

afs: Use d_instantiate() rather than d_add() and don't d_drop()

73116df7

David Howells

Revert "vfs: Allow userns root to call mknod on owned filesystems."

94f82008

Christian Brauner

auxdisplay: charlcd: fix x/y command parsing

9bc30ab8

Mans Rullgard

Linux 4.20-rc6 v4.20-rc6

40e020c1

Linus Torvalds

scsi: target: iscsi: cxgbit: fix csk leak

ed076c55

Varun Prakash

afs: Fix missing net error handling

4584ae96

David Howells

dma-mapping: fix flags in dma_alloc_wc

0cd60eb1

Christoph Hellwig

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

d48f782e

Linus Torvalds

scsi: t10-pi: Return correct ref tag when queue has no integrity profile

60a89a3c

Martin K. Petersen

afs: Fix validation/callback interaction

ae3b7361

David Howells

Merge branch 'akpm' (patches from Andrew)

23203e3f

Linus Torvalds

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

8586ca8a

Linus Torvalds

net/sched: cls_flower: Reject duplicated rules also under skip_sw

35cc3cef

Or Gerlitz

scsi: bnx2fc: Fix NULL dereference in error handling

9ae4f842

Dan Carpenter

iov_iter: teach csum_and_copy_to_iter() to handle pipe-backed ones

78e1f386

Al Viro

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

6cafab50

Linus Torvalds

mm, page_alloc: fix has_unmovable_pages for HugePages

While playing with gigantic hugepages and memory_hotplug, I triggered
the following #PF when "cat memoryX/removable":

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
#PF error: [normal kernel read fault]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:has_unmovable_pages+0x154/0x210
Call Trace:
is_mem_section_removable+0x7d/0x100
removable_show+0x90/0xb0
dev_attr_show+0x1c/0x50
sysfs_kf_seq_show+0xca/0x1b0
seq_read+0x133/0x380
__vfs_read+0x26/0x180
vfs_read+0x89/0x140
ksys_read+0x42/0x90
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason is we do not pass the Head to page_hstate(), and so, the call
to compound_order() in page_hstate() returns 0, so we end up checking
all hstates's size to match PAGE_SIZE.

Obviously, we do not find any hstate matching that size, and we return
NULL. Then, we dereference that NULL pointer in
hugepage_migration_supported() and we got the #PF from above.

Fix that by getting the head page before calling page_hstate().

Also, since gigantic pages span several pageblocks, re-adjust the logic
for skipping pages. While are it, we can also get rid of the
round_up().

[osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>