"mem" in the name already indicates the root, similar to
release_mem_region() and devm_request_mem_region(). Make it implicit.
The only single caller always passes iomem_resource, other parents are not
applicable.
Suggested-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
either created within add_memory*() or passed via add_memory_resource()
shall be marked mergeable and merged with applicable siblings.
To implement that, we need a kernel/resource interface to mark selected
System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
merging.
Note: We really want to merge after the whole operation succeeded, not
directly when adding a resource to the resource tree (it would break
add_memory_resource() and require splitting resources again when the
operation failed - e.g., due to -ENOMEM).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Julien Grall <julien@xen.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "selective merging of system ram resources", v4.
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Resources are effectively stored in a list-based tree. Having a lot of
resources not only wastes memory, it also makes traversing that tree more
expensive, and makes /proc/iomem explode in size (e.g., requiring
kexec-tools to manually merge resources when creating a kdump header. The
current kexec-tools resource count limit does not allow for more than
~100GB of memory with a memory block size of 128MB on x86-64).
Let's allow to selectively merge system ram resources by specifying a new
flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
tested with virtio-mem.
This patch (of 8):
Let's make sure splitting a resource on memory hotunplug will never fail.
This will become more relevant once we merge selected System RAM resources
- then, we'll trigger that case more often on memory hotunplug.
In general, this function is already unlikely to fail. When we remove
memory, we free up quite a lot of metadata (memmap, page tables, memory
block device, etc.). The only reason it could really fail would be when
injecting allocation errors.
All other error cases inside release_mem_region_adjustable() seem to be
sanity checks if the function would be abused in different context - let's
add WARN_ON_ONCE() in these cases so we can catch them.
[natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/1159
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Roger Pau Monn <roger.pau@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In support of detecting whether a resource might have been been claimed,
report the parent to the walk_iomem_res_desc() callback. For example, the
ACPI HMAT parser publishes "hmem" platform devices per target range.
However, if the HMAT is disabled / missing a fallback driver can attach
devices to the raw memory ranges as a fallback if it sees unclaimed /
orphan "Soft Reserved" resources in the resource tree.
Otherwise, find_next_iomem_res() returns a resource with garbage data from
the stack allocation in __walk_iomem_res_desc() for the res->parent field.
There are currently no users that expect ->child and ->sibling to be
valid, and the resource_lock would be needed to traverse them. Use a
compound literal to implicitly zero initialize the fields that are not
being returned in addition to setting ->parent.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: https://lkml.kernel.org/r/159643097166.4062302.11875688887228572793.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Close the hole of holding a mapping over kernel driver takeover event of
a given address range.
Commit 90a545e981 ("restrict /dev/mem to idle io memory ranges")
introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
kernel against scenarios where a /dev/mem user tramples memory that a
kernel driver owns. However, this protection only prevents *new* read(),
write() and mmap() requests. Established mappings prior to the driver
calling request_mem_region() are left alone.
Especially with persistent memory, and the core kernel metadata that is
stored there, there are plentiful scenarios for a /dev/mem user to
violate the expectations of the driver and cause amplified damage.
Teach request_mem_region() to find and shoot down active /dev/mem
mappings that it believes it has successfully claimed for the exclusive
use of the driver. Effectively a driver call to request_mem_region()
becomes a hole-punch on the /dev/mem device.
The typical usage of unmap_mapping_range() is part of
truncate_pagecache() to punch a hole in a file, but in this case the
implementation is only doing the "first half" of a hole punch. Namely it
is just evacuating current established mappings of the "hole", and it
relies on the fact that /dev/mem establishes mappings in terms of
absolute physical address offsets. Once existing mmap users are
invalidated they can attempt to re-establish the mapping, or attempt to
continue issuing read(2) / write(2) to the invalidated extent, but they
will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
block those subsequent accesses.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 90a545e981 ("restrict /dev/mem to idle io memory ranges")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch series "mm/memory_hotplug: online_pages() cleanups", v2.
Some cleanups (+ one fix for a special case) in the context of
online_pages().
This patch (of 5):
This makes it clearer that we will never call func() with duplicate PFNs
in case we have multiple sub-page memory resources. All unaligned parts
of PFNs are completely discarded.
Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out the guts of devm_request_free_mem_region so that we can
implement both a device managed and a manually release version as tiny
wrappers around it.
Link: https://lore.kernel.org/r/20190818090557.17853-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.
Improve performance by not considering children of the tree if the top
level does not match. Since the range of the parents should include the
range of the children such check is redundant.
Running sysbench on dax (pmem emulation, with write_cache disabled):
sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run
Provides the following results:
events (avg/stddev)
-------------------
5.2-rc3: 1247669.0000/16075.39
w/patch: 1286320.5000/16402.72 (+3%)
Link: http://lkml.kernel.org/r/20190613045903.4922-3-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since resources can be removed, locking should ensure that the resource
is not removed while accessing it. However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.
Keep holding the lock while the data is copied. While at it, change the
return value to a more informative value. It is disregarded by the
callers.
[akpm@linux-foundation.org: fix find_next_iomem_res() documentation]
Link: http://lkml.kernel.org/r/20190613045903.4922-2-namit@vmware.com
Fixes: ff3cc952d3 ("resource: Add remove_resource interface")
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Keep the physical address allocation that hmm_add_device does with the
rest of the resource code, and allow future reuse of it without the hmm
wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
a compat driver so distributions can opt-in to the new ABI.
* Allow for an alternative driver for the device-dax address-range
* Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
* Arrange for the device-dax target-node to be onlined so that the newly
added memory range can be uniquely referenced by numa apis.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJchWpGAAoJEB7SkWpmfYgCJk8P/0Q1DINszUDO/vKjJ09cDs9P
Jw3it6GBIL50rDOu9QdcprSpwYDD0h1mLAV/m6oa3bVO+p4uWGvnxaxRx2HN2c/v
vhZFtUDpHlqR63vzWMNVKRprYixCRJDUr6xQhhCcE3ak/ELN6w7LWfikKVWv15UL
MfR96IQU38f+xRda/zSXnL9606Dvkvu/inEHj84lRcHIwj3sQAUalrE8bR3O32gZ
bDg/l5kzT49o8ZXUo/TegvRSSSZpJmOl2DD0RW+ax5q3NI2bOXFrVDUKBKxf/hcQ
E/V9i57TrqQx0GqRhnU7rN/v53cFZGGs31TEEIB/xs3bzCnADxwXcjL5b5K005J6
vJjBA2ODBewHFK3uVx46Hy1iV4eCtZWj4QrMnrjdSrjXOfbF5GTbWOhPFgoq7TWf
S7VqFEf3I2gDPaMq4o8Ej1kLH4HMYeor2NSOZjyvGn87rSZ3ZIQguwbaNIVl+itz
gdDt0ZOU0BgOBkV+rZIeZDaGdloWCHcDPL15CkZaOZyzdWhfEZ7dod6ad+9udilU
EUPH62RgzXZtfm5zpebYyjNVLbb9pLZ0nT+UypyGR6zqWx1SqU3mXi63NFXPco+x
XA9j//edPeI6NHg2CXLEh8DLuCg3dG1zWRJANkiF+niBwyCR8CHtGWAoY6soXbKe
2UrXGcIfXxyJ8V9v8v4q
=hfa3
-----END PGP SIGNATURE-----
Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax updates from Dan Williams:
"New device-dax infrastructure to allow persistent memory and other
"reserved" / performance differentiated memories, to be assigned to
the core-mm as "System RAM".
Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.
One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.
Summary:
- Replace the /sys/class/dax device model with /sys/bus/dax, and
include a compat driver so distributions can opt-in to the new ABI.
- Allow for an alternative driver for the device-dax address-range
- Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
- Arrange for the device-dax target-node to be onlined so that the
newly added memory range can be uniquely referenced by numa apis"
NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.
And apparently the PMEM modules can cause that a lot more than regular
RAM. The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.
Quoting Dan from another email:
"The exposure can be reduced in the volatile-RAM case by scanning for
and clearing errors before it is onlined as RAM. The userspace tooling
for that can be in place before v5.1-final. There's also runtime
notifications of errors via acpi_nfit_uc_error_notify() from
background scrubbers on the DIMM devices. With that mechanism the
kernel could proactively clear newly discovered poison in the volatile
case, but that would be additional development more suitable for v5.2.
I understand the concern, and the need to highlight this issue by
tapping the brakes on feature development, but I don't see PMEM as RAM
making the situation worse when the exposure is also there via DAX in
the PMEM case. Volatile-RAM is arguably a safer use case since it's
possible to repair pages where the persistent case needs active
application coordination"
* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: "Hotplug" persistent memory for use like normal RAM
mm/resource: Let walk_system_ram_range() search child resources
mm/memory-hotplug: Allow memory resources to be children
mm/resource: Move HMM pr_debug() deeper into resource code
mm/resource: Return real error codes from walk failures
device-dax: Add a 'modalias' attribute to DAX 'bus' devices
device-dax: Add a 'target_node' attribute
device-dax: Auto-bind device after successful new_id
acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
device-dax: Add /sys/class/dax backwards compatibility
device-dax: Add support for a dax override driver
device-dax: Move resource pinning+mapping into the common driver
device-dax: Introduce bus + driver model
device-dax: Start defining a dax bus model
device-dax: Remove multi-resource infrastructure
device-dax: Kill dax_region base
device-dax: Kill dax_region ida
In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.
However, it currently only finds memory resources which are
"top-level" iomem_resources. Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):
a0000000-bfffffff : Persistent Memory (legacy)
a0000000-afffffff : System RAM
Changing the true->false here allows children to be searched
as well. We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there. It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.
When HMM consumes address space, it makes a best guess about
what to consume. However, it is possible that a future memory
or device hotplug can collide with the reserved area. In the
case of these conflicts, there is an error message in
register_memory_resource().
Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.
Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available. This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error. The memory hotplug does the following:
ret = walk_system_ram_range(..., func);
if (ret)
return ret;
and 'ret' makes it out to userspace, eventually. The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1. That leads to a very odd
-EPERM (-1) return code out to userspace.
Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.
This return code is compatible with all the callers that I
audited.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Since commit c40dd2f766 ("powerpc: Add System RAM to /proc/iomem")
it is possible to use the generic walk_system_ram_range() and
the generic page_is_ram().
To enable the use of walk_system_ram_range() by the IBM EHEA ethernet
driver, we still need an export of the generic function.
As powerpc was the only user of CONFIG_ARCH_HAS_WALK_MEMORY, the
ifdef around the generic walk_system_ram_range() has become useless
and can be dropped.
Fixes: c40dd2f766 ("powerpc: Add System RAM to /proc/iomem")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Keep the EXPORT_SYMBOL_GPL in powerpc code]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a preparation for the next patch.
Currently, we only call release_mem_region_adjustable() in __remove_pages
if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
are being released by themselves with devm_release_mem_region.
Since we do not want to touch any zone/page stuff during the removing of
the memory (but during the offlining), we do not want to check for the
zone here. So we need another way to tell release_mem_region_adjustable()
to not realease the resource in case it belongs to HMM/devm.
HMM/devm acquires/releases a resource through
devm_request_mem_region/devm_release_mem_region.
These resources have the flag IORESOURCE_MEM, while resources acquired by
hot-add memory path (register_memory_resource()) contain
IORESOURCE_SYSTEM_RAM.
So, we can check for this flag in release_mem_region_adjustable, and if
the resource does not contain such flag, we know that we are dealing with
a HMM/devm resource, so we can back off.
Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first group of warnings is caused by a "/**" kernel-doc notation
marker but the function comments are not in kernel-doc format.
Also add another error return value here.
../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'
Add the missing function parameter documentation for the other warnings:
../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b69c2e20f6 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Drop BUG_ON()s and do normal error handling instead, in
find_next_iomem_res().
- Align function arguments on opening braces.
- Get rid of local var sibling_only in find_next_iomem_res().
- Shorten unnecessarily long first_level_children_only arg name.
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
Link: <new submission>
Previously find_next_iomem_res() used "*res" as both an input parameter for
the range to search and the type of resource to search for, and an output
parameter for the resource we found, which makes the interface confusing.
The current callers use find_next_iomem_res() incorrectly because they
allocate a single struct resource and use it for repeated calls to
find_next_iomem_res(). When find_next_iomem_res() returns a resource, it
overwrites the start, end, flags, and desc members of the struct. If we
call find_next_iomem_res() again, we must update or restore these fields.
The previous code restored res.start and res.end, but not res.flags or
res.desc.
Since the callers did not restore res.flags, if they searched for flags
IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags
IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would
incorrectly skip resources unless they were also marked as
IORESOURCE_SYSRAM.
Fix this by restructuring the interface so it takes explicit "start, end,
flags" parameters and uses "*res" only as an output parameter.
Based on a patch by Lianbo Jiang <lijiang@redhat.com>.
[ bp: While at it:
- make comments kernel-doc style.
-
Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com
find_next_iomem_res() finds an iomem resource that covers part of a range
described by "start, end". All callers expect that range to be inclusive,
i.e., both start and end are included, but find_next_iomem_res() doesn't
handle the end address correctly.
If it finds an iomem resource that contains exactly the end address, it
skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an
iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip
it:
find_next_iomem_res(...)
{
start = 0x0;
end = 0x10000;
for (p = next_resource(...)) {
# p->start = 0x10000;
# p->end = 0x10000;
# we *should* return this resource, but this condition is false:
if ((p->end >= start) && (p->start < end))
break;
Adjust find_next_iomem_res() so it allows a resource that includes the
single byte at the end of the range. This is a corner case that we
probably don't see in practice.
Fixes: 58c1b5b079 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com
* DAX broke a fundamental assumption of truncate of file mapped pages.
The truncate path assumed that it is safe to disconnect a pinned page
from a file and let the filesystem reclaim the physical block. With DAX
the page is equivalent to the filesystem block. Introduce
dax_layout_busy_page() to enable filesystems to wait for pinned DAX
pages to be released. Without this wait a filesystem could allocate
blocks under active device-DMA to a new file.
* DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
* Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they are not
necessary. The table indicates whether CPU caches are power-fail
protected. Clarify that a deep flush is always performed on
REQ_{FUA,PREFLUSH} requests.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
4GBvt9ri9i/Ww/A+ppWs
=4bfw
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
There is currently a mismatch between the resources that will trigger
the e820_pmem driver to register/load and the resources that will
actually be surfaced as pmem ranges. register_e820_pmem() uses
walk_iomem_res_desc() which includes children and siblings. In contrast,
e820_pmem_probe() only considers top level resources. For example the
following resource tree results in the driver being loaded, but no
resources being registered:
398000000000-39bfffffffff : PCI Bus 0000:ae
39be00000000-39bf07ffffff : PCI Bus 0000:af
39be00000000-39beffffffff : 0000:af:00.0
39be10000000-39beffffffff : Persistent Memory (legacy)
Fix this up to allow definitions of "legacy" pmem ranges anywhere in
system-physical address space. Not that it is a recommended or safe to
define a pmem range in PCI space, but it is useful for debug /
experimentation, and the restriction on being a top-level resource was
arbitrary.
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We've got a bug report indicating a kernel panic at booting on an x86-32
system, and it turned out to be the invalid PCI resource assigned after
reallocation. __find_resource() first aligns the resource start address
and resets the end address with start+size-1 accordingly, then checks
whether it's contained. Here the end address may overflow the integer,
although resource_contains() still returns true because the function
validates only start and end address. So this ends up with returning an
invalid resource (start > end).
There was already an attempt to cover such a problem in the commit
47ea91b405 ("Resource: fix wrong resource window calculation"), but
this case is an overseen one.
This patch adds the validity check of the newly calculated resource for
avoiding the integer overflow problem.
Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1086739
Link: http://lkml.kernel.org/r/s5hpo37d5l8.wl-tiwai@suse.de
Fixes: 23c570a674 ("resource: ability to resize an allocated resource")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Michael Henders <hendersm@shaw.ca>
Tested-by: Michael Henders <hendersm@shaw.ca>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make iomem_is_exclusive return bool due to this particular function only
using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set resource structs inserted by __reserve_region_with_split() to have the
correct type. Setting the type doesn't fix any functional problem but
makes %pR on the resource work better.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
When we reserve regions because the user specified a "reserve=" parameter,
set the resource type to either IORESOURCE_IO (for regions below 0x10000)
or IORESOURCE_MEM. The test for 0x10000 is just a heuristic; obviously
there can be memory below 0x10000 as well.
Improve documentation of the "reserve=" parameter.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
In order for memory pages to be properly mapped when SEV is active, it's
necessary to use the PAGE_KERNEL protection attribute as the base
protection. This ensures that memory mapping of, e.g. ACPI tables,
receives the proper mapping attributes.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: kvm@vger.kernel.org
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171020143059.3291-11-brijesh.singh@amd.com
In preperation for a new function that will need additional resource
information during the resource walk, update the resource walk callback to
pass the resource structure. Since the current callback start and end
arguments are pulled from the resource structure, the callback functions
can obtain them from the resource structure directly.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com
The walk_iomem_res_desc(), walk_system_ram_res() and walk_system_ram_range()
functions each have much of the same code.
Create a new function that consolidates the common code from these
functions in one place to reduce the amount of duplicated code.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20171020143059.3291-9-brijesh.singh@amd.com
In commit c4004b02f8 ("x86: remove the kernel code/data/bss resources
from /proc/iomem") I was hoping to remove the phyiscal kernel address
data from /proc/iomem entirely, but that had to be reverted because some
system programs actually use it.
This limits all the detailed resource information to properly
credentialed users instead.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
insert_resource() and remove_resouce() are called by producers
of resources, such as FW modules and bus drivers. These modules
may be implemented as loadable modules.
Export insert_resource() and remove_resouce() so that they can
be called from such modules.
link: https://lkml.org/lkml/2016/3/8/872
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
insert_resource() and insert_resource_conflict() are called
by resource producers to insert a new resource. When there
is any conflict, they move conflicting resources down to the
children of the new resource. There is no destructor of these
interfaces, however.
Add remove_resource(), which removes a resource previously
inserted by insert_resource() or insert_resource_conflict(),
and moves the children up to where they were before.
__release_resource() is changed to have @release_child, so
that this function can be used for remove_resource() as well.
Also add comments to clarify that these functions are intended
for producers of resources to avoid any confusion with
request/release_resource() for consumers.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
__request_region() sets 'flags' of a new resource from @parent
as it inherits the parent's attribute. When a target resource
has a conflict, this function inserts the new resource entry
under the conflicted entry by updating @parent. In this case,
the new resource entry needs to inherit attribute from the updated
parent. This conflict is a typical case since __request_region()
is used to allocate a new resource from a specific resource range.
For instance, request_mem_region() calls __request_region() with
@parent set to &iomem_resource, which is the root entry of the
whole iomem range. When this request results in inserting a new
entry "DEV-A" under "BUS-1", "DEV-A" needs to inherit from the
immediate parent "BUS-1" as it holds specific attribute for the
range.
root (&iomem_resource)
:
+ "BUS-1"
+ "DEV-A"
Change __request_region() to set 'flags' and 'desc' of a new entry
from the immediate parent.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In __request_region, if a conflict with a BUSY and MUXED resource is
detected, then the caller goes to sleep and waits for the resource to be
released. A pointer on the conflicting resource is kept. At wake-up
this pointer is used as a parent to retry to request the region.
A first problem is that this pointer might well be invalid (if for
example the conflicting resource have already been freed). Another
problem is that the next call to __request_region() fails to detect a
remaining conflict. The previously conflicting resource is passed as a
parameter and __request_region() will look for a conflict among the
children of this resource and not at the resource itself. It is likely
to succeed anyway, even if there is still a conflict.
Instead, the parent of the conflicting resource should be passed to
__request_region().
As a fix, this patch doesn't update the parent resource pointer in the
case we have to wait for a muxed region right after.
Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Tested-by: Vincent Donnefort <vdonnefort@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_iomem_res_desc() replaced walk_iomem_res() and there is no
caller to walk_iomem_res() any more. Kill it. Also remove @name
from find_next_iomem_res() as it is no longer used.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-17-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a new interface, walk_iomem_res_desc(), which walks through
the iomem table by identifying a target with @flags and @desc.
This interface provides the same functionality as
walk_iomem_res(), but does not use strcmp() to @name for better
efficiency.
walk_iomem_res() is deprecated and will be removed in a later
patch.
Requested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
[ Fixup comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-14-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change region_intersects() to identify a target with @flags and
@desc, instead of @name with strcmp().
Change the callers of region_intersects(), memremap() and
devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and
IORES_DESC_NONE in @desc when searching System RAM.
Also, export region_intersects() so that the ACPI EINJ error
injection driver can call this function in a later patch.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-13-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all System RAM resource entries have been initialized
to IORESOURCE_SYSTEM_RAM type, change walk_system_ram_res() and
walk_system_ram_range() to call find_next_iomem_res() by setting
@res.flags to IORESOURCE_SYSTEM_RAM and @name to NULL. With this
change, they walk through the iomem table to find System RAM
ranges without the need to do strcmp() on the resource names.
No functional change is made to the interfaces.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
[ Boris: fixup comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-11-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
walk_iomem_res() and region_intersects() still need to use
strcmp() for searching a resource entry by @name in the iomem
table.
This patch introduces I/O resource descriptor 'desc' in struct
resource for the iomem search interfaces. Drivers can assign
their unique descriptor to a range when they support the search
interfaces.
Otherwise, 'desc' is set to IORES_DESC_NONE (0). This avoids
changing most of the drivers as they typically allocate resource
entries statically, or by calling alloc_resource(), kzalloc(),
or alloc_bootmem_low(), which set the field to zero by default.
A later patch will address some drivers that use kmalloc()
without zero'ing the field.
Also change release_mem_region_adjustable() to set 'desc' when
its resource entry gets separated. Other resource interfaces are
also changed to initialize 'desc' explicitly although
alloc_resource() sets it to 0.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I/O resource flags consist of I/O resource types and modifier
bits. Therefore, checking an I/O resource type in 'flags' must
be performed with a bitwise operation.
Fix find_next_iomem_res() and region_intersects() that simply
compare 'flags' against a given value.
Also change __request_region() to set 'res->flags' from
resource_type() and resource_ext_type() of the parent, so that
children nodes will inherit the extended I/O resource type.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default. If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver. This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.
Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.
In general if a device driver is busily using a memory region it already
informs other parts of the kernel to not touch it via
request_mem_region(). /dev/mem should honor the same safety restriction
by default. Debugging a device driver from userspace becomes more
difficult with this enabled. Any application using /dev/mem or mmap of
sysfs pci resources will now need to perform the extra step of either:
1/ Disabling the driver, for example:
echo <device id> > /dev/bus/<parent bus>/drivers/<driver name>/unbind
2/ Rebooting with "iomem=relaxed" on the command line
3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n
Traditional users of /dev/mem like dosemu are unaffected because the
first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction.
Legacy X configurations use /dev/mem to talk to graphics hardware, but
that functionality has since moved to kernel graphics drivers.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
region_is_ram() is used to prevent the establishment of aliased mappings
to physical "System RAM" with incompatible cache settings. However, it
uses "-1" to indicate both "unknown" memory ranges (ranges not described
by platform firmware) and "mixed" ranges (where the parameters describe
a range that partially overlaps "System RAM").
Fix this up by explicitly tracking the "unknown" vs "mixed" resource
cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
This re-write also adds support for detecting when the requested region
completely eclipses all of a resource. Note, the implementation treats
overlaps between "unknown" and the requested memory type as
REGION_INTERSECTS.
Finally, other memory types can be passed in by name, for now the only
usage "System RAM".
Suggested-by: Luis R. Rodriguez <mcgrof@suse.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
region_is_ram() looks up the iomem_resource table to check if
a target range is in RAM. However, it always returns with -1
due to invalid range checks. It always breaks the loop at the
first entry of the table.
Another issue is that it compares p->flags and flags, but it always
fails. flags is declared as int, which makes it as a negative value
with IORESOURCE_BUSY (0x80000000) set while p->flags is unsigned long.
Fix the range check and flags so that region_is_ram() works as
advertised.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-4-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
All users of __check_region(), check_region(), and check_mem_region() are
gone. We got rid of the last user in v4.0-rc1. Remove them.
bloat-o-meter on x86_64 shows:
add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function old new delta
__kstrtab___check_region 15 - -15
__ksymtab___check_region 16 - -16
__check_region 71 - -71
Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently ACPI, PCI and pnp all implement the same resource list
management with different data structure. We need to transfer from
one data structure into another when passing resources from one
subsystem into another subsystem. So move struct resource_list_entry
from ACPI into resource core and rename it as resource_entry,
then it could be reused by different subystems and avoid the data
structure conversion.
Introduce dedicated header file resource_ext.h instead of embedding
it into ioport.h to avoid header file inclusion order issues.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We have a large university system in the UK that is experiencing very long
delays modprobing the driver for a specific I/O device. The delay is from
8-10 minutes per device and there are 31 devices in the system. This 4 to
5 hour delay in starting up those I/O devices is very much a burden on the
customer.
There are two causes for requiring a restart/reload of the drivers. First
is periodic preventive maintenance (PM) and the second is if any of the
devices experience a fatal error. Both of these trigger this excessively
long delay in bringing the system back up to full capability.
The problem was tracked down to a very slow IOREMAP operation and the
excessively long ioresource lookup to insure that the user is not
attempting to ioremap RAM. These patches provide a speed up to that
function.
The modprobe time appears to be affected quite a bit by previous activity
on the ioresource list, which I suspect is due to cache preloading. While
the overall improvement is impacted by other overhead of starting the
devices, this drastically improves the modprobe time.
Also our system is considerably smaller so the percentages gained will not
be the same. Best case improvement with the modprobe on our 20 device
smallish system was from 'real 5m51.913s' to 'real 0m18.275s'.
This patch (of 2):
Since the ioremap operation is verifying that the specified address range
is NOT RAM, it will search the entire ioresource list if the condition is
true. To make matters worse, it does this one 4k page at a time. For a
128M BAR region this is 32 passes to determine the entire region does not
contain any RAM addresses.
This patch provides another resource lookup function, region_is_ram, that
searches for the entire region specified, verifying that it is completely
contained within the resource region. If it is found, then it is checked
to be RAM or not, within a single pass.
The return result reflects if it was found or not (-1), and whether it is
RAM (1) or not (0). This allows the caller to fallback to the previous
page by page search if it was not found.
[akpm@linux-foundation.org: fix spellos and typos in comment]
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Reviewed-by: Cliff Wickman <cpw@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide device-managed implementations of the request_resource() and
release_resource() functions. Upon failure to request a resource, the new
devm_request_resource() function will output an error message for
consistent error reporting.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Richard and Daniel reported that UML is broken due to changes to
resource traversal functions. Problem is that iomem_resource.child can
be null and new code does not consider that possibility. Old code used
a for loop and that loop will not even execute if p was null.
Revert back to for() loop logic and bail out if p is null.
I also moved sibling_only check out of resource_lock. There is no
reason to keep it inside the lock.
Following is backtrace of the UML crash.
RIP: 0033:[<0000000060039b9f>]
RSP: 0000000081459da0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000219b3fff RCX: 000000006010d1d9
RDX: 0000000000000001 RSI: 00000000602dfb94 RDI: 0000000081459df8
RBP: 0000000081459de0 R08: 00000000601b59f4 R09: ffffffff0000ff00
R10: ffffffff0000ff00 R11: 0000000081459e88 R12: 0000000081459df8
R13: 00000000219b3fff R14: 00000000602dfb94 R15: 0000000000000000
Kernel panic - not syncing: Segfault with no mm
CPU: 0 PID: 1 Comm: swapper Not tainted 3.16.0-10454-g58d08e3 #13
Stack:
00000000 000080d0 81459df0 219b3fff
81459e70 6010d1d9 ffffffff 6033e010
81459e50 6003a269 81459e30 00000000
Call Trace:
[<6010d1d9>] ? kclist_add_private+0x0/0xe7
[<6003a269>] walk_system_ram_range+0x61/0xb7
[<6000e859>] ? proc_kcore_init+0x0/0xf1
[<6010d574>] kcore_update_ram+0x4c/0x168
[<6010d72e>] ? kclist_add+0x0/0x2e
[<6000e943>] proc_kcore_init+0xea/0xf1
[<6000e859>] ? proc_kcore_init+0x0/0xf1
[<6000e859>] ? proc_kcore_init+0x0/0xf1
[<600189f0>] do_one_initcall+0x13c/0x204
[<6004ca46>] ? parse_args+0x1df/0x2e0
[<6004c82d>] ? parameq+0x0/0x3a
[<601b5990>] ? strcpy+0x0/0x18
[<60001e1a>] kernel_init_freeable+0x240/0x31e
[<6026f1c0>] kernel_init+0x12/0x148
[<60019fad>] new_thread_handler+0x81/0xa3
Fixes 8c86e70ace ("resource: provide new functions to walk
through resources").
Reported-by: Daniel Walter <sahne@0x90.at>
Tested-by: Richard Weinberger <richard@nod.at>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Daniel Walter <sahne@0x90.at>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have added two more functions to walk through resources.
Currently walk_system_ram_range() deals with pfn and /proc/iomem can
contain partial pages. By dealing in pfn, callback function loses the
info that last page of a memory range is a partial page and not the full
page. So I implemented walk_system_ram_res() which returns u64 values to
callback functions and now it properly return start and end address.
walk_system_ram_range() uses find_next_system_ram() to find the next ram
resource. This in turn only travels through siblings of top level child
and does not travers through all the nodes of the resoruce tree. I also
need another function where I can walk through all the resources, for
example figure out where "GART" aperture is. Figure out where ACPI memory
is.
So I wrote another function walk_iomem_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller can
specify "name" of resource, start and end and flags.
Got rid of find_next_system_ram_res() and instead implemented more generic
find_next_iomem_res() which can be used to traverse top level children
only based on an argument.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The resource map sanity check message is a bit confusing. Change it to be
more readable:
-resource map sanity check conflict: 0xfed10000 0xfed15fff 0xfed10000 0xfed13fff pnp 00:01
+resource sanity check: requesting [mem 0xfed10000-0xfed15fff], which spans more than pnp 00:01 [mem 0xfed10000-0xfed13fff]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
sparse says:
kernel/resource.c:518:5: warning:
symbol 'reallocate_resource' was not declared. Should it be static?
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't set the type (I/O, memory, etc.) of resources added by
__request_region(), which leads to confusing messages like this:
address space collision: [io 0x1000-0x107f] conflicts with ACPI CPU throttle [??? 0x00001010-0x00001015 flags 0x80000000]
Set the type of a new resource added by __request_region() (used by
request_region() and request_mem_region()) to the type of its parent. This
makes the resource tree internally consistent and fixes messages like the
above, where the ACPI CPU throttle resource really is an I/O port region,
but request_region() didn't fill in the type, so %pR didn't know how to
print it.
Sample dmesg showing the issue at the link below.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=71611
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
We have two identical copies of resource_contains() already, and more
places that could use it. This moves it to ioport.h where it can be
shared.
resource_contains(struct resource *r1, struct resource *r2) returns true
iff r1 and r2 are the same type (most callers already checked this
separately) and the r1 address range completely contains r2.
In addition, the new resource_contains() checks that both r1 and r2 have
addresses assigned to them. If a resource is IORESOURCE_UNSET, it doesn't
have a valid address and can't contain or be contained by another resource.
Some callers already check this or for res->start.
No functional change.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
This line was introduced by fcb11918 ("resources: add arch hook for
preventing allocation in reserved areas"). But the struct tmp was already
assigned to *new in the above line, so this seems superfluous. Just
remove it.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When param1 is enabled in EINJ but not assigned with a valid
value, sometimes it will cause the error like below:
APEI: Can not request [mem 0x7aaa7000-0x7aaa7007] for APEI EINJ Trigger registers
It is because some firmware will access target address specified in
param1 to trigger the error when injecting memory error. This will
cause resource conflict with regular memory. So It must be removed
from trigger table resources, but incorrect param1/param2
combination will stop this action. Add extra check to avoid
this kind of error.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When hot removing memory presented at boot time, following messages are shown:
kernel BUG at mm/slub.c:3409!
invalid opcode: 0000 [#1] SMP
Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod
CPU 0
Pid: 5091, comm: kworker/0:2 Tainted: G W 3.9.0-rc6+ #15
RIP: kfree+0x232/0x240
Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80)
Call Trace:
__release_region+0xd4/0xe0
__remove_pages+0x52/0x110
arch_remove_memory+0x89/0xd0
remove_memory+0xc4/0x100
acpi_memory_device_remove+0x6d/0xb1
acpi_device_remove+0x89/0xab
__device_release_driver+0x7c/0xf0
device_release_driver+0x2f/0x50
acpi_bus_device_detach+0x6c/0x70
acpi_ns_walk_namespace+0x11a/0x250
acpi_walk_namespace+0xee/0x137
acpi_bus_trim+0x33/0x7a
acpi_bus_hot_remove_device+0xc4/0x1a1
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x1f7/0x590
worker_thread+0x11a/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
RIP [<ffffffff811c41d2>] kfree+0x232/0x240
RSP <ffff88084678d968>
The reason why the messages are shown is to release a resource
structure, allocated by bootmem, by kfree(). So when we release a
resource structure, we should check whether it is allocated by bootmem
or not.
But even if we know a resource structure is allocated by bootmem, we
cannot release it since SLxB cannot treat it. So for reusing a resource
structure, this patch remembers it by using bootmem_resource as follows:
When releasing a resource structure by free_resource(), free_resource()
checks whether the resource structure is allocated by bootmem or not.
If it is allocated by bootmem, free_resource() adds it to
bootmem_resource. If it is not allocated by bootmem, free_resource()
release it by kfree().
And when getting a new resource structure by get_resource(),
get_resource() checks whether bootmem_resource has released resource
structures or not. If there is a released resource structure,
get_resource() returns it. If there is not a releaed resource
structure, get_resource() returns new resource structure allocated by
kzalloc().
[akpm@linux-foundation.org: s/get_resource/alloc_resource/]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add release_mem_region_adjustable(), which releases a requested region
from a currently busy memory resource. This interface adjusts the
matched memory resource accordingly even if the requested region does
not match exactly but still fits into.
This new interface is intended for memory hot-delete. During bootup,
memory resources are inserted from the boot descriptor table, such as
EFI Memory Table and e820. Each memory resource entry usually covers
the whole contigous memory range. Memory hot-delete request, on the
other hand, may target to a particular range of memory resource, and its
size can be much smaller than the whole contiguous memory. Since the
existing release interfaces like __release_region() require a requested
region to be exactly matched to a resource entry, they do not allow a
partial resource to be released.
This new interface is restrictive (i.e. release under certain
conditions), which is consistent with other release interfaces,
__release_region() and __release_resource(). Additional release
conditions, such as an overlapping region to a resource entry, can be
supported after they are confirmed as valid cases.
There is no change to the existing interfaces since their restriction is
valid for I/O resources.
[akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
[akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
[akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __adjust_resource(), which is called by adjust_resource() internally
after the resource_lock is held. There is no interface change to
adjust_resource(). This change allows other functions to call
__adjust_resource() internally while the resource_lock is held.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using a recursive call add a non-conflicting region in
__reserve_region_with_split() could result in a stack overflow in the case
that the recursive calls are too deep. Convert the recursive calls to an
iterative loop to avoid the problem.
Tested on a machine containing 135 regions. The kernel no longer panicked
with stack overflow.
Also tested with code arbitrarily adding regions with no conflict,
embedding two consecutive conflicts and embedding two non-consecutive
conflicts.
Signed-off-by: T Makphaibulchoke <tmac@hp.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@gmail.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.
This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff). In
this case __request_resource would return the whole root range as
conflict range (i.e. 0-ffffffff). Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.
This patch aborts looking for an usable range when the request does not
intersect with the root range. When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.
When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.
[ 5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[ 5.975150] Modules linked in:
[ 5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[ 5.985324] Call Trace:
[ 5.987759] [<c1039dfc>] ? console_unlock+0x17b/0x18d
[ 5.992891] [<c1039620>] warn_slowpath_common+0x48/0x5d
[ 5.998194] [<c1031758>] ? sub_preempt_count+0x63/0x89
[ 6.003412] [<c1039644>] warn_slowpath_null+0xf/0x13
[ 6.008453] [<c1031758>] sub_preempt_count+0x63/0x89
[ 6.013499] [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[ 6.018453] [<c10c6349>] add_partial+0x36/0x3b
[ 6.022973] [<c10c7c0a>] deactivate_slab+0x96/0xb4
[ 6.027842] [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[ 6.034456] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.039842] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.045232] [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[ 6.050710] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.056100] [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[ 6.061320] [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[ 6.067230] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[ 7.179057] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[ 7.184970] [<c17b4779>] reserve_region_with_split+0x30/0x42
[ 7.190709] [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[ 7.196623] [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[ 7.202184] [<c17cad8a>] pcibios_init+0x23/0x35
[ 7.206789] [<c17ca574>] pci_subsys_init+0x3f/0x44
[ 7.211659] [<c1002088>] do_one_initcall+0x72/0x122
[ 7.216615] [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[ 7.221659] [<c17a27ff>] kernel_init+0xa6/0x118
[ 7.226265] [<c17a2759>] ? start_kernel+0x334/0x334
[ 7.231223] [<c14d7482>] kernel_thread_helper+0x6/0x10
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a resource has no parent, allow its start/end to be set arbitrarily
as long as any children are still contained within the new range.
[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
In the comment of allocate_resource(), the explanation of parameter max
and min is not correct.
Actually, these two parameters are used to specify the range of the
resource that will be allocated, not the min/max size that will be
allocated.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
EXPORT_SYMBOL(adjust_resource) should be right after adjust_resource().
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.
Nothing to see here but a whole lot of instances of:
-#include <linux/module.h>
+#include <linux/export.h>
This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
__find_resource() incorrectly returns a resource window which overlaps
an existing allocated window. This happens when the parent's
resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
to all its children resource-windows.
__find_resource() looks for gaps in resource allocation among the
children resource windows. When it encounters the last child window it
blindly tries the range next to one allocated to the last child. Since
the last child's window ends at 0xffffffff the calculation overflows,
leading the algorithm to believe that any window in the range 0x0000000
to 0xfffffff is available for allocation. This leads to a conflicting
window allocation.
Michal Ludvig reported this issue seen on his platform. The following
patch fixes the problem and has been verified by Michal. I believe this
bug has been there for ages. It got exposed by git commit 2bbc694227
("PCI : ability to relocate assigned pci-resources")
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Tested-by: Michal Ludvig <mludvig@logix.net.nz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a function to find an existing resource by a resource start address.
This allows to implement simple allocators (with a malloc/free-alike API)
on top of the resource system.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Provides the ability to resize a resource that is already allocated.
This functionality is put in place to support reallocation needs of
pci resources.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds arch_remove_reservations(), which an arch can implement if it
needs to protect part of the address space from allocation.
Sometimes that can be done by just putting a region in the resource tree,
but there are cases where that doesn't work well. For example, x86 BIOS
E820 reservations are not related to devices, so they may overlap part of,
all of, or more than a device resource, so they may not end up at the
correct spot in the resource tree.
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
x86: allocate space within a region top-down
x86: update iomem_resource end based on CPU physical address capabilities
x86/PCI: allocate space from the end of a region, not the beginning
PCI: allocate bus resources from the top down
resources: support allocating space within a region from the top down
resources: handle overflow when aligning start of available area
resources: ensure callback doesn't allocate outside available space
resources: factor out resource_clip() to simplify find_resource()
resources: add a default alignf to simplify find_resource()
x86/PCI: MMCONFIG: fix region end calculation
PCI: Add support for polling PME state on suspended legacy PCI devices
PCI: Export some PCI PM functionality
PCI: fix message typo
PCI: log vendor/device ID always
PCI: update Intel chipset names and defines
PCI: use new ccflags variable in Makefile
PCI: add PCI_MSIX_TABLE/PBA defines
PCI: add PCI vendor id for STmicroelectronics
x86/PCI: irq and pci_ids patch for Intel Patsburg DeviceIDs
PCI: OLPC: Only enable PCI configuration type override on XO-1
...
If the same resource is inserted to the resource tree (maybe not on
purpose), a dead loop will be created. In this situation, The kernel does
not report any warning or error :(
The command below will show a endless print.
#cat /proc/iomem
[akpm@linux-foundation.org: add WARN_ON()]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocate space from the top of a region first, then work downward,
if an architecture desires this.
When we allocate space from a resource, we look for gaps between children
of the resource. Previously, we always looked at gaps from the bottom up.
For example, given this:
[mem 0xbff00000-0xf7ffffff] PCI Bus 0000:00
[mem 0xbff00000-0xbfffffff] gap -- available
[mem 0xc0000000-0xdfffffff] PCI Bus 0000:02
[mem 0xe0000000-0xf7ffffff] gap -- available
we attempted to allocate from the [mem 0xbff00000-0xbfffffff] gap first,
then the [mem 0xe0000000-0xf7ffffff] gap.
With this patch an architecture can choose to allocate from the top gap
[mem 0xe0000000-0xf7ffffff] first.
We can't do this across the board because iomem_resource.end is initialized
to 0xffffffff_ffffffff on 64-bit architectures, and most machines can't
address the entire 64-bit physical address space. Therefore, we only
allocate top-down if the arch requests it by clearing
"resource_alloc_from_bottom".
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
If tmp.start is near ~0, ALIGN(tmp.start) may overflow, which would
make us think there's more available space than there really is. We
would likely return something that conflicts with a previous resource,
which would cause a failure when allocate_resource() requests the newly-
allocated region.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=646027
Reported-by: Fabrice Bellet <fabrice@bellet.info>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The alignment callback returns a proposed location, which may have been
adjusted to avoid ISA aliases or for other architecture-specific reasons.
We already had a check ("tmp.start < tmp.end") to make sure the callback
doesn't return an area that extends past the available area. This patch
reworks the check to make sure it doesn't return an area that extends
either below or above the available area.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This factors out the min/max clipping to simplify find_resource().
No functional change.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This removes a test from find_resource(), which is getting cluttered.
No functional change.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
SuperIO devices share regions and use lock/unlock operations to chip
select. We therefore need to be able to request a resource and wait for
it to be freed by whichever other SuperIO device currently hogs it.
Right now you have to poll which is horrible.
Add a MUXED field to IO port resources. If the MUXED field is set on the
resource and on the request (via request_muxed_region) then we block
until the previous owner of the muxed resource releases their region.
This allows us to implement proper resource sharing and locking for
superio chips using code of the form
enable_my_superio_dev() {
request_muxed_region(0x44, 0x02, "superio:watchdog");
outb() ..sequence to enable chip
}
disable_my_superio_dev() {
outb() .. sequence of disable chip
release_region(0x44, 0x02);
}
Signed-off-by: Giel van Schijndel <me@mortis.eu>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
request_resource() and insert_resource() only return success or failure,
which no information about what existing resource conflicted with the
proposed new reservation. This patch adds request_resource_conflict()
and insert_resource_conflict(), which return the conflicting resource.
Callers may use this for better error messages or to adjust the new
resource and retry the request.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The System RAM walk shall skip partial RAM pages and avoid calling
func() on them. So that page_is_ram() return 0 for a partial RAM page.
In particular, it shall not call func() with len=0.
This fixes a boot time bug reported by Sachin and root caused by Thomas:
> >>> WARNING: at arch/x86/mm/ioremap.c:111 __ioremap_caller+0x169/0x2f1()
> >>> Hardware name: BladeCenter LS21 -[79716AA]-
> >>> Modules linked in:
> >>> Pid: 0, comm: swapper Not tainted 2.6.33-git6-autotest #1
> >>> Call Trace:
> >>> [<ffffffff81047cff>] ? __ioremap_caller+0x169/0x2f1
> >>> [<ffffffff81063b7d>] warn_slowpath_common+0x77/0xa4
> >>> [<ffffffff81063bb9>] warn_slowpath_null+0xf/0x11
> >>> [<ffffffff81047cff>] __ioremap_caller+0x169/0x2f1
> >>> [<ffffffff813747a3>] ? acpi_os_map_memory+0x12/0x1b
> >>> [<ffffffff81047f10>] ioremap_nocache+0x12/0x14
> >>> [<ffffffff813747a3>] acpi_os_map_memory+0x12/0x1b
> >>> [<ffffffff81282fa0>] acpi_tb_verify_table+0x29/0x5b
> >>> [<ffffffff812827f0>] acpi_load_tables+0x39/0x15a
> >>> [<ffffffff8191c8f8>] acpi_early_init+0x60/0xf5
> >>> [<ffffffff818f2cad>] start_kernel+0x397/0x3a7
> >>> [<ffffffff818f2295>] x86_64_start_reservations+0xa5/0xa9
> >>> [<ffffffff818f237a>] x86_64_start_kernel+0xe1/0xe8
> >>> ---[ end trace 4eaa2a86a8e2da22 ]---
> >>> ioremap reserve_memtype failed -22
The return code is -EINVAL, so it failed in the is_ram check, which is
not too surprising
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009c000 (usable)
> BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000cffa3900 (usable)
> BIOS-e820: 00000000cffa3900 - 00000000cffa7400 (ACPI data)
The ACPI data is not starting on a page boundary and neither does the
usable RAM area end on a page boundary. Very useful !
> ACPI: DSDT 00000000cffa3900 036CE (v01 IBM SERLEWIS 00001000 INTL 20060912)
ACPI is trying to map DSDT at cffa3900, which results in a check
vs. cffa3000 which is the relevant page boundary. The generic is_ram
check correctly identifies that as RAM because it's in the usable
resource area. The old e820 based is_ram check does not take
overlapping resource areas into account. That's why it works.
CC: Sachin Sant <sachinp@in.ibm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100301135551.GA9998@localhost>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mm: Unify kernel_physical_mapping_init() API
x86, mm: Allow highmem user page tables to be disabled at boot time
x86: Do not reserve brk for DMI if it's not going to be used
x86: Convert tlbstate_lock to raw_spinlock
x86: Use the generic page_is_ram()
x86: Remove BIOS data range from e820
Move page_is_ram() declaration to mm.h
Generic page_is_ram: use __weak
resources: introduce generic page_is_ram()
Useful for freeing a portion of the resource tree, e.g. when trying to
reallocate resources more efficiently.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Now that we return the new resource start position, there is no
need to update "struct resource" inside the align function.
Therefore, mark the struct resource as const.
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
As suggested by Linus, align functions should return the start
of a resource, not void. An update of "res->start" is no longer
necessary.
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
x86/mm is on 32-rc4 and missing the spinlock namespace changes which
are needed for further commits into this topic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use __weak instead of __attribute__((weak)).
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
It's based on walk_system_ram_range(), for archs that don't have
their own page_is_ram().
The static verions in MIPS and SCORE are also made global.
v4: prefer plain 1 instead of PAGE_IS_RAM (H. Peter Anvin)
v3: add comment (KAMEZAWA Hiroyuki)
"AFAIK, this "System RAM" information has been used for kdump to
grab valid memory area and seems good for the kernel itself."
v2: add PAGE_IS_RAM macro (Américo Wang)
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: linux-mips@linux-mips.org
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100122081619.GA6431@localhost>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The second parameter to alignf() in allocate_resource() must
reflect what new resource is attempted to be allocated, else
functions like pcibios_align_resource() (at least on x86) or
pcmcia_align() can't work correctly.
Commit 1e5ad96790 broke this by
setting the "new" resource until we're about to return success.
To keep the resource untouched when allocate_resource() fails,
a "tmp" resource is introduced.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When "allocate_resource(root, new, size, ...)" fails, we currently
clobber "new". This is inconvenient for the caller, who might care
about the original contents of the resource.
For example, when pci_bus_alloc_resource() fails, the "can't allocate
mem resource %pR" message from pci_assign_resources() currently contains
junk for the resource start/end.
This patch delays the "new" update until we're about to return success.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Originally, walk_memory_resource() was introduced to traverse all memory
of "System RAM" for detecting memory hotplug/unplug range. For doing so,
flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
memory hotplug.
But for using other purpose, /proc/kcore, this may includes some firmware
area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
check strict to find out busy "System RAM".
Note: PPC64 keeps their own walk_memory_resouce(), which walk through
ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
patch makes no difference in behavior, finally.
And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
Because pfn_valid() just show "there is memmap or not* and cannot be used
for "there is physical memory or not", this function is useful in generic
to scan physical memory range.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the 32-bit signed quantities get assigned to the u64 resource_size_t,
they are incorrectly sign-extended.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13253
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9905
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Reported-by: Leann Ogasawara <leann@ubuntu.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Reported-by: <pablomme@googlemail.com>
Tested-by: <pablomme@googlemail.com>
Cc: <stable@kernel.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function is not actually used right now, since the original use
case for it was done with insert_resource_expand_to_fit() instead.
However, we now have another usage case that wants to basically do a
"reserve IO resource, splitting around existing resources", however that
one doesn't actually want the "recurse into the conflicting resource"
logic at all.
And since recursing into the conflicting resource was the most complex
part, and isn't wanted, just remove it. Maybe we'll some day want both
versions, but we can just resurrect the logic then.
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>