mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-10 15:34:34 +00:00
ae146d9b76
326 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c00eb15a89 |
mm/zonelist: enumerate zonelists array index
Hardcoding index to zonelists array in gfp_zonelist() is not a good idea, let's enumerate it to improve readability. No functional change. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator] Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
06640290bf |
include/linux/mmzone.h: remove unused is_unevictable_lru()
Since commit
|
||
|
|
8990332760 |
include/linux/mmzone.h: reflow comment
Someone has an 86 column display. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0aaa29a56e |
mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand
High-order watermark checking exists for two reasons -- kswapd high-order awareness and protection for high-order atomic requests. Historically the kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic allocations on demand and avoids using those blocks for order-0 allocations. This is more flexible and reliable than MIGRATE_RESERVE was. A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation request steals a pageblock but limits the total number to 1% of the zone. Callers that speculatively abuse atomic allocations for long-lived high-order allocations to access the reserve will quickly fail. Note that SLUB is currently not such an abuser as it reclaims at least once. It is possible that the pageblock stolen has few suitable high-order pages and will need to steal again in the near future but there would need to be strong justification to search all pageblocks for an ideal candidate. The pageblocks are unreserved if an allocation fails after a direct reclaim attempt. The watermark checks account for the reserved pageblocks when the allocation request is not a high-order atomic allocation. The reserved pageblocks can not be used for order-0 allocations. This may allow temporary wastage until a failed reclaim reassigns the pageblock. This is deliberate as the intent of the reservation is to satisfy a limited number of atomic high-order short-lived requests if the system requires them. The stutter benchmark was used to evaluate this but while it was running there was a systemtap script that randomly allocated between 1 high-order page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This is much larger than the potential reserve and it does not attempt to be realistic. It is intended to stress random high-order allocations from an unknown source, show that there is a reduction in failures without introducing an anomaly where atomic allocations are more reliable than regular allocations. The amount of memory reserved varied throughout the workload as reserves were created and reclaimed under memory pressure. The allocation failures once the workload warmed up were as follows; 4.2-rc5-vanilla 70% 4.2-rc5-atomic-reserve 56% The failure rate was also measured while building multiple kernels. The failure rate was 14% but is 6% with this patch applied. Overall, this is a small reduction but the reserves are small relative to the number of allocation requests. In early versions of the patch, the failure rate reduced by a much larger amount but that required much larger reserves and perversely made atomic allocations seem more reliable than regular allocations. [yalin.wang2010@gmail.com: fix redundant check and a memory leak] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: yalin wang <yalin.wang2010@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
974a786e63 |
mm, page_alloc: remove MIGRATE_RESERVE
MIGRATE_RESERVE preserves an old property of the buddy allocator that existed prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to remain contiguous until the only alternative was to fail the allocation. At the time it was discovered that high-order atomic allocations relied on this property so MIGRATE_RESERVE was introduced. A later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE and supporting code so it'll be easier to review. Note that this patch in isolation may look like a false regression if someone was bisecting high-order atomic allocation failures. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f77cf4e4cc |
mm, page_alloc: delete the zonelist_cache
The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.
1) The cpuset checks are no-ops unless a cpuset is active and in general
are a lot cheaper.
2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.
3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost
of the watermark checking.
4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.
The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it. If stalls due
to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.
The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticeable difference as a result of the removal.
The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency. On a 2-node machine the results of this patch are
stutter
4.3.0-rc1 4.3.0-rc1
baseline nozlc-v4
Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)
Note that the maximum stall latency went from 24 seconds to 12 which is
still bad but an improvement. The milage varies considerably 2-node
machine on an earlier test went from 494 seconds to 47 seconds and a
4-node machine that tested an earlier version of this patch went from a
worst case stall time of 6 seconds to 67ms. The nature of the benchmark
is inherently unpredictable as it is hammering the system and the milage
will vary between machines.
There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed
1. Direct reclaim rates were higher. This was likely due to direct reclaim
being entered instead of the zlc disabling a zone and busy looping.
Busy looping may have the effect of allowing kswapd to make more
progress and in some cases may be better overall. If this is found then
the correct action is to put direct reclaimers to sleep on a waitqueue
and allow kswapd make forward progress. Busy looping on the zlc is even
worse than when the allocator used to blindly call congestion_wait().
2. There was higher swap activity as direct reclaim was active.
3. Direct reclaim efficiency was lower. This is related to 1 as more
scanning activity also encountered more pages that could not be
immediately reclaimed
In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons
1. The test is primarily concerned with latency. The mmap attempts are also
faulted which means there are THP allocation requests. The ZLC could
cause zones to be disabled causing the process to busy loop instead
of reclaiming. This looks like elevated direct reclaim activity but
it's the correct action to take based on what processes requested.
2. The test hammers reclaim and compaction heavily. The number of successful
THP faults is highly variable but affects the reclaim stats. It's not a
realistic or reasonable measure of page reclaim activity.
3. No other page-reclaim intensive workload that was tested showed a problem.
4. If a workload is identified that benefitted from the busy looping then it
should be fixed by having direct reclaimers sleep on a wait queue until
woken by kswapd instead of busy looping. We had this class of problem before
when congestion_waits() with a fixed timeout was a brain damaged decision
but happened to benefit some workloads.
If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
016c13daa5 |
mm, page_alloc: use masks and shifts when converting GFP flags to migrate types
This patch redefines which GFP bits are used for specifying mobility and the order of the migrate types. Once redefined it's possible to convert GFP flags to a migrate type with a simple mask and shift. The only downside is that readers of OOM kill messages and allocation failures may have been used to the existing values but scripts/gfp-translate will help. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e2b19197ff |
mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe
Overall, the intent of this series is to remove the zonelist cache which was introduced to avoid high overhead in the page allocator. Once this is done, it is necessary to reduce the cost of watermark checks. The series starts with minor micro-optimisations. Next it notes that GFP flags that affect watermark checks are abused. __GFP_WAIT historically identified callers that could not sleep and could access reserves. This was later abused to identify callers that simply prefer to avoid sleeping and have other options. A patch distinguishes between atomic callers, high-priority callers and those that simply wish to avoid sleep. The zonelist cache has been around for a long time but it is of dubious merit with a lot of complexity and some issues that are explained. The most important issue is that a failed THP allocation can cause a zone to be treated as "full". This potentially causes unnecessary stalls, reclaim activity or remote fallbacks. The issues could be fixed but it's not worth it. The series places a small number of other micro-optimisations on top before examining GFP flags watermarks. High-order watermarks enforcement can cause high-order allocations to fail even though pages are free. The watermark checks both protect high-order atomic allocations and make kswapd aware of high-order pages but there is a much better way that can be handled using migrate types. This series uses page grouping by mobility to reserve pageblocks for high-order allocations with the size of the reservation depending on demand. kswapd awareness is maintained by examining the free lists. By patch 12 in this series, there are no high-order watermark checks while preserving the properties that motivated the introduction of the watermark checks. This patch (of 10): No user of zone_watermark_ok_safe() specifies alloc_flags. This patch removes the unnecessary parameter. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b171e40930 |
mm/page_alloc: remove unused parameter in init_currently_empty_zone()
Commit
|
||
|
|
12f03ee606 |
libnvdimm for 4.3:
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map. This facility is used by the pmem driver to
enable pfn_to_page() operations on the page frames returned by DAX
('direct_access' in 'struct block_device_operations'). For now, the
'memmap' allocation for these "device" pages comes from "System
RAM". Support for allocating the memmap from device memory will
arrive in a later kernel.
2/ Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3. Completion of
the conversion is targeted for v4.4.
3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
5/ Miscellaneous updates and fixes to libnvdimm including support
for issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
slsw6DkrWT60CRE42nbK
=o57/
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
|
||
|
|
4e6dab4233 |
mm: remove struct node_active_region
struct node_active_region is not used anymore. Remove it. Signed-off-by: minkyung88.kim <minkyung88.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
033fbae988 |
mm: ZONE_DEVICE for "device memory"
While pmem is usable as a block device or via DAX mappings to userspace there are several usage scenarios that can not target pmem due to its lack of struct page coverage. In preparation for "hot plugging" pmem into the vmemmap add ZONE_DEVICE as a new zone to tag these pages separately from the ones that are subject to standard page allocations. Importantly "device memory" can be removed at will by userspace unbinding the driver of the device. Having a separate zone prevents allocation and otherwise marks these pages that are distinct from typical uniform memory. Device memory has different lifetime and performance characteristics than RAM. However, since we have run out of ZONES_SHIFT bits this functionality currently depends on sacrificing ZONE_DMA. Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jerome Glisse <j.glisse@gmail.com> [hch: various simplifications in the arch interface] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
||
|
|
3a80a7fa79 |
mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set
This patch initalises all low memory struct pages and 2G of the highest zone on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set but will be available in a later patch. Parallel initialisation of struct page depends on some features from memory hotplug and it is necessary to alter alter section annotations. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
75a592a471 |
mm: meminit: inline some helper functions
early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are unnecessarily visible outside memory initialisation. As well as unnecessary visibility, it's unnecessary function call overhead when initialising pages. This patch moves the helpers inline. [akpm@linux-foundation.org: fix build] [mhocko@suse.cz: fix build] Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8a942fdea5 |
mm: meminit: make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid
__early_pfn_to_nid() use static variables to cache recent lookups as memblock lookups are very expensive but it assumes that memory initialisation is single-threaded. Parallel initialisation of struct pages will break that assumption so this patch makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache recent search information. early_pfn_to_nid() keeps the same interface but is only safe to use early in boot due to the use of a global static variable. meminit_pfn_in_nid() is an SMP-safe version that callers must maintain their own state for. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d7e4a2ea51 |
mm: refactor zone_movable_is_highmem()
All callers of zone_movable_is_highmem are under #ifdef CONFIG_HIGHMEM, so the else branch return 0 is not needed. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a368ab67aa |
mm: move zone lock to a different cache line than order-0 free page lists
Huang Ying reported the following problem due to commit |
||
|
|
05891fb065 |
mm: microoptimize zonelist operations
next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer via extra parameter. Since the latter can be trivially obtained by dereferencing the former, the overhead of the extra parameter is unjustified. This patch thus removes the zone parameter from next_zones_zonelist(). Both callers happen to be in the same header file, so it's simple to add the zoneref dereference inline. We save some bytes of code size. add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105) function old new delta nr_free_zone_pages 129 115 -14 __alloc_pages_nodemask 2300 2285 -15 get_page_from_freelist 2652 2576 -76 add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10) function old new delta try_to_compact_pages 569 579 +10 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
44628d9755 |
mm: fix typo of MIGRATE_RESERVE in comment
Found it when I want to jump to the definition of MIGRATE_RESERVE ctags. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
eefa864b70 |
mm/page_ext: resurrect struct page extending code for debugging
When we debug something, we'd like to insert some information to every page. For this purpose, we sometimes modify struct page itself. But, this has drawbacks. First, it requires re-compile. This makes us hesitate to use the powerful debug feature so development process is slowed down. And, second, sometimes it is impossible to rebuild the kernel due to third party module dependency. At third, system behaviour would be largely different after re-compile, because it changes size of struct page greatly and this structure is accessed by every part of kernel. Keeping this as it is would be better to reproduce errornous situation. This feature is intended to overcome above mentioned problems. This feature allocates memory for extended data per page in certain place rather than the struct page itself. This memory can be accessed by the accessor functions provided by this code. During the boot process, it checks whether allocation of huge chunk of memory is needed or not. If not, it avoids allocating memory at all. With this advantage, we can include this feature into the kernel in default and can avoid rebuild and solve related problems. Until now, memcg uses this technique. But, now, memcg decides to embed their variable to struct page itself and it's code to extend struct page has been removed. I'd like to use this code to develop debug feature, so this patch resurrect it. To help these things to work well, this patch introduces two callbacks for clients. One is the need callback which is mandatory if user wants to avoid useless memory allocation at boot-time. The other is optional, init callback, which is used to do proper initialization after memory is allocated. Detailed explanation about purpose of these functions is in code comment. Please refer it. Others are completely same with previous extension code in memcg. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1306a85aed |
mm: embed the memcg pointer directly into struct page
Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.
There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.
text data bss dec hex filename
8828345 1725264 983040 11536649 b00909 vmlinux.old
8827425 1725264 966656 11519345 afc571 vmlinux.new
[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
ad53f92eb4 |
mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
Before describing bugs itself, I first explain definition of freepage.
1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
Now, I describe problems and related patch.
Patch 1: There is race conditions on getting pageblock migratetype that
it results in misplacement of freepages on buddy list, incorrect
freepage count and un-availability of freepage.
Patch 2: Freepages on pcp list could have stale cached information to
determine migratetype of buddy list to go. This causes misplacement of
freepages on buddy list and incorrect freepage count.
Patch 4: Merging between freepages on different migratetype of
pageblocks will cause freepages accouting problem. This patch fixes it.
Without patchset [3], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.
With patchset [3], I did simple CMA allocation test and get below
result:
- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed
With patchset [3] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.
On my simple memory offlining test, these problems also occur on that
environment, too.
This patch (of 4):
There are two paths to reach core free function of buddy allocator,
__free_one_page(), one is free_one_page()->__free_one_page() and the
other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
Each paths has race condition causing serious problems. At first, this
patch is focused on first type of freepath. And then, following patch
will solve the problem in second type of freepath.
In the first type of freepath, we got migratetype of freeing page
without holding the zone lock, so it could be racy. There are two cases
of this race.
1. pages are added to isolate buddy list after restoring orignal
migratetype
CPU1 CPU2
get migratetype => return MIGRATE_ISOLATE
call free_one_page() with MIGRATE_ISOLATE
grab the zone lock
unisolate pageblock
release the zone lock
grab the zone lock
call __free_one_page() with MIGRATE_ISOLATE
freepage go into isolate buddy list,
although pageblock is already unisolated
This may cause two problems. One is that we can't use this page anymore
until next isolation attempt of this pageblock, because freepage is on
isolate buddy list. The other is that freepage accouting could be wrong
due to merging between different buddy list. Freepages on isolate buddy
list aren't counted as freepage, but ones on normal buddy list are
counted as freepage. If merge happens, buddy freepage on normal buddy
list is inevitably moved to isolate buddy list without any consideration
of freepage accouting so it could be incorrect.
2. pages are added to normal buddy list while pageblock is isolated.
It is similar with above case.
This also may cause two problems. One is that we can't keep these
freepages from being allocated. Although this pageblock is isolated,
freepage would be added to normal buddy list so that it could be
allocated without any restriction. And the other problem is same as
case 1, that it, incorrect freepage accouting.
This race condition would be prevented by checking migratetype again
with holding the zone lock. Because it is somewhat heavy operation and
it isn't needed in common case, we want to avoid rechecking as much as
possible. So this patch introduce new variable, nr_isolate_pageblock in
struct zone to check if there is isolated pageblock. With this, we can
avoid to re-check migratetype in common case and do it only if there is
isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
mentioned problems.
Changes from v3:
Add one more check in free_one_page() that checks whether migratetype is
MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
5705465174 |
mm: clean up zone flags
Page reclaim tests zone_is_reclaim_dirty(), but the site that actually sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the reader through layers indirection just to track down a simple bit. Remove all zone flag wrappers and just use bitops against zone->flags directly. It's just as readable and the lines are barely any longer. Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and remove the zone_flags_t typedef. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4ffeaf3560 |
mm: page_alloc: reduce cost of the fair zone allocation policy
The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts are reset and a second attempt
made before entering the slow path.
One assumption made with this scheme is that batches expire at roughly
the same time and the resets each time are justified. This assumption
does not hold when zones reach their low watermark as the batches will
be consumed at uneven rates. Allocation failure due to watermark
depletion result in additional zonelist scans for the reset and another
watermark check before hitting the slowpath.
On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes. The system CPU overhead comparison looks like
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanilla vmstat-v5 lowercost-v5
User 746.94 774.56 802.00
System 65336.22 32847.27 40852.33
Elapsed 27553.52 27415.04 27368.46
However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
0d5d823ab4 |
mm: move zone->pages_scanned into a vmstat counter
zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.
On a small UMA machine running tiobench the difference is marginal. On
a 4-node machine the overhead is more noticable. Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04
Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
3484b2de94 |
mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines
The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example
o The zone->node field is shared with the zone lock and zone->node is
accessed frequently from the page allocator due to the fair zone
allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so
reclaim-intensive and allocator-intensive workloads can bounce the cache
line on a stat update
This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines. Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.
On the test configuration I used the overall size of struct zone shrunk
by one cache line. On smaller machines, this is not likely to be
noticable. However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.
3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5r9
User 746.94 759.78
System 65336.22 58350.98
Elapsed 27553.52 27282.02
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
1a4dc5bc7c |
mem-hotplug: improve zone_movable_is_highmem logic
In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set. In online_pages, it extracts pages from the previous zone before ZONE_MOVABLE. Which is logically inconsistent: If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on, zone_movable_is_highmem() makes movable zone not highmem, but online_pages() extracts pages from ZONE_HIGHMEM. This inconsistency doesn't cause real problem currently, because all architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP. However, fixing it makes code clear, and also helps futher coding. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Zhen <zhangzhen@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiang Liu <liuj97@gmail.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7aeb09f910 |
mm: page_alloc: use unsigned int for order in more places
X86 prefers the use of unsigned types for iterators and there is a tendency to mix whether a signed or unsigned type if used for page order. This converts a number of sites in mm/page_alloc.c to use unsigned int for order where possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dc4b0caff2 |
mm: page_alloc: reduce number of times page_to_pfn is called
In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e58469bafd |
mm: page_alloc: use word-based accesses for get/set pageblock bitmaps
The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.
In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.
In addition to the performance benefits, this patch closes races that are
possible between:
a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
reads part of the bits before and other part of the bits after
set_pageblock_migratetype() has updated them.
b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
read-modify-update set bit operation in set_pageblock_skip() will cause
lost updates to some bits changed in the set_pageblock_migratetype().
Joonsoo Kim first reported the case a) via code inspection. Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock). Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g. list_move(). Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.
That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove. Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.
After this patch, the race has no longer been observed in testing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
35979ef339 |
mm, compaction: add per-zone migration pfn cache for async compaction
Each zone has a cached migration scanner pfn for memory compaction so that subsequent calls to memory compaction can start where the previous call left off. Currently, the compaction migration scanner only updates the per-zone cached pfn when pageblocks were not skipped for async compaction. This creates a dependency on calling sync compaction to avoid having subsequent calls to async compaction from scanning an enormous amount of non-MOVABLE pageblocks each time it is called. On large machines, this could be potentially very expensive. This patch adds a per-zone cached migration scanner pfn only for async compaction. It is updated everytime a pageblock has been scanned in its entirety and when no pages from it were successfully isolated. The cached migration scanner pfn for sync compaction is updated only when called for sync compaction. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Greg Thelen <gthelen@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bfc8c90139 |
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
5f7a75acdb |
mm: page_alloc: do not cache reclaim distances
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
449dd6984d |
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a528910e12 |
mm: thrash detection-based file cache sizing
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e97ca8e5b8 |
mm: fix GFP_THISNODE callers and clarify
GFP_THISNODE is for callers that implement their own clever fallback to remote nodes. It restricts the allocation to the specified node and does not invoke reclaim, assuming that the caller will take care of it when the fallback fails, e.g. through a subsequent allocation request without GFP_THISNODE set. However, many current GFP_THISNODE users only want the node exclusive aspect of the flag, without actually implementing their own fallback or triggering reclaim if necessary. This results in things like page migration failing prematurely even when there is easily reclaimable memory available, unless kswapd happens to be running already or a concurrent allocation attempt triggers the necessary reclaim. Convert all callsites that don't implement their own fallback strategy to __GFP_THISNODE. This restricts the allocation a single node too, but at the same time allows the allocator to enter the slowpath, wake kswapd, and invoke direct reclaim if necessary, to make the allocation happen when memory is full. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1c5e9c27cb |
mm: numa: limit scope of lock for NUMA migrate rate limiting
NUMA migrate rate limiting protects a migration counter and window using a lock but in some cases this can be a contended lock. It is not critical that the number of pages be perfect, lost updates are acceptable. Reduce the importance of this lock. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
943dca1a1f |
mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
9TB memory machine since onlining memory sections is too slow. And we
found out setup_zone_migrate_reserve spent >90% of the time.
The problem is, setup_zone_migrate_reserve scans all pageblocks
unconditionally, but it is only necessary if the number of reserved
block was reduced (i.e. memory hot remove).
Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
that the number of reserved pageblocks is almost always unchanged.
This patch adds zone->nr_migrate_reserve_block to maintain the number of
MIGRATE_RESERVE pageblocks and it reduces the overhead of
setup_zone_migrate_reserve dramatically. The following table shows time
of onlining a memory section.
Amount of memory | 128GB | 192GB | 256GB|
---------------------------------------------
linux-3.12 | 23.9 | 31.4 | 44.5 |
This patch | 8.3 | 8.3 | 8.6 |
Mel's proposal patch | 10.9 | 19.2 | 31.3 |
---------------------------------------------
(millisecond)
128GB : 4 nodes and each node has 32GB of memory
192GB : 6 nodes and each node has 32GB of memory
256GB : 8 nodes and each node has 32GB of memory
(*1) Mel proposed his idea by the following threads.
https://lkml.org/lkml/2013/10/30/272
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
6e543d5780 |
mm: vmscan: fix do_try_to_free_pages() livelock
This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at |
||
|
|
81c0a2bb51 |
mm: page_alloc: fair zone allocator policy
Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.
But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.
The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.
Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021
Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937
When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b21fbccd4b |
mm: remove unused functions is_{normal_idx, normal, dma32, dma}
These functions are nowhere used, so remove them. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
55878e88c5 |
sparsemem: add BUILD_BUG_ON when sizeof mem_section is non-power-of-2
Instead of leaving a hidden trap for the next person who comes along and wants to add something to mem_section, add a big fat warning about it needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init() to catch mistakes. Right now non-power-of-2 mem_sections cause a number of WARNs at boot (which don't clearly point to the size of mem_section as an issue), but the system limps on (temporarily, at least). This is based upon Dave Hansen's earlier RFC where he ran into the same issue: "sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2" http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c3d5f5f0c2 |
mm: use a dedicated lock to protect totalram_pages and zone->managed_pages
Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to protect totalram_pages and zone->managed_pages. Other than the memory hotplug driver, totalram_pages and zone->managed_pages may also be modified at runtime by other drivers, such as Xen balloon, virtio_balloon etc. For those cases, memory hotplug lock is a little too heavy, so introduce a dedicated lock to protect totalram_pages and zone->managed_pages. Now we have a simplified locking rules totalram_pages and zone->managed_pages as: 1) no locking for read accesses because they are unsigned long. 2) no locking for write accesses at boot time in single-threaded context. 3) serialize write accesses at runtime by acquiring the dedicated managed_page_count_lock. Also adjust zone->managed_pages when freeing reserved pages into the buddy system, to keep totalram_pages and zone->managed_pages in consistence. [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: <sworddragon2@aol.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
114d4b79f7 |
mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
72c3b51bda |
mm: fix comment referring to non-existent size_seqlock, change to span_seqlock
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
283aba9f9e |
mm: vmscan: block kswapd if it is encountering pages under writeback
Historically, kswapd used to congestion_wait() at higher priorities if
it was not making forward progress. This made no sense as the failure
to make progress could be completely independent of IO. It was later
replaced by wait_iff_congested() and removed entirely by commit
|
||
|
|
d43006d503 |
mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority
Currently kswapd queues dirty pages for writeback if scanning at an elevated priority but the priority kswapd scans at is not related to the number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd priority loop", the priority is related to the size of the LRU and the zone watermark which is no indication as to whether kswapd should write pages or not. This patch tracks if an excessive number of unqueued dirty pages are being encountered at the end of the LRU. If so, it indicates that dirty pages are being recycled before flusher threads can clean them and flags the zone so that kswapd will start writing pages until the zone is balanced. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5d434fcb25 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina: "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small code cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits) mm: Convert print_symbol to %pSR gfs2: Convert print_symbol to %pSR m32r: Convert print_symbol to %pSR iostats.txt: add easy-to-find description for field 6 x86 cmpxchg.h: fix wrong comment treewide: Fix typo in printk and comments doc: devicetree: Fix various typos docbook: fix 8250 naming in device-drivers pata_pdc2027x: Fix compiler warning treewide: Fix typo in printks mei: Fix comments in drivers/misc/mei treewide: Fix typos in kernel messages pm44xx: Fix comment for "CONFIG_CPU_IDLE" doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP" mmzone: correct "pags" to "pages" in comment. kernel-parameters: remove outdated 'noresidual' parameter Remove spurious _H suffixes from ifdef comments sound: Remove stray pluses from Kconfig file radio-shark: Fix printk "CONFIG_LED_CLASS" doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE ... |
||
|
|
8761e31c22 |
mmzone: correct "pags" to "pages" in comment.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> |
||
|
|
f9228b204f |
mm: zone_end_pfn is too small
Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output
below).
The key hint is "page 4294967296 outside zone".
4294967296 = 0x100000000 (bit 32 is set).
The problem is in include/linux/mmzone.h:
530 static inline unsigned zone_end_pfn(const struct zone *zone)
531 {
532 return zone->zone_start_pfn + zone->spanned_pages;
533 }
zone_end_pfn is "unsigned" (32 bits). Changing it to "unsigned long"
(64 bits) fixes the problem.
zone_end_pfn() was added recently in commit
|
||
|
|
da3649e133 |
mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate.
Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar zone_*() functions. Change node_end_pfn() to be a wrapper of pgdat_end_pfn(). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2a6e3ebee2 |
mm: add zone_is_empty() and zone_is_initialized()
Factoring out these 2 checks makes it more clear what we are actually checking for. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
108bcc96ef |
mm: add & use zone_end_pfn() and zone_spans_pfn()
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code duplication. This also switches to using them in compaction (where an additional variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and kmemleak. Note that in compaction.c I avoid calling zone_end_pfn() repeatedly because I expect at some point the sycronization issues with start_pfn & spanned_pages will need fixing, either by actually using the seqlock or clever memory barrier usage. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bbeae5b05e |
mm: move page flags layout to separate header
This is a preparation patch for moving page->_last_nid into page->flags that moves page flag layout information to a separate header. This patch is necessary because otherwise there would be a circular dependency between mm_types.h and mm.h. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
194159fbcc |
mm: remove MIGRATE_ISOLATE check in hotpath
Several functions test MIGRATE_ISOLATE and some of those are hotpath but MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie, CMA, memory-hotplug and memory-failure) which are not common config option. So let's not add unnecessary overhead and code when we don't enable CONFIG_MEMORY_ISOLATION. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a458431e17 |
mm: fix zone_watermark_ok_safe() accounting of isolated pages
Commit |
||
|
|
3d59eebc5e |
Automatic NUMA Balancing V11
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
Ka0JKgnWvsa6ez6FSzKI
=ivQa
-----END PGP SIGNATURE-----
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma
Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
|
||
|
|
9feedc9d83 |
mm: introduce new field "managed_pages" to struct zone
Currently a zone's present_pages is calcuated as below, which is inaccurate and may cause trouble to memory hotplug. spanned_pages - absent_pages - memmap_pages - dma_reserve. During fixing bugs caused by inaccurate zone->present_pages, we found zone->present_pages has been abused. The field zone->present_pages may have different meanings in different contexts: 1) pages existing in a zone. 2) pages managed by the buddy system. For more discussions about the issue, please refer to: http://lkml.org/lkml/2012/11/5/866 https://patchwork.kernel.org/patch/1346751/ This patchset tries to introduce a new field named "managed_pages" to struct zone, which counts "pages managed by the buddy system". And revert zone->present_pages to count "physical pages existing in a zone", which also keep in consistence with pgdat->node_present_pages. We will set an initial value for zone->managed_pages in function free_area_init_core() and will adjust it later if the initial value is inaccurate. For DMA/normal zones, the initial value is set to: (spanned_pages - absent_pages - memmap_pages - dma_reserve) Later zone->managed_pages will be adjusted to the accurate value when the bootmem allocator frees all free pages to the buddy system in function free_all_bootmem_node() and free_all_bootmem(). The bootmem allocator doesn't touch highmem pages, so highmem zones' managed_pages is set to the accurate value "spanned_pages - absent_pages" in function free_area_init_core() and won't be updated anymore. This patch also adds a new field "managed_pages" to /proc/zoneinfo and sysrq showmem. [akpm@linux-foundation.org: small comment tweaks] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bc357f431c |
mm: cma: remove watermark hacks
Commits |
||
|
|
8177a420ed |
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> |
||
|
|
bea8c150a7 |
memcg: fix hotplugged memory zone oops
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e46a28790e |
CMA: migrate mlocked pages
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
957f822a0a |
mm, numa: reclaim from all nodes within reclaim distance
RECLAIM_DISTANCE represents the distance between nodes at which it is deemed too costly to allocate from; it's preferred to try to reclaim from a local zone before falling back to allocating on a remote node with such a distance. To do this, zone_reclaim_mode is set if the distance between any two nodes on the system is greather than this distance. This, however, ends up causing the page allocator to reclaim from every zone regardless of its affinity. What we really want is to reclaim only from zones that are closer than RECLAIM_DISTANCE. This patch adds a nodemask to each node that represents the set of nodes that are within this distance. During the zone iteration, if the bit for a zone's node is set for the local node, then reclaim is attempted; otherwise, the zone is skipped. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
62997027ca |
mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
Compaction caches if a pageblock was scanned and no pages were isolated so that the pageblocks can be skipped in the future to reduce scanning. This information is not cleared by the page allocator based on activity due to the impact it would have to the page allocator fast paths. Hence there is a requirement that something clear the cache or pageblocks will be skipped forever. Currently the cache is cleared if there were a number of recent allocation failures and it has not been cleared within the last 5 seconds. Time-based decisions like this are terrible as they have no relationship to VM activity and is basically a big hammer. Unfortunately, accurate heuristics would add cost to some hot paths so this patch implements a rough heuristic. There are two cases where the cache is cleared. 1. If a !kswapd process completes a compaction cycle (migrate and free scanner meet), the zone is marked compact_blockskip_flush. When kswapd goes to sleep, it will clear the cache. This is expected to be the common case where the cache is cleared. It does not really matter if kswapd happens to be asleep or going to sleep when the flag is set as it will be woken on the next allocation request. 2. If there have been multiple failures recently and compaction just finished being deferred then a process will clear the cache and start a full scan. This situation happens if there are multiple high-order allocation requests under heavy memory pressure. The clearing of the PG_migrate_skip bits and other scans is inherently racy but the race is harmless. For allocations that can fail such as THP, they will simply fail. For requests that cannot fail, they will retry the allocation. Tests indicated that scanning rates were roughly similar to when the time-based heuristic was used and the allocation success rates were similar. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c89511ab2f |
mm: compaction: Restart compaction from near where it left off
This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bb13ffeb9f |
mm: compaction: cache if a pageblock was scanned and no pages were isolated
When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
753341a4b8 |
revert "mm: have order > 0 compaction start off where it left"
This reverts commit |
||
|
|
d1ce749a0d |
cma: count free CMA pages
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this counter always available (not only when CONFIG_CMA=y). [akpm@linux-foundation.org: use conventional migratetype naming] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5515061d22 |
mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
If swap is backed by network storage such as NBD, there is a risk that a large number of reclaimers can hang the system by consuming all PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune min_free_kbytes in advance which is a bit fragile. This patch throttles direct reclaimers if half the PF_MEMALLOC reserves are in use. If the system is routinely getting throttled the system administrator can increase min_free_kbytes so degradation is smoother but the system will keep running. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
702d1a6e07 |
memory-hotplug: fix kswapd looping forever problem
When hotplug offlining happens on zone A, it starts to mark freed page as MIGRATE_ISOLATE type in buddy for preventing further allocation. (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them). When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe doesn't consider MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's helping. The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever like below. __alloc_pages_slowpath restart: wake_all_kswapd rebalance: __alloc_pages_direct_reclaim do_try_to_free_pages if global_reclaim && !all_unreclaimable return 1; /* It means we did did_some_progress */ skip __alloc_pages_may_oom should_alloc_retry goto rebalance; If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable, we can solve this problem by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. It could be a problem still if other subsystem needs GFP_ATOMIC request. So kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE going sleep. This patch counts the number of MIGRATE_ISOLATE page block and zone_watermark_ok_safe will consider it if the system has such blocks (fortunately, it's very rare so no problem in POV overhead and kswapd is never hotpath). Copy/modify from Mel's quote " Ideal solution would be "allocating" the pageblock. It would keep the free space accounting as it is but historically, memory hotplug didn't allocate pages because it would be difficult to detect if a pageblock was isolated or if part of some balloon. Allocating just full pageblocks would work around this, However, it would play very badly with CMA. " [1] http://lkml.org/lkml/2012/6/14/74 [akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()] [akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build] Signed-off-by: Minchan Kim <minchan@kernel.org> Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9adb62a5df |
mm/hotplug: correctly setup fallback zonelists when creating new pgdat
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
fe03025db3 |
mm: CONFIG_HAVE_MEMBLOCK_NODE -> CONFIG_HAVE_MEMBLOCK_NODE_MAP
|
||
|
|
7db8889ab0 |
mm: have order > 0 compaction start off where it left
Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. The obvious solution is to have isolate_freepages remember where it left off last time, and continue at that point the next time it gets invoked for an order > 0 compaction. This could cause compaction to fail if cc->free_pfn and cc->migrate_pfn are close together initially, in that case we restart from the end of the zone and try once more. Forced full (order == -1) compactions are left alone. [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: s/laste/last/, use 80 cols] Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Jim Schutt <jaschut@sandia.gov> Tested-by: Jim Schutt <jaschut@sandia.gov> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ca28ddc908 |
mm: remove unused LRU_ALL_EVICTABLE
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c255a45805 |
memcg: rename config variables
Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [mhocko@suse.cz: fix missed bits] Cc: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d14b7a419a |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina: "Trivial updates all over the place as usual." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits) Fix typo in include/linux/clk.h . pci: hotplug: Fix typo in pci iommu: Fix typo in iommu video: Fix typo in drivers/video Documentation: Add newline at end-of-file to files lacking one arm,unicore32: Remove obsolete "select MISC_DEVICES" module.c: spelling s/postition/position/g cpufreq: Fix typo in cpufreq driver trivial: typo in comment in mksysmap mach-omap2: Fix typo in debug message and comment scsi: aha152x: Fix sparse warning and make printing pointer address more portable. Change email address for Steve Glendinning Btrfs: fix typo in convert_extent_bit via: Remove bogus if check netprio_cgroup.c: fix comment typo backlight: fix memory leak on obscure error path Documentation: asus-laptop.txt references an obsolete Kconfig item Documentation: ManagementStyle: fixed typo mm/vmscan: cleanup comment error in balance_pgdat mm: cleanup on the comments of zone_reclaim_stat ... |
||
|
|
d8adde17e5 |
memory hotplug: fix invalid memory access caused by stale kswapd pointer
kswapd_stop() is called to destroy the kswapd work thread when all memory of a NUMA node has been offlined. But kswapd_stop() only terminates the work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale pointer will prevent kswapd_run() from creating a new work thread when adding memory to the memory-less NUMA node again. Eventually the stale pointer may cause invalid memory access. An example stack dump as below. It's reproduced with 2.6.32, but latest kernel has the same issue. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051a94>] exit_creds+0x12/0x78 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/memory/memory391/state CPU 11 Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285 RIP: 0010:exit_creds+0x12/0x78 RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10 R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000 R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001 FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600) Stack: ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1 0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003 Call Trace: __put_task_struct+0x5d/0x97 kthread_stop+0x50/0x58 offline_pages+0x324/0x3da memory_block_change_state+0x179/0x1db store_mem_state+0x9e/0xbb sysfs_write_file+0xd0/0x107 vfs_write+0xad/0x169 sys_write+0x45/0x6e system_call_fastpath+0x16/0x1b Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0 RIP exit_creds+0x12/0x78 RSP <ffff8806044f1d78> CR2: 0000000000000000 [akpm@linux-foundation.org: add pglist_data.kswapd locking comments] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
59f91e5dd0 |
Merge branch 'master' into for-next
Conflicts: include/linux/mmzone.h Synced with Linus' tree so that trivial patch can be applied on top of up-to-date code properly. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> |
||
|
|
46028e6d10 |
mm: cleanup on the comments of zone_reclaim_stat
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> |
||
|
|
7f5e86c2cc |
mm: add link from struct lruvec to struct zone
This is the first stage of struct mem_cgroup_zone removal. Further patches replace struct mem_cgroup_zone with a pointer to struct lruvec. If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of(). Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
89abfab133 |
mm/memcg: move reclaim_stat into lruvec
With mem_cgroup_disabled() now explicit, it becomes clear that the zone_reclaim_stat structure actually belongs in lruvec, per-zone when memcg is disabled but per-memcg per-zone when it's enabled. We can delete mem_cgroup_get_reclaim_stat(), and change update_page_reclaim_stat() to update just the one set of stats, the one which get_scan_count() will actually use. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f3fd4a6192 |
mm: remove lru type checks from __isolate_lru_page()
After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can completely remove anon/file and active/inactive lru type filters from __isolate_lru_page(), because isolation for 0-order reclaim always isolates pages from right lru list. And pages-isolation for lumpy shrink_inactive_list() or memory-compaction anyway allowed to isolate pages from all evictable lru lists. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d484864dd9 |
Merge branch 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull CMA and ARM DMA-mapping updates from Marek Szyprowski: "These patches contain two major updates for DMA mapping subsystem (mainly for ARM architecture). First one is Contiguous Memory Allocator (CMA) which makes it possible for device drivers to allocate big contiguous chunks of memory after the system has booted. The main difference from the similar frameworks is the fact that CMA allows to transparently reuse the memory region reserved for the big chunk allocation as a system memory, so no memory is wasted when no big chunk is allocated. Once the alloc request is issued, the framework migrates system pages to create space for the required big chunk of physically contiguous memory. For more information one can refer to nice LWN articles: - 'A reworked contiguous memory allocator': http://lwn.net/Articles/447405/ - 'CMA and ARM': http://lwn.net/Articles/450286/ - 'A deep dive into CMA': http://lwn.net/Articles/486301/ - and the following thread with the patches and links to all previous versions: https://lkml.org/lkml/2012/4/3/204 The main client for this new framework is ARM DMA-mapping subsystem. The second part provides a complete redesign in ARM DMA-mapping subsystem. The core implementation has been changed to use common struct dma_map_ops based infrastructure with the recent updates for new dma attributes merged in v3.4-rc2. This allows to use more than one implementation of dma-mapping calls and change/select them on the struct device basis. The first client of this new infractructure is dmabounce implementation which has been completely cut out of the core, common code. The last patch of this redesign update introduces a new, experimental implementation of dma-mapping calls on top of generic IOMMU framework. This lets ARM sub-platform to transparently use IOMMU for DMA-mapping calls if one provides required IOMMU hardware. For more information please refer to the following thread: http://www.spinics.net/lists/arm-kernel/msg175729.html The last patch merges changes from both updates and provides a resolution for the conflicts which cannot be avoided when patches have been applied on the same files (mainly arch/arm/mm/dma-mapping.c)." Acked by Andrew Morton <akpm@linux-foundation.org>: "Yup, this one please. It's had much work, plenty of review and I think even Russell is happy with it." * 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits) ARM: dma-mapping: use PMD size for section unmap cma: fix migration mode ARM: integrate CMA with DMA-mapping subsystem X86: integrate CMA with DMA-mapping subsystem drivers: add Contiguous Memory Allocator mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks mm: extract reclaim code from __alloc_pages_direct_reclaim() mm: Serialize access to min_free_kbytes mm: page_isolation: MIGRATE_CMA isolation functions added mm: mmzone: MIGRATE_CMA migration type added mm: page_alloc: change fallbacks array handling mm: page_alloc: introduce alloc_contig_range() mm: compaction: export some of the functions mm: compaction: introduce isolate_freepages_range() mm: compaction: introduce map_pages() mm: compaction: introduce isolate_migratepages_range() mm: page_alloc: remove trailing whitespace ARM: dma-mapping: add support for IOMMU mapper ARM: dma-mapping: use alloc, mmap, free from dma_ops ARM: dma-mapping: remove redundant code and do the cleanup ... Conflicts: arch/x86/include/asm/dma-mapping.h |
||
|
|
49f223a9cd |
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
alloc_contig_range() performs memory allocation so it also should keep track on keeping the correct level of memory watermarks. This commit adds a call to *_slowpath style reclaim to grab enough pages to make sure that the final collection of contiguous pages from freelists will not starve the system. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> CC: Michal Nazarewicz <mina86@mina86.com> Tested-by: Rob Clark <rob.clark@linaro.org> Tested-by: Ohad Ben-Cohen <ohad@wizery.com> Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Tested-by: Robert Nelson <robertcnelson@gmail.com> Tested-by: Barry Song <Baohua.Song@csr.com> |
||
|
|
47118af076 |
mm: mmzone: MIGRATE_CMA migration type added
The MIGRATE_CMA migration type has two main characteristics: (i) only movable pages can be allocated from MIGRATE_CMA pageblocks and (ii) page allocator will never change migration type of MIGRATE_CMA pageblocks. This guarantees (to some degree) that page in a MIGRATE_CMA page block can always be migrated somewhere else (unless there's no memory left in the system). It is designed to be used for allocating big chunks (eg. 10MiB) of physically contiguous memory. Once driver requests contiguous memory, pages from MIGRATE_CMA pageblocks may be migrated away to create a contiguous block. To minimise number of migrations, MIGRATE_CMA migration type is the last type tried when page allocator falls back to other migration types when requested. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Rob Clark <rob.clark@linaro.org> Tested-by: Ohad Ben-Cohen <ohad@wizery.com> Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Tested-by: Robert Nelson <robertcnelson@gmail.com> Tested-by: Barry Song <Baohua.Song@csr.com> |
||
|
|
35fca53e15 |
mmzone: fix comment typo coelesce -> coalesce
Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> |
||
|
|
aff622495c |
vmscan: only defer compaction for failed order and higher
Currently a failed order-9 (transparent hugepage) compaction can lead to memory compaction being temporarily disabled for a memory zone. Even if we only need compaction for an order 2 allocation, eg. for jumbo frames networking. The fix is relatively straightforward: keep track of the highest order at which compaction is succeeding, and only defer compaction for orders at which compaction is failing. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4111304dab |
mm: enum lru_list lru
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c824493528 |
mm: compaction: make isolate_lru_page() filter-aware again
Commit
|
||
|
|
6290df5458 |
mm: collect LRU list heads into struct lruvec
Having a unified structure with a LRU list set for both global zones and per-memcg zones allows to keep that code simple which deals with LRU lists and does not care about the container itself. Once the per-memcg LRU lists directly link struct pages, the isolation function and all other list manipulations are shared between the memcg case and the global LRU case. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ab8fabd46f |
mm: exclude reserved pages from dirtyable memory
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.
This patch:
The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.
[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
0ee332c145 |
memblock: Kill early_node_map[]
Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP - there's no user of early_node_map[] left. Kill early_node_map[] and replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also, relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h as page_alloc.c would no longer host an alternative implementation. This change is ultimately one to one mapping and shouldn't cause any observable difference; however, after the recent changes, there are some functions which now would fit memblock.c better than page_alloc.c and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK doesn't make much sense on some of them. Further cleanups for functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice. -v2: Fix compile bug introduced by mis-spelling CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in mmzone.h. Reported by Stephen Rothwell. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> |
||
|
|
49ea7eb65e |
mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes
When direct reclaim encounters a dirty page, it gets recycled around the LRU for another cycle. This patch marks the page PageReclaim similar to deactivate_page() so that the page gets reclaimed almost immediately after the page gets cleaned. This is to avoid reclaiming clean pages that are younger than a dirty page encountered at the end of the LRU that might have been something like a use-once page. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ee72886d8e |
mm: vmscan: do not writeback filesystem pages in direct reclaim
Testing from the XFS folk revealed that there is still too much I/O from the end of the LRU in kswapd. Previously it was considered acceptable by VM people for a small number of pages to be written back from reclaim with testing generally showing about 0.3% of pages reclaimed were written back (higher if memory was low). That writing back a small number of pages is ok has been heavily disputed for quite some time and Dave Chinner explained it well; It doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. To complicate matters, filesystems respond very differently to requests from reclaim according to Christoph Hellwig; xfs tries to write it back if the requester is kswapd ext4 ignores the request if it's a delayed allocation btrfs ignores the request As a result, each filesystem has different performance characteristics when under memory pressure and there are many pages being dirtied. In some cases, the request is ignored entirely so the VM cannot depend on the IO being dispatched. The objective of this series is to reduce writing of filesystem-backed pages from reclaim, play nicely with writeback that is already in progress and throttle reclaim appropriately when writeback pages are encountered. The assumption is that the flushers will always write pages faster than if reclaim issues the IO. A secondary goal is to avoid the problem whereby direct reclaim splices two potentially deep call stacks together. There is a potential new problem as reclaim has less control over how long before a page in a particularly zone or container is cleaned and direct reclaimers depend on kswapd or flusher threads to do the necessary work. However, as filesystems sometimes ignore direct reclaim requests already, it is not expected to be a serious issue. Patch 1 disables writeback of filesystem pages from direct reclaim entirely. Anonymous pages are still written. Patch 2 removes dead code in lumpy reclaim as it is no longer able to synchronously write pages. This hurts lumpy reclaim but there is an expectation that compaction is used for hugepage allocations these days and lumpy reclaim's days are numbered. Patches 3-4 add warnings to XFS and ext4 if called from direct reclaim. With patch 1, this "never happens" and is intended to catch regressions in this logic in the future. Patch 5 disables writeback of filesystem pages from kswapd unless the priority is raised to the point where kswapd is considered to be in trouble. Patch 6 throttles reclaimers if too many dirty pages are being encountered and the zones or backing devices are congested. Patch 7 invalidates dirty pages found at the end of the LRU so they are reclaimed quickly after being written back rather than waiting for a reclaimer to find them I consider this series to be orthogonal to the writeback work but it is worth noting that the writeback work affects the viability of patch 8 in particular. I tested this on ext4 and xfs using fs_mark, a simple writeback test based on dd and a micro benchmark that does a streaming write to a large mapping (exercises use-once LRU logic) followed by streaming writes to a mix of anonymous and file-backed mappings. The command line for fs_mark when botted with 512M looked something like ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760 The number of files was adjusted depending on the amount of available memory so that the files created was about 3xRAM. For multiple threads, the -d switch is specified multiple times. The test machine is x86-64 with an older generation of AMD processor with 4 cores. The underlying storage was 4 disks configured as RAID-0 as this was the best configuration of storage I had available. Swap is on a separate disk. Dirty ratio was tuned to 40% instead of the default of 20%. Testing was run with and without monitors to both verify that the patches were operating as expected and that any performance gain was real and not due to interference from monitors. Here is a summary of results based on testing XFS. 512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%) 512M1P-xfs Elapsed Time fsmark 51.41 48.29 512M1P-xfs Elapsed Time simple-wb 114.09 108.61 512M1P-xfs Elapsed Time mmap-strm 113.46 109.34 512M1P-xfs Kswapd efficiency fsmark 62% 63% 512M1P-xfs Kswapd efficiency simple-wb 56% 61% 512M1P-xfs Kswapd efficiency mmap-strm 44% 42% 512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%) 512M-xfs Elapsed Time fsmark 56.08 48.90 512M-xfs Elapsed Time simple-wb 112.22 98.13 512M-xfs Elapsed Time mmap-strm 219.15 196.67 512M-xfs Kswapd efficiency fsmark 54% 56% 512M-xfs Kswapd efficiency simple-wb 54% 55% 512M-xfs Kswapd efficiency mmap-strm 45% 44% 512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%) 512M-4X-xfs Elapsed Time fsmark 63.26 55.88 512M-4X-xfs Elapsed Time simple-wb 100.90 90.25 512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38 512M-4X-xfs Kswapd efficiency fsmark 49% 50% 512M-4X-xfs Kswapd efficiency simple-wb 54% 56% 512M-4X-xfs Kswapd efficiency mmap-strm 37% 36% 512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%) 512M-16X-xfs Elapsed Time fsmark 67.47 58.25 512M-16X-xfs Elapsed Time simple-wb 103.22 90.89 512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82 512M-16X-xfs Kswapd efficiency fsmark 45% 46% 512M-16X-xfs Kswapd efficiency simple-wb 53% 55% 512M-16X-xfs Kswapd efficiency mmap-strm 33% 33% Up until 512-4X, the FSmark improvements were statistically significant. For the 4X and 16X tests the results were within standard deviations but just barely. The time to completion for all tests is improved which is an important result. In general, kswapd efficiency is not affected by skipping dirty pages. 1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%) 1024M1P-xfs Elapsed Time fsmark 84.14 80.41 1024M1P-xfs Elapsed Time simple-wb 210.77 184.78 1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34 1024M1P-xfs Kswapd efficiency fsmark 69% 75% 1024M1P-xfs Kswapd efficiency simple-wb 71% 77% 1024M1P-xfs Kswapd efficiency mmap-strm 43% 44% 1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%) 1024M-xfs Elapsed Time fsmark 94.59 91.00 1024M-xfs Elapsed Time simple-wb 229.84 195.08 1024M-xfs Elapsed Time mmap-strm 405.38 440.29 1024M-xfs Kswapd efficiency fsmark 79% 71% 1024M-xfs Kswapd efficiency simple-wb 74% 74% 1024M-xfs Kswapd efficiency mmap-strm 39% 42% 1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%) 1024M-4X-xfs Elapsed Time fsmark 103.33 97.74 1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57 1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88 1024M-4X-xfs Kswapd efficiency fsmark 81% 70% 1024M-4X-xfs Kswapd efficiency simple-wb 73% 72% 1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38% 1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%) 1024M-16X-xfs Elapsed Time fsmark 103.11 99.11 1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24 1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82 1024M-16X-xfs Kswapd efficiency fsmark 84% 69% 1024M-16X-xfs Kswapd efficiency simple-wb 74% 73% 1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40% All FSMark tests up to 16X had statistically significant improvements. For the most part, tests are completing faster with the exception of the streaming writes to a mixture of anonymous and file-backed mappings which were slower in two cases In the cases where the mmap-strm tests were slower, there was more swapping due to dirty pages being skipped. The number of additional pages swapped is almost identical to the fewer number of pages written from reclaim. In other words, roughly the same number of pages were reclaimed but swapping was slower. As the test is a bit unrealistic and stresses memory heavily, the small shift is acceptable. 4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%) 4608M1P-xfs Elapsed Time fsmark 512.01 492.15 4608M1P-xfs Elapsed Time simple-wb 618.18 566.24 4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07 4608M1P-xfs Kswapd efficiency fsmark 93% 86% 4608M1P-xfs Kswapd efficiency simple-wb 88% 84% 4608M1P-xfs Kswapd efficiency mmap-strm 46% 45% 4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%) 4608M-xfs Elapsed Time fsmark 555.96 532.34 4608M-xfs Elapsed Time simple-wb 659.72 571.85 4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38 4608M-xfs Kswapd efficiency fsmark 89% 91% 4608M-xfs Kswapd efficiency simple-wb 88% 82% 4608M-xfs Kswapd efficiency mmap-strm 48% 46% 4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%) 4608M-4X-xfs Elapsed Time fsmark 592.91 564.00 4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07 4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53 4608M-4X-xfs Kswapd efficiency fsmark 90% 94% 4608M-4X-xfs Kswapd efficiency simple-wb 87% 82% 4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43% 4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%) 4608M-16X-xfs Elapsed Time fsmark 602.69 585.78 4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81 4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86 4608M-16X-xfs Kswapd efficiency fsmark 98% 98% 4608M-16X-xfs Kswapd efficiency simple-wb 88% 82% 4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42% Unlike the other tests, the fsmark results are not statistically significant but the min and max times are both improved and for the most part, tests completed faster. There are other indications that this is an improvement as well. For example, in the vast majority of cases, there were fewer pages scanned by direct reclaim implying in many cases that stalls due to direct reclaim are reduced. KSwapd is scanning more due to skipping dirty pages which is unfortunate but the CPU usage is still acceptable In an earlier set of tests, I used blktrace and in almost all cases throughput throughout the entire test was higher. However, I ended up discarding those results as recording blktrace data was too heavy for my liking. On a laptop, I plugged in a USB stick and ran a similar tests of tests using it as backing storage. A desktop environment was running and for the entire duration of the tests, firefox and gnome terminal were launching and exiting to vaguely simulate a user. 1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%) 1024M-xfs Elapsed Time fsmark 2053.52 1641.03 1024M-xfs Elapsed Time simple-wb 1229.53 768.05 1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03 1024M-xfs Kswapd efficiency fsmark 84% 85% 1024M-xfs Kswapd efficiency simple-wb 92% 81% 1024M-xfs Kswapd efficiency mmap-strm 60% 51% 1024M-xfs Avg wait ms fsmark 5404.53 4473.87 1024M-xfs Avg wait ms simple-wb 2541.35 1453.54 1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53 The mmap-strm results were hurt because firefox launching had a tendency to push the test out of memory. On the postive side, firefox launched marginally faster with the patches applied. Time to completion for many tests was faster but more importantly - the "Avg wait" time as measured by iostat was far lower implying the system would be more responsive. It was also the case that "Avg wait ms" on the root filesystem was lower. I tested it manually and while the system felt slightly more responsive while copying data to a USB stick, it was marginal enough that it could be my imagination. This patch: do not writeback filesystem pages in direct reclaim. When kswapd is failing to keep zones above the min watermark, a process will enter direct reclaim in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This causes two problems. First, it can result in very deep call stacks, particularly if the target storage or filesystem are complex. Some filesystems ignore write requests from direct reclaim as a result. The second is that a single-page flush is inefficient in terms of IO. While there is an expectation that the elevator will merge requests, this does not always happen. Quoting Christoph Hellwig; The elevator has a relatively small window it can operate on, and can never fix up a bad large scale writeback pattern. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd. Anonymous pages are still written to swap as there is not the equivalent of a flusher thread for anonymous pages. If the dirty pages cannot be written back, they are placed back on the LRU lists. There is now a direct dependency on dirty page balancing to prevent too many pages in the system being dirtied which would prevent reclaim making forward progress. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f80c067361 |
mm: zone_reclaim: make isolate_lru_page() filter-aware
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless, we have isolated mapped page and re-add it into LRU's head. It's unnecessary CPU overhead and makes LRU churning. Of course, when we isolate the page, the page might be mapped but when we try to migrate the page, the page would be not mapped. So it could be migrated. But race is rare and although it happens, it's no big deal. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
39deaf8585 |
mm: compaction: make isolate_lru_page() filter-aware
In async mode, compaction doesn't migrate dirty or writeback pages. So, it's meaningless to pick the page and re-add it to lru list. Of course, when we isolate the page in compaction, the page might be dirty or writeback but when we try to migrate the page, the page would be not dirty, writeback. So it could be migrated. But it's very unlikely as isolate and migration cycle is much faster than writeout. So, this patch helps cpu overhead and prevent unnecessary LRU churning. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4356f21d09 |
mm: change isolate mode from #define to bitwise type
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
60063497a9 |
atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bb2a0de92c |
memcg: consolidate memory cgroup lru stat functions
In mm/memcontrol.c, there are many lru stat functions as..
mem_cgroup_zone_nr_lru_pages
mem_cgroup_node_nr_file_lru_pages
mem_cgroup_nr_file_lru_pages
mem_cgroup_node_nr_anon_lru_pages
mem_cgroup_nr_anon_lru_pages
mem_cgroup_node_nr_unevictable_lru_pages
mem_cgroup_nr_unevictable_lru_pages
mem_cgroup_node_nr_lru_pages
mem_cgroup_nr_lru_pages
mem_cgroup_get_local_zonestat
Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
This seems bad. This patch consolidates all functions into
mem_cgroup_zone_nr_lru_pages()
mem_cgroup_node_nr_lru_pages()
mem_cgroup_nr_lru_pages()
For these functions, "which LRU?" information is passed by a mask.
example:
mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
example:
mem_cgroup_nr_lru_pages(mem, ALL_LRU)
BTW, considering layout of NUMA memory placement of counters, this patch seems
to be better.
Now, when we gather all LRU information, we scan in following orer
for_each_lru -> for_each_node -> for_each_zone.
This means we'll touch cache lines in different node in turn.
After patch, we'll scan
for_each_node -> for_each_zone -> for_each_lru(mask)
Then, we'll gather information in the same cacheline at once.
[akpm@linux-foundation.org: fix warnigns, build error]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
c6830c2260 |
Fix node_start/end_pfn() definition for mm/page_cgroup.c
commit
|