Since pte_flags() is much cheaper than pte_val() in some virtualized
environments (namely, Xen), use the former whereever possible.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: "Nick Piggin" <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When nr_range gets decremented, the same slot must be considered for
coalescing with its new successor again.
The issue is apparently pretty benign to native code, but surfaces as a
boot time crash in our forward ported Xen tree (where the page table
setup overall works differently than in native).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The check prevents flags on mappings from being changed, which is not
desireable. There's no need to check for replacing a mapping, and
x86-32 does not do this check.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
early_ioremap() is also used to map normal memory when constructing
the linear memory mapping. However, since we sometimes need to be able
to distinguish between actual IO mappings and normal memory mappings,
add a early_memremap() call, which maps with PAGE_KERNEL (as opposed
to PAGE_KERNEL_IO for early_ioremap()), and use it when constructing
pagetables.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use one of the software-defined PTE bits to indicate that a mapping is
intended for an IO address. On native hardware this is irrelevent,
since a physical address is a physical address. But in a virtual
environment, physical addresses are also virtualized, so there needs
to be some way to distinguish between pseudo-physical addresses and
actual hardware addresses; _PAGE_IOMAP indicates this intent.
By default, __supported_pte_mask masks out _PAGE_IOMAP, so it doesn't
even appear in the final pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the prototypes from the generic kernel.h header to the more
appropriate include/asm-x86/bios_ebda.h header file.
Also, remove the check from the power management code - this is a
pure x86 matter for now.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86 implementation of early_ioremap has an off by one error. If we get
an object which ends on the first byte of a page we undermap by one page and
this causes a crash on boot with the ASUS P5QL whose DMI table happens to fit
this alignment.
The size computation is currently
last_addr = phys_addr + size - 1;
npages = (PAGE_ALIGN(last_addr) - phys_addr)
(Consider a request for 1 byte at alignment 0...)
Closes#11693
Debugging work by Ian Campbell/Felix Geyer
Signed-off-by: Alan Cox <alan@rehat.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeremy Fitzhardinge wrote:
> I'd noticed that current tip/master hasn't been booting under Xen, and I
> just got around to bisecting it down to this change.
>
> commit 065ae73c5462d42e9761afb76f2b52965ff45bd6
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
>
> x86, cpa: make the kernel physical mapping initialization a two pass sequence
>
> This patch is causing Xen to fail various pagetable updates because it
> ends up remapping pagetables to RW, which Xen explicitly prohibits (as
> that would allow guests to make arbitrary changes to pagetables, rather
> than have them mediated by the hypervisor).
Instead of making init a two pass sequence, to satisfy the Intel's TLB
Application note (developer.intel.com/design/processor/applnots/317080.pdf
Section 6 page 26), we preserve the original page permissions
when fragmenting the large mappings and don't touch the existing memory
mapping (which satisfies Xen's requirements).
Only open issue is: on a native linux kernel, we will go back to mapping
the first 0-1GB kernel identity mapping as executable (because of the
static mapping setup in head_64.S). We can fix this in a different
patch if needed.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Track the memtype for RAM pages in page struct instead of using the
memtype list. This avoids the explosion in the number of entries in
memtype list (of the order of 20,000 with AGP) and makes the PAT
tracking simpler.
We are using PG_arch_1 bit in page->flags.
We still use the memtype list for non RAM pages.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do a global flush tlb after splitting the large page and before we do the
actual change page attribute in the PTE.
With out this, we violate the TLB application note, which says
"The TLBs may contain both ordinary and large-page translations for
a 4-KByte range of linear addresses. This may occur if software
modifies the paging structures so that the page size used for the
address range changes. If the two translations differ with respect
to page frame or attributes (e.g., permissions), processor behavior
is undefined and may be implementation-specific."
And also serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity
mappings) using cpa_lock. So that we don't allow any other cpu, with stale
large tlb entries change the page attribute in parallel to some other cpu
splitting a large page entry along with changing the attribute.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: arjan@linux.intel.com
Cc: venkatesh.pallipadi@intel.com
Cc: jeremy@goop.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Interrupt context no longer splits large page in cpa(). So we can do away
with cpa memory pool code.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: arjan@linux.intel.com
Cc: venkatesh.pallipadi@intel.com
Cc: jeremy@goop.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No alias checking needed for setting present/not-present mapping. Otherwise,
we may need to break large pages for 64-bit kernel text mappings (this adds to
complexity if we want to do this from atomic context especially, for ex:
with CONFIG_DEBUG_PAGEALLOC). Let's keep it simple!
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: arjan@linux.intel.com
Cc: venkatesh.pallipadi@intel.com
Cc: jeremy@goop.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't use large pages for kernel identity mapping with DEBUG_PAGEALLOC.
This will remove the need to split the large page for the
allocated kernel page in the interrupt context.
This will simplify cpa code(as we don't do the split any more from the
interrupt context). cpa code simplication in the subsequent patches.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: arjan@linux.intel.com
Cc: venkatesh.pallipadi@intel.com
Cc: jeremy@goop.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the first pass, kernel physical mapping will be setup using large or
small pages but uses the same PTE attributes as that of the early
PTE attributes setup by early boot code in head_[32|64].S
After flushing TLB's, we go through the second pass, which setups the
direct mapped PTE's with the appropriate attributes (like NX, GLOBAL etc)
which are runtime detectable.
This two pass mechanism conforms to the TLB app note which says:
"Software should not write to a paging-structure entry in a way that would
change, for any linear address, both the page size and either the page frame
or attributes."
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: arjan@linux.intel.com
Cc: venkatesh.pallipadi@intel.com
Cc: jeremy@goop.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Export set_memory_ro() and set_memory_rw() calls for use by drivers that need
to have more debug information about who might be writing to memory space.
this was initially developed for use while debugging a memory corruption
problem with e1000e.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Go through the iomem resource tree to check if any of the ioremap()
requests span more than any slot in the iomem resource tree and do
a WARN_ON() if we hit this check.
This will raise a red-flag, if some driver is mapping more than what
is needed. And hopefully identify possible corruptions much earlier.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They were already called once in arch/x86/kernel/setup.c - we don't need to call them again.
fixes:
http://bugzilla.kernel.org/show_bug.cgi?id=11485
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Perodically check for corruption in low phusical memory. Don't bother
checking at fault time, since it won't show anything useful.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some BIOSes have been observed to corrupt memory in the low 64k. This
change:
- Reserves all memory which does not have to be in that area, to
prevent it from being used as general memory by the kernel. Things
like the SMP trampoline are still in the memory, however.
- Clears the reserved memory so we can observe changes to it.
- Adds a function check_for_bios_corruption() which checks and reports on
memory becoming unexpectedly non-zero. Currently it's called in the
x86 fault handler, and the powermanagement debug output.
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the fourth PDPT entry cannot be shared under Xen,
vmalloc_sync_all() must iterate over pmd-s rather than pgd-s here.
Luckily, the code isn't used for native PAE (SHARED_KERNEL_PMD is 1)
and the change is benign to non-PAE.
Also do a little more cleanup in that function.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
They were already called once in arch/x86/kernel/setup.c - we don't need to call them again.
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The CPA test code uses _PAGE_UNUSED1, so make sure its obvious.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the start addr for free_memtype calls in the error path.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Rene Herman <rene.herman@keyaccess.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
early_io{re,un}map() are __init and hence can't be called from __meminit
functions.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While I don't have a hotplug capable system at hand, I think two issues need
fixing:
- pud_phys (in kernel_physical_ampping_init()) would remain uninitialized in
the after_bootmem case
- the locking done just around phys_pmd_{init,update}() would leave out pgd
updates, and it was needlessly covering code portions that do allocations
(perhaps using a more friendly gfp value in alloc_low_page() would then be
possible)
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Actually, might as well simply reconstruct the memtype list at free time
I guess. How is this for a coalescing version of the array functions?
Compiles, boots and provides me with:
root@7ixe4:~# wc -l /debug/x86/pat_memtype_list
53 /debug/x86/pat_memtype_list
otherwise (down from 16384+).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The new set_memory_array_{uc,wb}() pass virtual addresses to
{reserve,free}_memtype() it seems.
Signed-off-by: Rene Herman <rene.herman@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We changed the interface of some pageattr, this patch makes
pageattr-test compile.
Signed-off-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add array interface APIs of pageattr. page based cache flush is quite
slow for a lot of pages. If pages are more than 1024 (4M), the patch
will use a wbinvd(). We have a simple test here (run a 3d game - open
arena), nearly all agp memory allocation are small (< 1M), so suppose
this will not impact runtime performance.
Signed-off-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
All kernel mappings like ioremap(), etc uses UC_MINUS as the type. /dev/mem
mappings with /dev/mem being opened with O_SYNC however was using UC,
resulting in a conflict with /dev/mem mmap failing. This seems to be
affecting some apps (one being flashrom) which are using O_SYNC and which were
working before.
Switch /dev/mem with O_SYNC also to UC_MINUS.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Big thinko in pat memtype tracking code. reserve_memtype should be called
with physical address and not virtual address.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: vmlinux.o(.text+0x180af): Section mismatch in reference from the function leave_uniprocessor() to the function .cpuinit.text:cpu_up()
The function leave_uniprocessor() references
the function __cpuinit cpu_up().
This is often because leave_uniprocessor lacks a __cpuinit
annotation or the annotation of cpu_up is wrong.
leave_uniprocessor calls cpu_up only when CONFIG_HOTPLUG_CPU is set,
so it can be safely annotated as __ref
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pekka Paalanen <pq@iki.fi>
Booting kernel with vmalloc=[any size<=16m] will oops on my pc (i386/1G memory).
BUG_ON in arch/x86/mm/init_32.c triggered:
BUG_ON((unsigned long)high_memory > VMALLOC_START);
It's due to the vm area hole.
In include/asm-x86/pgtable_32.h:
#define VMALLOC_OFFSET (8 * 1024 * 1024)
#define VMALLOC_START (((unsigned long)high_memory + 2 * VMALLOC_OFFSET - 1) \
& ~(VMALLOC_OFFSET - 1))
There's several related point:
1. MAXMEM :
(-__PAGE_OFFSET - __VMALLOC_RESERVE).
The space after VMALLOC_END is included as well, I set it to
(VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE)
2. VMALLOC_OFFSET is not considered in __VMALLOC_RESERVE
fixed by adding VMALLOC_OFFSET to it.
3. VMALLOC_START :
(((unsigned long)high_memory + 2 * VMALLOC_OFFSET - 1) & ~(VMALLOC_OFFSET - 1))
So it's not always 8M, bigger than 8M possible.
I set it to ((unsigned long)high_memory + VMALLOC_OFFSET)
4. the VMALLOC_RESERVE is an unused macro, so remove it here.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: akpm@linux-foundation.org
Cc: hidave.darkstar@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rene Herman reported significant Xorg startup/shutdown slowdown due
to PAT. It turns out that the memtype list has thousands of entries.
Add cached_entry to list add routine, in order to speed up the
lookup for sequential reserve_memtype calls.
Reported-by: Rene Herman <rene.herman@keyaccess.nl>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (32 commits)
x86: add MAP_STACK mmap flag
x86: fix section mismatch warning - spp_getpage()
x86: change init_gdt to update the gdt via write_gdt, rather than a direct write.
x86-64: fix overlap of modules and fixmap areas
x86, geode-mfgpt: check IRQ before using MFGPT as clocksource
x86, acpi: cleanup, temp_stack is used only when CONFIG_SMP is set
x86: fix spin_is_contended()
x86, nmi: clean UP NMI watchdog failure message
x86, NMI: fix watchdog failure message
x86: fix /proc/meminfo DirectMap
x86: fix readb() et al compile error with gcc-3.2.3
arch/x86/Kconfig: clean up, experimental adjustement
x86: invalidate caches before going into suspend
x86, perfctr: don't use CCCR_OVF_PMI1 on Pentium 4Ds
x86, AMD IOMMU: initialize dma_ops after sysfs registration
x86m AMD IOMMU: cleanup: replace LOW_U32 macro with generic lower_32_bits
x86, AMD IOMMU: initialize device table properly
x86, AMD IOMMU: use status bit instead of memory write-back for completion wait
x86: silence mmconfig printk
x86, msr: fix NULL pointer deref due to msr_open on nonexistent CPUs
...
WARNING: vmlinux.o(.text+0x17a3e): Section mismatch in reference from the function set_pte_vaddr_pud() to the function .init.text:spp_getpage()
The function set_pte_vaddr_pud() references
the function __init spp_getpage().
This is often because set_pte_vaddr_pud lacks a __init
annotation or the annotation of spp_getpage is wrong.
spp_getpage is called from __init (__init_extra_mapping) and
non __init (set_pte_vaddr_pud) functions, so it can't be __init.
Unfortunately it calls alloc_bootmem_pages which is __init,
but does it only when bootmem allocator is available (after_bootmem == 0).
So annotate it accordingly.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Jean Delvare's machine triggered this BUG
acpi_os_map_memory phys ffff0000 size 65535
------------[ cut here ]------------
kernel BUG at arch/x86/mm/pat.c:233!
with ACPI in the backtrace.
Adding some debugging output showed that ACPI calls
acpi_os_map_memory phys ffff0000 size 65535
And ioremap/PAT does this check in 32bit, so addr+size wraps and the BUG
in reserve_memtype() triggers incorrectly.
BUG_ON(start >= end); /* end is exclusive */
But reserve_memtype already uses u64:
int reserve_memtype(u64 start, u64 end,
so the 32bit truncation must happen in the caller. Presumably in ioremap
when it passes this information to reserve_memtype().
This patch does this computation in 64bit.
http://bugzilla.kernel.org/show_bug.cgi?id=11346
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Highmem code can leave ptes and tlb entries around for a given page even after
kunmap, and after it has been freed.
>From what I can gather, the PAT code may change the cache attributes of
arbitrary physical addresses (ie. including highmem pages), which would result
in aliases in the case that it operates on one of these lazy tlb highmem
pages.
Flushing kmaps should solve the problem.
I've also just added code for conditional flushing if we haven't got
any dangling highmem aliases -- this should help performance if we
change page attributes frequently or systems that aren't using much
highmem pages (eg. if < 4G RAM). Should be turned into 2 patches, but
just for RFC...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce two APIs for page attribute. flushing tlb/cache in every page
attribute is expensive. AGP gart usually will do a lot of operations to
change a page to uc, new APIs can reduce flush.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: airlied@linux.ie
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do we actually want these DirectMap lines in the x86 /proc/meminfo?
I can see they're interesting to CPA developers and TLB optimizers,
but they don't fit its usual "where has all my memory gone?" usage.
If they are to stay, here are some fixes.
1. On x86_32 without PAE, they're not 2M but 4M pages: no need to
mess with the internal enum, but show the right name to users.
2. Many machines can never show anything but 0 for DirectMap1G,
so suppress that line unless direct_gbpages are really enabled.
3. The unit in /proc/meminfo is kB not number of pages: HugePages
messed that up, but they're an example to regret not to follow.
4. Once we use kB, it's easy to see that 1GB has gone missing (which
explains why CONFIG_CPA_DEBUG=y soon wraps DirectMap2M negative):
because head_64.S's level2_ident_pgt entries were not counted.
My fix is not ideal, but works for more and for less than 1G,
and avoids interfering with early bootup pagetable contortions.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so we don't get warning on 32bit system with 64g RAM or more
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes
part of the warning section for better reporting/collection.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
fix spinlock recursion in hvc_console
stop_machine: remove unused variable
modules: extend initcall_debug functionality to the module loader
export virtio_rng.h
lguest: use get_user_pages_fast() instead of get_user_pages()
mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
lguest: don't set MAC address for guest unless specified
Out of line get_user_pages_fast fallback implementation, make it a weak
symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST.
Export the symbol to modules so lguest can use it.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Recently I've run a kernel w/o PAT support and wondered why there was
a file "x86/pat_memtype_list" in debugfs. Of course it's empty if PAT
is disabled ... so just get rid of this interface if PAT is disabled.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Simon Horman reported that gcc-3.4.x crashes when compiling
pgd_prepopulate_pmd() when PREALLOCATED_PMDS == 0 and CONFIG_DEBUG_INFO
is enabled.
Adding an extra check for PREALLOCATED_PMDS == 0 [which is compiled out
by gcc] seems to avoid the problem.
Reported-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alexey Dobriyan reported trouble with LTP with the new fast-gup code,
and Johannes Weiner debugged it to non-page-aligned addresses, where the
new get_user_pages_fast() code would do all the wrong things, including
just traversing past the end of the requested area due to 'addr' never
matching 'end' exactly.
This is not a pretty fix, and we may actually want to move the alignment
into generic code, leaving just the core code per-arch, but Alexey
verified that the vmsplice01 LTP test doesn't crash with this.
Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com>
Debugged-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove arch-specific show_mem() in favor of the generic version.
This also removes the following redundant information display:
- pages in swapcache, printed by show_swap_cache_info()
- dirty pages, writeback pages, mapped pages, slab pages,
pagetable pages, printed by show_free_areas()
where show_mem() calls show_free_areas(), which calls
show_swap_cache_info().
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement get_user_pages_fast without locking in the fastpath on x86.
Do an optimistic lockless pagetable walk, without taking mmap_sem or any
page table locks or even mmap_sem. Page table existence is guaranteed by
turning interrupts off (combined with the fact that we're always looking
up the current mm, means we can do the lockless page table walk within the
constraints of the TLB shootdown design). Basically we can do this
lockless pagetable walk in a similar manner to the way the CPU's pagetable
walker does not have to take any locks to find present ptes.
This patch (combined with the subsequent ones to convert direct IO to use
it) was found to give about 10% performance improvement on a 2 socket 8
core Intel Xeon system running an OLTP workload on DB2 v9.5
"To test the effects of the patch, an OLTP workload was run on an IBM
x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
runs with and without the patch resulted in an overall performance
benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
__up_read and __down_read routines that is seen during thread contention
for system resources was reduced from 2.8% down to .05%. Monitoring the
/proc/vmstat output from the patched run showed that the counter for
fast_gup contained a very high number while the fast_gup_slow value was
zero."
(fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
counter we had for the number of times the slowpath was invoked).
The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.
I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.
The downside of using get_user_pages_fast is that if there is not a pte
with the correct permissions for the access, we end up falling back to
get_user_pages and so the get_user_pages_fast is a bit of extra work.
However this should not be the common case in most performance critical
code.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: Kconfig fix]
[akpm@linux-foundation.org: Makefile fix/cleanup]
[akpm@linux-foundation.org: warning fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
This finally allows to select GB pages for hugetlbfs in x86 now that all
the infrastructure is in place.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Straight forward extensions for huge pages located in the PUD instead of
PMDs.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).
The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.
This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.
Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.
[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to be able to debug things like the X server and programs using
the PPC Cell SPUs, the debugger needs to be able to access device memory
through ptrace and /proc/pid/mem.
This patch:
Add the generic_access_phys access function and put the hooks in place
to allow access_process_vm to access device or PPC Cell SPU memory.
[riel@redhat.com: Add documentation for the vm_ops->access function]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Benjamin Herrensmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of places that define either a single bootmem descriptor or an
array of them. Use only one central array with MAX_NUMNODES items instead.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
declared do_page_fault() in asm-x86/trap.h for both X86_32 and X86_64
removed do_invalid_op declaration from mm/fault.c as it is already declared in asm-x86/trap.h
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
included <asm/smp.h> in mm/init_32.c for zap_low_mappings()
declared free_initmem() in asm-x86/page_XX.h
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
PTE_PFN_MASK was getting lonely, so I made it a friend.
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rusty, in his peevish way, complained that macros defining constants
should have a name which somewhat accurately reflects the actual
purpose of the constant.
Aside from the fact that PTE_MASK gives no clue as to what's actually
being masked, and is misleadingly similar to the functionally entirely
different PMD_MASK, PUD_MASK and PGD_MASK, I don't really see what the
problem is.
But if this patch silences the incessent noise, then it will have
achieved its goal (TODO: write test-case).
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are a couple of places where (P)Dprintk is used which is an old
compile time enabled printk wrapper. Convert it to the generic
pr_debug().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
From: Arjan van de Ven <arjan@infradead.org>
Date: Sat, 19 Jul 2008 15:47:17 -0700
CONFIG_NONPROMISC_DEVMEM was a rather confusing name - but renaming it
to CONFIG_PROMISC_DEVMEM causes problems on architectures that do not
support this feature; this patch renames it to CONFIG_STRICT_DEVMEM,
so that architectures can opt-in into it.
( the polarity of the option is still the same as it was originally; it
needs to be for now to not break architectures that don't have the
infastructure yet to support this feature)
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "V.Radhakrishnan" <rk@atr-labs.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
Add a debugfs interface to list out all the PAT memtype reservations.
Appears at debugfs x86/pat_memtype_list and output format is
type @ <start addr>-<end addr>
We do not hold the lock while printing the entire list. So, the list may not be
a consistent copy in case where regions are getting added or deleted
at the same time.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
it's separate functionality that deserves its own file.
This also prepares 32-bit memtest support.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus observed:
> The real bug is that we shouldn't have "double negatives", and
> certainly not negative config options. Making that "promiscuous
> /dev/mem" option a negated thing as a config option was bad.
right ... lets rename this option. There should never be a negation
in config options.
[ that reminds me of CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER, but that
is for another commit ;-) ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix asm/e820.h for userspace inclusion
x86: fix numaq_tsc_disable
x86: fix kernel_physical_mapping_init() for large x86 systems
Synchronized tables with current specifications.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Fix bug in kernel_physical_mapping_init() that causes kernel
page table to be built incorrectly for systems with greater
than 512GB of memory.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
optimization: try to merge the range with same page size in
init_memory_mapping, to get the best possible linear mappings set up.
thus when GBpages is not there, we could do 2M pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tighten the boundary checks around max_low_pfn_mapped - dont overmap
nor undermap into holes.
also print out tseg for AMD cpus, for diagnostic purposes.
(this is an SMM area, and we split up any big mappings around that area)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when more than 4g memory is installed, don't map the big hole below 4g.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
save the SLIT, in case we are using fixmap to read it, and that fixmap
could be cleared by others.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add ioremap_default(), which gives a sane mapping without worrying about
type conflicts.
Use it in /dev/mem read in place of ioremap(), as with ioremap(),
any mapping of the region (other than UC_MINUS) will cause a conflict
and failure of /dev/mem read.
Should address the vbetest failure reported at:
http://bugzilla.kernel.org/show_bug.cgi?id=11057
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix phys_pmd_init to make sure not to return bigger value than end.
also print out range split:1G/2M/4K in init_memory_mapping().
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
handle head and tail that are not aligned to big pages (2MB/1GB boundary).
with this patch, on system that support gbpages, change:
last_map_addr: 1080000000 end: 1078000000
to:
last_map_addr: 1078000000 end: 1078000000
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
prepare for overmapped patch
also printout last_map_addr together with end
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/mm/pgtable_32.c:144: warning: 'fixmaps' defined but not used
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we already have the same srat handling interface for 32bit.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than using _PAGE_GLOBAL - which not all CPUs support - to test
CPA, use one of the reserved-for-software-use PTE flags instead. This
allows CPA testing to work on CPUs which don't support PGD.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Older x86-32 processors do not support global mappings (PGD), so must
only use it if the processor supports it.
The _PAGE_KERNEL* flags always have _PAGE_KERNEL set, since logically
we always want it set.
This is OK even on processors which do not support PGD, since all
_PAGE flags are masked with __supported_pte_mask before being turned
into a real in-pagetable pte. On 32-bit systems, __supported_pte_mask
is initialized to not contain _PAGE_GLOBAL, and it is then added if
the CPU is found to support it.
The x86-32 code used to use __PAGE_KERNEL/__PAGE_KERNEL_EXEC for this
purpose, but they're now redundant and can be removed.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When allocating a new pud, unconditionally populate the pgd (why did
we bother to create a new pud if we weren't going to populate it?).
This will only happen if the pgd slot was empty, since any existing
pud will be reused.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move out e820_register_active_regions from non numa zones_sizes_init()
and remove numa version zones_sizes_init().
and let 32 bit call remove_all_active_ranges() in setup_arch() directly
like 64-bit
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kva ram already mapped right after away, so don't need to get that for low ram.
avoid wasting one copy of pgdat.
also add node id in early_res name in case we get it from find_e820_area.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use PMD_SHIFT to calculate boundary also adjust size for pre-allocated
table size
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
some ram-end boundary only has page alignment, instead of 2M alignment.
v2: make init_memory_mapping more solid: start could be any value other than 0
v3: fix NON PAE by handling left over in kernel_physical_mapping
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
I don't know if this will fix this bug, but it's definitely a bugfix.
It was trashing random pages by overwriting them with pagetables...
Don't trash a large pmd's data when mapping physical memory.
This is a bugfix for "x86_64: adjust mapping of physical pagetables
to work with Xen".
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do that in init_memory_mapping
also remove one init_ohci1394_dma_on_all_controllers
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The patch "x86: introduce init_memory_mapping for 32bit" does not allocate
enough space for PTEs if the CPU does not implement PSE.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We will need to set a pte on l3_user_pgt. Extract set_pte_vaddr_pud()
from set_pte_vaddr(), that will accept the l3 page table as parameter.
This change should be a no-op for existing code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If PSE is not available, then fall back to 4k page mappings for the
vmemmap area.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes a few of changes to the construction of the initial
pagetables to work better with paravirt_ops/Xen. The main areas
are:
1. Support non-PSE mapping of memory, since Xen doesn't currently
allow 2M pages to be mapped in guests.
2. Make sure that the ioremap alias of all pages are dropped before
attaching the new page to the pagetable. This avoids having
writable aliases of pagetable pages.
3. Preserve existing pagetable entries, rather than overwriting. Its
possible that a fair amount of pagetable has already been constructed,
so reuse what's already in place rather than ignoring and overwriting it.
The algorithm relies on the invariant that any page which is part of
the kernel pagetable is itself mapped in the linear memory area. This
way, it can avoid using ioremap on a pagetable page.
The invariant holds because it maps memory from low to high addresses,
and also allocates memory from low to high. Each allocated page can
map at least 2M of address space, so the mapped area will always
progress much faster than the allocated area. It relies on the early
boot code mapping enough pages to get started.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jan Beulich points out that vmalloc_sync_all() assumes that the
kernel's pmd is always expected to be present in the pgd. The current
pgd construction code will add the pgd to the pgd_list before its pmds
have been pre-populated, thereby making it visible to
vmalloc_sync_all().
However, because pgd_prepopulate_pmd also does the allocation, it may
block and cannot be done under spinlock.
The solution is to preallocate the pmds out of the spinlock, then
populate them while holding the pgd_list lock.
This patch also pulls the pmd preallocation and mop-up functions out
to be common, assuming that the compiler will generate no code for
them when PREALLOCTED_PMDS is 0. Also, there's no need for pgd_ctor
to clear the pgd again, since it's allocated as a zeroed page.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add hooks which are called at pgd_alloc/free time. The pgd_alloc hook
may return an error code, which if non-zero, causes the pgd allocation
to be failed. The hooks may be used to allocate/free auxillary
per-pgd information.
also fix:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
> arch/x86/kernel/entry_64.S: In file included from
> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
vmalloc_sync_all() is only called from register_die_notifier and
alloc_vm_area. Neither is on any performance-critical paths, so
vmalloc_sync_all() itself is not on any hot paths.
Given that the optimisations in vmalloc_sync_all add a fair amount of
code and complexity, and are fairly hard to evaluate for correctness,
it's better to just remove them to simplify the code rather than worry
about its absolute performance.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
that range is from find_e820_area, so don't try to use end_pfn to see
if out of boundary...use table_top instead to avoid possible strange
result while cross the boundary...
also change early_printk to printk, because init_memory_mapping is after
early param parsing, and console=uart8250 already working at that time.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... so can we use mem below max_low_pfn earlier.
this allows us to move several functions more early instead of waiting
to after paging_init.
That includes moving relocate_initrd() earlier in the bootup, and kva
related early setup done in initmem_init. (in followup patches)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The 32-bit early_ioremap will work equally well for 64-bit, so just use it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the _populate() functions to attach new pages to a pagetable, to
make sure the right paravirt_ops calls get called.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
those function depend on paging setup pgtable, so they could access
the ram in bootmem region but just get mapped.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
for 32bit
we already had early_res support, so don't need to track min_low_pfn.
keep it to 0 always.
also use init_bootmem_node instead of init_bootmem, so don't touch
min_low_pfn.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so that max_low_pfn is not changed after it is set.
so we can move that early and out of initmem_init.
could call find_low_pfn_range just after max_pfn is set.
also could move reserve_initrd out of setup_bootmem_allocator
so 32bit is more like 64bit.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
want to remove arch_get_ram_range, and use early_node_map instead.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Making a variable page-aligned by using
__attribute__((section(".data.page_aligned"))) is fragile because if
sizeof(variable) is not also a multiple of page size, it leaves
variables in the remainder of the section unaligned.
This patch introduces two new qualifiers, __page_aligned_data and
__page_aligned_bss to set the section *and* the alignment of
variables. This makes page-aligned variables more robust because the
linker will make sure they're aligned properly. Unfortunately it
requires *all* page-aligned data to use these macros...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sparse mutters:
arch/x86/mm/numa_64.c:195:27: warning: symbol 'end_pfn' shadows an earlier one
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
plat_node_bdata, cmdline, nodemap_addr, nodemap_size are local to
numa_64.c. Make them static
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Consolidate node_to_cpumask operations and remove the 256k
byte node_to_cpumask_map. This is done by allocating the
node_to_cpumask_map array after the number of possible nodes
(nr_node_ids) is known.
* Debug printouts when CONFIG_DEBUG_PER_CPU_MAPS is active have
been increased. It now shows faults when calling node_to_cpumask()
and node_to_cpumask_ptr().
For inclusion into sched-devel/latest tree.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ sched-devel/latest .../mingo/linux-2.6-sched-devel.git
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is
used by some per_cpu variables that are initialized and accessed
before there are per_cpu areas allocated.
["Early" in respect to per_cpu variables is "earlier than the per_cpu
areas have been setup".]
This patchset adds these new macros:
DEFINE_EARLY_PER_CPU(_type, _name, _initvalue)
EXPORT_EARLY_PER_CPU_SYMBOL(_name)
DECLARE_EARLY_PER_CPU(_type, _name)
early_per_cpu_ptr(_name)
early_per_cpu_map(_name, _idx)
early_per_cpu(_name, _cpu)
The DEFINE macro defines the per_cpu variable as well as the early
map and pointer. It also initializes the per_cpu variable and map
elements to "_initvalue". The early_* macros provide access to
the initial map (usually setup during system init) and the early
pointer. This pointer is initialized to point to the early map
but is then NULL'ed when the actual per_cpu areas are setup. After
that the per_cpu variable is the correct access to the variable.
The early_per_cpu() macro is not very efficient but does show how to
access the variable if you have a function that can be called both
"early" and "late". It tests the early ptr to be NULL, and if not
then it's still valid. Otherwise, the per_cpu variable is used
instead:
#define early_per_cpu(_name, _cpu) \
(early_per_cpu_ptr(_name) ? \
early_per_cpu_ptr(_name)[_cpu] : \
per_cpu(_name, _cpu))
A better method is to actually check the pointer manually. In the
case below, numa_set_node can be called both "early" and "late":
void __cpuinit numa_set_node(int cpu, int node)
{
int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
if (cpu_to_node_map)
cpu_to_node_map[cpu] = node;
else
per_cpu(x86_cpu_to_node_map, cpu) = node;
}
* Add a flag "arch_provides_topology_pointers" that indicates pointers
to topology cpumask_t maps are available. Otherwise, use the function
returning the cpumask_t value. This is useful if cpumask_t set size
is very large to avoid copying data on to/off of the stack.
* The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while
the non-debug case has been optimized a bit.
* Remove an unreferenced compiler warning in drivers/base/topology.c
* Clean up #ifdef in setup.c
For inclusion into sched-devel/latest tree.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ sched-devel/latest .../mingo/linux-2.6-sched-devel.git
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
so don't punish all other cpus without that problem when init highmem
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage
also remove page_is_reserved_early(), it is not needed anymore.
v2: fix the build of other platforms
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in case we have kva before ramdisk on a node, we still need to use
those ranges.
v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. add reserve_bootmem_generic for 32bit
2. change len to unsigned long
3. make early_res_to_bootmem to use it
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use PAGE_OFFSET macro instead of using 0xffff810000000000UL directly.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
1) Remove __meminit from update_pages_count. It is used inside
split_pages()
2) Make the code depend on PROC_FS. Doing statistics for nothing is
useless and not adding useless code is nice to the Linux tiny folks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add information about the mapping state of the direct mapping to
/proc/meminfo. I chose /proc/meminfo because that is where all the other
memory statistics are too and it is a generally useful metric even
outside debugging situations. A lot of split kernel pages means the
kernel will run slower.
This way we can see how many large pages are really used for it and how
many are split.
Useful for general insight into the kernel.
v2: Add hotplug locking to 64bit to plug a very obscure theoretical race.
32bit doesn't need it because it doesn't support hotadd for lowmem.
Fix some typos
v3: Rename dpages_cnt
Add CONFIG ifdef for count update as requested by tglx
Expand description
v4: Fix stupid bugs added in v3
Move update_page_count to pageattr.c
Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the #ifdef conditional because this comparison is already done in
user_mode_vm().
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Cc: akpm@osdl.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:524: warning: passing argument 2 of 'find_e820_area_size' from incompatible pointer type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
'man 3 printf' tells me that %p should be printed as if by %#x, but
this is not true for the kernel, which does not use the '0x' prefix
for the %p conversion specifier.
A small cast to (void *) is also prettier than #ifdef/#else/#endif.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/mm/built-in.o(.text+0x3a1): Section mismatch in
reference from the function set_pte_phys() to the function
.init.text:spp_getpage()
The function set_pte_phys() references
the function __init spp_getpage().
This is often because set_pte_phys lacks a __init
annotation or the annotation of spp_getpage is wrong.
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:520: warning: passing argument 2 of
'find_e820_area_size' from incompatible pointer type
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
... to strip down loop body in reserve_memtype.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... and move last debug message out of locked section.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- add BUG statement to catch invalid start and end parameters
- No need to track the actual type in both req_type and actual_type --
keep req_type unchanged.
- removed (IMHO) superfluous comments
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- parse/ml => entry (within list_for_each and friends)
- new_entry => new
- ret_type => new_type (to avoid confusion with req_type)
(... to make it more readable...)
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix build failure:
arch/x86/mm/pgtable.c:280: warning: ‘enum fixed_addresses’ declared inside parameter list
arch/x86/mm/pgtable.c:280: warning: its scope is only this definition or declaration, which is probably not what you want
arch/x86/mm/pgtable.c:280: error: parameter 1 (‘idx’) has incomplete type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In both cases, I went with the 32-bit behaviour.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I've removed the test from phys_to_nid and made a function from __phys_addr
only when the debugging is enabled (on x86_32).
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: tglx@linutronix.de
Cc: hpa@zytor.com
Cc: Mike Travis <travis@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <x86@kernel.org>
Cc: linux-mm@kvack.org
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add some (configurable) expensive sanity checking to catch wrong address
translations on x86.
- create linux/mmdebug.h file to be able include this file in
asm headers to not get unsolvable loops in header files
- __phys_addr on x86_32 became a function in ioremap.c since
PAGE_OFFSET, is_vmalloc_addr and VMALLOC_* non-constasts are undefined
if declared in page_32.h
- add __phys_addr_const for initializing doublefault_tss.__cr3
Tested on 386, 386pae, x86_64 and x86_64 numa=fake=2.
Contains Andi's enable numa virtual address debug patch.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up over-complications in pat_x_mtrr_type().
And if reserve_memtype() ignores stray req_type bits when
pat_enabled, it's better to mask them off when not also.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changed the call to find_e820_area_size to pass u64 instead of unsigned long.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix an incompatible pointer type warning on x86_64 compilations.
early_memtest() is passing a u64* to find_e820_area_size() which is expecting
an unsigned long. Change t_start and t_size to unsigned long as those are
also 64-bit types on x88_64.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Page faults in kernel address space between PAGE_OFFSET up to
VMALLOC_START should not try to map as vmalloc.
Fix rarely endless page faults inside mount_block_root for root
filesystem at boot time.
All 32bit kernels up to 2.6.25 can fail into this hole.
I can not present this under native linux kernel. I see, that the 64bit
has fixed the problem. I copied the same lines into 32bit part.
Recorded debugs are from coLinux kernel 2.6.22.18 (virtualisation):
http://www.henrynestler.com/colinux/testing/pfn-check-0.7.3/20080410-antinx/bug16-recursive-page-fault-endless.txt
The physicaly memory was trimmed down to 192MB to better catch the bug.
More memory gets the bug more rarely.
Details, how every x86 32bit system can fail:
Start from "mount_block_root",
http://lxr.linux.no/linux/init/do_mounts.c#L297
There the variable "fs_names" got one memory page with 4096 bytes.
Variable "p" walks through the existing file system types. The first
string is no problem.
But, with the second loop in mount_block_root the offset of "p" is not
at beginning of page, the offset is for example +9, if "reiserfs" is the
first in list.
Than calls do_mount_root, and lands in sys_mount.
Remember: Variable "type_page" contains now "fs_type+9" and not contains
a full page.
The sys_mount copies 4096 bytes with function "exact_copy_from_user()":
http://lxr.linux.no/linux/fs/namespace.c#L1540
Mostly exist pages after the buffer "fs_names+4096+9" and the page fault
handler was not called. No problem.
In the case, if the page after "fs_names+4096" is not mapped, the page
fault handler was called from http://lxr.linux.no/linux/fs/namespace.c#L1320
The do_page_fault gots an address 0xc03b4000.
It's kernel address, address >= TASK_SIZE, but not from vmalloc! It's
from "__getname()" alias "kmem_cache_alloc".
The "error_code" is 0. "vmalloc_fault" will be call:
http://lxr.linux.no/linux/arch/i386/mm/fault.c#L332
"vmalloc_fault" tryed to find the physical page for a non existing
virtual memory area. The macro "pte_present" in vmalloc_fault()
got a next page fault for 0xc0000ed0 at:
http://lxr.linux.no/linux/arch/i386/mm/fault.c#L282
No PTE exist for such virtual address. The page fault handler was trying
to sync the physical page for the PTE lockup.
This called vmalloc_fault() again for address 0xc000000, and that also
was not existing. The endless began...
In normal case the cpu would still loop with disabled interrrupts. Under
coLinux this was catched by a stack overflow inside printk debugs.
Signed-off-by: Henry Nestler <henry.nestler@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
BTW, what does pat_wc_enabled stand for? Does it mean
"write-combining"?
Currently it is used to globally switch on or off PAT support.
Thus I renamed it to pat_enabled.
I think this increases readability (and hope that I didn't miss
something).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Starting with commit 8d4a430085 (x86:
cleanup PAT cpu validation) the PAT CPU feature flag is not cleared
anymore. Now the error message
"PAT enabled, but CPU feature cleared"
in pat_init() is misleading.
Furthermore the current code does not check for existence of the PAT
CPU feature flag if a CPU is whitelisted in validate_pat_support.
This patch clears pat_wc_enabled if boot CPU has no PAT feature flag
and adapts the paranoia check.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clarify the usage of mtrr_lookup() in PAT code, and to make PAT code
resilient to mtrr lookup problems.
Specifically, pat_x_mtrr_type() is restructured to highlight, under what
conditions we look for mtrr hint. pat_x_mtrr_type() uses a default type
when there are any errors in mtrr lookup (still maintaining the pat
consistency). And, reserve_memtype() highlights its usage ot mtrr_lookup
for request type of '-1' and also defaults in a sane way on any mtrr
lookup failure.
pat.c looks at mtrr type of a range to get a hint on what mapping type
to request when user/API: (1) hasn't specified any type (/dev/mem
mapping) and we do not want to take performance hit by always mapping
UC_MINUS. This will be the case for /dev/mem mappings used to map BIOS
area or ACPI region which are WB'able. In this case, as long as MTRR is
not WB, PAT will request UC_MINUS for such mappings.
(2) user/API requests WB mapping while in reality MTRR may have UC or
WC. In this case, PAT can map as WB (without checking MTRR) and still
effective type will be UC or WC. But, a subsequent request to map same
region as UC or WC may fail, as the region will get trackked as WB in
PAT list. Looking at MTRR hint helps us to track based on effective type
rather than what user requested. Again, here mtrr_lookup is only used as
hint and we fallback to WB mapping (as requested by user) as default.
In both cases, after using the mtrr hint, we still go through the
memtype list to make sure there are no inconsistencies among multiple
users.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Rufus & Azrael <rufus-azrael@numericable.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a SLIT sanity checking patch. It moves slit_valid() function to
generic ACPI code and does sanity checking for both x86 and ia64. It sets up
node_distance with LOCAL_DISTANCE and REMOTE_DISTANCE when hitting invalid
SLIT table on ia64. It also cleans up unused variable localities in
acpi_parse_slit() on x86.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
don't assume we can use RAM near the end of every node.
Esp systems that have few memory and they could have
kva address and kva RAM all below max_low_pfn.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.
So we need to make shrink_active_range() smarter.
shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch makes early reserved highmem pages become reserved
pages. This can be used for highmem pages allocated by bootloader such
as EFI memory map, linked list of setup_data, etc.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix this:
WARNING: vmlinux.o(.text+0x114bb): Section mismatch in reference from
the function nopat() to the function .cpuinit.text:pat_disable()
The function nopat() references
the function __cpuinit pat_disable().
This is often because nopat lacks a __cpuinit
annotation or the annotation of pat_disable is wrong.
Reported-by: "Fabio Comolli" <fabio.comolli@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clarify the usage of mtrr_lookup() in PAT code, and to make PAT code
resilient to mtrr lookup problems.
Specifically, pat_x_mtrr_type() is restructured to highlight, under what
conditions we look for mtrr hint. pat_x_mtrr_type() uses a default type
when there are any errors in mtrr lookup (still maintaining the pat
consistency). And, reserve_memtype() highlights its usage ot mtrr_lookup
for request type of '-1' and also defaults in a sane way on any mtrr
lookup failure.
pat.c looks at mtrr type of a range to get a hint on what mapping type
to request when user/API: (1) hasn't specified any type (/dev/mem
mapping) and we do not want to take performance hit by always mapping
UC_MINUS. This will be the case for /dev/mem mappings used to map BIOS
area or ACPI region which are WB'able. In this case, as long as MTRR is
not WB, PAT will request UC_MINUS for such mappings.
(2) user/API requests WB mapping while in reality MTRR may have UC or
WC. In this case, PAT can map as WB (without checking MTRR) and still
effective type will be UC or WC. But, a subsequent request to map same
region as UC or WC may fail, as the region will get trackked as WB in
PAT list. Looking at MTRR hint helps us to track based on effective type
rather than what user requested. Again, here mtrr_lookup is only used as
hint and we fallback to WB mapping (as requested by user) as default.
In both cases, after using the mtrr hint, we still go through the
memtype list to make sure there are no inconsistencies among multiple
users.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Rufus & Azrael <rufus-azrael@numericable.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changed the call to find_e820_area_size to pass u64 instead of unsigned long.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
OGAWA Hirofumi and Fede have reported rare pmd_ERROR messages:
mm/memory.c:127: bad pmd ffff810000207xxx(9090909090909090).
Initialization's cleanup_highmap was leaving alignment filler
behind in the pmd for MODULES_VADDR: when vmalloc's guard page
would occupy a new page table, it's not allocated, and then
module unload's vfree hits the bad 9090 pmd entry left over.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mika Kukkonen noticed that the nesting check in early_iounmap() is not
actually done.
Reported-by: Mika Kukkonen <mikukkon@srv1-m700-lanp.koti>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: torvalds@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: mikukkon@iki.fi
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
this way 32-bit is more similar to 64-bit, and smarter e820 and numa.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when 1/3 user/kernel split is used, and less memory is installed, or if
we have a big hole below 4g, max_low_pfn is still using 3g-128m
try to go down from max_low_pfn until we get it. otherwise will panic.
need to make 32-bit code to use register_e820_active_regions ... later.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
on 32-bit in head_32.S after initial page table is done, we get initial
max_pfn_mapped, and then kernel_physical_mapping_init will give us
a final one.
We need to use that to make sure find_e820_area will get valid addresses
for boot_map and for NODE_DATA(0) on numa32.
XEN PV and lguest may need to assign max_pfn_mapped too.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we don't need to call memory_present that early.
numa and sparse will call memory_present later and might
even fail, it will call memory_present for the full range.
also for sparse it will call alloc_bootmem ... before we set up bootmem.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... otherwise alloc_remap will not get node_mem_map from kva area, and
alloc_node_mem_map has to alloc_bootmem_node to get mem_map.
It will use two low address copies ...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so every element will represent 64M instead of 256M.
AMD opteron could have HW memory hole remapping, so could have
[0, 8g + 64M) on node0. Reduce element size to 64M to keep that on node 0
Later we need to use find_e820_area() to allocate memory_node_map like
on 64-bit. But need to move memory_present out of populate_mem_map...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
on Summit it's possible to have:
CONFIG_ACPI_SRAT=y
CONFIG_HAVE_ARCH_PARSE_SRAT=y
in which case acpi.h defines the acpi_numa_slit_init() and
acpi_numa_processor_affinity_init() methods as a macro.
reserve early numa kva, so it will not clash with new RAMDISK
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
(Updated with a common max-stack-used checker that knows about
the canary, as suggested by Joe Perches)
Use a canary at the end of the stack to clearly indicate
at oops time whether the stack has ever overflowed.
This is a very simple implementation with a couple of
drawbacks:
1) a thread may legitimately use exactly up to the last
word on the stack
-- but the chances of doing this and then oopsing later seem slim
2) it's possible that the stack usage isn't dense enough
that the canary location could get skipped over
-- but the worst that happens is that we don't flag the overrun
-- though this happens fairly often in my testing :(
With the code in place, an intentionally-bloated stack oops might
do:
BUG: unable to handle kernel paging request at ffff8103f84cc680
IP: [<ffffffff810253df>] update_curr+0x9a/0xa8
PGD 8063 PUD 0
Thread overran stack or stack corrupted
Oops: 0000 [1] SMP
CPU 0
...
... unless the stack overrun is so bad that it corrupts some other
thread.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove three extern declarations for routines
that don't exist. Fix a typo in a comment.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
memory_add_physaddr_to_nid() is only used in the
CONFIG_MEMORY_HOTPLUG_SPARSE || CONFIG_ACPI_HOTPLUG_MEMORY case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
free_initrd_mem needs a prototype, which is in linux/initrd.h
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sparse mutters:
arch/x86/mm/k8topology_64.c:108:7: warning: symbol 'nodeid' shadows an earlier one
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
k8_scan_nodes is global and needs a prototype. Add the header file
which contains it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/mm/pat.c: In function 'phys_mem_access_prot_allowed':
arch/x86/mm/pat.c:526: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:526: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:527: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:527: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:528: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:528: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:529: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
arch/x86/mm/pat.c:529: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
Don't open-code test_bit() on a __u32
Cc: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mmiotrace_ioremap() expects to receive the original unaligned map phys address
and size. Also fix {un,}register_kmmio_probe() to deal properly with
unaligned size.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From 8979ee55cb6a429c4edd72ebec2244b849f6a79a Mon Sep 17 00:00:00 2001
From: Pekka Paalanen <pq@iki.fi>
Date: Sat, 12 Apr 2008 00:18:57 +0300
Mmiotrace is not reliable with multiple CPUs and may
miss events. Drop to single CPU when mmiotrace is activated.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix gcc printk format warnings:
next-20080415/arch/x86/mm/mmio-mod.c: In function 'print_pte':
next-20080415/arch/x86/mm/mmio-mod.c:154: warning: format '%lx' expects type 'long unsigned int', but argument 3 has type 'pteval_t'
next-20080415/arch/x86/mm/mmio-mod.c:154: warning: format '%lx' expects type 'long unsigned int', but argument 4 has type 'pteval_t'
next-20080415/arch/x86/mm/mmio-mod.c: At top level:
next-20080415/arch/x86/mm/mmio-mod.c:403: warning: 'downed_cpus' defined but not used
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This had become a no-op.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Kconfig.debug, Makefile and testmmiotrace.c style fixes.
Use real mutex instead of mutex.
Fix failure path in register probe func.
kmmio: RCU read-locked over single stepping.
Generate mapping id's.
Make mmio-mod.c built-in and rewrite its locking.
Add debugfs file to enable/disable mmiotracing.
kmmio: use irqsave spinlocks.
Lots of cleanups in mmio-mod.c
Marker file moved from /proc into debugfs.
Call mmiotrace entrypoints directly from ioremap.c.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
kmmio.c handles the list of mmio probes with callbacks, list of traced
pages, and attaching into the page fault handler and die notifier. It
arms, traps and disarms the given pages, this is the core of mmiotrace.
mmio-mod.c is a user interface, hooking into ioremap functions and
registering the mmio probes. It also decodes the required information
from trapped mmio accesses via the pre and post callbacks in each probe.
Currently, hooking into ioremap functions works by redefining the symbols
of the target (binary) kernel module, so that it calls the traced
versions of the functions.
The most notable changes done since the last discussion are:
- kmmio.c is a built-in, not part of the module
- direct call from fault.c to kmmio.c, removing all dynamic hooks
- prepare for unregistering probes at any time
- make kmmio re-initializable and accessible to more than one user
- rewrite kmmio locking to remove all spinlocks from page fault path
Can I abuse call_rcu() like I do in kmmio.c:unregister_kmmio_probe()
or is there a better way?
The function called via call_rcu() itself calls call_rcu() again,
will this work or break? There I need a second grace period for RCU
after the first grace period for page faults.
Mmiotrace itself (mmio-mod.c) is still a module, I am going to attack
that next. At some point I will start looking into how to make mmiotrace
a tracer component of ftrace (thanks for the hint, Ingo). Ftrace should
make the user space part of mmiotracing as simple as
'cat /debug/trace/mmio > dump.txt'.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The custom page fault handler list is replaced with a single function
pointer. All related functions and variables are renamed for
mmiotrace.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: pq@iki.fi
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use lookup_address() from pageattr.c instead of doing the same
manually. Also had to EXPORT_SYMBOL_GPL(lookup_address) to make this
work for modules. This also fixes "undefined symbol 'init_mm'"
compile error for x86_32.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Without CONFIG_DYNAMIC_FTRACE, mark_rodata_ro() would mark a wrong
number of pages as no-execute. The bug was introduced in the patch
"ftrace: dont write protect kernel text". The symptom was machine reboot
after a CPU hotplug.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Provides kernel modules a way to register custom page fault handlers.
On every page fault this will call a list of registered functions. The
functions may handle the fault and force do_page_fault() to return
immediately.
This functionality is similar to the now removed page fault notifiers.
Custom page fault handlers are used by debugging and reverse engineering
tools. Mmiotrace is one such tool and a patch to add it into the tree
will follow.
The custom page fault handlers are called earlier in do_page_fault()
than the page fault notifiers were.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dynamic ftrace cant work when the kernel has its text write protected.
This patch keeps the kernel from being write protected when
dynamic ftrace is in place.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When allocating the pgdat's for numa nodes on x86_32 we attempt to place
them in the numa remap space for that node. However should the node not
have any remap space allocated (such as due to having non-ram pages in
the remap location in the node) then we will incorrectly place the pgdat
at zero. Check we have remap available, falling back to node 0 memory
where we do not.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Recent kernels have been panic'ing trying to allocate memory early in boot,
in __alloc_pages:
BUG: unable to handle kernel paging request at 00001568
IP: [<c10407b6>] __alloc_pages+0x33/0x2cc
*pdpt = 00000000013a5001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1, comm: swapper Not tainted (2.6.25 #78)
EIP: 0060:[<c10407b6>] EFLAGS: 00010246 CPU: 0
EIP is at __alloc_pages+0x33/0x2cc
EAX: 00001564 EBX: 000412d0 ECX: 00001564 EDX: 000005c3
ESI: f78012a0 EDI: 00000001 EBP: 00001564 ESP: f7871e50
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f7870000 task=f786f670 task.ti=f7870000)
Stack: 00000000 f786f670 00000010 00000000 0000b700 000412d0 f78012a0 00000001
00000000 c105b64d 00000000 000412d0 f78012a0 f7803120 00000000 c105c1c5
00000010 f7803144 000412d0 00000001 f7803130 f7803120 f78012a0 00000001
Call Trace:
[<c105b64d>] kmem_getpages+0x94/0x129
[<c105c1c5>] cache_grow+0x8f/0x123
[<c105c689>] ____cache_alloc_node+0xb9/0xe4
[<c105c999>] kmem_cache_alloc_node+0x92/0xd2
[<c1018929>] build_sched_domains+0x536/0x70d
[<c100b63c>] do_flush_tlb_all+0x0/0x3f
[<c100b63c>] do_flush_tlb_all+0x0/0x3f
[<c10572d6>] interleave_nodes+0x23/0x5a
[<c105c44f>] alternate_node_alloc+0x43/0x5b
[<c1018b47>] arch_init_sched_domains+0x46/0x51
[<c136e85e>] kernel_init+0x0/0x82
[<c137ac19>] sched_init_smp+0x10/0xbb
[<c136e8a1>] kernel_init+0x43/0x82
[<c10035cf>] kernel_thread_helper+0x7/0x10
Debugging this showed that the NODE_DATA() for nodes other than node 0
were all NULL. Tracing this back showed that the NODE_DATA() pointers
were being initialised to each nodes remap space. However under
SPARSEMEM remap is disabled which leads to the pgdat's being placed
incorrectly at kernel virtual address 0. Leading to the panic when
attempting to allocate memory from these nodes.
Numa remap was disabled in the commit below. This occured while fixing
problems triggered when attempting to boot x86_32 NUMA SPARSEMEM kernels
on non-numa hardware.
x86: make NUMA work on 32-bit
commit 1b000a5dbe
The real problem is believed to be related to other alignment issues in
the regions blocked out from the bootmem allocator for small memory
systems, and has been fixed separately. Therefore re-enable remap for
SPARSMEM, which fixes pgdat allocation issues. Testing confirms that
SPARSMEM NUMA kernels will boot correctly with this part of the change
reverted.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pat_disable() is __init, which means it goes away after booting is complete.
Unfortunately it is used by the hotplug code if the machine is not
pat-capable, causing a crash.
Fix by marking pat_disable() as __cpuinit.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/mm/pat.c: In function `phys_mem_access_prot_allowed':
arch/x86/mm/pat.c:558: warning: long long unsigned int format, long
unsigned int arg (arg 6)
arch/x86/mm/pat.c: In function `map_devmem':
arch/x86/mm/pat.c:580: warning: long long unsigned int format, long
unsigned int arg (arg 6)
Signed-off-by: D Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After resume on a 2cpu laptop, kernel builds collapse with a sed hang,
sh or make segfault (often on 20295564), real-time signal to cc1 etc.
Several hurdles to jump, but a manually-assisted bisect led to -rc1's
d2bcbad5f3 x86: do not zap_low_mappings
in __smp_prepare_cpus. Though the low mappings were removed at bootup,
they were left behind (with Global flags helping to keep them in TLB)
after resume or cpu online, causing the crashes seen.
Reinstate zap_low_mappings (with local __flush_tlb_all) for each cpu_up
on x86_32. This used to be serialized by smp_commenced_mask: that's now
gone, but a low_mappings flag will do. No need for native_smp_cpus_done
to repeat the zap: let mem_init zap BSP's low mappings just like on UP.
(In passing, fix error code from native_cpu_up: do_boot_cpu returns a
variety of diagnostic values, Dprintk what it says but convert to -EIO.
And save_pg_dir separately before zap_low_mappings: doesn't matter now,
but zapping twice in succession wiped out resume's swsusp_pg_dir.)
That worked well on the duo and one quad, but wouldn't boot 3rd or 4th
cpu on P4 Xeon, oopsing just after unlock_ipi_call_lock. The TLB flush
IPI now being sent reveals a long-standing bug: the booting cpu has its
APIC readied in smp_callin at the top of start_secondary, but isn't put
into the cpu_online_map until just before that unlock_ipi_call_lock.
So native_smp_call_function_mask to online cpus would send_IPI_allbutself,
including the cpu just coming up, though it has been excluded from the
count to wait for: by the time it handles the IPI, the call data on
native_smp_call_function_mask's stack may well have been overwritten.
So fall back to send_IPI_mask while cpu_online_map does not match
cpu_callout_map: perhaps there's a better APICological fix to be
made at the start_secondary end, but I wouldn't know that.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use CONFIG_MEMTEST only. if it is set, will have memtest=0 (disabled)
need to have memtest=4 in command line to test more patterns.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the scattered checks for PAT support to a single function. Its
moved to addon_cpuid_features.c as this file is shared between 32 and
64 bit.
Remove the manipulation of the PAT feature bit and just disable PAT in
the PAT layer, based on the PAT bit provided by the CPU and the
current CPU version/model white list.
Change the boot CPU check so it works on Voyager somewhere in the
future as well :) Also panic, when a secondary has PAT disabled but
the primary one has alrady switched to PAT. We have no way to undo
that.
The white list is kept for now to ensure that we can rely on known to
work CPU types and concentrate on the software induced problems
instead of fighthing CPU erratas and subtle wreckage caused by not yet
verified CPUs. Once the PAT code has stabilized enough, we can remove
the white list and open the can of worms.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.
That came from 9fc34113f6 x86: debug pmd_bad();
but we understand now that the typecasting was wrong for PAE in the previous
version: pagetable pages above 4GB looked bad and stopped Arjan from booting.
And revert that cded932b75 x86: fix pmd_bad
and pud_bad to support huge pages. It was the wrong way round: we shouldn't
weaken every pmd_bad and pud_bad check to let huge pages slip through - in
part they check that we _don't_ have a huge page where it's not expected.
Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
junk in the upper word is good; and x86_64 should follow x86_32's stricter
comparison, to stop thinking any subset of required bits is good); but that
should be a later patch.
Fix Hans' good observation that follow_page() will never find pmd_huge()
because that would have already failed the pmd_bad test: test pmd_huge in
between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check?
No, once it's a hugepage entry, it can get quite far from a good pmd: for
example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.
However... though follow_page() contains this and another test for huge
pages, so it's nice to keep it working on them, where does it actually get
called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to
to call alternative hugetlb processing, as does unmap_vmas() and others.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Earlier-version-tested-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jeff Chua <jeff.chua.linux@gmail.com>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch/x86/pci/Makefile_32 has a nasty detail. VISWS and NUMAQ build
override the generic pci-y rules. This needs a proper cleanup, but
that needs more thoughts. Undo
commit 895d30935e
x86: numaq fix
do not override the existing pci-y rule when adding visws or
numaq rules.
There is also a stupid init function ordering problem vs. acpi.o
Add comments to the Makefile to avoid tripping over this again.
Remove the srat stub code in discontig_32.c to allow a proper NUMAQ
build.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
bdd3cee2e4 (x86: ioremap(), extend check
to all RAM pages) breaks OLPC's ioremap call. The ioremap that OLPC uses is:
romsig = ioremap(0xffffffc0, 16);
The commit that breaks it is basically:
- for (pfn = phys_addr >> PAGE_SHIFT; pfn < max_pfn_mapped &&
- (pfn << PAGE_SHIFT) < last_addr; pfn++) {
+ for (pfn = phys_addr >> PAGE_SHIFT;
+ (pfn << PAGE_SHIFT) < last_addr; pfn++) {
+
Previously, the 'pfn < max_pfn_mapped' check would've caused us to not
enter the loop. Removing that check means we loop infinitely. The
reason for that is because pfn is 0xfffff, and last_addr is 0xffffffcf.
The remaining check that is used to exit the loop is not sufficient;
when pfn<<PAGE_SHIFT is 0xfffff000, that is less than 0xffffffcf; when
we increment pfn and it overflows (pfn == 0x100000), pfn<<PAGE_SHIFT
ends up being 0. That, of course, is less than last_addr. In effect,
pfn<<PAGE_SHIFT is never lower than last_addr.
The simple fix for this is to limit the last_addr check to the PAGE_MASK;
a patch is below.
Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use UC_MINUS for ioremap(), ioremap_nocache() instead of strong UC.
Once all the X drivers move to ioremap_wc(), we can go back to strong
UC semantics for ioremap() and ioremap_nocache().
To avoid attribute aliasing issues, pci_mmap_page_range() will also
use UC_MINUS for default non write-combining mapping request.
Next steps:
a) change all the video drivers using ioremap() or ioremap_nocache()
and adding WC MTTR using mttr_add() to ioremap_wc()
b) for strict usage, we can go back to strong uc semantics
for ioremap() and ioremap_nocache() after some grace period for
completing step-a.
c) user level X server needs to use the appropriate method for setting
up WC mapping (like using resourceX_wc sysfs file instead of
adding MTRR for WC and using /dev/mem or resourceX under /sys)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Vegard Nossum reported a large (150 seconds) boot delay during bootup,
and bisected it to "x86: ioremap(), extend check to all RAM pages"
(commit bdd3cee2e4). Revert this commit for now.
Bisected-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch removes the no longer used export of kmap_atomic_to_page.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-pci:
x86: add pci=check_enable_amd_mmconf and dmi check
x86: work around io allocation overlap of HT links
acpi: get boot_cpu_id as early for k8_scan_nodes
x86_64: don't need set default res if only have one root bus
x86: double check the multi root bus with fam10h mmconf
x86: multi pci root bus with different io resource range, on 64-bit
x86: use bus conf in NB conf fun1 to get bus range on, on 64-bit
x86: get mp_bus_to_node early
x86 pci: remove checking type for mmconfig probe
x86: remove unneeded check in mmconf reject
driver core: try parent numa_node at first before using default
x86: seperate mmconf for fam10h out from setup_64.c
x86: if acpi=off, force setting the mmconf for fam10h
x86_64: check MSR to get MMCONFIG for AMD Family 10h
x86_64: check and enable MMCONFIG for AMD Family 10h
x86_64: set cfg_size for AMD Family 10h in case MMCONFIG
x86: mmconf enable mcfg early
x86: clear pci_mmcfg_virt when mmcfg get rejected
x86: validate against acpi motherboard resources
Fixed up fairly trivial conflicts in arch/x86/pci/{init.c,pci.h} due to
OLPC support manually.
All architectures use an effectively identical definition of online_page(), so
just make it common code. x86-64, ia64, powerpc and sh are actually
identical; x86-32 is slightly different.
x86-32's differences arise because it puts its hotplug pages in the highmem
zone. We can handle this in the generic code by inspecting the page to see if
its in highmem, and update the totalhigh_pages count appropriately. This
leaves init_32.c:free_new_highpage with a single caller, so I folded it into
add_one_highpage_init.
I also removed an incorrect comment referring to the NUMA case; any NUMA
details have already been dealt with by the time online_page() is called.
[akpm@linux-foundation.org: fix indenting]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo already fixed one of these at my request (in "x86 PAT: tone down
debugging messages", commit 1ebcc654f0),
but there was another one he missed.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[mingo@elte.hu: split from "x86_64: get boot_cpu_id as early for k8_scan_nodes]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On big systems with lots of memory, don't print out too much during
bootup, and make it easy to find if it is continuous.
on 256G 8 sockets system will get
[ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
[ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
[ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
[ffffe20038700000-ffffe200387fffff] potential offnode page_structs
[ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
[ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
[ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
[ffffe20054700000-ffffe200547fffff] potential offnode page_structs
[ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
[ffffe20070700000-ffffe200707fffff] potential offnode page_structs
[ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
[ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
[ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
[ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
[ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
[ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
[ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
[ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
[ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
[ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
[ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
[ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7
instead of a very long print out...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
typical case: four sockets system, every node has 4g ram, and we are using:
memmap=10g$4g
to mask out memory on node1 and node2
when numa is enabled, early_node_mem is used to get node_data and node_bootmap.
if it can not get memory from the same node with find_e820_area(), it will
use alloc_bootmem to get buff from previous nodes.
so check it and print out some info about it.
need to move early_res_to_bootmem into every setup_node_bootmem.
and it takes range that node has. otherwise alloc_bootmem could return addr
that reserved early.
depends on "mm: make reserve_bootmem can crossed the nodes".
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"mm: make reserve_bootmem can crossed the nodes" provides new
reserve_bootmem(), let reserve_bootmem_generic() use that.
reserve_bootmem_generic() is used to reserve initramdisk, so this way
we can make sure even when bootloader or kexec load ranges cross the
node memory boundaries, reserve_bootmem still works.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
disable /dev/mem mmap of RAM with PAT. It makes things safer and
eliminates aliasing. A future improvement would be to avoid the
range_is_allowed duplication.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-fixes:
x86 PAT: decouple from nonpromisc devmem
x86 PAT: tone down debugging messages
It is claimed that NexGen CPUs were never shipped:
http://lkml.org/lkml/2008/4/20/179
Also, the kernel support for these chips has been broken for
a long time, the code intended to support NexGen thereby being
essentially dead.
As an outcome of the discussion that can be found using the URL
above, this patch removes the NexGen support altogether.
The changes in this patch survived a defconfig build for i386, a
couple of successful randconfig builds, as well as a runtime test,
which consisted in booting a 32-bit x86 box up to the shell prompt.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus reported these excessive debug printouts:
> Overlap at 0xe0300000-0xe0400000
> Overlap at 0xe0300000-0xe0380000
> Overlap at 0xe0300000-0xe0400000
> Overlap at 0xe0300000-0xe0400000
> Overlap at 0xe0300000-0xe0400000
> Overlap at 0xe0300000-0xe0400000
> Overlap at 0xe0300000-0xe0400000
turn that into a pr_debug().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-pat:
generic: add ioremap_wc() interface wrapper
/dev/mem: make promisc the default
pat: cleanups
x86: PAT use reserve free memtype in mmap of /dev/mem
x86: PAT phys_mem_access_prot_allowed for dev/mem mmap
x86: PAT avoid aliasing in /dev/mem read/write
devmem: add range_is_allowed() check to mmap of /dev/mem
x86: introduce /dev/mem restrictions with a config option
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-xen-next: (52 commits)
xen: add balloon driver
xen: allow compilation with non-flat memory
xen: fold xen_sysexit into xen_iret
xen: allow set_pte_at on init_mm to be lockless
xen: disable preemption during tlb flush
xen pvfb: Para-virtual framebuffer, keyboard and pointer driver
xen: Add compatibility aliases for frontend drivers
xen: Module autoprobing support for frontend drivers
xen blkfront: Delay wait for block devices until after the disk is added
xen/blkfront: use bdget_disk
xen: Make xen-blkfront write its protocol ABI to xenstore
xen: import arch generic part of xencomm
xen: make grant table arch portable
xen: replace callers of alloc_vm_area()/free_vm_area() with xen_ prefixed one
xen: make include/xen/page.h portable moving those definitions under asm dir
xen: add resend_irq_on_evtchn() definition into events.c
Xen: make events.c portable for ia64/xen support
xen: move events.c to drivers/xen for IA64/Xen support
xen: move features.c from arch/x86/xen/features.c to drivers/xen
xen: add missing definitions in include/xen/interface/vcpu.h which ia64/xen needs
...
All pagetables need fundamentally the same setup and destruction, so
just use the same code for everything.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make KERNEL_PGD_PTRS common, as previously it was only being defined
for 32-bit.
There are a couple of follow-on changes from this:
- KERNEL_PGD_PTRS was being defined in terms of USER_PGD_PTRS. The
definition of USER_PGD_PTRS doesn't really make much sense on x86-64,
since it can have two different user address-space configurations.
I renamed USER_PGD_PTRS to KERNEL_PGD_BOUNDARY, which is meaningful
for all of 32/32, 32/64 and 64/64 process configurations.
- USER_PTRS_PER_PGD was also defined and was being used for similar
purposes. Converting its users to KERNEL_PGD_BOUNDARY left it
completely unused, and so I removed it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Zach Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Rename (alloc|release)_(pt|pd) to pte/pmd to explicitly match the name
of the appropriate pagetable level structure.
[ x86.git merge work by Mark McLoughlin <markmc@redhat.com> ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Common definitions for 3-level pagetable functions.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Common definitions for 2-level pagetable functions.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a common arch/x86/mm/pgtable.c file for common pagetable functions.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert asm-x86/pgalloc_64.h from macros into functions (#include hell
prevents __*_free_tlb from being inline, but they're probably a bit
big to inline anyway).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use reserve_memtype and free_memtype wrappers for /dev/mem mmaps. The memtype
is slightly complicated here, given that we have to support existing X mappings.
We fallback on UC_MINUS for that.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce phys_mem_access_prot_allowed(), which checks whether the mapping
is possible, without any conflicts and returns success or failure based on that.
phys_mem_access_prot() by itself does not allow failure case. This ability
to return error is needed for PAT where we may have aliasing conflicts.
x86 setup __HAVE_PHYS_MEM_ACCESS_PROT and move x86 specific code out of
/dev/mem into arch specific area.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add xlate and unxlate around /dev/mem read/write. This sets up the mapping
that can be used for /dev/mem read and write without aliasing worries.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces a restriction on /dev/mem: Only non-memory can be
read or written unless the newly introduced config option is set.
The X server needs access to /dev/mem for the PCI space, but it doesn't need
access to memory; both the file permissions and SELinux permissions of /dev/mem
just make X effectively super-super powerful. With the exception of the
BIOS area, there's just no valid app that uses /dev/mem on actual memory.
Other popular users of /dev/mem are rootkits and the like.
(note: mmap access of memory via /dev/mem was already not allowed since
a really long time)
People who want to use /dev/mem for kernel debugging can enable the config
option.
The restrictions of this patch have been in the Fedora and RHEL kernels for
at least 4 years without any problems.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel: (62 commits)
sched: build fix
sched: better rt-group documentation
sched: features fix
sched: /debug/sched_features
sched: add SCHED_FEAT_DEADLINE
sched: debug: show a weight tree
sched: fair: weight calculations
sched: fair-group: de-couple load-balancing from the rb-trees
sched: fair-group scheduling vs latency
sched: rt-group: optimize dequeue_rt_stack
sched: debug: add some debug code to handle the full hierarchy
sched: fair-group: SMP-nice for group scheduling
sched, cpuset: customize sched domains, core
sched, cpuset: customize sched domains, docs
sched: prepatory code movement
sched: rt: multi level group constraints
sched: task_group hierarchy
sched: fix the task_group hierarchy for UID grouping
sched: allow the group scheduler to have multiple levels
sched: mix tasks and groups
...
* Move large array "struct bootnode nodes" from stack to _initdata
section to reduce amount of stack space required.
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move dma_ops structure definition to pci-dma.c, where it
belongs.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For example, If the physical address layout on a two node system with 8 GB
memory is something like:
node 0: 0-2GB, 4-6GB
node 1: 2-4GB, 6-8GB
Current kernels fail to boot/detect this NUMA topology.
ACPI SRAT tables can expose such a topology which needs to be supported.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
this function doesnt just 'find' the max_pfn - it also has
other side-effects such as registering sparse memory maps.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix printk formats in x86/mm/ioremap.c:
next-20080410/arch/x86/mm/ioremap.c:137: warning: format '%llx' expects type 'long long unsigned int', but argument 2 has type 'resource_size_t'
next-20080410/arch/x86/mm/ioremap.c:188: warning: format '%llx' expects type 'long long unsigned int', but argument 2 has type 'resource_size_t'
next-20080410/arch/x86/mm/ioremap.c:188: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove old comments that include the old arch/i386 directory.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
do not return a -EINVAL when mmap()-ing PCI holes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cleanup references to the early cpu maps for the non-SMP configuration
and remove some functions called for SMP configurations only.
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Increase the number of bits in an apicid from 8 to 32.
By default, MP_processor_info() gets the APICID from the
mpc_config_processor structure. However, this structure limits
the size of APICID to 8 bits. This patch allows the caller of
MP_processor_info() to optionally pass a larger APICID that will
be used instead of the one in the mpc_config_processor struct.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch renames VM_MASK to X86_VM_MASK (which
in turn defined as alias to X86_EFLAGS_VM) to better
distinguish from virtual memory flags. We can't just
use X86_EFLAGS_VM instead because it is also used
for conditional compilation
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel recommends to not use large pages for the first 1MB
of the physical memory because there are fixed size MTRRs there
which cause splitups in the TLBs.
On AMD doing so is also a good idea.
The implementation is a little different between 32bit and 64bit.
On 32bit I just taught the initial page table set up about this
because it was very simple to do. This also has the advantage
that the risk of a prefetch ever seeing the page even
if it only exists for a short time is minimized.
On 64bit that is not quite possible, so use set_memory_4k() a little
later (in check_bugs) instead.
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: andreas.herrmann3@amd.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a new function to force split large pages into 4k pages.
This is needed for some followup optimizations.
I had to add a new field to cpa_data to pass down the information
that try_preserve_large_page should not run.
Right now no set_page_4k() because I didn't need it and all the
specialized users I have in mind would be more comfortable with
pure addresses. I also didn't export it because it's unlikely
external code needs it.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: andreas.herrmann3@amd.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
map more memory than end_pfn. Account this in max_pfn_mapped.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: andreas.herrmann3@amd.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Even on 32bit 2MB pages can map more memory than is in the true
max_low_pfn if end_pfn is not highmem and not aligned to 2MB.
Add a end_pfn_map similar to x86-64 that accounts for this
fact. This is important for code that really needs to know about
all mapping aliases.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: andreas.herrmann3@amd.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the PAT related printks in ioremap pr_debug.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Bug fixes for reserve_memtype() call in __ioremap and pci_mmap_page_range().
If reserve_memtype returns non-zero, then it is an error and subsequent free is
not required. Requested and returned prot value check should be done when
reserve_memtype returns success.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make known_pat_cpu to think amd k8 and fam10h is ok too.
also make tom2 below to be WRBACK
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds debug prints at critical code. Adds enough info in dmesg to allow us to
do effective first round of analysis of any issues that may result due to PAT
patch series.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce ioremap_wc for wc remap.
(generic wrapper is in a later patch)
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a set_memory_wc interface(), similar to set_memory_uc interface.
Callers has to call set_memory_uc, set_memory_wb and
set_memory_wc, set_memory_wb as pairs.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use reserve_memtype and free_memtype interfaces in set_memory_uc/set_memory_wb
interfaces to avoid aliasing.
Usage model of set_memory_uc and set_memory_wb is for RAM memory and users
will first call set_memory_uc and call set_memory_wb after use to reset the
attribute.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use reserve_memtype and free_memtype interfaces in ioremap/iounmap to avoid
aliasing.
If there is an existing alias for the region, inherit the memory type from
the alias. If there are conflicting aliases for the entire region, then fail
ioremap.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make ioremap_change_attr() non-static and use prot_val in place of ioremap_mode.
This interface is used in subsequent PAT patches.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sets up pat_init() infrastructure.
PAT MSR has following setting.
PAT
|PCD
||PWT
|||
000 WB _PAGE_CACHE_WB
001 WC _PAGE_CACHE_WC
010 UC- _PAGE_CACHE_UC_MINUS
011 UC _PAGE_CACHE_UC
We are effectively changing WT from boot time setting to WC.
UC_MINUS is used to provide backward compatibility to existing /dev/mem
users(X).
reserve_memtype and free_memtype are new interfaces for maintaining alias-free
mapping. It is currently implemented in a simple way with a linked list and
not optimized. reserve and free tracks the effective memory type, as a result
of PAT and MTRR setting rather than what is actually requested in PAT.
pat_init piggy backs on mtrr_init as the rules for setting both pat and mtrr
are same.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do simple memtest after init_memory_mapping
use find_e820_area_size to find all ram range that is not reserved.
and do some simple bits test to find some bad ram.
if find some bad ram, use reserve_early to exclude that range.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
otherwise Vmemmap and High Kernel Mapping string is not showing up.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Roland Dreier reported in http://lkml.org/lkml/2008/2/27/194
[ 8425.915139] BUG: unable to handle kernel paging request at ffffc20001a0a000
[ 8425.919087] IP: [<ffffffff8021dacc>] clflush_cache_range+0xc/0x25
[ 8425.919087] PGD 1bf80e067 PUD 1bf80f067 PMD 1bb497067 PTE 80000047000ee17b
This is on a Intel machine with 36bit physical address space. The PTE
entry references 47000ee000, which is outside of it.
Add a check for the physical address space and warn/printk about the
stupid caller.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Don't we have a special section for page-aligned data so it doesn't
> waste most of two pages?
We have .bss.page_aligned and it seems appropriate to use it.
text data bss dec hex filename
- 3388 8236 4 11628 2d6c ../build-32/arch/x86/mm/ioremap.o
+ 3388 48 4100 7536 1d70 ../build-32/arch/x86/mm/ioremap.o
Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
we don't need get that so early.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The AMD Fam10h CPUs support new Gigabyte page table entry for
mapping 1GB at a time. Use this for the kernel direct mapping.
Only done for 64bit because i386 does not support GB page tables.
This only applies to the data portion of the direct mapping; the
kernel text mapping stays with 2MB pages because the AMD Fam10h
microarchitecture does not support GB ITLBs and AMD recommends
against using GB mappings for code.
Can be disabled with disable_gbpages on the kernel command line
[ tglx@linutronix.de: simplify enable code ]
[ Yinghai Lu <yinghai.lu@sun.com>: boot fix on 256 GB RAM ]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
These new controls toggle experimental support for a new CPU feature,
the straightforward extension of largepages from the pmd level to the
pud level, which allows 1GB (kernel) TLBs instead of 2MB TLBs.
Turn it off by default, as this code has not been tested well enough yet.
Use the CONFIG_DIRECT_GBPAGES=y .config option or gbpages on the
boot line can be used to enable it. If enabled in the .config then
nogbpages boot option disables it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Clean up the page table dumper (fix boundary conditions, table driven
address ranges, some formatting changes since it is no longer using
the kernel log but a separate virtual file), and generalize to 32
bits.
[ mingo@elte.hu: x86: fix the pagetable dumper ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds code to the kernel to have an (optional)
/proc/kernel_page_tables debug file that basically dumps the kernel
pagetables; this allows us kernel developers to verify that nothing fishy is
going on and that the various mappings are set up correctly. This was quite
useful in finding various change_page_attr() bugs, and is very likely to be
useful in the future as well.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: mingo@elte.hu
Cc: tglx@tglx.de
Cc: hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Unify arch/x86/mm/Makefile between 32 and 64 bits.
All configuration variables that are protected by Kconfig constraints
have been put in the common part of the Makefile; however, the NUMA
files are totally different between 32 and 64 bits and are handled via
an ifdef.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add debug information for DEBUG_PAGEALLOC to get some statistics about
the pool usage and split status.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I believe http://bugzilla.kernel.org/show_bug.cgi?id=10318 is a false
positive. There's no way in which networking will be using highmem pages
here, so it won't be taking the KM_USER0 kmap slot, so there's no point in
performing these checks.
Cc: Pawel Staszewski <pstaszewski@artcom.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Really sad. We lose almost all real-life coverage of the debug tests
with this patch. Now it will only report problems for the cases where
people actually end up using a HIGHMEM page, not when they just _might_
use one. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus noticed a second bug and an uncleanliness:
- we'd return on any instruction fetch fault
- we'd use both the value of 16 and the PF_INSTR symbol which are
the same and make no sense
the cleanup nicely unifies this piece of logic.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The first page of the compound page is determined in follow_huge_addr()
but then PageCompound() only checks if the page is part of a compound page.
PageHead() allows checking if this is indeed the first page of the
compound.
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
some early Athlon XP's and Opterons generate bogus faults on prefetch
instructions. The workaround for this regressed over .24 - reinstate it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix the 3D performance drop reported at:
http://bugzilla.kernel.org/show_bug.cgi?id=10328
fb drivers are using ioremap()/ioremap_nocache(), followed by mtrr_add with
WC attribute. Recent changes in page attribute code made both
ioremap()/ioremap_nocache() mappings as UC (instead of previous UC-). This
breaks the graphics performance, as the effective memory type is UC instead
of expected WC.
The correct way to fix this is to add ioremap_wc() (which uses UC- in the
absence of PAT kernel support and WC with PAT) and change all the
fb drivers to use this new ioremap_wc() API.
We can take this correct and longer route for post 2.6.25. For now,
revert back to the UC- behavior for ioremap/ioremap_nocache.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we could call find_max_pfn() directly instead of setup_memory() to get
max_pfn needed for mtrr trimming.
otherwise setup_memory() is called two times... that is duplicated...
[ mingo@elte.hu: both Thomas and me simulated a double call to
setup_bootmem_allocator() and can confirm that it is a real bug
which can hang in certain configs. It's not been reported yet but
that is probably due to the relatively scarce nature of
MTRR-trimming systems. ]
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It appears that 64-bit PCI resources cannot possibly ever have worked on
x86-32 even when the RESOURCES_64BIT config option was set, because any
driver that tried to [pci_]ioremap() the resource would have been unable
to do so because the high 32 bits would have been silently dropped on
the floor by the ioremap() routines that only used "unsigned long".
Change them to use "resource_size_t" instead, which properly encodes the
whole 64-bit resource data if RESOURCES_64BIT is enabled.
Acked-by: H. Peter Anvin <hpa@kernel.org>
Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
so use nodedata_phys directly.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
quicklists cause a serious memory leak on 32-bit x86,
as documented at:
http://bugzilla.kernel.org/show_bug.cgi?id=9991
the reason is that the quicklist pool is a special-purpose
cache that grows out of proportion. It is not accounted for
anywhere and users have no way to even realize that it's
the quicklists that are causing RAM usage spikes. It was
supposed to be a relatively small pool, but as demonstrated
by KOSAKI Motohiro, they can grow as large as:
Quicklists: 1194304 kB
given how much trouble this code has caused historically,
and given that Andrew objected to its introduction on x86
(years ago), the best option at this point is to remove them.
[ any performance benefits of caching constructed pgds should
be implemented in a more generic way (possibly within the page
allocator), while still allowing constructed pages to be
allocated by other workloads. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
resolve boot problem reported by Mel Gorman:
http://lkml.org/lkml/2008/2/13/404
init_cpu_to_node will use cpu->apic (from MADT or mptable) and
apic->node(from SRAT or AMD config space with k8_bus_64.c) to have
cpu->node mapping, and later identify_cpu will overwrite them
again...(with nearby_node...)
this patch checks if the node is online, otherwise it will not
update cpu_node map. so keep cpu_node map to online node before
identify_cpu..., to prevent possible error.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Revert:
commit 8be8f54bae
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Sat Feb 23 20:43:21 2008 +0100
x86: CPA: avoid split of alias mappings
because it clearly mishandles the case when __change_page_attr(), called
from __change_page_attr_set_clr(), changes cpa->processed to 1 and
cpa_process_alias(cpa) is executed right after that.
This crashes my x86-64 test box early in the boot process
(ref. http://bugzilla.kernel.org/show_bug.cgi?id=10140#c4).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
avoid over-eager large page splitup.
When the target area needs to be split or is split already (ioremap)
then the current code enforces the split of large mappings in the alias
regions even if we could avoid it.
Use a separate variable processed in the cpa_data structure to carry
the number of pages which have been processed instead of reusing the
numpages variable. This keeps numpages intact and gives the alias code
a chance to keep large mappings intact.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jan Beulich noticed it during code review that if a driver's ioremap()
fails (say due to -ENOMEM) then we might leak the struct vm_area.
Free it properly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
recently the 64-bit allyesconfig bzImage kernel started spontaneously
rebooting during early bootup.
after a few fun hours spent with early init debugging, it turns out
that we've got this rather annoying limit on the size of the kernel
image:
#define KERNEL_TEXT_SIZE (40*1024*1024)
which limit my vmlinux just happened to pass:
text data bss dec hex filename
29703744 4222751 8646224 42572719 2899baf vmlinux
40 MB is 42572719 bytes, so my vmlinux was just 1.5% above this limit :-/
So it happily crashed right in head_64.S, which - as we all know - is
the most debuggable code in the whole architecture ;-)
So increase the limit to allow an up to 128MB kernel image to be mapped.
(should anyone be that crazy or lazy)
We have a full 4K of pagetable (level2_kernel_pgt) allocated for these
mappings already, so there's no RAM overhead and the limit was rather
pointless and arbitrary.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use PF_MEMALLOC to prevent recursive calls in the DBEUG_PAGEALLOC
case. This makes the code simpler and more robust against allocation
failures.
This fixes the following fallback to non-mmconfig:
http://lkml.org/lkml/2008/2/20/551http://bugzilla.kernel.org/show_bug.cgi?id=10083
Also, for DEBUG_PAGEALLOC=n reduce the pool size to one page.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make hibernation work with CONFIG_DEBUG_PAGEALLOC set on x86, by
checking if the pages to be copied are marked as present in the
kernel mapping and temporarily marking them as present if that's not
the case. No functional modifications are introduced if
CONFIG_DEBUG_PAGEALLOC is unset.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
agp: fix missing casts that produced a warning.
agp: add support for 662/671 to agp driver
fix historic ioremap() abuse in AGP
agp/sis: Suspend support for SiS AGP
agp/sis: Clear bit 2 from aperture size byte as well
page_is_ram() has a special case for the 640k-1M bios area, however
due to a thinko the special case checks the e820 table entry and
not the memory the user has asked for. This patch fixes the bug.
[ mingo@elte.hu: this too is better solved in the e820 space, but those
fixes are too intrusive for v2.6.25. ]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch teaches page_is_ram() about the fact that the first
4Kb of memory are special on x86, even though the E820 table
normally doesn't exclude it.
This fixes the WARN_ON() reported by Laurent Riffard who was also
very helpful in diagnosing the issue.
[ mingo@elte.hu: we are working on doing this properly in the e820
space, but for 2.6.25 this is the better fix. ]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
reserve_hotadd() are only used by __init acpi_numa_memory_affinity_init().
Annotate reserve_hotadd() with __init is the trivial fix.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
New implementation does not use lru for anything so there is no need
to reject pages that are in the LRU. Similar for compound pages (which
were checked because they also use page->lru)
[ tglx@linutronix.de: removed unused variable ]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Several AGP drivers right now use ioremap_nocache() on kernel ram in order
to turn a page of regular memory uncached.
There are two problems with this:
1) This is a total nightmare for the ioremap() implementation to keep
various mappings of the same page coherent.
2) It's a total nightmare for the AGP code since it adds a ton of
complexity in terms of keeping track of 2 different pointers to
the same thing, in terms of error handling etc etc.
This patch fixes this by making the AGP drivers use the new
set_memory_XX APIs instead.
Note: amd-k7-agp.c is built on Alpha too, and generic.c is built
on ia64 as well, which do not yet have the set_memory_*() APIs,
so for them some we have a few ugly #ifdefs - hopefully they'll
be fixed soon.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dave Airlie <airlied@linux.ie>
When the CPA code is called with an virtual address in the range of
the direct mapping or the high alias then we do not need to run
through the alias check for this range.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
NX settings are not required to be consistent across alias mappings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The early boot code maps KERNEL_TEXT_SIZE (currently 40MB) starting
from __START_KERNEL_map. The kernel itself only needs _text to _end
mapped in the high alias. On relocatible kernels the ASM setup code
adjusts the compile time created high mappings to the relocation. This
creates invalid pmd entries for negative offsets:
0xffffffff80000000 -> pmd entry: ffffffffff2001e3
It points outside of the physical address space and is marked present.
This starts at the virtual address __START_KERNEL_map and goes up to
the point where the first valid physical address (0x0) is mapped.
Zap the mappings before _text and after _end right away in early
boot. This removes also the invalid entries.
Furthermore it simplifies the range check for high aliases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
c_p_a() did not discover all aliases correctly. (such as when called
on vmalloc()-ed areas or ioremap()-ed areas)
Push the alias checks to the lower, physical level and consistently
discover all aliases that might exist: the low direct mappings and
the high linear kernel-text mappings (on 64-bit).
Thanks to Andi Kleen for pointing out that this was buggy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the cpa API is page aligned - warn about any weird alignments.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
extern should not appear in C files. Also, the definitions
do not match the prototype currently, not sure what way you
want to go with this, I've switched the prototype to return
int, but I can see going to the void return as well.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
dump_pagetable() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ mingo@elte.hu: while gbpages cannot be enabled on mainline currently,
keep the code uptodate and this fix is easy enough. ]
Use correct page sizes and masks for GB pages in try_preserve_large_page()
This prevents a boot hang on a GB capable system with CONFIG_DIRECT_GBPAGES
enabled.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
At the early stages of boot, before the kernel pagetable has been
fully initialized, a Xen kernel will still be running off the
Xen-provided pagetables rather than swapper_pg_dir[]. Therefore,
readback cr3 to determine the base of the pagetable rather than
assuming swapper_pg_dir[].
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Tested-by: Jody Belka <knew-linux@pimb.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pageattr-test.c contains a noisy debug printk that people reported.
The condition under which it prints (randomly tapping into a mem_map[]
hole and not being able to c_p_a() there) is valid behavior and not
interesting to report.
Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now, we check only the first 4k page for static required protections.
This does not take overlapping regions into account. So we might end up
setting the wrong permissions/protections for other parts of this large page.
This can be optimized further, but correctness is the important part.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switch the split page code to use the page pool. We do this
unconditionally to avoid different behaviour with and without
DEBUG_PAGEALLOC enabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
DEBUG_PAGEALLOC was not possible on 64-bit due to its early-bootup
hardcoded reliance on PSE pages, and the unrobustness of the runtime
splitup of large pages. The splitup ended in recursive calls to
alloc_pages() when a page for a pte split was requested.
Avoid the recursion with a preallocated page pool, which is used to
split up large mappings and gets refilled in the return path of
kernel_map_pages after the split has been done. The size of the page
pool is adjusted to the available memory.
This part just implements the page pool and the initialization w/o
using it yet.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some important parts of f6df72e71e got
dropped along the way, reintroduce them.
Only affects paravirt guests.
Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.
early_ioremap is updated to use the standard page table accessors.
Clear any mappings beyond max_low_pfn from the boot page tables in
native_pagetable_setup_start because the initial mappings can extend
beyond the range of physical memory and into the vmalloc area.
Derived from patches by Eric Biederman and H. Peter Anvin.
[ jeremy@goop.org: PAE swapper_pg_dir needs to be page-sized fix ]
Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mika Penttilä <mika.penttila@kolumbus.fi>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Adjust the definition of lookup_address to take an unsigned long
level argument. Adjust callers in xen/mmu.c that pass in a
dummy variable.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Background: I've implemented 1K/2K page tables for s390. These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM. The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste). The pgstes are only required if the process is using the SIE
instruction. The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).
Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch. For everybody else it will be a (struct page *). The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor. The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset adds a flags variable to reserve_bootmem() and uses the
BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions
between crashkernel area and already used memory.
This patch:
Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE.
If that flag is set, the function returns with -EBUSY if the memory already
has been reserved in the past. This is to avoid conflicts.
Because that code runs before SMP initialisation, there's no race condition
inside reserve_bootmem_core().
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix powerpc build]
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lockdep just caught this one:
=================================
[ INFO: inconsistent lock state ]
2.6.24 #38
---------------------------------
inconsistent {in-softirq-W} -> {softirq-on-W} usage.
swapper/1 [HC0[0]:SC0[0]:HE1:SE1] takes:
(pgd_lock){-+..}, at: [<ffffffff8022a9ea>] mm_init+0x1da/0x250
{in-softirq-W} state was registered at:
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 394559
hardirqs last enabled at (394559): [<ffffffff80267f0a>] get_page_from_freelist+0x30a/0x4c0
hardirqs last disabled at (394558): [<ffffffff80267d25>] get_page_from_freelist+0x125/0x4c0
softirqs last enabled at (393952): [<ffffffff80232f8e>] __do_softirq+0xce/0xe0
softirqs last disabled at (393945): [<ffffffff8020c57c>] call_softirq+0x1c/0x30
other info that might help us debug this:
no locks held by swapper/1.
stack backtrace:
Pid: 1, comm: swapper Not tainted 2.6.24 #38
Call Trace:
[<ffffffff8024e1fb>] print_usage_bug+0x18b/0x190
[<ffffffff8024f55d>] mark_lock+0x53d/0x560
[<ffffffff8024fffa>] __lock_acquire+0x3ca/0xed0
[<ffffffff80250ba8>] lock_acquire+0xa8/0xe0
[<ffffffff8022a9ea>] ? mm_init+0x1da/0x250
[<ffffffff809bcd10>] _spin_lock+0x30/0x70
[<ffffffff8022a9ea>] mm_init+0x1da/0x250
[<ffffffff8022aa99>] mm_alloc+0x39/0x50
[<ffffffff8028b95a>] bprm_mm_init+0x2a/0x1a0
[<ffffffff8028d12b>] do_execve+0x7b/0x220
[<ffffffff80209776>] sys_execve+0x46/0x70
[<ffffffff8020c214>] kernel_execve+0x64/0xd0
[<ffffffff8020901e>] ? _stext+0x1e/0x20
[<ffffffff802090ba>] init_post+0x9a/0xf0
[<ffffffff809bc5f6>] ? trace_hardirqs_on_thunk+0x35/0x3a
[<ffffffff8024f75a>] ? trace_hardirqs_on+0xba/0xd0
[<ffffffff8020c1a8>] ? child_rip+0xa/0x12
[<ffffffff8020bcbc>] ? restore_args+0x0/0x44
[<ffffffff8020c19e>] ? child_rip+0x0/0x12
turns out that pgd_lock has been used on 64-bit x86 in an irq-unsafe
way for almost two years, since commit 8c914cb704.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
delay the CPA self-test so that any impact (corruption) of
user-space pagetables can be triggered. Repeat the test
every 30 seconds.
this would have prevented the bug fixed by 8cb2a7c1e9,
at its source.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The .rodata section really should just be read only; the config option
is there to make breaking up the 2Mb page an option (so people whos machines
give more performance for the 2Mb case can opt to do so).
But when the page gets split anyway, this is no longer an issue, so
clean up the code and remove the ifdefs
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The .rodata section shouldn't just be read-only,
but also non-executable. This is free since we've broken
up the 2MB page already anyway.
also update test_nx to check for this.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In very rare cases, on certain CPUs, we could end up in the spurious
fault handler and ignore a large pud/pmd mapping. The resulting pte
pointer points into the mapped physical space and dereferencing it
will fault recursively.
Make the code aware of large mappings and do the permission check
on the pmd/pud entry, when a large pud/pmd mapping is detected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When change_page_attr splits a large page on x86_32 (without PAE), it is
currently corrupting every process's page directory: fix that by removing
the thinko which passes down a physical instead of a virtual address.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use set_pte() for setting up the 2MB pages in the direct mapping.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
pud and pmd entries in the RAM area might be marked as non present.
Do not try to modify them in the selftest.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the readout of the large entry into the spinlock section to
prevent an unlikely but possible race.
Mark the pmd/pud entry present after the split. We preserved the
non present bit in the new split mapping.
Remove the stale gfp_flags double initialization.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
lookup_address() returns a wrong level and a wrong pointer to a non
existing pte, when pmd or pud entries are marked !present. This
happens for example due to boot time mapping of GART into the low
memory space.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
An Athlon 64 X2 test system showed hard hangs shortly after marking
the kernel text read-only, if we tried to preserve largepages and
changed the PSE entry from RW to RO. The pagetable code itself is
correct, it's the CPU that locked up hard (and not even the NMI
watchdog could punch through that hard hang).
So be conservative and always do splitups - like we did in the past.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CPA is called on a range which fits into a large page mapping,
avoid to split the page when:
1) There is no change of attributes
2) The range to change is a complete large mapping
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The number of arguments which need to be transported is increasing
and we want to add flush optimizations and large page preserving.
Create struct cpa data and pass a pointer instead of increasing the
number of arguments further.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We only need to flush the caches in cpa() if the the caching attributes
have changed. Otherwise only flush the TLBs.
This checks the PAT bits too although they are currently not used by
the kernel.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Mask out the not supported bits (e.g. NX). If the clr/set masks
are empty after the mask return without changing anything.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When an ioremap is unmapped, do not change the page attributes. There might
be another mapping of the same physical address. PAT might detect a conflicting
mapping attribute for no good reason. The mapping is removed anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that cpa works on non-direct mappings as well, we can safely
remove the range check in ioremap_change_attr().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove tons of castings which make the code hard to read.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When splitting large pages, we ge the pfn from the existing entry
instead of calculating it ourself.
This removes the last remaining range restriction of the cpa code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When changing the attributes of a pte, we should use the PFN from the
existing PTE rather than going through hoops calculating what we think
it might have been; this is both fragile and totally unneeded. It also
makes it more hairy to call any of these functions on non-direct maps
for no good reason whatsover.
With this change, __change_page_attr() no longer takes a pfn as argument,
which simplifies all the callers.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@tglx.de>
Right now, enforcing that the high mapping of the kernel text doesn't
get the NX bit is done deep in the guts of CPA, rather than in the
static_protection() function that enforces all other per-arch sanity
checks.
This patch moves this sanity check into the central static_protection()
function instead, and makes it apply ONLY to the kernel text, not to all
other areas in the high mapping.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Revert "defer cr3 reload when doing pud_clear()" since I'm going to
replace it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The constructors for PAE and non-PAE pgd_ctors are more or less
identical, and can be made into the same function.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the _ASM_EXTABLE macro from <asm/asm.h>, instead of open-coding
__ex_table entires in arch/x86/mm/init_32.c.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
print out node_data addr and bootmap_start addr.
helpful for debugging early crashes on high-end NUMA systems.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In split_large_page we clear the NX bit for the new split ptes, but we
need to preserve the original setting of it for the split ptes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kevin Winchester reported the loss of direct rendering, due to:
[ 0.588184] agpgart: Detected AGP bridge 0
[ 0.588184] agpgart: unable to get memory for graphics translation table.
[ 0.588184] agpgart: agp_backend_initialize() failed.
[ 0.588207] agpgart-amd64: probe of 0000:00:00.0 failed with error -12
and bisected it down to:
commit 266b9f8727
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Wed Jan 30 13:34:06 2008 +0100
x86: fix ioremap RAM check
this check was too strict and caused an ioremap() failure.
the problem is due to the somewhat unclean way of how the GART code
reserves a memory range for its aperture, and how it utilizes it
later on.
Allow RAM pages to be ioremap()-ed too, as long as they are reserved.
Bisected-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
swsusp_pg_dir[] is used for suspend, but not for hibernation.
clean-up the ifdefs which worked by accident, while implying the opposite.
Delete the __nosavedata, which also implied the opposite.
Some day we may optimize CONFIG_ACPI_SLEEP to build minimal kernels
for just hibernate or just suspend but not both,
but today isn't that day.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
This patch fixes a bug of early_ioremap_reset(), which had been fixed
before by "convert the boot time page table to the kernels native
format" patch. But that patch has been reverted now.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch replaces __change_page_attr_set_clr() with
change_page_attr_set_clr() in change_page_attr_clear() to flush the
TLB/cache properly.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
memnode.map is s16 array because of nodeid is 16 bit now.
so need to increase the nodemap_size according to that bits.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
one early crash on one 8 node 256g machine:
Command line: console=uart8250,io,0x3f8,115200n8 initrd=kernel.org/mydisk11_x86_64.gz rw root=/dev/ram0 debug initcall_debug apic=debug acpi.debug_level=0x0000000f pci=routeirq ip=dhcp load_ramdisk=1 ramdisk_size=131072 BOOT_IMAGE=kernel.org/bzImage_2.6.25_k8.1
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dffe0000 (usable)
BIOS-e820: 00000000dffe0000 - 00000000dffee000 (ACPI data)
BIOS-e820: 00000000dffee000 - 00000000dffff050 (ACPI NVS)
BIOS-e820: 00000000dffff050 - 00000000e0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000004020000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
end_pfn_map = 67239936
Kernel panic - not syncing: Duplicated early reservation d40000-e42000
Pid: 0, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #3
Call Trace:
[<ffffffff80221545>] lapic_get_maxlvt+0x0/0x10
[<ffffffff80221657>] clear_local_APIC+0x5/0xcf
[<ffffffff80221726>] disable_local_APIC+0x5/0x17
[<ffffffff8021fe16>] smp_send_stop+0x46/0x4c
[<ffffffff80235293>] panic+0x94/0x13e
[<ffffffff80bc3b03>] sctp_eps_proc_init+0x12/0x34
[<ffffffff80b9f1c5>] reserve_early+0x30/0x6c
[<ffffffff80803925>] init_memory_mapping+0x2cd/0x2dc
[<ffffffff80b9dc01>] setup_arch+0x21f/0x44e
[<ffffffff80b978be>] start_kernel+0x6f/0x2c7
[<ffffffff80b971cc>] _sinittext+0x1cc/0x1d3
it turns out there is overlap between pgtable and bss...
in System.map we have
ffffffff80d40420 b rsi_table
ffffffff80d40620 B krb5_seq_lock
ffffffff80d40628 b i.20437
ffffffff80d40630 b xprt_rdma_inline_write_padding
ffffffff80d40638 b sunrpc_table_header
ffffffff80d40640 b zero
ffffffff80d40644 b min_memreg
ffffffff80d40648 b rpcrdma_tk_lock_g
ffffffff80d40650 B sctp_assocs_id_lock
ffffffff80d40658 B proc_net_sctp
ffffffff80d40660 B sctp_assocs_id
ffffffff80d40680 B sysctl_sctp_mem
ffffffff80d40690 B sysctl_sctp_rmem
ffffffff80d406a0 B sysctl_sctp_wmem
ffffffff80d406b0 b sctp_ctl_socket
ffffffff80d406b8 b sctp_pf_inet6_specific
ffffffff80d406c0 b sctp_pf_inet_specific
ffffffff80d406c8 b sctp_af_v4_specific
ffffffff80d406d0 b sctp_af_v6_specific
ffffffff80d406d8 b sctp_rand.33270
ffffffff80d406dc b sctp_memory_pressure
ffffffff80d406e0 b sctp_sockets_allocated
ffffffff80d406e4 b sctp_memory_allocated
ffffffff80d406e8 b sctp_sysctl_header
ffffffff80d406f0 b zero
ffffffff80d406f4 A __bss_stop
ffffffff80d406f4 A _end
need to round up table_start to PAGE_SIZE.
also make the panic more informative.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This just adds the PCI IDs of AMD's family 10h and 11h CPU's northbridges to
k8topology discovery.
Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Put appropriate pagetable update hooks in so that paravirt knows
what's going on in there.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use a standard list threaded through page->lru for maintaining the pgd
list on PAE. This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In x86 PAE mode, stop treating pmds as a special case. Previously
they were always allocated and freed with the pgd. The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.
This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.
There is a complicating wart, however. When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3. Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.
This patch simply avoids reloading cr3 unless the update is to the
current pagetable. Later patches will optimise this further.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The change from current to tsk in do_page_fault is safe as
this is set at the very beginning of the function.
Removes a likely() annotation from the 64-bit version, this
could have instead been added to 32-bit.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).
This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On !PAE 32-bit, _PAGE_NX will be 0, making is_prefetch always
return early. The test is sufficient on PAE as __supported_pte_mask
is updated in the same places as nx_enabled in init_32.c which also
takes disable_nx into account.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Unify includes in moved fault.c.
Modify Makefiles to pick up unified file.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Elimination of these ifdefs can be done in a unified file.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It's about time to get on with unifying these files, elimination
of the ugly ifdefs can occur in the unified file.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
printk fixes. NOP in terms of functionality, but strings got
a bit larger due to the KERN_ markers that were added.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This changes the oops dumping format for page faults to
be similar between X86_32 and 64.
This is the first user of printk_address on X86_32.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This will help when unifying the oops dumping code on 32/64
bit. No functional changes.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Further towards unifying these files, add another helper
in same spirit as is_errata93.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Further towards unifying these files, add another helper
in same spirit as is_errata93.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cleanup the address calculations, which are necessary to identify the
high/low alias mappings of the kernel on 64 bit machines. Instead of
calling __pa/__va back and forth, calculate the physical address once
and base the other calculations on it. Add understandable constants so
we can use the already available within() helper. Also add comments,
which help mere mortals to understand what this code does.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
debug incorrect/late access to init memory, by permanently unmapping
the init memory ranges. Depends on CONFIG_DEBUG_PAGEALLOC=y.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It looks like a mismerge put the rodata self-check in the wrong spot; move
it to the right place after marking the .rodata section read only.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
clflush is sufficient to be issued on one CPU. The invalidation is
broadcast throughout the coherence domain.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
clflush is an unordered operation with respect to other memory
traffic, including other CLFLUSH instructions. This needs proper
fencing with mfence.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function name global_flush_tlb() suggests something different from
what the function really does. Rename it to cpa_flush_all(), which is an
understandable counterpart to cpa_flush_range().
no global visibility of the old API anymore.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use clflush on CPUs which support this.
clflush is only used when the page attribute operation has been
successful. On CPUs which do not support clflush and in the case of
error the old fashioned global_flush_tlb() is called.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert cpa_set and cpa_clear to call the new set_clr function.
Seperate out the debug helpers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create a set_and_clr function to avoid the duplicate loops. Allows
also to do combined operations for optimization.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To avoid the modification of the flush code for the clflush
implementation, move the flush into the set and clear functions and
provide helper functions for the debugging code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Latest update; I now have 4 NX tests, but 2 fail so they're #if 0'd.
I also cleaned up the NX test code quite a bit, and got rid of the ugly
exception table sorting stuff.
From: Arjan van de Ven <arjan@linux.intel.com>
This patch adds testcases for the CONFIG_DEBUG_RODATA configuration option
as well as the NX CPU feature/mappings. Both testcases can move to tests/
once that patch gets merged into mainline.
(I'm half considering moving the rodata test into mm/init.c but I'll
wait with that until init.c is unified)
As part of this I had to fix a not-quite-right alignment in the vmlinux.lds.h
for the RODATA sections, which lead to 1 page less being marked read only.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When we free initmem, various rodata and CPA checks may have left
memory read only.. this patch ensures that the memory is writable
before we free it.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In Ingo's testing, he found a bug in the CPA selftest code. What would
happen is that the test would call change_page_attr_addr on a range of
memory, part of which was read only, part of which was writable. The
only thing the test wanted to change was the global bit...
What actually happened was that the selftest would take the permissions
of the first page, and then the change_page_attr_addr call would then
set the permissions of the entire range to this first page. In the
rodata section case, this resulted in pages after the .rodata becoming
read only... which made the kernel rather unhappy in many interesting
ways.
This is just another example of how dangerous the cpa API is (was); this
patch changes the test to use the incremental clear/set APIs
instead, and it changes the clear/set implementation to work on a 1 page
at a time basis.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The set_memory_* and set_pages_* family of API's currently requires the
callers to do a global tlb flush after the function call; forgetting this is
a very nasty deathtrap. This patch moves the global tlb flush into
each of the callers
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
change_page_attr_add is only used in pageattr.c now, so we can
make this function static.
change_page_attr() isn't used anywere at all anymore; this function
is a really bad API anyway so just remove the bloat entirely.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
page_is_ram has a FIXME since ages, which reminds to sanity check the
BIOS area between 640k and 1M, which is sometimes falsely reported as
RAM in the e820 tables.
Implement the sanity check. Move the BIOS range defines from
pageattr.c into e820.h to avoid duplicate defines.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the introduction of the new API, no driver or non-archcore code needs
to use c-p-a anymore, so this patch also deprecates the EXPORT_SYMBOL of CPA
(it's a horrible API after all).
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch converts various users of change_page_attr() to the new,
more intent driven set_page_*/set_memory_* API set.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Right now, if drivers or other code want to change, say, a cache attribute of a
page, the only API they have is change_page_attr(). c-p-a is a really bad API
for this, because it forces the caller to know *ALL* the attributes he wants
for the page, not just the 1 thing he wants to change. So code that wants to
set a page uncachable, needs to be aware of the NX status as well etc etc etc.
This patch introduces a set of new APIs for this, set_pages_<attr> and
set_memory_<attr>, that offer a logical change to the user, and leave all
attributes not implied by the requested logical change alone.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unify the now identical ioremap_32.c and ioremap_64.c into the
same ioremap.c file. No code changed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When ioremap_page_range fails, then we can use remove_vm_area instead
of vunmap safely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use change_page_attr_addr() instead of change_page_attr(), which
simplifies the code significantly and matches the 64bit
implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make c_p_a unconditional for ioremap and iounmap. This ensures
complete consistency of the flags which are handed to
ioremap_page_range and the real flags in the mappings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64bit uses end_pfn_map and 32bit uses max_low_pfn. There are several
files which have #ifdef'ed defines which map either to end_pfn_map or
max_low_pfn. Replace this by a universal define and clean up all the
other instances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Get rid of the douplicate define of ISA_START/END_ADDRESS and use the
same headers in 32 and 64 bit code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The pgprot flags which are handed into ioremap_page_range() are
different to those which are set in change_page_attr(). The
ioremap_page_range flags are executable, while the c_p_a flags are
not.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The pgprot flags which are handed into ioremap_page_range() are
different to those which are set in change_page_attr(). The
ioremap_page_range flags are executable, while the c_p_a flags are
not. Also make the mappings global (which is a NOP currently on 32bit,
although CPUs from PPRO+ onwards support it, but that's a separate
fix.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
What the check_exec() function really is trying to do is enforce certain
bits in the pgprot that are required by the x86 architecture, but that
callers might not be aware of (such as NX bit exclusion of the BIOS
area for BIOS based PCI access; it's not uncommon to ioremap the BIOS
region for various purposes and normally ioremap() memory has the NX bit
set).
This patch turns the check_exec() function into static_protections()
which also is now used to make sure the kernel text area remains non-NX
and that the .rodata section remains read-only. If the architecture
ends up requiring more such mandatory prot settings for specific areas,
this is now a reasonable place to add these.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch fixes a bug of ioremap_nocache. ioremap_nocache() will call
__ioremap() with flags != 0 to do the real work, which will call
change_page_attr_addr() if phys_addr + size - 1 < (end_pfn_map << PAGE_SHIFT).
But some pages between 0 ~ end_pfn_map << PAGE_SHIFT are not mapped by
identity map, this will make change_page_attr_addr failed.
This patch is based on latest x86 git and has been tested on x86_64 platform.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch fixes a bug of change_page_attr/change_page_attr_addr on
Intel i386/x86_64 CPUs. After changing page attribute to be
executable with these functions, the page remains un-executable on
Intel i386/x86_64 CPU. Because on Intel i386/x86_64 CPU, only if the
"NX" bits of all three level page tables are cleared (PAE is enabled),
the corresponding page is executable (refer to section 4.13.2 of Intel
64 and IA-32 Architectures Software Developer's Manual). So, the bug
is fixed through clearing the "NX" bit of PMD when splitting the huge
PMD.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
do some leftover cleanups in the now unified arch/x86/mm/pageattr.c
file.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
unify the now perfectly identical pageattr_32/64.c files - no code changed.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
backmerge 64-bit details into 32-bit pageattr.c.
the pageattr_32.c and pageattr_64.c files are now identical.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
careful: might change driver behavior - but this is the right
return value.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
prepare for the unification of the cpa code, by unifying the
lookup_address() logic between 32-bit and 64-bit.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
prepare for the unification of the cpa code, by unifying the
lookup_address() logic between 32-bit and 64-bit.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
get more testing of the c_p_a() code done by not turning off
PSE on DEBUG_PAGEALLOC.
this simplifies the early pagetable setup code, and tests
the largepage-splitup code quite heavily.
In the end, all the largepages will be split up pretty quickly,
so there's no difference to how DEBUG_PAGEALLOC worked before.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
further simplify cpa locking: since the largepage-split is a
slowpath, use the pgd_lock for the whole operation, intead
of the mmap_sem.
This also makes it suitable for DEBUG_PAGEALLOC purposes again.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>