redonkable/alistair23-linux

Author	SHA1	Message	Date
Linus Torvalds	35b740e466	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (106 commits) perf kvm: Fix copy & paste error in description perf script: Kill script_spec__delete perf top: Fix a memory leak perf stat: Introduce get_ratio_color() helper perf session: Remove impossible condition check perf tools: Fix feature-bits rework fallout, remove unused variable perf script: Add generic perl handler to process events perf tools: Use for_each_set_bit() to iterate over feature flags perf tools: Unify handling of features when writing feature section perf report: Accept fifos as input file perf tools: Moving code in some files perf tools: Fix out-of-bound access to struct perf_session perf tools: Continue processing header on unknown features perf tools: Improve macros for struct feature_ops perf: builtin-record: Document and check that mmap_pages must be a power of two. perf: builtin-record: Provide advice if mmap'ing fails with EPERM. perf tools: Fix truncated annotation perf script: look up thread using tid instead of pid perf tools: Look up thread names for system wide profiling perf tools: Fix comm for processes with named threads ...	2012-01-06 08:02:58 -08:00
Linus Torvalds	423d091dfe	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) cpu: Export cpu_up() rcu: Apply ACCESS_ONCE() to rcu_boost() return value Revert "rcu: Permit rt_mutex_unlock() with irqs disabled" docs: Additional LWN links to RCU API rcu: Augment rcu_batch_end tracing for idle and callback state rcu: Add rcutorture tests for srcu_read_lock_raw() rcu: Make rcutorture test for hotpluggability before offlining CPUs driver-core/cpu: Expose hotpluggability to the rest of the kernel rcu: Remove redundant rcu_cpu_stall_suppress declaration rcu: Adaptive dyntick-idle preparation rcu: Keep invoking callbacks if CPU otherwise idle rcu: Irq nesting is always 0 on rcu_enter_idle_common rcu: Don't check irq nesting from rcu idle entry/exit rcu: Permit dyntick-idle with callbacks pending rcu: Document same-context read-side constraints rcu: Identify dyntick-idle CPUs on first force_quiescent_state() pass rcu: Remove dynticks false positives and RCU failures rcu: Reduce latency of rcu_prepare_for_idle() rcu: Eliminate RCU_FAST_NO_HZ grace-period hang rcu: Avoid needlessly IPIing CPUs at GP end ...	2012-01-06 08:02:40 -08:00
Linus Torvalds	4a2164a7db	Merge branch 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) memblock: Reimplement memblock allocation using reverse free area iterator memblock: Kill early_node_map[] score: Use HAVE_MEMBLOCK_NODE_MAP s390: Use HAVE_MEMBLOCK_NODE_MAP mips: Use HAVE_MEMBLOCK_NODE_MAP ia64: Use HAVE_MEMBLOCK_NODE_MAP SuperH: Use HAVE_MEMBLOCK_NODE_MAP sparc: Use HAVE_MEMBLOCK_NODE_MAP powerpc: Use HAVE_MEMBLOCK_NODE_MAP memblock: Implement memblock_add_node() memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users memblock: Track total size of regions automatically powerpc: Cleanup memblock usage memblock: Reimplement memblock_enforce_memory_limit() using __memblock_remove() memblock: Make memblock functions handle overflowing range @size memblock: Reimplement __memblock_remove() using memblock_isolate_range() memblock: Separate out memblock_isolate_range() from memblock_set_node() memblock: Kill memblock_init() memblock: Kill sentinel entries at the end of static region arrays memblock: Add __memblock_dump_all() ...	2012-01-06 07:54:53 -08:00
Eric Dumazet	ceb7b40b65	x86: Fix atomic64_xxx_cx8() functions It appears about all functions in arch/x86/lib/atomic64_cx8_32.S are wrong in case cmpxchg8b must be restarted, because LOCK_PREFIX macro defines a label "1" clashing with other local labels : 1: some_instructions LOCK_PREFIX cmpxchg8b (%ebp) jne 1b / jumps to beginning of LOCK_PREFIX ! A possible fix is to use a magic label "672" in LOCK_PREFIX asm definition, similar to the "671" one we defined in LOCK_PREFIX_HERE. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Jan Beulich <JBeulich@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1325608540.2320.103.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-01-04 15:01:56 +01:00
Jan Beulich	cdcd629869	x86: Fix and improve cmpxchg_double{,_local}() Just like the per-CPU ones they had several problems/shortcomings: Only the first memory operand was mentioned in the asm() operands, and the 2x64-bit version didn't have a memory clobber while the 2x32-bit one did. The former allowed the compiler to not recognize the need to re-load the data in case it had it cached in some register, while the latter was overly destructive. The types of the local copies of the old and new values were incorrect (the types of the pointed-to variables should be used here, to make sure the respective old/new variable types are compatible). The __dummy/__junk variables were pointless, given that local copies of the inputs already existed (and can hence be used for discarded outputs). The 32-bit variant of cmpxchg_double_local() referenced cmpxchg16b_local(). At once also: - change the return value type to what it really is: 'bool' - unify 32- and 64-bit variants - abstract out the common part of the 'normal' and 'local' variants Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-01-04 15:01:54 +01:00
Ingo Molnar	adaf4ed2ab	Merge commit 'v3.2-rc7' into x86/asm Merge reason: Update from -rc4 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-01-04 15:01:28 +01:00
Al Viro	f4ae40a6a5	switch debugfs to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:56 -05:00
Al Viro	2c9ede55ec	switch device_get_devnode() and ->devnode() to umode_t * both callers of device_get_devnode() are only interested in lower 16bits and nobody tries to return anything wider than 16bit anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:55 -05:00
Linus Torvalds	7578ed02e4	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix raw_spin_unlock_irqrestore() usage oprofile, arm/sh: Fix oprofile_arch_exit() linkage issue	2011-12-29 17:09:16 -08:00
Alan Cox	7c9c3a1e5f	x86/intel config: Fix the APB_TIMER selection Seems Kconfig SELECT isn't selecting things hierarchically when selected. config APB_TIMER def_bool y if X86_INTEL_MID prompt "Intel MID APB Timer Support" if X86_INTEL_MID select DW_APB_TIMER depends on X86_INTEL_MID && SFI when we select APB_TIMER doesn't select DW_APB_TIMER so do it by hand. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/n/tip-kpnaimplltk6d1lolusqj3ae@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-29 21:53:17 +01:00
Avi Kivity	222d21aa07	KVM: x86 emulator: implement RDPMC (0F 33) Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:43 +02:00
Avi Kivity	80bdec64c0	KVM: x86 emulator: fix RDPMC privilege check RDPMC is only privileged if CR4.PCE=0. check_rdpmc() already implements this, so all we need to do is drop the Priv flag. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:41 +02:00
Gleb Natapov	a6c06ed1a6	KVM: Expose the architectural performance monitoring CPUID leaf Provide a CPUID leaf that describes the emulated PMU. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:40 +02:00
Avi Kivity	fee84b079d	KVM: VMX: Intercept RDPMC Intercept RDPMC and forward it to the PMU emulation code. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:38 +02:00
Avi Kivity	332b56e484	KVM: SVM: Intercept RDPMC Intercept RDPMC and forward it to the PMU emulation code. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:37 +02:00
Avi Kivity	022cd0e840	KVM: Add generic RDPMC support Add a helper function that emulates the RDPMC instruction operation. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:35 +02:00
Gleb Natapov	f5132b0138	KVM: Expose a version 2 architectural PMU to a guests Use perf_events to emulate an architectural PMU, version 2. Based on PMU version 1 emulation by Avi Kivity. [avi: adjust for cpuid.c] [jan: fix anonymous field initialization for older gcc] Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:29 +02:00
Avi Kivity	8934208221	KVM: Expose kvm_lapic_local_deliver() Needed to deliver performance monitoring interrupts. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:39 +02:00
Takuya Yoshikawa	e0dac408d0	KVM: x86 emulator: Use opcode::execute for Group 9 instruction Group 9: 0F C7 Rename em_grp9() to em_cmpxchg8b() and register it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:38 +02:00
Takuya Yoshikawa	c04ec8393f	KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions Group 4: FE Group 5: FF Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:36 +02:00
Takuya Yoshikawa	c15af35f54	KVM: x86 emulator: Use opcode::execute for Group 1A instruction Group 1A: 8F Register em_pop() directly and remove em_grp1a(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:35 +02:00
Gleb Natapov	d546cb406e	KVM: drop bsp_vcpu pointer from kvm struct Drop bsp_vcpu pointer from kvm struct since its only use is incorrect anyway. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:32 +02:00
Jan Kiszka	a647795efb	KVM: x86: Consolidate PIT legacy test Move the test for KVM_PIT_FLAGS_HPET_LEGACY into create_pit_timer instead of replicating it on the caller site. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:30 +02:00
Jan Kiszka	bb5a798ad5	KVM: x86: Do not rely on implicit inclusions Works so far by change, but it is not guaranteed to stay like this. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:29 +02:00
Avi Kivity	43771ebfc9	KVM: Make KVM_INTEL depend on CPU_SUP_INTEL PMU virtualization needs to talk to Intel-specific bits of perf; these are only available when CPU_SUP_INTEL=y. Fixes arch/x86/built-in.o: In function `atomic_switch_perf_msrs': vmx.c:(.text+0x6b1d4): undefined reference to `perf_guest_get_msrs' Reported-by: Ingo Molnar <mingo@elte.hu> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:27 +02:00
Avi Kivity	9e31905f29	Merge remote-tracking branch 'tip/perf/core' into kvm-updates/3.3 * tip/perf/core: (66 commits) perf, x86: Expose perf capability to other modules perf, x86: Implement arch event mask as quirk x86, perf: Disable non available architectural events jump_label: Provide jump_label_key initializers jump_label, x86: Fix section mismatch perf, core: Rate limit perf_sched_events jump_label patching perf: Fix enable_on_exec for sibling events perf: Remove superfluous arguments perf, x86: Prefer fixed-purpose counters when scheduling perf, x86: Fix event scheduler for constraints with overlapping counters perf, x86: Implement event scheduler helper functions perf: Avoid a useless pmu_disable() in the perf-tick x86/tools: Add decoded instruction dump mode x86: Update instruction decoder to support new AVX formats x86/tools: Fix insn_sanity message outputs x86/tools: Fix instruction decoder message output x86: Fix instruction decoder to handle grouped AVX instructions x86/tools: Fix Makefile to build all test tools perf test: Soft errors shouldn't stop the "Validate PERF_RECORD_" test perf test: Validate PERF_RECORD_ events and perf_sample fields ... Signed-off-by: Avi Kivity <avi@redhat.com> * commit 'b3d9468a8bd218a695e3a0ff112cd4efd27b670a': (66 commits) perf, x86: Expose perf capability to other modules perf, x86: Implement arch event mask as quirk x86, perf: Disable non available architectural events jump_label: Provide jump_label_key initializers jump_label, x86: Fix section mismatch perf, core: Rate limit perf_sched_events jump_label patching perf: Fix enable_on_exec for sibling events perf: Remove superfluous arguments perf, x86: Prefer fixed-purpose counters when scheduling perf, x86: Fix event scheduler for constraints with overlapping counters perf, x86: Implement event scheduler helper functions perf: Avoid a useless pmu_disable() in the perf-tick x86/tools: Add decoded instruction dump mode x86: Update instruction decoder to support new AVX formats x86/tools: Fix insn_sanity message outputs x86/tools: Fix instruction decoder message output x86: Fix instruction decoder to handle grouped AVX instructions x86/tools: Fix Makefile to build all test tools perf test: Soft errors shouldn't stop the "Validate PERF_RECORD_" test perf test: Validate PERF_RECORD_ events and perf_sample fields ...	2011-12-27 11:22:24 +02:00
Sasha Levin	ff5c2c0316	KVM: Use memdup_user instead of kmalloc/copy_from_user Switch to using memdup_user when possible. This makes code more smaller and compact, and prevents errors. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:21 +02:00
Sasha Levin	cdfca7b346	KVM: Use kmemdup() instead of kmalloc/memcpy Switch to kmemdup() in two places to shorten the code and avoid possible bugs. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:20 +02:00
Jan Kiszka	234b639206	KVM: x86 emulator: Remove set-but-unused cr4 from check_cr_write This was probably copy&pasted from the cr0 case, but it's unneeded here. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:16 +02:00
Jan Kiszka	3d56cbdf35	KVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages freed_pages is never evaluated, so remove it as well as the return code kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:15 +02:00
Alex,Shi	086c985501	KVM: use this_cpu_xxx replace percpu_xxx funcs percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:13 +02:00
Xiao Guangrong	e37fa7853c	KVM: MMU: audit: inline audit function inline audit function and little cleanup Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:12 +02:00
Xiao Guangrong	d750ea2886	KVM: MMU: remove oos_shadow parameter The unsync code should be stable now, maybe it is the time to remove this parameter to cleanup the code a little bit Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:10 +02:00
Xiao Guangrong	e459e3228d	KVM: MMU: move the relevant mmu code to mmu.c Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create() Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:09 +02:00
Xiao Guangrong	9edb17d55f	KVM: x86: remove the dead code of KVM_EXIT_HYPERCALL KVM_EXIT_HYPERCALL is not used anymore, so remove the code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:07 +02:00
Xiao Guangrong	0375f7fad9	KVM: MMU: audit: replace mmu audit tracepoint with jump-label The tracepoint is only used to audit mmu code, it should not be exposed to user, let us replace it with jump-label. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:05 +02:00
Sasha Levin	831bf664e9	KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid This patch cleans and simplifies kvm_dev_ioctl_get_supported_cpuid by using a table instead of duplicating code as Avi suggested. This patch also fixes a bug where kvm_dev_ioctl_get_supported_cpuid would return -E2BIG when amount of entries passed was just right. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:02 +02:00
Liu, Jinsong	fb215366b3	KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest Intel latest cpu add 6 new features, refer http://software.intel.com/file/36945 The new feature cpuid listed as below: 1. FMA CPUID.EAX=01H:ECX.FMA[bit 12] 2. MOVBE CPUID.EAX=01H:ECX.MOVBE[bit 22] 3. BMI1 CPUID.EAX=07H,ECX=0H:EBX.BMI1[bit 3] 4. AVX2 CPUID.EAX=07H,ECX=0H:EBX.AVX2[bit 5] 5. BMI2 CPUID.EAX=07H,ECX=0H:EBX.BMI2[bit 8] 6. LZCNT CPUID.EAX=80000001H:ECX.LZCNT[bit 5] This patch expose these features to guest. Among them, FMA/MOVBE/LZCNT has already been defined, MOVBE/LZCNT has already been exposed. This patch defines BMI1/AVX2/BMI2, and exposes FMA/BMI1/AVX2/BMI2 to guest. Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:01 +02:00
Avi Kivity	00b27a3efb	KVM: Move cpuid code to new file The cpuid code has grown; put it into a separate file. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:21:49 +02:00
Takuya Yoshikawa	2b5e97e1fa	KVM: x86 emulator: Use opcode::execute for INS/OUTS from/to port in DX INSB : 6C INSW/INSD : 6D OUTSB : 6E OUTSW/OUTSD: 6F The I/O port address is read from the DX register when we decode the operand because we see the SrcDX/DstDX flag is set. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:46 +02:00
Xiao Guangrong	28a37544fb	KVM: introduce id_to_memslot function Introduce id_to_memslot to get memslot by slot id Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:39 +02:00
Xiao Guangrong	be6ba0f096	KVM: introduce kvm_for_each_memslot macro Introduce kvm_for_each_memslot to walk all valid memslot Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:37 +02:00
Xiao Guangrong	be593d6286	KVM: introduce update_memslots function Introduce update_memslots to update slot which will be update to kvm->memslots Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:35 +02:00
Xiao Guangrong	93a5cef07d	KVM: introduce KVM_MEM_SLOTS_NUM macro Introduce KVM_MEM_SLOTS_NUM macro to instead of KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:34 +02:00
Takuya Yoshikawa	ff227392cd	KVM: x86 emulator: Use opcode::execute for BSF/BSR BSF: 0F BC BSR: 0F BD Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:32 +02:00
Takuya Yoshikawa	e940b5c20f	KVM: x86 emulator: Use opcode::execute for CMPXCHG CMPXCHG: 0F B0, 0F B1 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:31 +02:00
Takuya Yoshikawa	e1e210b0a7	KVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR WRMSR: 0F 30 RDMSR: 0F 32 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:29 +02:00
Takuya Yoshikawa	bc00f8d2c2	KVM: x86 emulator: Use opcode::execute for MOV to cr/dr MOV: 0F 22 (move to control registers) MOV: 0F 23 (move to debug registers) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:28 +02:00
Takuya Yoshikawa	d4ddafcdf2	KVM: x86 emulator: Use opcode::execute for CALL CALL: E8 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:26 +02:00
Takuya Yoshikawa	ce7faab24f	KVM: x86 emulator: Use opcode::execute for BT family BT : 0F A3 BTS: 0F AB BTR: 0F B3 BTC: 0F BB Group 8: 0F BA Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:25 +02:00
Takuya Yoshikawa	d7841a4b1b	KVM: x86 emulator: Use opcode::execute for IN/OUT IN : E4, E5, EC, ED OUT: E6, E7, EE, EF Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:23 +02:00
Gleb Natapov	46199f33c2	KVM: VMX: remove unneeded vmx_load_host_state() calls. vmx_load_host_state() does not handle msrs switching (except MSR_KERNEL_GS_BASE) since commit `26bb0981b3`. Remove call to it where it is no longer make sense. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:22 +02:00
Takuya Yoshikawa	95d4c16ce7	KVM: Optimize dirty logging by rmap_write_protect() Currently, write protecting a slot needs to walk all the shadow pages and checks ones which have a pte mapping a page in it. The walk is overly heavy when dirty pages in that slot are not so many and checking the shadow pages would result in unwanted cache pollution. To mitigate this problem, we use rmap_write_protect() and check only the sptes which can be reached from gfns marked in the dirty bitmap when the number of dirty pages are less than that of shadow pages. This criterion is reasonable in its meaning and worked well in our test: write protection became some times faster than before when the ratio of dirty pages are low and was not worse even when the ratio was near the criterion. Note that the locking for this write protection becomes fine grained. The reason why this is safe is descripted in the comments. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:20 +02:00
Takuya Yoshikawa	7850ac5420	KVM: Count the number of dirty pages for dirty logging Needed for the next patch which uses this number to decide how to write protect a slot. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:19 +02:00
Takuya Yoshikawa	9b9b149236	KVM: MMU: Split gfn_to_rmap() into two functions rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed. This results in calling gfn_to_memslot() repeatedly with that gfn. This patch introduces __gfn_to_rmap() which takes the slot as an argument to avoid this. This is also needed for the following dirty logging optimization. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:17 +02:00
Takuya Yoshikawa	d6eebf8b80	KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect() Remove redundant checks and use is_large_pte() macro. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:13 +02:00
Chris Wright	fb92045843	KVM: MMU: remove KVM host pv mmu support The host side pv mmu support has been marked for feature removal in January 2011. It's not in use, is slower than shadow or hardware assisted paging, and a maintenance burden. It's November 2011, time to remove it. Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:10 +02:00
Chris Wright	5202397df8	KVM guest: remove KVM guest pv mmu support This has not been used for some years now. It's time to remove it. Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:08 +02:00
Jan Kiszka	3f2e5260f5	KVM: x86: Simplify kvm timer handler The vcpu reference of a kvm_timer can't become NULL while the timer is valid, so drop this redundant test. This also makes it pointless to carry a separate __kvm_timer_fn, fold it into kvm_timer_fn. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:17:05 +02:00
Xiao Guangrong	a30f47cb15	KVM: MMU: improve write flooding detected Detecting write-flooding does not work well, when we handle page written, if the last speculative spte is not accessed, we treat the page is write-flooding, however, we can speculative spte on many path, such as pte prefetch, page synced, that means the last speculative spte may be not point to the written page and the written page can be accessed via other sptes, so depends on the Accessed bit of the last speculative spte is not enough Instead of detected page accessed, we can detect whether the spte is accessed after it is written, if the spte is not accessed but it is written frequently, we treat is not a page table or it not used for a long time Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:02 +02:00
Xiao Guangrong	5d9ca30e96	KVM: MMU: fix detecting misaligned accessed Sometimes, we only modify the last one byte of a pte to update status bit, for example, clear_bit is used to clear r/w bit in linux kernel and 'andb' instruction is used in this function, in this case, kvm_mmu_pte_write will treat it as misaligned access, and the shadow page table is zapped Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:17:01 +02:00
Xiao Guangrong	889e5cbced	KVM: MMU: split kvm_mmu_pte_write function kvm_mmu_pte_write is too long, we split it for better readable Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:59 +02:00
Xiao Guangrong	f8734352c6	KVM: MMU: remove unnecessary kvm_mmu_free_some_pages In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling kvm_mmu_free_some_pages is really unnecessary Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:58 +02:00
Xiao Guangrong	f57f2ef58f	KVM: MMU: fast prefetch spte on invlpg path Fast prefetch spte for the unsync shadow page on invlpg path Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:56 +02:00
Xiao Guangrong	505aef8f30	KVM: MMU: cleanup FNAME(invlpg) Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the same code between FNAME(invlpg) and FNAME(sync_page) Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:54 +02:00
Xiao Guangrong	d01f8d5e02	KVM: MMU: do not mark accessed bit on pte write path In current code, the accessed bit is always set when page fault occurred, do not need to set it on pte write path Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:53 +02:00
Xiao Guangrong	6f6fbe98c3	KVM: x86: cleanup port-in/port-out emulated Remove the same code between emulator_pio_in_emulated and emulator_pio_out_emulated Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:51 +02:00
Xiao Guangrong	1cb3f3ae5a	KVM: x86: retry non-page-table writing instructions If the emulation is caused by #PF and it is non-page_table writing instruction, it means the VM-EXIT is caused by shadow page protected, we can zap the shadow page and retry this instruction directly The idea is from Avi Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:50 +02:00
Xiao Guangrong	d5ae7ce835	KVM: x86: tag the instructions which are used to write page table The idea is from Avi: \| tag instructions that are typically used to modify the page tables, and \| drop shadow if any other instruction is used. \| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg, \| and cmpxchg8b. This patch is used to tag the instructions and in the later path, shadow page is dropped if it is written by other instructions Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:48 +02:00
Xiao Guangrong	f759e2b4c7	KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the function when spte is prefetched, unfortunately, we can not know how many spte need to be prefetched on this path, that means we can use out of the free pte_list_desc object in the cache, and BUG_ON() is triggered, also some path does not fill the cache, such as INS instruction emulated that does not trigger page fault Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:47 +02:00
Nadav Har'El	51cfe38ea5	KVM: nVMX: Fix warning-causing idt-vectoring-info behavior When L0 wishes to inject an interrupt while L2 is running, it emulates an exit to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original nVMX patch 23, titled "Correct handling of interrupt injection". Unfortunately, it is possible (though rare) that at this point there is valid idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2, and when L2 tried to run this interrupt's handler, it got a page fault - so it returns the original interrupt vector in idt_vectoring_info. The problem is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT like we wished to, because the VMX spec guarantees that idt_vectoring_info and exit_reason_external_interrupt can never happen together. This is not just specified in the spec - a KVM L1 actually prints a kernel warning "unexpected, valid vectoring info" if we violate this guarantee, and some users noticed these warnings in L1's logs. In order to better emulate a processor, which would never return the external interrupt and the idt-vectoring-info together, we need to separate the two injection steps: First, complete L1's injection into L2 (i.e., enter L2, injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT. Most of this is already in the code - the only change we need is to remain in L2 (and not exit to L1) in this case. Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that although we do enter L2 first, it will exit immediately after processing its injection, allowing us to promptly inject to L1. Note how we test vmcs12->idt_vectoring_info_field; This isn't really the vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated), but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value of this field. This was explained in patch 25 ("Correct handling of idt vectoring info") of the original nVMX patch series. Thanks to Dave Allan and to Federico Simoncelli for reporting this bug, to Abel Gordon for helping me figure out the solution, and to Avi Kivity for helping to improve it. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:45 +02:00
Nadav Har'El	d6185f20a0	KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT. This bit requests that when next entering the guest, we should run it only for as little as possible, and exit again. We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1 to continue running so it can inject an event to it, we unfortunately cannot just pretend to have run L2 for a little while - We must really launch L2, otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2) will be lost. So the existing code runs L2 in this case. But L2 could potentially run for a long time until it exits, and the injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us to request that L2 will be entered, as necessary, but will exit as soon as possible after entry. Our implementation of this request uses smp_send_reschedule() to send a self-IPI, with interrupts disabled. The interrupts remain disabled until the guest is entered, and then, after the entry is complete (often including processing an injection and jumping to the relevant handler), the physical interrupt is noticed and causes an exit. On recent Intel processors, we could have achieved the same goal by using MTF instead of a self-IPI. Another technique worth considering in the future is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to slightly improve performance by avoiding the useless interrupt handler which ends up being called when smp_send_reschedule() is used. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:43 +02:00
Jan Kiszka	4d25a066b6	KVM: Don't automatically expose the TSC deadline timer in cpuid Unlike all of the other cpuid bits, the TSC deadline timer bit is set unconditionally, regardless of what userspace wants. This is broken in several ways: - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC deadline timer feature, a guest that uses the feature will break - live migration to older host kernels that don't support the TSC deadline timer will cause the feature to be pulled from under the guest's feet; breaking it - guests that are broken wrt the feature will fail. Fix by not enabling the feature automatically; instead report it to userspace. Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not KVM_GET_SUPPORTED_CPUID. Fixes the Illumos guest kernel, which uses the TSC deadline timer feature. [avi: add the KVM_CAP + documentation] Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-26 13:27:44 +02:00
Rafael J. Wysocki	b7ba68c4a0	Merge branch 'pm-sleep' into pm-for-linus * pm-sleep: (51 commits) PM: Drop generic_subsys_pm_ops PM / Sleep: Remove forward-only callbacks from AMBA bus type PM / Sleep: Remove forward-only callbacks from platform bus type PM: Run the driver callback directly if the subsystem one is not there PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers PM / Sleep: Merge internal functions in generic_ops.c PM / Sleep: Simplify generic system suspend callbacks PM / Hibernate: Remove deprecated hibernation snapshot ioctls PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled() PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep() PM / Sleep: Make [un]lock_system_sleep() generic PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count() PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else' Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer PM / Sleep: Unify diagnostic messages from device suspend/resume ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR PM / Hibernate: Remove deprecated hibernation test modes PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path ... Conflicts: kernel/kmod.c	2011-12-25 23:42:20 +01:00
Jan Kiszka	0924ab2cfa	KVM: x86: Prevent starting PIT timers in the absence of irqchip support User space may create the PIT and forgets about setting up the irqchips. In that case, firing PIT IRQs will crash the host: BUG: unable to handle kernel NULL pointer dereference at 0000000000000128 IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm] ... Call Trace: [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm] [<ffffffff81071431>] process_one_work+0x111/0x4d0 [<ffffffff81071bb2>] worker_thread+0x152/0x340 [<ffffffff81075c8e>] kthread+0x7e/0x90 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10 Prevent this by checking the irqchip mode before starting a timer. We can't deny creating the PIT if the irqchips aren't set up yet as current user land expects this order to work. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-25 17:13:18 +02:00
Suresh Siddha	c284b42aba	x86: Skip cpus with apic-ids >= 255 in !x2apic_mode If the x2apic mode is disabled for reasons like interrupt-remapping not available etc, then we need to skip the logical cpu bringup of apic-id's >= 255. Otherwise as the platform is in xapic mode, init/startup IPI's will consider only the low 8-bits and there is a possibility of re-sending init/startup IPI's to the logical cpu that is already online. This will avoid potential reboots/unpredictable behavior etc. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20111222014632.702932458@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-23 11:01:49 -08:00
Yinghai Lu	a31bc32760	x86, x2apic: Allow "nox2apic" to disable x2apic mode setup by BIOS Currently "nox2apic" boot parameter was not enabling x2apic mode if the cpu, kernel are all capable of enabling x2apic mode and the OS handover happened in xapic mode. However If the bios enabled x2apic prior to OS handover, using "nox2apic" boot parameter had no effect. If the boot cpu's apicid is < 255, enable "nox2apic" boot parameter to disable the x2apic mode setup by the bios. This will enable the kernel to fallback to xapic mode and bringup only the cpu's which has apic-id < 255. -v2: fix patch error and two compiling warning make disable_x2apic to be __init Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/CAE9FiQUeB-3uxJAMiHsz=uPWoFv5Hg1pVepz7aU6YtqOxMC-=Q@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-23 11:01:43 -08:00
Yinghai Lu	fb209bd891	x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping On some of the recent Intel SNB platforms, by default bios is pre-enabling x2apic mode in the cpu with out setting up interrupt-remapping. This case was resulting in the kernel to panic as the cpu is already in x2apic mode but the OS was not able to enable interrupt-remapping (which is a pre-req for using x2apic capability). On these platforms all the apic-ids are < 255 and the kernel can fallback to xapic mode if the bios has not enabled interrupt-remapping (which is mostly the case if the bios has not exported interrupt-remapping tables to the OS). Reported-by: Berck E. Nash <flyboy@gmail.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20111222014632.600418637@sbsiddha-desk.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-23 11:01:01 -08:00
Yinghai Lu	a35fd28256	x86, acpi: Skip acpi x2apic entries if the x2apic feature is not present If the x2apic feature is not present (either the cpu is not capable of it or the user has disabled the feature using boot-parameter etc), ignore the x2apic MADT and SRAT entries provided by the ACPI tables. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20111222014632.540896503@sbsiddha-desk.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-23 11:00:50 -08:00
Yinghai Lu	e8524b2f43	x86, apic: Add probe() for apic_flat Currently we start with the default apic_flat mode and switch to some other apic model depending on the apic drivers acpi_madt_oem_check() routines and later followed by the apic drivers probe() routines. Once we selected non flat mode there was no case where we fall back to flat mode again. Upcoming changes allow bios-enabled x2apic mode to be disabled by the OS if interrupt-remapping etc is not setup properly by the bios. We now has a case for the apic to fall back to legacy flat mode during apic driver probe() seqeuence. Add a simple flat_probe() which allows the apic_flat mode to be the last fallback option. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20111222014632.484984298@sbsiddha-desk.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-23 11:00:45 -08:00
Robert Richter	2e64694de2	perf/x86: Fix raw_spin_unlock_irqrestore() usage Use raw_spin_unlock_irqrestore() as equivalent to raw_spin_lock_irqsave(). Signed-off-by: Robert Richter <robert.richter@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1324646665-13334-1-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-23 17:57:01 +01:00
Christoph Lameter	933393f58f	percpu: Remove irqsafe_cpu_xxx variants We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. -tj: This is part of on-going percpu API cleanup. For detailed discussion of the subject, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1222078 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <alpine.DEB.2.00.1112221154380.11787@router.home>	2011-12-22 10:40:20 -08:00
Linus Torvalds	ecefc36b41	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: Add a flow_cache_flush_deferred function ipv4: reintroduce route cache garbage collector net: have ipconfig not wait if no dev is available sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd asix: new device id davinci-cpdma: fix locking issue in cpdma_chan_stop sctp: fix incorrect overflow check on autoclose r8169: fix Config2 MSIEnable bit setting. llc: llc_cmsg_rcv was getting called after sk_eat_skb. net: bpf_jit: fix an off-one bug in x86_64 cond jump target iwlwifi: update SCD BC table for all SCD queues Revert "Bluetooth: Revert: Fix L2CAP connection establishment" Bluetooth: Clear RFCOMM session timer when disconnecting last channel Bluetooth: Prevent uninitialized data access in L2CAP configuration iwlwifi: allow to switch to HT40 if not associated iwlwifi: tx_sync only on PAN context mwifiex: avoid double list_del in command cancel path ath9k: fix max phy rate at rate control init nfc: signedness bug in __nci_request() iwlwifi: do not set the sequence control bit is not needed	2011-12-21 18:29:26 -08:00
Kay Sievers	edbaa603eb	driver-core: remove sysdev.h usage. The sysdev.h file should not be needed by any in-kernel code, so remove the .h file from these random files that seem to still want to include it. The sysdev code will be going away soon, so this include needs to be removed no matter what. Cc: Jiandong Zheng <jdzheng@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Kukjin Kim <kgene.kim@samsung.com> Cc: David Brown <davidb@codeaurora.org> Cc: Daniel Walker <dwalker@fifo99.com> Cc: Bryan Huntsman <bryanh@codeaurora.org> Cc: Ben Dooks <ben-linux@fluff.org> Cc: Wan ZongShun <mcuos.com@gmail.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: "Venkatesh Pallipadi Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>	2011-12-21 16:26:03 -08:00
Kay Sievers	8a25a2fd12	cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem and converts the devices to regular devices. The sysdev drivers are implemented as subsystem interfaces now. After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Userspace relies on events and generic sysfs subsystem infrastructure from sysdev devices, which are made available with this conversion. Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@amd64.org> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Len Brown <lenb@kernel.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-12-21 14:29:42 -08:00
Rafael J. Wysocki	b00f4dc5ff	Merge branch 'master' into pm-sleep * master: (848 commits) SELinux: Fix RCU deref check warning in sel_netport_insert() binary_sysctl(): fix memory leak mm/vmalloc.c: remove static declaration of va from __get_vm_area_node ipmi_watchdog: restore settings when BMC reset oom: fix integer overflow of points in oom_badness memcg: keep root group unchanged if creation fails nilfs2: potential integer overflow in nilfs_ioctl_clean_segments() nilfs2: unbreak compat ioctl cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask evm: prevent racing during tfm allocation evm: key must be set once during initialization mmc: vub300: fix type of firmware_rom_wait_states module parameter Revert "mmc: enable runtime PM by default" mmc: sdhci: remove "state" argument from sdhci_suspend_host x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT IB/qib: Correct sense on freectxts increment and decrement RDMA/cma: Verify private data length cgroups: fix a css_set not found bug in cgroup_attach_proc oprofile: Fix uninitialized memory access when writing to writing to oprofilefs Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel" ... Conflicts: kernel/cgroup_freezer.c	2011-12-21 21:59:45 +01:00
Steven Rostedt	42181186ad	x86: Add counter when debug stack is used with interrupts enabled Mathieu Desnoyers pointed out a case that can cause issues with NMIs running on the debug stack: int3 -> interrupt -> NMI -> int3 Because the interrupt changes the stack, the NMI will not see that it preempted the debug stack. Looking deeper at this case, interrupts only happen when the int3 is from userspace or in an a location in the exception table (fixup). userspace -> int3 -> interurpt -> NMI -> int3 All other int3s that happen in the kernel should be processed without ever enabling interrupts, as the do_trap() call will panic the kernel if it is called to process any other location within the kernel. Adding a counter around the sections that enable interrupts while using the debug stack allows the NMI to also check that case. If the NMI sees that it either interrupted a task using the debug stack or the debug counter is non-zero, then it will have to change the IDT table to make the int3 not change stacks (which will corrupt the stack if it does). Note, I had to move the debug_usage functions out of processor.h and into debugreg.h because of the static inlined functions to inc and dec the debug_usage counter. __get_cpu_var() requires smp.h which includes processor.h, and would fail to build. Link: http://lkml.kernel.org/r/1323976535.23971.112.camel@gandalf.stny.rr.com Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:56 -05:00
Steven Rostedt	ccd49c2391	x86: Allow NMIs to hit breakpoints in i386 With i386, NMIs and breakpoints use the current stack and they do not reset the stack pointer to a fix point that might corrupt a previous NMI or breakpoint (as it does in x86_64). But NMIs are still not made to be re-entrant, and need to prevent the case that an NMI hitting a breakpoint (which does an iret), doesn't allow another NMI to run. The fix is to let the NMI be in 3 different states: 1) not running 2) executing 3) latched When no NMI is executing on a given CPU, the state is "not running". When the first NMI comes in, the state is switched to "executing". On exit of that NMI, a cmpxchg is performed to switch the state back to "not running" and if that fails, the NMI is restarted. If a breakpoint is hit and does an iret, which re-enables NMIs, and another NMI comes in before the first NMI finished, it will detect that the state is not in the "not running" state and the current NMI is nested. In this case, the state is switched to "latched" to let the interrupted NMI know to restart the NMI handler, and the nested NMI exits without doing anything. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:55 -05:00
Steven Rostedt	228bdaa95f	x86: Keep current stack in NMI breakpoints We want to allow NMI handlers to have breakpoints to be able to remove stop_machine from ftrace, kprobes and jump_labels. But if an NMI interrupts a current breakpoint, and then it triggers a breakpoint itself, it will switch to the breakpoint stack and corrupt the data on it for the breakpoint processing that it interrupted. Instead, have the NMI check if it interrupted breakpoint processing by checking if the stack that is currently used is a breakpoint stack. If it is, then load a special IDT that changes the IST for the debug exception to keep the same stack in kernel context. When the NMI is done, it puts it back. This way, if the NMI does trigger a breakpoint, it will keep using the same stack and not stomp on the breakpoint data for the breakpoint it interrupted. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:55 -05:00
Steven Rostedt	3f3c8b8c4b	x86: Add workaround to NMI iret woes In x86, when an NMI goes off, the CPU goes into an NMI context that prevents other NMIs to trigger on that CPU. If an NMI is suppose to trigger, it has to wait till the previous NMI leaves NMI context. At that time, the next NMI can trigger (note, only one more NMI will trigger, as only one can be latched at a time). The way x86 gets out of NMI context is by calling iret. The problem with this is that this causes problems if the NMI handle either triggers an exception, or a breakpoint. Both the exception and the breakpoint handlers will finish with an iret. If this happens while in NMI context, the CPU will leave NMI context and a new NMI may come in. As NMI handlers are not made to be re-entrant, this can cause havoc with the system, not to mention, the nested NMI will write all over the previous NMI's stack. Linus Torvalds proposed the following workaround to this problem: https://lkml.org/lkml/2010/7/14/264 "In fact, I wonder if we couldn't just do a software NMI disable instead? Hav ea per-cpu variable (in the _core_ percpu areas that get allocated statically) that points to the NMI stack frame, and just make the NMI code itself do something like NMI entry: - load percpu NMI stack frame pointer - if non-zero we know we're nested, and should ignore this NMI: - we're returning to kernel mode, so return immediately by using "popf/ret", which also keeps NMI's disabled in the hardware until the "real" NMI iret happens. - before the popf/iret, use the NMI stack pointer to make the NMI return stack be invalid and cause a fault - set the NMI stack pointer to the current stack pointer NMI exit (not the above "immediate exit because we nested"): clear the percpu NMI stack pointer Just do the iret. Now, the thing is, now the "iret" is atomic. If we had a nested NMI, we'll take a fault, and that re-does our "delayed" NMI - and NMI's will stay masked. And if we didn't have a nested NMI, that iret will now unmask NMI's, and everything is happy." I first tried to follow this advice but as I started implementing this code, a few gotchas showed up. One, is accessing per-cpu variables in the NMI handler. The problem is that per-cpu variables use the %gs register to get the variable for the given CPU. But as the NMI may happen in userspace, we must first perform a SWAPGS to get to it. The NMI handler already does this later in the code, but its too late as we have saved off all the registers and we don't want to do that for a disabled NMI. Peter Zijlstra suggested to keep all variables on the stack. This simplifies things greatly and it has the added benefit of cache locality. Two, faulting on the iret. I really wanted to make this work, but it was becoming very hacky, and I never got it to be stable. The iret already had a fault handler for userspace faulting with bad segment registers, and getting NMI to trigger a fault and detect it was very tricky. But for strange reasons, the system would usually take a double fault and crash. I never figured out why and decided to go with a simple "jmp" approach. The new approach I took also simplified things. Finally, the last problem with Linus's approach was to have the nested NMI handler do a ret instead of an iret to give the first NMI NMI-context again. The problem is that ret is much more limited than an iret. I couldn't figure out how to get the stack back where it belonged. I could have copied the current stack, pushed the return onto it, but my fear here is that there may be some place that writes data below the stack pointer. I know that is not something code should depend on, but I don't want to chance it. I may add this feature later, but for now, an NMI handler that loses NMI context will not get it back. Here's what is done: When an NMI comes in, the HW pushes the interrupt stack frame onto the per cpu NMI stack that is selected by the IST. A special location on the NMI stack holds a variable that is set when the first NMI handler runs. If this variable is set then we know that this is a nested NMI and we process the nested NMI code. There is still a race when this variable is cleared and an NMI comes in just before the first NMI does the return. For this case, if the variable is cleared, we also check if the interrupted stack is the NMI stack. If it is, then we process the nested NMI code. Why the two tests and not just test the interrupted stack? If the first NMI hits a breakpoint and loses NMI context, and then it hits another breakpoint and while processing that breakpoint we get a nested NMI. When processing a breakpoint, the stack changes to the breakpoint stack. If another NMI comes in here we can't rely on the interrupted stack to be the NMI stack. If the variable is not set and the interrupted task's stack is not the NMI stack, then we know this is the first NMI and we can process things normally. But in order to do so, we need to do a few things first. 1) Set the stack variable that tells us that we are in an NMI handler 2) Make two copies of the interrupt stack frame. One copy is used to return on iret The other is used to restore the first one if we have a nested NMI. This is what the stack will look like: +-------------------------+ \| original SS \| \| original Return RSP \| \| original RFLAGS \| \| original CS \| \| original RIP \| +-------------------------+ \| temp storage for rdx \| +-------------------------+ \| NMI executing variable \| +-------------------------+ \| Saved SS \| \| Saved Return RSP \| \| Saved RFLAGS \| \| Saved CS \| \| Saved RIP \| +-------------------------+ \| copied SS \| \| copied Return RSP \| \| copied RFLAGS \| \| copied CS \| \| copied RIP \| +-------------------------+ \| pt_regs \| +-------------------------+ The original stack frame contains what the HW put in when we entered the NMI. We store %rdx as a temp variable to use. Both the original HW stack frame and this %rdx storage will be clobbered by nested NMIs so we can not rely on them later in the first NMI handler. The next item is the special stack variable that is set when we execute the rest of the NMI handler. Then we have two copies of the interrupt stack. The second copy is modified by any nested NMIs to let the first NMI know that we triggered a second NMI (latched) and that we should repeat the NMI handler. If the first NMI hits an exception or breakpoint that takes it out of NMI context, if a second NMI comes in before the first one finishes, it will update the copied interrupt stack to point to a fix up location to trigger another NMI. When the first NMI calls iret, it will instead jump to the fix up location. This fix up location will copy the saved interrupt stack back to the copy and execute the nmi handler again. Note, the nested NMI knows enough to check if it preempted a previous NMI handler while it is in the fixup location. If it has, it will not modify the copied interrupt stack and will just leave as if nothing happened. As the NMI handle is about to execute again, there's no reason to latch now. To test all this, I forced the NMI handler to call iret and take itself out of NMI context. I also added assemble code to write to the serial to make sure that it hits the nested path as well as the fix up path. Everything seems to be working fine. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:54 -05:00
Steven Rostedt	1fd466efc8	x86: Document the NMI handler about not using paranoid_exit Linus cleaned up the NMI handler but it still needs some comments to explain why it uses save_paranoid but not paranoid_exit. Just to keep others from adding that in the future, document why it's not used. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:53 -05:00
Linus Torvalds	549c89b98c	x86: Do not schedule while still in NMI context The NMI handler uses the paranoid_exit routine that checks the NEED_RESCHED flag, and if it is set and the return is for userspace, then interrupts are enabled, the stack is swapped to the thread's stack, and schedule is called. The problem with this is that we are still in an NMI context until an iret is executed. This means that any new NMIs are now starved until an interrupt or exception occurs and does the iret. As NMIs can not be masked and can interrupt any location, they are treated as a special case. NEED_RESCHED should not be set in an NMI handler. The interruption by the NMI should not disturb the work flow for scheduling. Any IPI sent to a processor after sending the NEED_RESCHED would have to wait for the NMI anyway, and after the IPI finishes the schedule would be called as required. There is no reason to do anything special leaving an NMI. Remove the call to paranoid_exit and do a simple return. This not only fixes the bug of starved NMIs, but it also cleans up the code. Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.com Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:52 -05:00
Stephane Eranian	9c1497ea59	perf events: Add Intel x86 mapping for PERF_COUNT_HW_REF_CPU_CYCLES Add event maps for Intel x86 processors (with architected PMU v2 or later). On AMD, there is frequency scaling but no Turbo. There is no core cycle event not subject to frequency scaling, therefore we do not provide a mapping. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323559734-3488-4-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-21 10:26:39 +01:00
Stephane Eranian	cd09c0c40a	perf events: Enable raw event support for Intel unhalted_reference_cycles event This patch adds the encoding and definitions necessary for the unhalted_reference_cycles event avaialble since Intel Core 2 processors. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323559734-3488-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-21 10:26:32 +01:00
Kevin Winchester	141168c36c	x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86' Several fields in struct cpuinfo_x86 were not defined for the !SMP case, likely to save space. However, those fields still have some meaning for UP, and keeping them allows some #ifdef removal from other files. The additional size of the UP kernel from this change is not significant enough to worry about keeping up the distinction: text data bss dec hex filename 4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before 4737444 506459 972040 6215943 5ed907 vmlinux.o.after for a difference of 276 bytes for an example UP config. If someone wants those 276 bytes back badly then it should be implemented in a cleaner way. Signed-off-by: Kevin Winchester <kjwinchester@gmail.com> Cc: Steffen Persvold <sp@numascale.com> Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-21 09:25:09 +01:00
Konrad Rzeszutek Wilk	cb85f123cd	Merge commit 'v3.2-rc3' into stable/for-linus-3.3 * commit 'v3.2-rc3': (412 commits) Linux 3.2-rc3 virtio-pci: make reset operation safer virtio-mmio: Correct the name of the guest features selector virtio: add HAS_IOMEM dependency to MMIO platform bus driver eCryptfs: Extend array bounds for all filename chars eCryptfs: Flush file in vma close eCryptfs: Prevent file create race condition regulator: TPS65910: Fix VDD1/2 voltage selector count i2c: Make i2cdev_notifier_call static i2c: Delete ANY_I2C_BUS i2c: Fix device name for 10-bit slave address i2c-algo-bit: Generate correct i2c address sequence for 10-bit target drm: integer overflow in drm_mode_dirtyfb_ioctl() Revert "of/irq: of_irq_find_parent: check for parent equal to child" drivers/gpu/vga/vgaarb.c: add missing kfree drm/radeon/kms/atom: unify i2c gpio table handling drm/radeon/kms: fix up gpio i2c mask bits for r4xx for real ttm: Don't return the bo reserved on error path mount_subtree() pointless use-after-free iio: fix a leak due to improper use of anon_inode_getfd() ...	2011-12-20 17:01:18 -05:00
Ingo Molnar	d87f69a16e	Merge commit 'v3.2-rc6' into perf/core Merge reason: Update with the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-20 20:32:11 +01:00
Ingo Molnar	45aa0663cc	Merge branch 'memblock-kill-early_node_map' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/memblock	2011-12-20 12:14:26 +01:00
Jussi Kivilinna	7ba8babf84	crypto: serpent-sse2 - remove unneeded LRW/XTS #ifdefs Since LRW & XTS are selected by serpent-sse2, we don't need these #ifdefs anymore. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2011-12-20 15:20:08 +08:00
Jussi Kivilinna	88715b9ade	crypto: twofish-x86_64-3way - remove unneeded LRW/XTS #ifdefs Since LRW & XTS are selected by twofish-x86_64-3way, we don't need these #ifdefs anymore. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2011-12-20 15:20:07 +08:00
Clemens Ladisch	13f541c10b	x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT When printing the code bytes in show_registers(), the markers around the byte at the fault address could make the printk() format string look like a valid log level and facility code. This would prevent this byte from being printed and result in a spurious newline: [ 7555.765589] Code: 8b 32 e9 94 00 00 00 81 7d 00 ff 00 00 00 0f 87 96 00 00 00 48 8b 83 c0 00 00 00 44 89 e2 44 89 e6 48 89 df 48 8b 80 d8 02 00 00 [ 7555.765683] 8b 48 28 48 89 d0 81 e2 ff 0f 00 00 48 c1 e8 0c 48 c1 e0 04 Add KERN_CONT where needed, and elsewhere in show_registers() for consistency. Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Link: http://lkml.kernel.org/r/4EEFA7AE.9020407@ladisch.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-19 13:09:56 -08:00
Markus Kötter	a03ffcf873	net: bpf_jit: fix an off-one bug in x86_64 cond jump target x86 jump instruction size is 2 or 5 bytes (near/long jump), not 2 or 6 bytes. In case a conditional jump is followed by a long jump, conditional jump target is one byte past the start of target instruction. Signed-off-by: Markus Kötter <nepenthesdev@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-19 15:47:29 -05:00
Martin Schwidefsky	612ef28a04	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into cputime-tip Conflicts: drivers/cpufreq/cpufreq_conservative.c drivers/cpufreq/cpufreq_ondemand.c drivers/macintosh/rack-meter.c fs/proc/stat.c fs/proc/uptime.c kernel/sched/core.c	2011-12-19 19:23:15 +01:00
Fernando Luis Vazquez Cao	b49d7d877f	x86: Convert per-cpu counter icr_read_retry_count into a member of irq_stat LAPIC related statistics are grouped inside the per-cpu structure irq_stat, so there is no need for icr_read_retry_count to be a standalone per-cpu variable. This patch moves icr_read_retry_count to where it belongs. Suggested-y: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Cc: Jörn Engel <joern@logfs.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-18 10:46:48 +01:00
Ingo Molnar	6e5ed27637	Merge commit 'v3.2-rc6' into x86/platform	2011-12-18 10:35:16 +01:00
Michael Demeter	d79a8869d8	x86/mrst: Add additional debug prints for pb_keys Added additional debug output that we always seem to add during power ons to validate firmware operation. Signed-off-by: Michael Demeter <michael.demeter@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20111215223116.10166.50803.stgit@bob.linux.org.uk [ fixed line breaks, formatting and commit title. ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-18 09:37:07 +01:00
Ingo Molnar	a228b5892b	Merge branch 'mce-inject' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce	2011-12-18 09:18:45 +01:00
Alan Cox	933b9463a0	x86/intel config: Revamp configuration to allow for Moorestown and Medfield This sets all up the other bits that need to be INTEL_MID specific rather than Moorestown specific. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20111217174318.7207.91543.stgit@bob.linux.org.uk Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-18 09:17:02 +01:00
Jesper Juhl	1affc46cff	x86: Use "do { } while(0)" for empty lock_cmos()/unlock_cmos() macros gcc noticed (when using -Wempty-body) that our use of lock_cmos() and unlock_cmos() in arch/x86/include/asm/mach_traps.h is potentially problematic : arch/x86/include/asm/mach_traps.h:32:15: warning: suggest braces around empty body in an ¡else¢ statement [-Wempty-body] arch/x86/include/asm/mach_traps.h:40:16: warning: suggest braces around empty body in an ¡else¢ statement [-Wempty-body] Let's just use the standard 'do {} while (0)' solution. That shuts up gcc and also prevents future problems if the macros should end up being used in a similar situation elsewhere. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1112180103130.21784@swampdragon.chaosbits.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-18 09:14:31 +01:00
Jesper Juhl	2ac13462b6	x86: Use "do { } while(0)" for empty flush_tlb_fix_spurious_fault() macro If one builds the kernel with -Wempty-body one gets this warning: mm/memory.c:3432:46: warning: suggest braces around empty body in an ¡if¢ statement [-Wempty-body] due to the fact that 'flush_tlb_fix_spurious_fault' is a macro that can sometimes be defined to nothing. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-mm@kvack.org Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1112180128070.21784@swampdragon.chaosbits.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-18 09:14:18 +01:00
Alan Cox	a0c3832a57	x86/apb: Fix configuration constraints The APB timer requires SFI, SCU and MID support Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20111217215719.3743.93550.stgit@bob.linux.org.uk Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-18 09:09:47 +01:00
Chen Gong	2c29d9dd57	x86: add IRQ context simulation in module mce-inject mce-inject provides a mechanism to simulate errors so that test scripts can check for correct operation of the kernel without requiring any specialized hardware to create rare events. The existing code can simulate events in normal process context and also in NMI context - but not in IRQ context. This patch fills that gap. Link: https://lkml.org/lkml/2011/12/7/537 Signed-off-by: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-12-16 11:20:02 -08:00
Maarten Lankhorst	2d2da60fb4	x86, efi: Break up large initrd reads The efi boot stub tries to read the entire initrd in 1 go, however some efi implementations hang if too much if asked to read too much data at the same time. After some experimentation I found out that my asrock p67 board will hang if asked to read chunks of 4MiB, so use a safe value. elilo reads in chunks of 16KiB, but since that requires many read calls I use a value of 1 MiB. hpa suggested adding individual blacklists for when systems are found where this value causes a crash. Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com> Link: http://lkml.kernel.org/r/4EEB3A02.3090201@gmail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-16 08:34:35 -08:00
Alan Cox	3e8f9451d3	x86: Fix INTEL_MID silly Doh.. pass the brown paper bags - preferably filled with mince pies.. This fixes occasional build failures. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/n/tip-r0oc1knlvzuqr69artaeq8s8@git.kernel.org [ extended the changelog a bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-16 09:33:17 +01:00
David Howells	ca3d30cc02	x86_64, asm: Optimise fls(), ffs() and fls64() fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a CMOV instruction after the BSR/BSF to set the destination register to -1 if the value to be scanned was 0 (in which case BSR/BSF set the Z flag). Instead, according to the AMD64 specification, we can make use of the fact that BSR/BSF doesn't modify its output register if its input is 0. By preloading the output with -1 and incrementing the result, we achieve the desired result without the need for a conditional check. The Intel x86_64 specification, however, says that the result of BSR/BSF in such a case is undefined. That said, when queried, one of the Intel CPU architects said that the behaviour on all Intel CPUs is that: (1) with BSRQ/BSFQ, the 64-bit destination register is written with its original value if the source is 0, thus, in essence, giving the effect we want. And, (2) with BSRL/BSFL, the lower half of the 64-bit destination register is written with its original value if the source is 0, and the upper half is cleared, thus giving us the effect we want (we return a 4-byte int). Further, it was indicated that they (Intel) are unlikely to get away with changing the behaviour. It might be possible to optimise the 32-bit versions of these functions, but there's a lot more variation, and so the effective non-destructive property of BSRL/BSRF cannot be relied on. [ hpa: specifically, some 486 chips are known to NOT have this property. ] I have benchmarked these functions on my Core2 Duo test machine using the following program: #include <stdlib.h> #include <stdio.h> #ifndef __x86_64__ #error #endif #define PAGE_SHIFT 12 typedef unsigned long long __u64, u64; typedef unsigned int __u32, u32; #define noinline __attribute__((noinline)) static __always_inline int fls64(__u64 x) { long bitpos = -1; asm("bsrq %1,%0" : "+r" (bitpos) : "rm" (x)); return bitpos + 1; } static inline unsigned long __fls(unsigned long word) { asm("bsr %1,%0" : "=r" (word) : "rm" (word)); return word; } static __always_inline int old_fls64(__u64 x) { if (x == 0) return 0; return __fls(x) + 1; } static noinline // __attribute__((const)) int old_get_order(unsigned long size) { int order; size = (size - 1) >> (PAGE_SHIFT - 1); order = -1; do { size >>= 1; order++; } while (size); return order; } static inline __attribute__((const)) int get_order_old_fls64(unsigned long size) { int order; size--; size >>= PAGE_SHIFT; order = old_fls64(size); return order; } static inline __attribute__((const)) int get_order(unsigned long size) { int order; size--; size >>= PAGE_SHIFT; order = fls64(size); return order; } unsigned long prevent_optimise_out; static noinline unsigned long test_old_get_order(void) { unsigned long n, total = 0; long rep, loop; for (rep = 1000000; rep > 0; rep--) { for (loop = 0; loop <= 16384; loop += 4) { n = 1UL << loop; total += old_get_order(n); } } return total; } static noinline unsigned long test_get_order_old_fls64(void) { unsigned long n, total = 0; long rep, loop; for (rep = 1000000; rep > 0; rep--) { for (loop = 0; loop <= 16384; loop += 4) { n = 1UL << loop; total += get_order_old_fls64(n); } } return total; } static noinline unsigned long test_get_order(void) { unsigned long n, total = 0; long rep, loop; for (rep = 1000000; rep > 0; rep--) { for (loop = 0; loop <= 16384; loop += 4) { n = 1UL << loop; total += get_order(n); } } return total; } int main(int argc, char **argv) { unsigned long total; switch (argc) { case 1: total = test_old_get_order(); break; case 2: total = test_get_order_old_fls64(); break; default: total = test_get_order(); break; } prevent_optimise_out = total; return 0; } This allows me to test the use of the old fls64() implementation and the new fls64() implementation and also to contrast these to the out-of-line loop-based implementation of get_order(). The results were: warthog>time ./get_order real 1m37.191s user 1m36.313s sys 0m0.861s warthog>time ./get_order x real 0m16.892s user 0m16.586s sys 0m0.287s warthog>time ./get_order x x real 0m7.731s user 0m7.727s sys 0m0.002s Using the current upstream fls64() as a basis for an inlined get_order() [the second result above] is much faster than using the current out-of-line loop-based get_order() [the first result above]. Using my optimised inline fls64()-based get_order() [the third result above] is even faster still. [ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64 instead of comparing BITS_PER_LONG, updated comments, rebased manually on top of `83d99df7c4` x86, bitops: Move fls64.h inside __KERNEL__ ] Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-15 15:16:49 -08:00
H. Peter Anvin	83d99df7c4	x86, bitops: Move fls64.h inside __KERNEL__ We would include <asm-generic/bitops/fls64.h> even without __KERNEL__, but that doesn't make sense, as: 1. That file provides fls64(), but the corresponding function fls() is not exported to user space. 2. The implementation of fls64.h uses kernel-only symbols. 3. fls64.h is not exported to user space. This appears to have been a bug introduced in checkin: `d57594c203` bitops: use __fls for fls64 on 64-bit archs Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Alexander van Heukelum <heukelum@mailshack.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/4EEA77E1.6050009@zytor.com	2011-12-15 15:04:07 -08:00
Linus Torvalds	42ebfc61cf	Merge branch 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/swiotlb: Use page alignment for early buffer allocation. xen: only limit memory map to maximum reservation for domain 0.	2011-12-15 10:52:40 -08:00
Ian Campbell	d3db728125	xen: only limit memory map to maximum reservation for domain 0. `d312ae878b` "xen: use maximum reservation to limit amount of usable RAM" clamped the total amount of RAM to the current maximum reservation. This is correct for dom0 but is not correct for guest domains. In order to boot a guest "pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for future memory expansion the guest must derive max_pfn from the e820 provided by the toolstack and not the current maximum reservation (which can reflect only the current maximum, not the guest lifetime max). The existing algorithm already behaves this correctly if we do not artificially limit the maximum number of pages for the guest case. For a guest booted with maxmem=512, memory=128 this results in: [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) -[ 0.000000] Xen: 0000000000100000 - 0000000008100000 (usable) -[ 0.000000] Xen: 0000000008100000 - 0000000020800000 (unusable) +[ 0.000000] Xen: 0000000000100000 - 0000000020800000 (usable) ... [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable) -[ 0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000 +[ 0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000 [ 0.000000] initial memory mapped : 0 - 027ff000 [ 0.000000] Base memory trampoline at [c009f000] 9f000 size 4096 -[ 0.000000] init_memory_mapping: 0000000000000000-0000000008100000 -[ 0.000000] 0000000000 - 0008100000 page 4k -[ 0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000 +[ 0.000000] init_memory_mapping: 0000000000000000-0000000020800000 +[ 0.000000] 0000000000 - 0020800000 page 4k +[ 0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000 [ 0.000000] xen: setting RW the range 27e8000 - 27ff000 [ 0.000000] 0MB HIGHMEM available. -[ 0.000000] 129MB LOWMEM available. -[ 0.000000] mapped low ram: 0 - 08100000 -[ 0.000000] low ram: 0 - 08100000 +[ 0.000000] 520MB LOWMEM available. +[ 0.000000] mapped low ram: 0 - 20800000 +[ 0.000000] low ram: 0 - 20800000 With this change "xl mem-set <domain> 512M" will successfully increase the guest RAM (by reducing the balloon). There is no change for dom0. Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: stable@kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-12-15 11:24:02 -05:00
Timo Teräs	cb3f718de8	x86, centaur: Enable cx8 for VIA Eden too My box with following cpuinfo needs the cx8 enabling still: vendor_id : CentaurHauls cpu family : 6 model : 13 model name : VIA Eden Processor 1200MHz stepping : 0 cpu MHz : 1199.940 cache size : 128 KB This fixes valgrind to work on my box (it requires and checks cx8 from cpuinfo). Signed-off-by: Timo Teräs <timo.teras@iki.fi> Link: http://lkml.kernel.org/r/1323961888-10223-1-git-send-email-timo.teras@iki.fi Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-12-15 08:04:42 -08:00
Ingo Molnar	6a54aebf69	Merge commit 'v3.2-rc5' into sched/core Merge reason: Pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-15 08:21:30 +01:00
Jan Beulich	cebef5beed	x86: Fix and improve percpu_cmpxchg{8,16}b_double() They had several problems/shortcomings: Only the first memory operand was mentioned in the 2x32bit asm() operands, and 2x64-bit version had a memory clobber. The first allowed the compiler to not recognize the need to re-load the data in case it had it cached in some register, and the second was overly destructive. The memory operand in the 2x32-bit asm() was declared to only be an output. The types of the local copies of the old and new values were incorrect (as in other per-CPU ops, the types of the per-CPU variables accessed should be used here, to make sure the respective types are compatible). The __dummy variable was pointless (and needlessly initialized in the 2x32-bit case), given that local copies of the inputs already exist. The 2x64-bit variant forced the address of the first object into %rsi, even though this is needed only for the call to the emulation function. The real cmpxchg16b can operate on an memory. At once also change the return value type to what it really is - 'bool'. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4EE86D6502000078000679FE@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-15 08:17:14 +01:00
Joerg Roedel	969df4b829	x86: Report cpb and eff_freq_ro flags correctly Add the flags to get rid of the [9] and [10] feature names in cpuinfo's 'power management' fields and replace them with meaningful names. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Link: http://lkml.kernel.org/r/1323875574-17881-1-git-send-email-joerg.roedel@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-15 08:14:49 +01:00
Ingo Molnar	715a43182a	Merge branch 'early-mce-decode' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/mce	2011-12-15 08:13:40 +01:00
Ingo Molnar	304fb45374	Merge branch 'ucode-verify-size' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/microcode	2011-12-15 08:12:15 +01:00
Fenghua Yu	29e9bf1841	x86, mce, therm_throt: Don't report power limit and package level thermal throttle events in mcelog Thermal throttle and power limit events are not defined as MCE errors in x86 architecture and should not generate MCE errors in mcelog. Current kernel generates fake software defined MCE errors for these events. This may confuse users because they may think the machine has real MCE errors while actually only thermal throttle or power limit events happen. To make it worse, buggy firmware on some platforms may falsely generate the events. Therefore, kernel reports MCE errors which users think as real hardware errors. Although the firmware bugs should be fixed, on the other hand, kernel should not report MCE errors either. So mcelog is not a good mechanism to report these events. To report the events, we count them in respective counters (core_power_limit_count, package_power_limit_count, core_throttle_count, and package_throttle_count) in /sys/devices/system/cpu/cpu#/thermal_throttle/. Users can check the counters for each event on each CPU. Please note that all CPU's on one package report duplicate counters. It's user application's responsibity to retrieve a package level counter for one package. This patch doesn't report package level power limit, core level power limit, and package level thermal throttle events in mcelog. When the events happen, only report them in respective counters in sysfs. Since core level thermal throttle has been legacy code in kernel for a while and users accepted it as MCE error in mcelog, core level thermal throttle is still reported in mcelog. In the mean time, the event is counted in a counter in sysfs as well. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Borislav Petkov <bp@amd64.org> Acked-by: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20111215001945.GA21009@linux-os.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-14 16:25:26 -08:00
Borislav Petkov	0937195715	x86, MCE: Drain mcelog buffer Add a function which drains whatever MCEs were logged in already during boot and before the decoder chains were registered. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:50:13 +01:00
Borislav Petkov	3653ada5d3	x86, mce: Add wrappers for registering on the decode chain No functionality change, this is done so that in a follow-on patch all queued-up MCEs can be decoded after registering on the chain. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:50:12 +01:00
Borislav Petkov	597e11a367	x86, microcode, AMD: Update copyrights Add Andreas and me as current maintainers. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:46:52 +01:00
Borislav Petkov	d733689ad5	x86, microcode, AMD: Exit early on success Once we've found and validated the ucode patch for the current CPU, there's no need to iterate over the remaining patches in the binary image. Exit then and save us a bunch of cycles. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:46:52 +01:00
Borislav Petkov	be62adb492	x86, microcode, AMD: Simplify ucode verification Basically, what we did until now is take out a chunk of the firmware image, vmalloc space for it and inspect it before application. And repeat. This patch changes all that so that we look at each ucode patch from the firmware image, check it for sanity and copy it to local buffer for application only once and if it passes all checks. Thus, vmalloc-ing for each piece is gone, we can do proper size checking only of the patch which is destined for the CPU of the current machine instead of each single patch, which is clearly wrong. Oh yeah, simplify and cleanup the code while at it, along with adding comments as to what actually happens. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:46:51 +01:00
Borislav Petkov	96b0ee4588	x86, microcode, AMD: Add a reusable buffer Add a simple 4K page which gets allocated on driver init and freed on driver exit instead of vmalloc'ing small buffers for each ucode patch. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:46:50 +01:00
Borislav Petkov	f72c1a5765	x86, microcode, AMD: Add a vendor-specific exit function This will be used to do cleanup work before the driver exits. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-12-14 12:46:47 +01:00
Fernando Luis Vázquez Cao	346b46be5f	x86: Add per-cpu stat counter for APIC ICR read tries In the IPI delivery slow path (NMI delivery) we retry the ICR read to check for delivery completion a limited number of times. [ The reason for the limited retries is that some of the places where it is used (cpu boot, kdump, etc) IPI delivery might not succeed (due to a firmware bug or system crash, for example) and in such a case it is better to give up and resume execution of other code. ] This patch adds a new entry to /proc/interrupts, RTR, which tells user space the number of times we retried the ICR read in the IPI delivery slow path. This should give some insight into how well the APIC message delivery hardware is working - if the counts are way too large then we are hitting a (very-) slow path way too often. Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Cc: Jörn Engel <joern@logfs.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/n/tip-vzsp20lo2xdzh5f70g0eis2s@git.kernel.org [ extended the changelog ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-14 09:32:05 +01:00
Ingo Molnar	919b83452b	Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu	2011-12-14 08:16:43 +01:00
Matt Fleming	291f36325f	x86, efi: EFI boot stub support There is currently a large divide between kernel development and the development of EFI boot loaders. The idea behind this patch is to give the kernel developers full control over the EFI boot process. As H. Peter Anvin put it, "The 'kernel carries its own stub' approach been very successful in dealing with BIOS, and would make a lot of sense to me for EFI as well." This patch introduces an EFI boot stub that allows an x86 bzImage to be loaded and executed by EFI firmware. The bzImage appears to the firmware as an EFI application. Luckily there are enough free bits within the bzImage header so that it can masquerade as an EFI application, thereby coercing the EFI firmware into loading it and jumping to its entry point. The beauty of this masquerading approach is that both BIOS and EFI boot loaders can still load and run the same bzImage, thereby allowing a single kernel image to work in any boot environment. The EFI boot stub supports multiple initrds, but they must exist on the same partition as the bzImage. Command-line arguments for the kernel can be appended after the bzImage name when run from the EFI shell, e.g. Shell> bzImage console=ttyS0 root=/dev/sdb initrd=initrd.img v7: - Fix checkpatch warnings. v6: - Try to allocate initrd memory just below hdr->inird_addr_max. v5: - load_options_size is UTF-16, which needs dividing by 2 to convert to the corresponding ASCII size. v4: - Don't read more than image->load_options_size v3: - Fix following warnings when compiling CONFIG_EFI_STUB=n arch/x86/boot/tools/build.c: In function ‘main’: arch/x86/boot/tools/build.c:138:24: warning: unused variable ‘pe_header’ arch/x86/boot/tools/build.c:138:15: warning: unused variable ‘file_sz’ - As reported by Matthew Garrett, some Apple machines have GOPs that don't have hardware attached. We need to weed these out by searching for ones that handle the PCIIO protocol. - Don't allocate memory if no initrds are on cmdline - Don't trust image->load_options_size Maarten Lankhorst noted: - Don't strip first argument when booted from efibootmgr - Don't allocate too much memory for cmdline - Don't update cmdline_size, the kernel considers it read-only - Don't accept '\n' for initrd names v2: - File alignment was too large, was 8192 should be 512. Reported by Maarten Lankhorst on LKML. - Added UGA support for graphics - Use VIDEO_TYPE_EFI instead of hard-coded number. - Move linelength assignment until after we've assigned depth - Dynamically fill out AddressOfEntryPoint in tools/build.c - Don't use magic number for GDT/TSS stuff. Requested by Andi Kleen - The bzImage may need to be relocated as it may have been loaded at a high address address by the firmware. This was required to get my macbook booting because the firmware loaded it at 0x7cxxxxxx, which triggers this error in decompress_kernel(), if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff)) error("Destination address too large"); Cc: Mike Waychison <mikew@google.com> Cc: Matthew Garrett <mjg@redhat.com> Tested-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1321383097.2657.9.camel@mfleming-mobl1.ger.corp.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-12 14:26:10 -08:00
Alexey Dobriyan	890890cb8e	x86/i386: Use less assembly in strlen(), speed things up a bit Current i386 strlen() hardcodes NOT/DEC sequence. DEC is mentioned to be suboptimal on Core2. So, put only REPNE SCASB sequence in assembly, compiler can do the rest. The difference in generated code is like below (MCORE2=y): <strlen>: push %edi mov $0xffffffff,%ecx mov %eax,%edi xor %eax,%eax repnz scas %es:(%edi),%al not %ecx - dec %ecx - mov %ecx,%eax + lea -0x1(%ecx),%eax pop %edi ret Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Beulich <JBeulich@suse.com> Link: http://lkml.kernel.org/r/20111211181319.GA17097@p183.telecom.by Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-12 18:33:42 +01:00
Keith Packard	e1ad783b12	Revert "x86, efi: Calling __pa() with an ioremap()ed address is invalid" This hangs my MacBook Air at boot time; I get no console messages at all. I reverted this on top of -rc5 and my machine boots again. This reverts commit `e8c7106280`. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Keith Packard <keithp@keithp.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg@redhat.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Huang Ying <huang.ying.caritas@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-12 18:25:56 +01:00
Frederic Weisbecker	1268fbc746	nohz: Remove tick_nohz_idle_enter_norcu() / tick_nohz_idle_exit_norcu() Those two APIs were provided to optimize the calls of tick_nohz_idle_enter() and rcu_idle_enter() into a single irq disabled section. This way no interrupt happening in-between would needlessly process any RCU job. Now we are talking about an optimization for which benefits have yet to be measured. Let's start simple and completely decouple idle rcu and dyntick idle logics to simplify. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-12-11 10:31:57 -08:00
Frederic Weisbecker	98ad1cc14a	x86: Call idle notifier after irq_enter() Interrupts notify the idle exit state before calling irq_enter(). But the notifier code calls rcu_read_lock() and this is not allowed while rcu is in an extended quiescent state. We need to wait for irq_enter() -> rcu_idle_exit() to be called before doing so otherwise this results in a grumpy RCU: [ 0.099991] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110() [ 0.099991] Hardware name: AMD690VM-FMH [ 0.099991] Modules linked in: [ 0.099991] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #255 [ 0.099991] Call Trace: [ 0.099991] <IRQ> [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0 [ 0.099991] [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20 [ 0.099991] [<ffffffff817d6fa2>] __atomic_notifier_call_chain+0xd2/0x110 [ 0.099991] [<ffffffff817d6ff1>] atomic_notifier_call_chain+0x11/0x20 [ 0.099991] [<ffffffff81001873>] exit_idle+0x43/0x50 [ 0.099991] [<ffffffff81020439>] smp_apic_timer_interrupt+0x39/0xa0 [ 0.099991] [<ffffffff817da253>] apic_timer_interrupt+0x13/0x20 [ 0.099991] <EOI> [<ffffffff8100ae67>] ? default_idle+0xa7/0x350 [ 0.099991] [<ffffffff8100ae65>] ? default_idle+0xa5/0x350 [ 0.099991] [<ffffffff8100b19b>] amd_e400_idle+0x8b/0x110 [ 0.099991] [<ffffffff810cb01f>] ? rcu_enter_nohz+0x8f/0x160 [ 0.099991] [<ffffffff810019a0>] cpu_idle+0xb0/0x110 [ 0.099991] [<ffffffff817a7505>] rest_init+0xe5/0x140 [ 0.099991] [<ffffffff817a7468>] ? rest_init+0x48/0x140 [ 0.099991] [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc [ 0.099991] [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135 [ 0.099991] [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Henroid <andrew.d.henroid@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-12-11 10:31:38 -08:00
Frederic Weisbecker	e37e112de3	x86: Enter rcu extended qs after idle notifier call The idle notifier, called by enter_idle(), enters into rcu read side critical section but at that time we already switched into the RCU-idle window (rcu_idle_enter() has been called). And it's illegal to use rcu_read_lock() in that state. This results in rcu reporting its bad mood: [ 1.275635] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110() [ 1.275635] Hardware name: AMD690VM-FMH [ 1.275635] Modules linked in: [ 1.275635] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #252 [ 1.275635] Call Trace: [ 1.275635] [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0 [ 1.275635] [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20 [ 1.275635] [<ffffffff817d6f22>] __atomic_notifier_call_chain+0xd2/0x110 [ 1.275635] [<ffffffff817d6f71>] atomic_notifier_call_chain+0x11/0x20 [ 1.275635] [<ffffffff810018a0>] enter_idle+0x20/0x30 [ 1.275635] [<ffffffff81001995>] cpu_idle+0xa5/0x110 [ 1.275635] [<ffffffff817a7465>] rest_init+0xe5/0x140 [ 1.275635] [<ffffffff817a73c8>] ? rest_init+0x48/0x140 [ 1.275635] [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc [ 1.275635] [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135 [ 1.275635] [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4 [ 1.275635] ---[ end trace a22d306b065d4a66 ]--- Fix this by entering rcu extended quiescent state later, just before the CPU goes to sleep. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-12-11 10:31:37 -08:00
Frederic Weisbecker	2bbb6817c0	nohz: Allow rcu extended quiescent state handling seperately from tick stop It is assumed that rcu won't be used once we switch to tickless mode and until we restart the tick. However this is not always true, as in x86-64 where we dereference the idle notifiers after the tick is stopped. To prepare for fixing this, add two new APIs: tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu(). If no use of RCU is made in the idle loop between tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch must instead call the new *_norcu() version such that the arch doesn't need to call rcu_idle_enter() and rcu_idle_exit(). Otherwise the arch must call tick_nohz_enter_idle() and tick_nohz_exit_idle() and also call explicitly: - rcu_idle_enter() after its last use of RCU before the CPU is put to sleep. - rcu_idle_exit() before the first use of RCU after the CPU is woken up. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-12-11 10:31:36 -08:00
Frederic Weisbecker	280f06774a	nohz: Separate out irq exit and idle loop dyntick logic The tick_nohz_stop_sched_tick() function, which tries to delay the next timer tick as long as possible, can be called from two places: - From the idle loop to start the dytick idle mode - From interrupt exit if we have interrupted the dyntick idle mode, so that we reprogram the next tick event in case the irq changed some internal state that requires this action. There are only few minor differences between both that are handled by that function, driven by the ts->inidle cpu variable and the inidle parameter. The whole guarantees that we only update the dyntick mode on irq exit if we actually interrupted the dyntick idle mode, and that we enter in RCU extended quiescent state from idle loop entry only. Split this function into: - tick_nohz_idle_enter(), which sets ts->inidle to 1, enters dynticks idle mode unconditionally if it can, and enters into RCU extended quiescent state. - tick_nohz_irq_exit() which only updates the dynticks idle mode when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called). To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed into tick_nohz_idle_exit(). This simplifies the code and micro-optimize the irq exit path (no need for local_irq_save there). This also prepares for the split between dynticks and rcu extended quiescent state logics. We'll need this split to further fix illegal uses of RCU in extended quiescent states in the idle loop. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-12-11 10:31:35 -08:00
Matt Fleming	f7d7d01be5	x86: Don't use magic strings for EFI loader signature Introduce a symbol, EFI_LOADER_SIGNATURE instead of using the magic strings, which also helps to reduce the amount of ifdeffery. Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1318848017-12301-1-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-09 17:35:38 -08:00
Matt Fleming	8af21e7e71	x86: Add missing bzImage fields to struct setup_header commit `37ba7ab5e3` ("x86, boot: make kernel_alignment adjustable; new bzImage fields") introduced some new fields into the bzImage header but struct setup_header was not updated accordingly. Add the missing 'pref_address' and 'init_size' fields. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1318848017-12301-1-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-09 17:35:33 -08:00
Matt Fleming	6d3e32e63f	x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware efi_call_phys_prelog() sets up a 1:1 mapping of the physical address range in swapper_pg_dir. Instead of replacing then restoring entries in swapper_pg_dir we should be using initial_page_table which already contains the 1:1 mapping. It's safe to blindly switch back to swapper_pg_dir in the epilog because the physical EFI routines are only called before efi_enter_virtual_mode(), e.g. before any user processes have been forked. Therefore, we don't need to track which pgd was in %cr3 when we entered the prelog. The previous code actually contained a bug because it assumed that the kernel was loaded at a physical address within the first 8MB of ram, usually at 0x100000. However, this isn't the case with a CONFIG_RELOCATABLE=y kernel which could have been loaded anywhere in the physical address space. Also delete the ancient (and bogus) comments about the page table being restored after the lock is released. There is no locking. Cc: Matthew Garrett <mjg@redhat.com> Cc: Darrent Hart <dvhart@linux.intel.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1323346250.3894.74.camel@mfleming-mobl1.ger.corp.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-09 16:39:11 -08:00
Linus Torvalds	a776878d6c	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, efi: Calling __pa() with an ioremap()ed address is invalid x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked x86/intel_mid: Kconfig select fix x86/intel_mid: Fix the Kconfig for MID selection	2011-12-09 14:45:12 -08:00
H. Peter Anvin	47db9e7c80	x86, um: Fix typo in 32-bit system call modifications We override sys_iopl(), not stub_iopl(); the latter is a 64-bitism that doesn't apply to i386 in the first place. Reported-by: Richard Weinberger <richard@nod.at> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-12-09 11:13:59 -08:00
Youquan Song	b6999b1912	thp: add compound tail page _mapcount when mapped With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page failed to be used. The root cause is that 1GB page allocation calls gup_huge_pud() while 2M page calls gup_huge_pmd. If compound pages are used and the page is a tail page, gup_huge_pmd() increases _mapcount to record tail page are mapped while gup_huge_pud does not do that. So when the mapped page is relesed, it will result in kernel oops because the page is not marked mapped. This patch add tail process for compound page in 1GB huge page which keeps the same process as 2M page. Reproduce like: 1. Add grub boot option: hugepagesz=1G hugepages=8 2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages 3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages -net none -device pci-assign,host=07:00.1 kernel BUG at mm/swap.c:114! invalid opcode: 0000 [#1] SMP Call Trace: put_page+0x15/0x37 kvm_release_pfn_clean+0x31/0x36 kvm_iommu_put_pages+0x94/0xb1 kvm_iommu_unmap_memslots+0x80/0xb6 kvm_assign_device+0xba/0x117 kvm_vm_ioctl_assigned_device+0x301/0xa47 kvm_vm_ioctl+0x36c/0x3a2 do_vfs_ioctl+0x49e/0x4e4 sys_ioctl+0x5a/0x7c system_call_fastpath+0x16/0x1b RIP put_compound_page+0xd4/0x168 Signed-off-by: Youquan Song <youquan.song@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-12-09 07:50:28 -08:00
Wang YanQing	819a693b5a	typo fixes: aera -> area, exntension -> extension One printk and one comment typo fix. Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-12-09 15:22:07 +01:00
Matt Fleming	e8c7106280	x86, efi: Calling __pa() with an ioremap()ed address is invalid If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set in ->attribute we currently call set_memory_uc(), which in turn calls __pa() on a potentially ioremap'd address. On CONFIG_X86_32 this is invalid, resulting in the following oops on some machines: BUG: unable to handle kernel paging request at f7f22280 IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210 [...] Call Trace: [<c104f8ca>] ? page_is_ram+0x1a/0x40 [<c1025aff>] reserve_memtype+0xdf/0x2f0 [<c1024dc9>] set_memory_uc+0x49/0xa0 [<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa [<c19216d4>] start_kernel+0x291/0x2f2 [<c19211c7>] ? loglevel+0x1b/0x1b [<c19210bf>] i386_start_kernel+0xbf/0xc8 A better approach to this problem is to map the memory region with the correct attributes from the start, instead of modifying it after the fact. The uncached case can be handled by ioremap_nocache() and the cached by ioremap_cache(). Despite first impressions, it's not possible to use ioremap_cache() to map all cached memory regions on CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really don't like being mapped into the vmalloc space, as detailed in the following bug report, https://bugzilla.redhat.com/show_bug.cgi?id=748516 Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA regions are covered by the direct kernel mapping table on CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI regions via the direct kernel mapping with the initial call to init_memory_mapping() in setup_arch(), whereas previously these regions wouldn't be mapped if they were after the last E820_RAM region until efi_ioremap() was called. Doing it this way allows us to delete efi_ioremap() completely. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg@redhat.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Huang Ying <huang.ying.caritas@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-09 08:32:26 +01:00
Petr Holasek	54eed6cb16	x86/numa: Add constraints check for nid parameters This patch adds constraint checks to the numa_set_distance() function. When the check triggers (this should not happen normally) it emits a warning and avoids a store to a negative index in numa_distance[] array - i.e. avoids memory corruption. Negative ids can be passed when the pxm-to-nids mapping is not properly filled while parsing the SRAT. Signed-off-by: Petr Holasek <pholasek@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Anton Arapov <anton@redhat.com> Link: http://lkml.kernel.org/r/20111208121640.GA2229@dhcp-27-244.brq.redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-09 08:03:34 +01:00
Borislav Petkov	d059f24a96	x86, CPU: Drop superfluous get_cpu_cap() prototype The get_cpu_cap() external function prototype was declared twice so lose one of them. Clean up the header guard while at it. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1322594083-14507-1-git-send-email-bp@amd64.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-09 08:00:53 +01:00
Mark Langsdorf	2ded6e6a94	x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked When HPET is operating in RTC mode, the TN_ENABLE bit on timer1 controls whether the HPET or the RTC delivers interrupts to irq8. When the system goes into suspend, the RTC driver sends a signal to the HPET driver so that the HPET releases control of irq8, allowing the RTC to wake the system from suspend. The switchover is accomplished by a write to the HPET configuration registers which currently only occurs while servicing the HPET interrupt. On some systems, I have seen the system suspend before an HPET interrupt occurs, preventing the write to the HPET configuration register and leaving the HPET in control of the irq8. As the HPET is not active during suspend, it does not generate a wake signal and RTC alarms do not work. This patch forces the HPET driver to immediately transfer control of the irq8 channel to the RTC instead of waiting until the next interrupt event. Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org	2011-12-08 21:47:22 +01:00
Tejun Heo	0ee332c145	memblock: Kill early_node_map[] Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP - there's no user of early_node_map[] left. Kill early_node_map[] and replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also, relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h as page_alloc.c would no longer host an alternative implementation. This change is ultimately one to one mapping and shouldn't cause any observable difference; however, after the recent changes, there are some functions which now would fit memblock.c better than page_alloc.c and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK doesn't make much sense on some of them. Further cleanups for functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice. -v2: Fix compile bug introduced by mis-spelling CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in mmzone.h. Reported by Stephen Rothwell. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-12-08 10:22:09 -08:00
Tejun Heo	1aadc0560f	memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users The only function of memblock_analyze() is now allowing resize of memblock region arrays. Rename it to memblock_allow_resize() and update its users. * The following users remain the same other than renaming. arm/mm/init.c::arm_memblock_init() microblaze/kernel/prom.c::early_init_devtree() powerpc/kernel/prom.c::early_init_devtree() openrisc/kernel/prom.c::early_init_devtree() sh/mm/init.c::paging_init() sparc/mm/init_64.c::paging_init() unicore32/mm/init.c::uc32_memblock_init() * In the following users, analyze was used to update total size which is no longer necessary. powerpc/kernel/machine_kexec.c::reserve_crashkernel() powerpc/kernel/prom.c::early_init_devtree() powerpc/mm/init_32.c::MMU_init() powerpc/mm/tlb_nohash.c::__early_init_mmu() powerpc/platforms/ps3/mm.c::ps3_mm_add_memory() powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups() sh/kernel/machine_kexec.c::reserve_crashkernel() * x86/kernel/e820.c::memblock_x86_fill() was directly setting memblock_can_resize before populating memblock and calling analyze afterwards. Call memblock_allow_resize() before start populating. memblock_can_resize is now static inside memblock.c. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-12-08 10:22:08 -08:00
Tejun Heo	fe091c208a	memblock: Kill memblock_init() memblock_init() initializes arrays for regions and memblock itself; however, all these can be done with struct initializers and memblock_init() can be removed. This patch kills memblock_init() and initializes memblock with struct initializer. The only difference is that the first dummy entries don't have .nid set to MAX_NUMNODES initially. This doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-12-08 10:22:07 -08:00
Dan Carpenter	3d6240b53e	x86, NMI: Add to_cpumask() to silence compile warning Gcc complains if we don't cast this to a struct cpumask pointer. arch/x86/kernel/nmi_selftest.c:93:2: warning: passing argument 1 of ‘cpumask_empty’ from incompatible pointer type [enabled by default] Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/20111207110612.GA3437@mwanda Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-07 23:26:57 +01:00
Don Zickus	4f941c57fe	x86, NMI: NMI selftest depends on the local apic The selftest doesn't work with out a local apic for now. Reported-by: Randy Durlap <rdunlap@xenotime.net> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20111207210630.GI1669@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-07 23:26:55 +01:00
Mitsuo Hayasaka	d2db661021	x86: Add stack top margin for stack overflow checking It seems that a margin for stack overflow checking is added to top of a kernel stack but is not added to IRQ and exception stacks in stack_overflow_check(). Therefore, the overflows of IRQ and exception stacks are always detected only after they actually occurred and data corruption might occur due to them. This patch adds the margin to top of IRQ and exception stacks as well as a kernel stack to enhance reliability. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20111207082910.9847.3359.stgit@ltc219.sdl.hitachi.co.jp [ removed the #undef - we typically don't do that for uncommon names ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-07 09:27:11 +01:00
H Hartley Sweeten	79f1ddd064	x86: Use the same node_distance for 32 and 64-bit The node_distance function is not x86 64-bit specific. Having the #ifdef around the extern function declaration and the #define causes the default node_distance macro to be used in asm-generic/topology.h. This also causes a sparse warning in arch/x86/mm/numa.c when CONFIG_X86_64 is not set: warning: symbol '__node_distance' was not declared. Should it be static? Remove the #ifdef to fix both issues. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1112061220310.28251@chino.kir.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-07 09:19:10 +01:00
Don Zickus	565cbc3e93	x86, NMI: NMI-selftest should handle the UP case properly If no remote cpus are online, then just quietly skip the remote IPI test for now. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: andi@firstfloor.org Cc: torvalds@linux-foundation.org Cc: peterz@infradead.org Cc: robert.richter@amd.com Link: http://lkml.kernel.org/r/20111206180859.GR1669@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 21:02:39 +01:00
Gleb Natapov	b3d9468a8b	perf, x86: Expose perf capability to other modules KVM needs to know perf capability to decide which PMU it can expose to a guest. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1320929850-10480-8-git-send-email-gleb@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 20:41:08 +01:00
Peter Zijlstra	c1d6f42f1a	perf, x86: Implement arch event mask as quirk Implement the disabling of arch events as a quirk so that we can print a message along with it. This creates some visibility into the problem space and could allow us to work on adding more work-around like the AAJ80 one. Requested-by: Ingo Molnar <mingo@elte.hu> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-wcja2z48wklzu1b0nkz0a5y7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 20:41:06 +01:00
Gleb Natapov	ffb871bc91	x86, perf: Disable non available architectural events Intel CPUs report non-available architectural events in cpuid leaf 0AH.EBX. Use it to disable events that are not available according to CPU. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1320929850-10480-7-git-send-email-gleb@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 20:41:05 +01:00
Peter Zijlstra	9cdbe1cbac	jump_label, x86: Fix section mismatch WARNING: arch/x86/kernel/built-in.o(.text+0x4c71): Section mismatch in reference from the function arch_jump_label_transform_static() to the function .init.text:text_poke_early() The function arch_jump_label_transform_static() references the function __init text_poke_early(). This is often because arch_jump_label_transform_static lacks a __init annotation or the annotation of text_poke_early is wrong. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Baron <jbaron@redhat.com> Link: http://lkml.kernel.org/n/tip-9lefe89mrvurrwpqw5h8xm8z@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 20:41:02 +01:00
Alan Cox	4e2b1c4f56	x86/intel_mid: Kconfig select fix If we select a symbol it should have a type declared first otherwise in some situations the config tools get upset. They are currently perhaps a bit too resilient which is why this wasn't noticed initially. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20111206132811.4041.32549.stgit@bob.linux.org.uk Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 14:40:50 +01:00
Alan Cox	dd13752537	x86/intel_mid: Fix the Kconfig for MID selection We currently fail to build on CONFIG_X86_INTEL_MID=y and CONFIG_X86_MRST unset. We could build all the bits to make generic MID work if you picked MID platform alone but that's really silly. Instead use select and two variables. This looks a bit daft right now but once we add a Medfield selection it'll start to look a good deal more sensible. Reported-by: Ingo Molnar <mingo@elte.hu> Reported-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20111205231433.28811.51297.stgit@bob.linux.org.uk Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 11:28:36 +01:00
Seiichi Ikarashi	1cf8343f55	x86: Fix rflags in FAKE_STACK_FRAME The x86_64 kernel pushes the fake kernel stack in arch/x86/kernel/entry_64.S:FAKE_STACK_FRAME, and rflags register in it does not conform to the specification. Although Intel's manual[1] says bit 1 of it shall be set to 1, this bit is cleared to 0 on pushing the fake stack. [1] Intel(R) 64 and IA-32 Architectures Software Developer's Manual Vol.1 3-21 Figure 3-8. EFLAGS Register If it is not on purpose, it is better to be fixed, because it can lead some tools misunderstanding the stack frame. For example, "crash" utility[2] actually detects it and warns you like below: RIP: ffffffff8005dfa2 RSP: ffff8104ce0c7f58 RFLAGS: 00000200 [...] bt: WARNING: possibly bogus exception frame Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com> Tested-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 10:02:38 +01:00
Stanislaw Gruszka	54c29c635a	mm, x86: Remove debug_pagealloc_enabled When (no)bootmem finish operation, it pass pages to buddy allocator. Since debug_pagealloc_enabled is not set, we will do not protect pages, what is not what we want with CONFIG_DEBUG_PAGEALLOC=y. To fix remove debug_pagealloc_enabled. That variable was introduced by commit `12d6f21e` "x86: do not PSE on CONFIG_DEBUG_PAGEALLOC=y" to get more CPA (change page attribude) code testing. But currently we have CONFIG_CPA_DEBUG, which test CPA. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1322582711-14571-1-git-send-email-sgruszka@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 09:24:07 +01:00
Stanislaw Gruszka	855c743a27	x86/mm: Initialize high mem before free_all_bootmem() Patch fixes a boot crash with pagealloc debugging enabled: Initializing HighMem for node 0 (000377fe:0003fff0) BUG: unable to handle kernel paging request at f6fefe80 IP: [<c1621ab5>] find_range_array+0x5e/0x69 [...] Call Trace: [<c1622064>] __get_free_all_memory_range+0x39/0xb4 [<c1620dd0>] add_highpages_with_active_regions+0x18/0x9b [<c1621a2e>] set_highmem_pages_init+0x70/0x90 [<c162122b>] mem_init+0x50/0x21b [<c16155bd>] start_kernel+0x1bf/0x31c [<c1615065>] i386_start_kernel+0x65/0x67 The crash happens when memblock wants to allocate big area for temporary "struct range" array and reuses pages from top of low memory, which were already passed to the buddy allocator. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: linux-mm@kvack.org Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20111206080833.GB3105@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 09:23:53 +01:00
Glauber Costa	3292beb340	sched/accounting: Change cpustat fields to an array This patch changes fields in cpustat from a structure, to an u64 array. Math gets easier, and the code is more flexible. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Tuner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1322498719-2255-2-git-send-email-glommer@parallels.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 09:06:38 +01:00
Peter Zijlstra	4defea8559	perf, x86: Prefer fixed-purpose counters when scheduling This avoids a scheduling failure for cases like: cycles, cycles, instructions, instructions (on Core2) Which would end up being programmed like: PMC0, PMC1, FP-instructions, fail Because all events will have the same weight. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-8tnwb92asqj7xajqqoty4gel@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 08:33:58 +01:00
Robert Richter	bc1738f6ee	perf, x86: Fix event scheduler for constraints with overlapping counters The current x86 event scheduler fails to resolve scheduling problems of certain combinations of events and constraints. This happens if the counter mask of such an event is not a subset of any other counter mask of a constraint with an equal or higher weight, e.g. constraints of the AMD family 15h pmu: counter mask weight amd_f15_PMC30 0x09 2 <--- overlapping counters amd_f15_PMC20 0x07 3 amd_f15_PMC53 0x38 3 The scheduler does not find then an existing solution. Here is an example: event code counter failure possible solution 0x02E PMC[3,0] 0 3 0x043 PMC[2:0] 1 0 0x045 PMC[2:0] 2 1 0x046 PMC[2:0] FAIL 2 The event scheduler may not select the correct counter in the first cycle because it needs to know which subsequent events will be scheduled. It may fail to schedule the events then. To solve this, we now save the scheduler state of events with overlapping counter counstraints. If we fail to schedule the events we rollback to those states and try to use another free counter. Constraints with overlapping counters are marked with a new introduced overlap flag. We set the overlap flag for such constraints to give the scheduler a hint which events to select for counter rescheduling. The EVENT_CONSTRAINT_OVERLAP() macro can be used for this. Care must be taken as the rescheduling algorithm is O(n!) which will increase scheduling cycles for an over-commited system dramatically. The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter masks must be kept at a minimum. Thus, the current stack is limited to 2 states to limit the number of loops the algorithm takes in the worst case. On systems with no overlapping-counter constraints, this implementation does not increase the loop count compared to the previous algorithm. V2: * Renamed redo -> overlap. * Reimplementation using perf scheduling helper functions. V3: * Added WARN_ON_ONCE() if out of save states. * Changed function interface of perf_sched_restore_state() to use bool as return value. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 08:33:56 +01:00
Robert Richter	1e2ad28f80	perf, x86: Implement event scheduler helper functions This patch introduces x86 perf scheduler code helper functions. We need this to later add more complex functionality to support overlapping counter constraints (next patch). The algorithm is modified so that the range of weight values is now generated from the constraints. There shouldn't be other functional changes. With the helper functions the scheduler is controlled. There are functions to initialize, traverse the event list, find unused counters etc. The scheduler keeps its own state. V3: * Added macro for_each_set_bit_cont(). * Changed functions interfaces of perf_sched_find_counter() and perf_sched_next_event() to use bool as return value. * Added some comments to make code better understandable. V4: * Fix broken event assignment if weight of the first event is not wmin (perf_sched_init()). Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 08:33:54 +01:00
Srikar Dronamraju	cc3a1bf52a	x86: Clean up and extend do_int3() Since there is a possibility of !KPROBES int3 listeners (such as kgdb) and since DIE_TRAP is currently not being used by anybody, notify all listeners with DIE_INT3. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20111025142159.GB21225@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 08:20:37 +01:00
Srikar Dronamraju	3596ff4e6b	x86: Call do_notify_resume() with interrupts enabled do_notify_resume() gets called with interrupts disabled on x86_32. This is different from the x86_64 behavior, where interrupts are enabled at the time. Queries on lkml on this issue hasn't yielded any clear answer. Lets make x86_32 behave the same as x86_64, unless there is a real reason to maintain status quo. Please refer https://lkml.org/lkml/2011/9/27/130 for more details. A similar change was suggested in ARM: https://lkml.org/lkml/2011/8/25/231 My 32-bit machine works fine (tm) with this patch. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20111025141812.GA21225@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 08:20:34 +01:00
H. Peter Anvin	a074335a37	x86, um: Mark system call tables readonly Mark the system call tables readonly, as they already are on native, and the 32-bit UM version was in the previous assembly version. The 32-bit version lost it due to copy and paste from the 64-bit version, which was missing the const. Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Link: http://lkml.kernel.org/r/tip-45db1c6176c8171d9ae6fa6d82e07d115a5950ca@git.kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-12-05 22:56:38 -08:00
Steffen Persvold	e4a02b4a95	x86: Fix the !CONFIG_NUMA build of the new CPU ID fixup code support I used "ifdef CONFIG_NUMA" simply because it doesn't make sense in a non-numa configuration even with SMP enabled. Besides, the only place where it is called right now is in kernel/cpu/amd.c:srat_detect_node() within the "CONFIG_NUMA" protected part. Signed-off-by: Steffen Persvold <sp@numascale.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 07:03:37 +01:00
Ingo Molnar	d6c1c49de5	Merge branch 'perf/urgent' into perf/core Merge reason: Add these cherry-picked commits so that future changes on perf/core don't conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-06 06:43:49 +01:00
Linus Torvalds	45e713efe2	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: intr_remapping: Fix section mismatch in ir_dev_scope_init() intel-iommu: Fix section mismatch in dmar_parse_rmrr_atsr_dev() x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions x86, AMD: Correct align_va_addr documentation x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms x86/mrst: Battery fixes x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode x86: Fix "Acer Aspire 1" reboot hang x86/mtrr: Resolve inconsistency with Intel processor manual x86: Document rdmsr_safe restrictions x86, microcode: Fix the failure path of microcode update driver init code Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup x86/mpparse: Account for bus types other than ISA and PCI x86, mrst: Change the pmic_gpio device type to IPC mrst: Added some platform data for the SFI translations x86,mrst: Power control commands update x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot x86, UV: Fix UV2 hub part number x86: Add user_mode_vm check in stack_overflow_check	2011-12-05 16:54:15 -08:00
Linus Torvalds	232ea34455	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix loss of notification with multi-event perf, x86: Force IBS LVT offset assignment for family 10h perf, x86: Disable PEBS on SandyBridge chips trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter perf session: Fix crash with invalid CPU list perf python: Fix undefined symbol problem perf/x86: Enable raw event access to Intel offcore events perf: Don't use -ENOSPC for out of PMU resources perf: Do not set task_ctx pointer in cpuctx if there are no events in the context perf/x86: Fix PEBS instruction unwind oprofile, x86: Fix crash when unloading module (nmi timer mode) oprofile: Fix crash when unloading module (hr timer mode)	2011-12-05 16:54:00 -08:00
Linus Torvalds	7125faceab	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched, x86: Avoid unnecessary overflow in sched_clock sched: Fix buglet in return_cfs_rq_runtime() sched: Avoid SMT siblings in select_idle_sibling() if possible sched: Set the command name of the idle tasks in SMP kernels sched, rt: Provide means of disabling cross-cpu bandwidth sharing sched: Document wait_for_completion_*() return values sched_fair: Fix a typo in the comment describing update_sd_lb_stats sched: Add a comment to effective_load() since it's a pain	2011-12-05 16:50:24 -08:00
H. Peter Anvin	45db1c6176	x86, um: Use the same style generated syscall tables as native Now when the native kernel uses a single style of generated system call table, follow suite for UML and implement the same style, all in C. This requires __NR_syscall_max and NR_syscalls to be generated; on native this is done in asm-headers.h but that file is common to all UML architectures; therefore put it in user-headers.h instead which already have accommodations for architecture-specific values. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-12-05 16:08:49 -08:00
Thomas Gleixner	0518469d0a	Merge branch 'fortglx/3.3/tip/timers/core' of git://git.linaro.org/people/jstultz/linux into timers/core	2011-12-05 22:13:49 +01:00
Alessandro Rubini	f9b15df466	x86/Kconfig: Cyclone-timer depends on x86-summit CONFIG_X86_CYCLONE_TIMER depends on CONFIG_X86_32_NON_STANDARD, which forces drivers/clocksource/cyclone.c to be compiled. The file doesn't do anything unless enabled by arch/x86/kernel/apic/summit_32.c Make CONFIG_X86_CYCLONE_TIMER depend by X86_SUMMIT instead, to avoid unnecessary code in other non-standard systems. Signed-off-by: Alessandro Rubini <rubini@gnudd.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20111028224842.GA7582@mail.gnudd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 18:18:15 +01:00
Sebastian Andrzej Siewior	668b448466	x86/div64: Add a micro-optimization shortcut if base is power of two In the target code I have a do_div(x, PAGE_SIZE). The x86-64 version of it was doing a shift and a mask which is clever. The 32bit version of it had a div operation in it which made me think. After digging I noticed that x86 has an optimized version of it. This patch adds this shift and mask optimization if base is constant so we don't have any runtime "checking" overhead since most users use a power of ten. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1322649814-544-1-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 18:16:11 +01:00
Andreas Herrmann	f62ef5f3e9	x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions I've received complaints that the numa_node attribute for family 15h model 00-0fh (e.g. Interlagos) northbridge functions shows -1 instead of the proper node ID. Correct this with attached quirks (similar to quirks for other AMD CPU families used in multi-socket systems). Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Frank Arnold <frank.arnold@amd.com> Cc: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 18:13:11 +01:00
Suresh Siddha	28a00184be	x86, tsc: Skip TSC synchronization checks for tsc=reliable tsc=reliable boot parameter is supposed to skip all the TSC stablility checks during boot time. On a 8-socket system where we want to run an experiment with the "tsc=reliable" boot option, TSC synchronization checks are not getting skipped and marking the TSC as not stable. Check for tsc_clocksource_reliable (which is set via tsc=reliable or for platforms supporting synthetic TSC_RELIABLE feature bit etc) and when set, skip the TSC synchronization tests during boot. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1320446537.15071.14.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 18:00:31 +01:00
Jan Beulich	f6b2bc8476	x86-64: Cleanup some assembly entry points system_call_after_swapgs doesn't really benefit from forcing alignment from it - quite the opposite, native code needlessly so far got a big NOP instruction inserted in front of it. Xen being the only user of the separate entry point can well live with the branch going to three bytes into a cache line. The compatibility mode ptregs entry points for one can make use of the GLOBAL() macro, and should be suitably aligned. Their shared continuation point (ia32_ptregs_common) otoh doesn't need to be global at all, but should continue to be properly aligned. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/4ED4CEEA020000780006407D@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:24:43 +01:00
Jan Beulich	46db09d3fd	x86-64: Slightly shorten line system call entry and exit paths GET_THREAD_INFO() involves a memory read immediately followed by an "sub" on the value read, in turn (in several cases) immediately followed by a use of the calculated value as the base address of a memory access. This combination of instructions has a non-negligible potential for stalls. In the system call entry point code, however, the (fixed) offset of the stack pointer from the end of the stack is generally known, and hence we can instead avoid the memory load and subtract, and instead do the memory reference using %rsp as the base register. To do so in a legible fashion, introduce a THREAD_INFO() macro which, provided a register (generally %rsp) and the known offset from the end of the stack, produces a suitable memory access operand. The patch attempts to only touch the fast paths (no auditing and alike), but manages to do so only in the 64-bit entry point case; the compatibility mode entry points have so many interdependencies between their various branch targets that it was necessary to also adjust the slow paths to eliminate the risk of having missed some register dependency during code analysis. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/4ED4CD690200007800064075@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:24:41 +01:00
Jan Beulich	39e9543344	x86-64: Reduce amount of redundant code generated for invalidate_interruptNN Previously these up to 32 entry points, consisting of all the same code except for their very first instruction, consumed 0x70 bytes per instance. Just like for device interrupt entry points, fold them together so that they all use a single instance of the code after having pushed their vector indicator (resulting in 0x10 bytes per instance, to retain 16-byte alignment of the individual entry points). Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/4ED4CA230200007800064065@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:24:39 +01:00
Jan Beulich	70ea6855d3	x86-64: Slightly shorten int_ret_from_sys_call Testing for a return to ring 0 was necessary here solely because of the branch out of ret_from_fork. That branch, however, can be directed to retint_restore_args, and thus the test-and-branch can be eliminated here. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/4ED4C7EE0200007800064028@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:24:37 +01:00
Steffen Persvold	44b111b519	x86: Add NumaChip support Adds support for Numascale NumaChip large-SMP systems. It is needed to enable the booting of more than ~168 cores. v2: - [Steffen] enumerate only accessible northbridges - [Daniel] rediffed and validated against 3.1-rc10 v3: - [Daniel] use x86_init core numbering override - [Daniel] cleanups as per feedback v4: - [Daniel] use updated x86_cpuinit override v5: - drop disabling interrupts locally, as ISR write is atomic; drop delay - added read-mostly annotations where appropriate - require CONFIG_SMP, so drop conditional path Workload tested on 96 cores/16 sockets. Signed-off-by: Steffen Persvold <sp@numascale.com> Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Link: http://lkml.kernel.org/r/1323101246-2400-1-git-send-email-daniel@numascale-asia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:17:24 +01:00
Daniel J Blueman	64be4c1c24	x86: Add x86_init platform override to fix up NUMA core numbering Add an x86_init vector for handling inconsistent core numbering. This is useful for multi-fabric platforms, such as Numascale NumaConnect. v2: - use struct x86_cpuinit_ops - provide default fall-back function to warn Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com> Cc: Steffen Persvold <sp@numascale.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:17:21 +01:00
Daniel J Blueman	9a0ebfbe3f	x86: Make flat_init_apic_ldr() available Allow flat_init_apic_ldr() to be used outside the compilation unit for similar APIC implementations. Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com> Cc: Steffen Persvold <sp@numascale.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Link: http://lkml.kernel.org/r/1323073238-32686-1-git-send-email-daniel@numascale-asia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:17:07 +01:00
Jack Steiner	b565201cf7	x86: Reduce clock calibration time during slave cpu startup Reduce the startup time for slave cpus. Adds hooks for an arch-specific function for clock calibration. These hooks are used on x86. If a newly started cpu has the same phys_proc_id as a core already active, uses the TSC for the delay loop and has a CONSTANT_TSC, use the already-calculated value of loops_per_jiffy. This patch reduces the time required to start slave cpus on a 4096 cpu system from: 465 sec OLD 62 sec NEW This reduces boot time on a 4096p system by almost 7 minutes. Nice... Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Stultz <john.stultz@linaro.org> [fix CONFIG_SMP=n build] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:12:43 +01:00
H Hartley Sweeten	2d070eff6b	arch/x86/mm/pageattr.c: Quiet sparse noise; local functions should be static Local functions should be marked static. This also quiets the following sparse noise: warning: symbol '_set_memory_array' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: hartleys@visionengravers.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:11:05 +01:00
H Hartley Sweeten	706d9a9c8b	arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer The last parameter to sort() is a pointer to the function used to swap items. This parameter should be NULL, not 0, when not used. This quiets the following sparse warning: warning: Using plain integer as NULL pointer Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: hartleys@visionengravers.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:10:20 +01:00
Mathias Nyman	35d4769962	x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms Intel MID x86 platforms have a memory mapped virtual RTC instead. No MID platform have the default ports (and accessing them may do weird stuff). Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: feng.tang@intel.com Cc: Feng Tang <feng.tang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:09:21 +01:00
Mike Ditto	d1bbdd6692	arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map() Replace the bubble sort in sanitize_e820_map() with a call to the generic kernel sort function to avoid pathological performance with large maps. On large (thousands of entries) E820 maps, the previous code took minutes to run; with this change it's now milliseconds. Signed-off-by: Mike Ditto <mditto@google.com> Cc: sassmann@kpanic.de Cc: yuenn@google.com Cc: Stefan Assmann <sassmann@kpanic.de> Cc: Nancy Yuen <yuenn@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-12-05 17:08:14 +01:00

... 2 3 4 5 6 ...

14568 commits