Selected topics on how to make best use of Intel® Processors and Intel® Developer Products Forschungszentrum Juelich, May 2011 Agenda • Update Intel® Fortran Compiler • Improvements for Automatic Vectorization in Intel® Compilers 12.x • Simultaneous Multi-Threading in latest Intel processors • Intel® Vtune™ Amplifier XE and Performance Monitoring Unit • [Programming Environment for Intel® Many Integrated Core] Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2 Intel® Parallel Studio XE 2011 Phase Advance d Build & Debug Advance d Verify Advance d Tune Productivity Tool Intel® Composer XE Intel® Inspector XE Intel® VTuneTM Amplifier XE Feature Benefit C/C++ and Fortran compilers, performance libraries, and parallel models Driving application performance and scalability benefits of multicore and forward scale to manycore. Additionally providing code robustness and security. Memory & threading error checking tool for higher code reliability & quality Increases productivity and lowers cost, by catching memory and threading defects early Performance Profiler to optimize performance and scalability Removes guesswork, saves time, makes it easier to find performance and scalability bottlenecks Combines ease of use with deeper insights. Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 3 Intel’s Family of Parallel Programming Models Intel® Parallel Building Blocks (PBB) Intel® Threading Building Blocks Intel® Math Kernel Library (MKL) (TBB) Intel® Array Building Blocks (ArBB) Fixed Function Libraries Intel® Cilk Plus Established Standards Research and Exploration MPI Intel® Concurrent Collections (CnC) CAF of F2008 Intel® Integrated Performance Primitives (IPP) OpenMP* OpenCL* • Intel® Cilk Plus, Intel® TBB, Intel® MKL and Intel® IPP are part of both Intel® Parallel Studio and Intel® Parallel Studio® XE; • Intel® ArBB in beta; to be released later as part of Intel® Parallel Studio XE • See preview version on whatif.intel.com for CnC and OpenCL 4 Agenda • Update Intel® Fortran Compiler • Improvements for Automatic Vectorization in Intel® Compilers 12.x • Simultaneous Multi-Threading in latest Intel processors • Intel® Vtune™ Amplifier and Performance Monitoring Unit Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 5 F2003 Standard Compliance Fortran 2003 implementation mostly complete Added in 12.0 ( not in 11.1 ) Complete type-bound procedures (GENERIC, OPERATOR,..) FINAL procedures Remaining major features of F2003 not implemented: User-defined derived type I/O Parameterized derived types Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 5/17/201 1 6 Fortran 2008 Features Co-arrays DO CONCURRENT CONTIGUOUS I/O enhancements New constants in ISO_FORTRAN_ENV New intrinsic functions Increase maximum rank from 7 to 31 F2008 requires only 15 Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 5/17/201 1 7 Fortran Standards Support Competitor Fortran 2003 Support Intel Fortran Compiler 12.0 A B C D E F G IEEE Arithmetic Complete Complete No Partial Complete Complete Complete No Allocatable Enhancements Complete Complete Complete Complete Complete Complete Complete Complete Nearly Complete Complete Nearly Complete Fortran 2008 Support Data enhancements & Object-Orientation All But 1 Feature No Partial Almost None Nearly Complete Miscellaneous All But 1 Feature Partial Partial Partial Complete Nearly Complete Nearly Complete Input/Output Complete Nearly Complete Nearly Complete Nearly Complete Complete All but 1 Feature Complete C interoperability Complete Complete Complete Complete Complete Complete Complete Sub-modules Co-Arrays Performance Enhancements Data Enhancements Accessing Data Objects No Complete Complete Partial No Input/Output Complete Execution Controls Stop Code No No No No No Partial No No No No No Partial Intinsic Procedures for Bit Processing Intrinsic Procedures and Modules Complete Partial Internal procedure as Programs and Procedures an actual argument No No No No No No Long Integers Long Integers No No Recursive Nearly Complete Inut/Output Partial Block No Construct Partial Counting No Bits Error & Gamma Parity Funtions No No Partial No No No Partial Intel® Fortran is the only implementation with full VAX Fortran and CVF functionality No Partial Partial Complete No No Yes No No Partial No No No No Long Integer Long Integers No No No Recursive Recursive I/O No Input/Output No No No No Key to Standards Coverage Complete Support or All but 1 Feature Nearly Complete No = None, Almost None, Partial Fragmentary Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 8 F2008 DO CONCURRENT A new Parallel Loop Construct • Syntax uses elements of Fortran 90 FORALL DO [,] CONCURRENT <forall-header> • Semantically there is a key difference to FORALL however : – No dependencies between the iterations of the loop body are permitted ( no “loop carried dependencies”) • The semantics of DO CONCURRENT make it easier to parallelize • DO CONCURRENT requires option –parallel (/Qparallel) • No requirement that the loop be done in parallel • The implementation in Intel® Compiler 12.0 will execute the iterations in parallel using OpenMP* F2008 DO CONCURRENT Example: DO CONCURRENT (i=1:m) a(k+i) = a(k+i) + factor*a(l+i) END DO Different from FORALL, using DO CONCURRENT, the programmer guarantees, that the values of m, k and l will never cause a(l+i) to reference an element of the array defined on the LHS in other words: the array sections a(l+1:l+m) and a(k+1:k+m) do not overlap This allows compiler to generate very efficient parallel code. Co-Array Fortran Fundamentals Simple extension to Fortran to make Fortran into a robust and efficient parallel programming language Single-Program, Multiple-Data programming model (SPMD). • Single program is replicated a fixed number of times • Each program instance has it’s own set of data objects – called an “IMAGE” • Each image executes asynchronously and normal Fortran rules apply • Extensions to normal Fortran array syntax to allow images to reference data in other image(s) Part of the Fortran 2008 standard Targeting both, shared and distributed memory architecture (cluster) No language elements related to co-existence with other parallel models (e.g. mixing explicit MPI, OpenMP and coarray coding) Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 11 CAF Memory Model real :: x(n)[∗ ∗] image p x(1) x(n) x(1) x(n) x(1) x(n) image q x(1)[q] x(n)[p] x(1) x(1) x(n) x(n) 12 Coarrays in Intel Fortran 12.0 Coarray implementation uses Intel MPI 4.0.1 • MPI library part of compiler installation • No need for any additional library etc ! • No support for older Intel MPI releases or nonIntel MPI ! Support for 32bit (IA-32) and 64bit (Intel® 64) Shared memory supported both on Windows* and Linux* Distributed memory support currently only for Linux • Windows support added in next release ( 12.1 ) Development for distributed memory requires Intel® Cluster Tool Kit license • No license required for execution ! Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Coarrays in Intel Fortran 12.0 Must compile with –coarray[=shared | distributed] shared is default* To set number of images to “n” : -coarray-num-images=n • The default is number of cores/processors at run-time • Shared memory: Includes cores and logical processors of hyper-threading • Distributed memory: Same rules as for MPI application • Environment variable FOR_COARRAY_NUM_IMAGES can be used too w/o re-compilation The file -coarray-config-file=<filename> allows definition of MPI specific parameters Mixing explicit MPI and CAF currently not supported • Support will be added in summer 2011 *Please note difference to current version of manual which describes ‘default’ to depend on license installed; the release notes are correct Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda • Update Intel® Fortran Compiler • Improvements for Automatic Vectorization in Intel® Compilers 12.x • Simultaneous Multi-Threading in latest Intel processors • Intel® Vtune™ Amplifier XE and Performance Monitoring Unit Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 15 Refresh: Intel Instruction Set Extensions 1999 2000 2004 2006 2007 SSE SSE2 SSE3 SSSE3 SSE4.1 70 instr SinglePrecision Vectors Streaming operations 144 instr Doubleprecision Vectors 8/16/32 64/128-bit vector integer 13 instr Complex Data 32 instr Decode 47 instr Video Graphics building blocks Advanced vector instr 2008 SSE4.2 (Nehalem) 8 instr String/XML processing POP-Count CRC Continued by • Intel® AES New Instructions - Intel® AES-NI (2009) • Intel® Advanced Vector Extensions – Intel® AVX (2010/11) 16 Many Ways to “vectorize” Fully automatic vectorization Ease of use Auto vectorization hints (#pragma ivdep) User Mandated Vectorization ( SIMD Pragma/Directive) Pragma/Directive) New in 12.0 !! SIMD intrinsic class (F32vec4 add) Vector intrinsic (mm_add_ps()) ASM code (addps) Programmer control Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Automatic Vectorization Transforming sequential code to exploit the vector (SIMD, SSE) processing capabilities • Manually by explicit source code modification • Automatically by tools like a compiler for (i=0;i<MAX;i++) c[i]=a[i]+b[i]; A[3] A[1] A[2] + + A[0] + + B[3] B[2] B[1] C[3] C[2] C[1] 128-bit Registers B[0] C[0] Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 18 Compiler Based Vectorization Switches –x<EXTENSION> Feature Extension Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available in initial Pentium® 4 or compatible non-Intel processors SSE2 Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available in Pentium® 4 or compatible non-Intel processors SSE3 Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) as available in Intel® Core™2 Duo processors SSSE3 Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture SSE4.1 Intel® SSE4.2 Accelerated String and Text Processing instructions supported first by by Intel® Core™ i7 processors SSE4.2 Extensions offered by Intel® ATOM™ processor : Intel® SSSE3 (!!) and MOVBE instruction SSE3_ATOM Intel® Advanced Vector Extensions (Intel® AVX) as available in 2nd generation Intel Core processor family AVX 19 Selecting Right Extensions makes a Difference ! static double A[1000], B[1000], C[1000]; void add() { int i; for (i=0; i<1000; i++) if (A[i]>0) A[i] += B[i]; else A[i] += C[i]; } .B1.2:: vmovaps vmovaps vcmpgtpd vblendvpd vaddpd vmovaps add cmp jl ymm3, A[rdx*8] ymm1, C[rdx*8] ymm2, ymm3, ymm0 ymm4, ymm1,B[rdx*8], ymm2 ymm5, ymm3, ymm4 A[rdx*8], ymm5 rdx, 4 rdx, 1000 .B1.2 AVX .B1.2:: movaps xorps cmpltpd movaps andps andnps orps addpd movaps add cmp jl xmm2, A[rdx*8] xmm0, xmm0 xmm0, xmm2 xmm1, B[rdx*8] xmm1, xmm0 xmm0, C[rdx*8] xmm1, xmm0 xmm2, xmm1 A[rdx*8], xmm2 rdx, 2 rdx, 1000 .B1.2 .B1.2:: movaps xorps cmpltpd movaps blendvpd addpd movaps add cmp jl xmm2, A[rdx*8] xmm0, xmm0 xmm0, xmm2 xmm1, C[rdx*8] xmm1, B[rdx*8], xmm0 xmm2, xmm1 A[rdx*8], xmm2 rdx, 2 rdx, 1000 .B1.2 SSE2 SSE4.1 20 Many Improvements in 12.0 Sample: Mixed Data Type • 11.1 Compiler wants to void foo(int n, float *A, double *B){ use “full vector” for each statement. int i; float t = 0.0f; #pragma ivdep for (i=0; i<n; i++) { A[i] = t; a3 a2 b1 a1 a0 b0 B[i] = t; } } 2=2. Good a1 b1 a3 4=2x2. Good a0 b0 a2 a1 b1 b0 b3 b2 Using XMM (128bit) as an example… A[i] = … for 4 elements at a time B[i] = … for 2 elements at a time mixed.c(5): (col. 3) remark: loop was not vectorized: unsupported data type. 4 != 2. Give up. t += 1.0f; – – – • 12.0 Compiler matches the number of elements. a0 – – A[i] = … for 2 or 4 elements at a time B[i] = … for 2 or 4 elements at a time mixed.c(5) (col. 3): remark: LOOP WAS VECTORIZED. 21 User-Mandated Vectorization User-mandated vectorization ( also called “SIMD Vectorization”) is based on a new SIMD Directive (or “pragma pragma”) ”) The SIMD directive provides additional information to compiler to enable vectorization of loops ( at this time only inner loop ) Supplements automatic vectorization but differently to what traditional directives ( “automatic vectorization hints”) like IVDEP, VECTOR ALWAYS, etc do Traditional directives: A hint; not necessary overriding compiler’s heuristic New SIMD directive : More like an assertion: in case vectorization still fails, it is considered a fault (an option controls whether it is really treated as error); Relationship similar to OpenMP versus automatic parallelization User Mandated Vectorization OpenMP** OpenMP Pure Automatic Vectorization Automatic Parallelization Copyright © 2011, Intel Corporation. All rights reserved. 22 *Other brands and names are the property of their respective owners. 22 Clauses of SIMD Directive vectorlength(num1, num2, …, numN) Each iteration in the vector loop will execute the computation equivalent to the VL-iters of scalar loop execution. private(var1, var2, …, varN) variables private to each iteration. initial value is broadcast to all private instances, and the last value is copied out from the last iteration instance. linear(var1:step1, var2:step2, …, varN:stepN) for every iteration of scalar loop, varX is incremented by stepX,. Every iteration of the vector loop, therefore increments varX by VL*stepX reduction(operator:var1, var2,…, varN) perform vector reduction of operator kind has to be applied to varX [no]assert to assert or not to assert when the vectorization fails. Default is to assert for SIMD pragma. Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 23 Sample: SIMD Directive Linear Clause do 10 i=n1,n,n3 k = k + j a(i) = a(i) + b(n-k+1) 10 continue Vectorization fails: “Subscript too complex” !DIR$ SIMD linear(k:j) do 10 i=n1,n,n3 k = k + j a(i) = a(i) + b(n-k+1) 10 continue Vectorization suceeds now: The compiler receives the additional information, that k is an induction variable being incremented by j in each iteration. This is sufficient to enable vectorization Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 24 Array Section Notation for C/C++ Array Section Notation <array base> [ <lower bound> : <length> [: <stride>] ] [ <lower bound> : <length> [: <stride>] ]..... Note that length is chosen. Not upper bound as in Fortran [lower bound : upper bound] A[:] // All elements of vector A B[2:6] // Elements 2 to 7 of vector B C[:][5] // Column 5 of matrix C D[0:3:2] // Elements 0,2,4 of vector D E[0:3][0:4] // 12 elements from E[0][0] to E[2][3] Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 25 Operator Maps Most arithmetic and logic operators for C/C++ basic data types are available for array sections: +, -, *, /, %, <,==,!=,>,|,&,^,&&,||,!,(unary), +(unary),++,--, +=, -=, *=, /=, *(p) An operator is implicitly mapped to all the elements of the array section operands: a[0:s]+b[2:s] => {a[i]+b[i+2], forall (i=0;i<s;i++)} Operations are parallel among all the elements Array operands must have the same rank Scalar operand is automatically expanded to fill the whole section a[0:s]*c => {a[i]*c, forall (i=0;i<s;i++)} Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 26 Assignment Map Assignment maps to all elements of the LHS array section in parallel: a[:][:] = b[:][2][:] + c; e[:] = d; e[:] = b[:][1][:]; // error a[:][:] = e[:]; // error LHS of an assignment defines an array context where RHS is evaluated. The rank of the RHS array section must be the same as the LHS The length of each rank must match the corresponding LHS rank Scalar is expanded automatically Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 27 Assignment – Semantic RHS is evaluated before any value is assigned to the LHS. Compiler ensure the semantics in generated code - potentially by introducing temporary variable to store RHS a[1:s] = a[0:s]+1 // old value of a[1:s-1] // is used Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 28 Agenda • Update Intel® Fortran Compiler • Improvements for Automatic Vectorization in Intel® Compilers 12.x • Simultaneous Multi-Threading in latest Intel processors • Intel® Vtune™ Amplifier XE and Performance Monitoring Unit Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 29 Simultaneous Multi-Threading (SMT) “Intel Hyper-Threading – HT” SMT: Run 2 threads at the very same time w/o SMT SMT per core Keep it fed with multiple threads Hide latency of a single thread Most power efficient performance feature Very low die area cost Can provide significant performance benefit depending on application Much more efficient than adding an entire core Time (proc. cycles) Take advantage of 4-wide execution engine Nehalem advantages Larger caches Massive memory BW Simultaneous multi-threading enhances performance and energy efficiency Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Note: Each box represents a processor execution unit SMT Implementation Details Replicated – Duplicate state for SMT • Register state • Renamed RSB • Large page ITLB Partitioned – Statically allocated between threads • Key buffers: Load, Store, Reorder • Small page ITLB Competitively shared – Depends on thread’s dynamic behavior • Reservation station • Caches • Data TLBs, 2nd level TLB Unaware • Execution units Applications that will benefit Complex memory access ( memory access stalls ) Mix of instruction types ( e.g integer and FP computation ) Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SMT Thread Selection Points Execution pipeline has multiple thread selection points where the architecture can select to work for one of the 2 logical threads Predict/Fetch Decode Alloc Schedule EX RS Retire ROB IQ • Select thread to fetch instructions from • Select instruction to decode • Select u-operation to allocate • Select instruction to retire • Additional selection points in memory pipeline like scheduling of MOB entries ( memory order buffer ) Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SMT Performance Chart Performance Gain SMT enabled vs disabled 40% 34% 35% 29% 30% 25% 20% 16% 13% 15% 10% 10% 7% 5% 0% Floating Point 3dsMax* Integer Cinebench* 10 POV-Ray* 3.7 3 DMark* beta 25 Vantage* CPU Intel®Core™i7 SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks, see: http://www.spec.org Floating Point is based on SPECfp_rate_base2006* estimate is based SPECint_rate_base2006* estimate Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3Integer channel DDR3 memory.on Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/ Agenda • Update Intel® Fortran Compiler • Improvements for Automatic Vectorization in Intel® Compilers 12.x • Simultaneous Multi-Threading in latest Intel processors • Intel® Vtune™ Amplifier XE and Performance Monitoring Unit Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 34 Intel® Vtune™ Amplifier XE Integrates popular and mature features of Intel® VTune™ Performance Analyzer, Intel® Parallel Amplifier, Intel® Thread Profiler and Intel® Performance Tuning Utility But not a super-set in all cases Some additional features being worked on and will be added later; some are still being evaluated/might be added to future updates Standalone GUI on Linux* and Windows* GUI in all environments based on wxWidgets: Very fast and stable Same look-and-feel for Linux & Windows Fast and native implementation on Linux No sluggish and fragile emulations anymore !! Comprehensive Command Line interface New instrumentation technologies for data collection Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 35 Event Based Sampling and Counting Captures program performance information in terms of hardware specific (PMU) events. Uses event multiplexing to collect as much information possible during a single run Initially only Sampling; support for Counting (EBC) will be added in future update EBC gains relevance since most UNCORE events of latest processors from Intel can not be “sampled” EBC doesn't capture CPU state information EBC data cannot be attributed to the program flow EBC mode has lower overhead and collects smaller trace files (tb5 / tb6) Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 36 Event Based Sampling Pre-defined event sets for supported processors Top level Triage Cache analysis and false sharing Branching issues Long-latency computation operations Structural hazards Working set size Data access patterns Memory latency Memory bandwidth Any other events can be collected Some 1100 for Nehalem architecture Pre-defined displays (viewpoints) transform data into information 37 Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 37 Event Based Sampling Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 38 Core Pipeline Overview 32K Instruction Cache uncore Decode Pipeline 256K L2 Cache MSROM 4 Register Allocation Table (RAT) 32K Data Cache Out Of Order Rename/retirement Load Store (address) 2 3 2 3 Store data 4 4 Integer 0 1 5 MMX/SSE 0 1 5 X87 0 1 5 Front End Execution Caches UOPS_ISSUED micro-op queue Branch Target Buffer UOPS_EXECUTED Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 6 0 1 5 Scheduler INT / SSE / X87 Register Stacks 4 UOPS_RETIRED Nehalem – Last Level Cache Miss (L3 Miss) ? Probably the most wanted, single, non-trivial “event” The complexity of the memory structure makes it difficult to define a LLC miss A L3 miss can be a hit to in another socket’s L3 – modified and non-modified, in any of the LLC, one of the remote L1 or L2 caches A L3 modified hit in another socket L3 is more expensive than accessing local DRAM In case it would exist in hardware, it would be an UNCORE event Very difficult (impossible ?) to sample Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Nehalem LLC Miss – The Solution Latency events for memory accesses: MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_<X> <X> can be 4, 8, 16, 32, 128, 256, … 32768 E.g. all memory accesses taking more than 128 cycles MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_128 In fact, this is the event being alias to pseudo-event LLC MISS Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Memory Bandwidth Delivered + speculative traffic to local memory Precise totals can be measured in IMC but cannot be broken down per source UNC_IMC_NORMAL_READS.ANY UNC_IMC_WRITES.FULL.ANY Intel provides a patch/script for Intel® Performance Tuning Utility to simplify bandwidth measurements • Available from Premier.intel.com, see product ‘VTune Performance Analyzer” • For PTU see whatif.intel.com • Measures the total bandwidth load for a selected time period • See forums on software.intel.com for more details and tips Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Introduction to Core architecture Execution Unit: Port Mapping Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 ALU1 ALU2 ALU3 LOAD AGU STORE AGU STORE DATA (MIU) SHIFT1 IMUL SHIFT2 SIMD FP SIALU1 JEU SIMUL SISHIFT SIALU2 SISHUF LB SB 48 entries 32 entries DTLB FMUL FADD FDIV FPREM L1-D$ FSHUF ROB Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. MOB LEA Result Bus INT Reservation station (36 entries) PMH L2$/LL$ INTEL CONFIDENTIAL Intel® Architecture Code Analyzer User Interface > iaca –f matrix_multiply.exe Analysis Report --------------Total Throughput: 4 Cycles Total Latency: 12 Cycles Throughput Bottleneck: Port 5 Total number of Uops: 13 Port Binding in cycles: ------------------------------------------------------| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------| Cycles | 1 | 0 | 1 | 3 | 2 | 2 | 2 | 2 | 4 | ------------------------------------------------------- port 5 is the bottleneck Actual port binding N - port number or number of cycles port was bound, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path X - other ports that can be used by this instructions F - Macro Fusion with the next instruction occurred Alternative port * - instruction micro-ops not bound to a port @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | -----------------------------------------------------------| 1 | | | | 1 | 1 | X | X | | | CP | | 2 | | | | X | X | 1 | 1 | | 1 | CP | | 1 | | | | 1 | 1 | X | X | | | CP | | 2 | | | | X | X | 1 | 1 | | 1 | CP | | 1 | | | | | | | | | 1 | CP | | 1* | | | | | | | | | | | | 1 | | | 1 | | | | | | | CP | | 2 | | | | 1 | | X | | 2 | | CP | | 1 | 1 | | X | | | | | | X | | | 0 | | | | | | | | | F | | | 1 | | | | | | | | | 1 | | binding Identifies instructions in critical path vmovsd xmm0, qword ptr [rax+rbx*1] vunpcklpd xmm0, xmm0, xmmword ptr [rax+rbx*1+0x20] vmovsd xmm1, qword ptr [rax+rbx*1+0x40] vunpcklpd xmm1, xmm1, xmmword ptr [rax+rbx*1+0x60] vinsertf128 ymm0, ymm0, xmm1, 0x1 vxorps ymm1, ymm1, ymm1 not bound to a port vmaxpd ymm1, ymm1, ymm0 vmovaps ymmword ptr [rcx+rbx*4], ymm1 add rbx, 0x8 cmp rbx, 0x20 CMP & JNZ are macro-fused jnz 0xffffffcc Agenda • Update Intel® Fortran Compiler • Improvements for Automatic Vectorization in Intel® Compilers 12.x • Simultaneous Multi-Threading in latest Intel processors • Intel® Vtune™ Amplifier XE and Performance Monitoring Unit Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 45 Optimization Notice Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20110307 Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 46 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2011. Intel Corporation. http://intel.com/software/products Copyright © 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 47 48
© Copyright 2025