Selected topics on how to make best use of Intel® Developer Products

Selected topics on how to
make best use of Intel®
Processors and Intel®
Developer Products
Forschungszentrum Juelich, May 2011
Agenda
• Update Intel® Fortran Compiler
• Improvements for Automatic Vectorization
in Intel® Compilers 12.x
• Simultaneous Multi-Threading in latest Intel
processors
• Intel® Vtune™ Amplifier XE and
Performance Monitoring Unit
• [Programming Environment for Intel®
Many Integrated Core]
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
2
Intel® Parallel Studio XE 2011
Phase
Advance
d Build &
Debug
Advance
d Verify
Advance
d Tune
Productivity Tool
Intel® Composer
XE
Intel® Inspector
XE
Intel® VTuneTM
Amplifier XE
Feature
Benefit
C/C++ and Fortran
compilers, performance
libraries, and parallel
models
Driving application performance and
scalability benefits of multicore and
forward scale to manycore.
Additionally providing code
robustness and security.
Memory & threading
error checking tool for
higher code reliability &
quality
Increases productivity and lowers
cost, by catching memory and
threading defects early
Performance Profiler to
optimize performance
and scalability
Removes guesswork, saves time,
makes it easier to find
performance and scalability
bottlenecks Combines ease of use
with deeper insights.
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
3
Intel’s Family of Parallel Programming Models
Intel® Parallel
Building Blocks (PBB)
Intel® Threading
Building Blocks
Intel® Math Kernel
Library (MKL)
(TBB)
Intel® Array
Building Blocks
(ArBB)
Fixed Function
Libraries
Intel®
Cilk
Plus
Established
Standards
Research and
Exploration
MPI
Intel® Concurrent
Collections
(CnC)
CAF of F2008
Intel®
Integrated
Performance
Primitives (IPP)
OpenMP*
OpenCL*
• Intel® Cilk Plus, Intel® TBB, Intel® MKL and Intel® IPP are part of both Intel® Parallel
Studio and Intel® Parallel Studio® XE;
• Intel® ArBB in beta; to be released later as part of Intel® Parallel Studio XE
• See preview version on whatif.intel.com for CnC and OpenCL
4
Agenda
• Update Intel® Fortran Compiler
• Improvements for Automatic Vectorization
in Intel® Compilers 12.x
• Simultaneous Multi-Threading in latest Intel
processors
• Intel® Vtune™ Amplifier and Performance
Monitoring Unit
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
5
F2003 Standard Compliance
Fortran 2003 implementation mostly
complete
Added in 12.0 ( not in 11.1 )
Complete type-bound procedures (GENERIC,
OPERATOR,..)
FINAL procedures
Remaining major features of F2003 not
implemented:
User-defined derived type I/O
Parameterized derived types
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
5/17/201
1
6
Fortran 2008 Features
Co-arrays
DO CONCURRENT
CONTIGUOUS
I/O enhancements
New constants in ISO_FORTRAN_ENV
New intrinsic functions
Increase maximum rank from 7 to 31
F2008 requires only 15
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
5/17/201
1
7
Fortran Standards Support
Competitor
Fortran 2003 Support
Intel Fortran Compiler
12.0
A
B
C
D
E
F
G
IEEE Arithmetic
Complete
Complete
No
Partial
Complete
Complete
Complete
No
Allocatable Enhancements
Complete
Complete
Complete
Complete
Complete
Complete
Complete
Complete
Nearly
Complete
Complete
Nearly
Complete
Fortran 2008 Support
Data enhancements &
Object-Orientation
All But 1 Feature
No
Partial
Almost None
Nearly
Complete
Miscellaneous
All But 1 Feature
Partial
Partial
Partial
Complete
Nearly
Complete
Nearly
Complete
Input/Output
Complete
Nearly
Complete
Nearly Complete
Nearly
Complete
Complete
All but 1
Feature
Complete
C interoperability
Complete
Complete
Complete
Complete
Complete
Complete
Complete
Sub-modules
Co-Arrays
Performance Enhancements
Data Enhancements
Accessing Data Objects
No
Complete
Complete
Partial
No
Input/Output
Complete
Execution Controls
Stop Code
No
No
No
No
No
Partial
No
No
No
No
No
Partial
Intinsic Procedures for Bit
Processing
Intrinsic Procedures and
Modules
Complete
Partial
Internal procedure as
Programs and Procedures
an actual argument
No
No
No
No
No
No
Long Integers Long Integers
No
No
Recursive
Nearly Complete
Inut/Output
Partial Block
No
Construct
Partial Counting
No
Bits
Error & Gamma
Parity
Funtions
No
No
Partial
No
No
No
Partial
Intel® Fortran is the only implementation
with full VAX Fortran and CVF functionality
No
Partial
Partial
Complete
No
No
Yes
No
No
Partial
No
No
No
No
Long Integer Long Integers
No
No
No
Recursive
Recursive I/O
No
Input/Output
No
No
No
No
Key to Standards Coverage
Complete Support or All but 1 Feature
Nearly Complete
No = None, Almost None, Partial
Fragmentary
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
8
F2008 DO CONCURRENT
A new Parallel Loop Construct
• Syntax uses elements of Fortran 90 FORALL
DO [,] CONCURRENT <forall-header>
• Semantically there is a key difference to FORALL however :
– No dependencies between the iterations of the loop body are
permitted ( no “loop carried dependencies”)
• The semantics of DO CONCURRENT make it easier to
parallelize
• DO CONCURRENT requires option –parallel (/Qparallel)
• No requirement that the loop be done in parallel
• The implementation in Intel® Compiler 12.0 will execute the
iterations in parallel using OpenMP*
F2008 DO CONCURRENT
Example:
DO CONCURRENT (i=1:m)
a(k+i) = a(k+i) + factor*a(l+i)
END DO
Different from FORALL, using DO CONCURRENT, the programmer guarantees,
that the values of m, k and l will never cause a(l+i) to reference an
element of the array defined on the LHS
in other words: the array sections a(l+1:l+m) and a(k+1:k+m) do not
overlap
This allows compiler to generate very efficient parallel code.
Co-Array Fortran Fundamentals
Simple extension to Fortran to make Fortran into a robust and
efficient parallel programming language
Single-Program, Multiple-Data programming model (SPMD).
• Single program is replicated a fixed number of times
• Each program instance has it’s own set of data objects –
called an “IMAGE”
• Each image executes asynchronously and normal Fortran
rules apply
• Extensions to normal Fortran array syntax to allow
images to reference data in other image(s)
Part of the Fortran 2008 standard
Targeting both, shared and distributed memory architecture
(cluster)
No language elements related to co-existence with other
parallel models (e.g. mixing explicit MPI, OpenMP and
coarray coding)
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
11
CAF Memory Model
real :: x(n)[∗
∗]
image p
x(1)
x(n)
x(1)
x(n)
x(1)
x(n)
image q
x(1)[q]
x(n)[p]
x(1)
x(1)
x(n)
x(n)
12
Coarrays in Intel Fortran 12.0
Coarray implementation uses Intel MPI 4.0.1
• MPI library part of compiler installation
• No need for any additional library etc !
• No support for older Intel MPI releases or nonIntel MPI !
Support for 32bit (IA-32) and 64bit (Intel® 64)
Shared memory supported both on Windows* and
Linux*
Distributed memory support currently only for Linux
• Windows support added in next release ( 12.1 )
Development for distributed memory requires
Intel® Cluster Tool Kit license
• No license required for execution !
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Coarrays in Intel Fortran 12.0
Must compile with
–coarray[=shared | distributed]
shared is default*
To set number of images to “n” :
-coarray-num-images=n
• The default is number of cores/processors at run-time
• Shared memory: Includes cores and logical processors
of hyper-threading
• Distributed memory: Same rules as for MPI application
• Environment variable FOR_COARRAY_NUM_IMAGES can be
used too w/o re-compilation
The file -coarray-config-file=<filename> allows definition
of MPI specific parameters
Mixing explicit MPI and CAF currently not supported
• Support will be added in summer 2011
*Please note difference to current version of manual which describes ‘default’ to depend on license installed; the
release notes are correct
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
• Update Intel® Fortran Compiler
• Improvements for Automatic Vectorization
in Intel® Compilers 12.x
• Simultaneous Multi-Threading in latest Intel
processors
• Intel® Vtune™ Amplifier XE and
Performance Monitoring Unit
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
15
Refresh: Intel Instruction Set Extensions
1999
2000
2004
2006
2007
SSE
SSE2
SSE3
SSSE3
SSE4.1
70 instr
SinglePrecision
Vectors
Streaming
operations
144 instr
Doubleprecision
Vectors
8/16/32
64/128-bit
vector
integer
13 instr
Complex
Data
32 instr
Decode
47 instr
Video
Graphics
building
blocks
Advanced
vector instr
2008
SSE4.2
(Nehalem)
8 instr
String/XML
processing
POP-Count
CRC
Continued by
• Intel® AES New Instructions - Intel® AES-NI (2009)
• Intel® Advanced Vector Extensions – Intel® AVX (2010/11)
16
Many Ways to “vectorize”
Fully automatic vectorization
Ease of use
Auto vectorization hints (#pragma ivdep)
User Mandated Vectorization
( SIMD Pragma/Directive)
Pragma/Directive)
New in
12.0 !!
SIMD intrinsic class (F32vec4 add)
Vector intrinsic (mm_add_ps())
ASM code (addps)
Programmer control
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17
Automatic Vectorization
Transforming sequential code to exploit the vector (SIMD, SSE)
processing capabilities
• Manually by explicit source code modification
• Automatically by tools like a compiler
for (i=0;i<MAX;i++)
c[i]=a[i]+b[i];
A[3]
A[1]
A[2]
+
+
A[0]
+
+
B[3]
B[2]
B[1]
C[3]
C[2]
C[1]
128-bit Registers
B[0]
C[0]
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
18
Compiler Based Vectorization
Switches –x<EXTENSION>
Feature
Extension
Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available
in initial Pentium® 4 or compatible non-Intel processors
SSE2
Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available
in Pentium® 4 or compatible non-Intel processors
SSE3
Intel® Supplemental Streaming SIMD Extensions 3 (Intel®
SSSE3) as available in Intel® Core™2 Duo processors
SSSE3
Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next
generation Intel Core™ micro-architecture
SSE4.1
Intel® SSE4.2 Accelerated String and Text Processing
instructions supported first by by Intel® Core™ i7 processors
SSE4.2
Extensions offered by Intel® ATOM™ processor : Intel® SSSE3
(!!) and MOVBE instruction
SSE3_ATOM
Intel® Advanced Vector Extensions (Intel® AVX) as available in
2nd generation Intel Core processor family
AVX
19
Selecting Right Extensions makes a
Difference !
static double A[1000], B[1000],
C[1000];
void add() {
int i;
for (i=0; i<1000; i++)
if (A[i]>0)
A[i] += B[i];
else
A[i] += C[i];
}
.B1.2::
vmovaps
vmovaps
vcmpgtpd
vblendvpd
vaddpd
vmovaps
add
cmp
jl
ymm3, A[rdx*8]
ymm1, C[rdx*8]
ymm2, ymm3, ymm0
ymm4, ymm1,B[rdx*8], ymm2
ymm5, ymm3, ymm4
A[rdx*8], ymm5
rdx, 4
rdx, 1000
.B1.2
AVX
.B1.2::
movaps
xorps
cmpltpd
movaps
andps
andnps
orps
addpd
movaps
add
cmp
jl
xmm2, A[rdx*8]
xmm0, xmm0
xmm0, xmm2
xmm1, B[rdx*8]
xmm1, xmm0
xmm0, C[rdx*8]
xmm1, xmm0
xmm2, xmm1
A[rdx*8], xmm2
rdx, 2
rdx, 1000
.B1.2
.B1.2::
movaps
xorps
cmpltpd
movaps
blendvpd
addpd
movaps
add
cmp
jl
xmm2, A[rdx*8]
xmm0, xmm0
xmm0, xmm2
xmm1, C[rdx*8]
xmm1, B[rdx*8], xmm0
xmm2, xmm1
A[rdx*8], xmm2
rdx, 2
rdx, 1000
.B1.2
SSE2
SSE4.1
20
Many Improvements in 12.0
Sample: Mixed Data Type
• 11.1 Compiler wants to
void foo(int n, float *A, double *B){
use “full vector” for each
statement.
int i;
float t = 0.0f;
#pragma ivdep
for (i=0; i<n; i++) {
A[i] = t;
a3
a2
b1
a1
a0
b0
B[i] = t;
}
}
2=2. Good
a1
b1
a3
4=2x2. Good
a0
b0
a2
a1
b1
b0
b3
b2
Using XMM (128bit) as an example…
A[i] = … for 4 elements at a time
B[i] = … for 2 elements at a time
mixed.c(5): (col. 3) remark:
loop was not vectorized:
unsupported data type.
4 != 2. Give up.
t += 1.0f;
–
–
–
• 12.0 Compiler matches
the number of elements.
a0
–
–
A[i] = … for 2 or 4 elements at a time
B[i] = … for 2 or 4 elements at a time
mixed.c(5) (col. 3): remark:
LOOP WAS VECTORIZED.
21
User-Mandated Vectorization
User-mandated vectorization ( also called “SIMD
Vectorization”) is based on a new SIMD Directive (or
“pragma
pragma”)
”)
The SIMD directive provides additional information to compiler to
enable vectorization of loops ( at this time only inner loop )
Supplements automatic vectorization but differently to what
traditional directives ( “automatic vectorization hints”) like
IVDEP, VECTOR ALWAYS, etc do
Traditional directives: A hint; not necessary overriding compiler’s
heuristic
New SIMD directive : More like an assertion: in case vectorization
still fails, it is considered a fault (an option controls whether it is
really treated as error);
Relationship similar to OpenMP versus automatic
parallelization
User Mandated Vectorization
OpenMP**
OpenMP
Pure Automatic Vectorization
Automatic Parallelization
Copyright © 2011, Intel Corporation. All rights reserved.
22
*Other brands and names are the property of their respective owners.
22
Clauses of SIMD Directive
vectorlength(num1, num2, …, numN)
Each iteration in the vector loop will execute the computation equivalent
to the VL-iters of scalar loop execution.
private(var1, var2, …, varN)
variables private to each iteration. initial value is broadcast to all private
instances, and the last value is copied out from the last iteration
instance.
linear(var1:step1, var2:step2, …, varN:stepN)
for every iteration of scalar loop, varX is incremented by stepX,. Every
iteration of the vector loop, therefore increments varX by VL*stepX
reduction(operator:var1, var2,…, varN)
perform vector reduction of operator kind has to be applied to varX
[no]assert
to assert or not to assert when the vectorization fails. Default is to
assert for SIMD pragma.
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
23
Sample: SIMD Directive Linear Clause
do 10 i=n1,n,n3
k = k + j
a(i) = a(i) + b(n-k+1)
10 continue
Vectorization fails: “Subscript too complex”
!DIR$ SIMD linear(k:j)
do 10 i=n1,n,n3
k = k + j
a(i) = a(i) + b(n-k+1)
10 continue
Vectorization suceeds now: The compiler receives the additional
information, that k is an induction variable being incremented by j in
each iteration. This is sufficient to enable vectorization
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
24
Array Section Notation for C/C++
Array Section Notation
<array base> [ <lower bound> : <length> [: <stride>] ]
[ <lower bound> : <length> [: <stride>] ].....
Note that length is chosen.
Not upper bound as in Fortran [lower bound :
upper bound]
A[:]
// All elements of vector A
B[2:6]
// Elements 2 to 7 of vector B
C[:][5] // Column 5 of matrix C
D[0:3:2]
// Elements 0,2,4 of vector D
E[0:3][0:4]
// 12 elements from E[0][0] to E[2][3]
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
25
Operator Maps
Most arithmetic and logic operators for C/C++ basic
data types are available for array sections:
+, -, *, /, %, <,==,!=,>,|,&,^,&&,||,!,(unary), +(unary),++,--, +=, -=, *=, /=,
*(p)
An operator is implicitly mapped to all the elements
of the array section operands:
a[0:s]+b[2:s] => {a[i]+b[i+2], forall (i=0;i<s;i++)}
Operations are parallel among all the elements
Array operands must have the same rank
Scalar operand is automatically expanded to fill
the whole section
a[0:s]*c => {a[i]*c, forall (i=0;i<s;i++)}
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
26
Assignment Map
Assignment maps to all elements of the LHS array
section in parallel:
a[:][:] = b[:][2][:] + c;
e[:] = d;
e[:] = b[:][1][:]; // error
a[:][:] = e[:]; // error
LHS of an assignment defines an array context
where RHS is evaluated.
The rank of the RHS array section must be the
same as the LHS
The length of each rank must match the
corresponding LHS rank
Scalar is expanded automatically
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
27
Assignment – Semantic
RHS is evaluated before any value is assigned to the
LHS. Compiler ensure the semantics in generated
code - potentially by introducing temporary variable
to store RHS
a[1:s] = a[0:s]+1 // old value of a[1:s-1]
// is used
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
28
Agenda
• Update Intel® Fortran Compiler
• Improvements for Automatic Vectorization
in Intel® Compilers 12.x
• Simultaneous Multi-Threading in latest Intel
processors
• Intel® Vtune™ Amplifier XE and
Performance Monitoring Unit
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
29
Simultaneous Multi-Threading (SMT)
“Intel Hyper-Threading – HT”
SMT: Run 2 threads at the very same time
w/o SMT
SMT
per core
Keep it fed with multiple threads
Hide latency of a single thread
Most power efficient performance
feature
Very low die area cost
Can provide significant performance
benefit depending on application
Much more efficient than adding an
entire core
Time (proc. cycles)
Take advantage of 4-wide execution
engine
Nehalem advantages
Larger caches
Massive memory BW
Simultaneous multi-threading enhances
performance and energy efficiency
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Note: Each box
represents a
processor
execution unit
SMT Implementation Details
Replicated – Duplicate state for SMT
• Register state
• Renamed RSB
• Large page ITLB
Partitioned – Statically allocated between threads
• Key buffers: Load, Store, Reorder
• Small page ITLB
Competitively shared – Depends on thread’s dynamic behavior
• Reservation station
• Caches
• Data TLBs, 2nd level TLB
Unaware
• Execution units
Applications that will benefit
Complex memory access ( memory access stalls )
Mix of instruction types ( e.g integer and FP computation )
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
SMT Thread Selection Points
Execution pipeline has multiple thread selection points where the
architecture can select to work for one of the 2 logical threads
Predict/Fetch
Decode
Alloc
Schedule EX
RS
Retire
ROB
IQ
• Select thread to fetch instructions from
• Select instruction to decode
• Select u-operation to allocate
• Select instruction to retire
• Additional selection points in memory pipeline like
scheduling of MOB entries ( memory order buffer )
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
SMT Performance Chart
Performance Gain SMT enabled vs disabled
40%
34%
35%
29%
30%
25%
20%
16%
13%
15%
10%
10%
7%
5%
0%
Floating Point
3dsMax*
Integer
Cinebench* 10 POV-Ray* 3.7 3 DMark*
beta 25 Vantage* CPU
Intel®Core™i7
SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation.
For more information on SPEC benchmarks, see: http://www.spec.org
Floating Point is based on SPECfp_rate_base2006* estimate
is based
SPECint_rate_base2006*
estimate
Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3Integer
channel DDR3
memory.on
Performance
tests and ratings are measured
using specific computer systems and /
or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect
actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on
performance tests and on the performance of Intel products, visit http://www.intel.com/performance/
Agenda
• Update Intel® Fortran Compiler
• Improvements for Automatic Vectorization
in Intel® Compilers 12.x
• Simultaneous Multi-Threading in latest Intel
processors
• Intel® Vtune™ Amplifier XE and
Performance Monitoring Unit
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
34
Intel® Vtune™ Amplifier XE
Integrates popular and mature features of Intel® VTune™
Performance Analyzer, Intel® Parallel Amplifier, Intel® Thread
Profiler and Intel® Performance Tuning Utility
But not a super-set in all cases
Some additional features being worked on and will be
added later; some are still being evaluated/might be
added to future updates
Standalone GUI on Linux* and Windows*
GUI in all environments based on wxWidgets: Very fast
and stable
Same look-and-feel for Linux & Windows
Fast and native implementation on Linux
No sluggish and fragile emulations anymore !!
Comprehensive Command Line interface
New instrumentation technologies for data collection
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
35
Event Based Sampling and Counting
Captures program performance information in
terms of hardware specific (PMU) events.
Uses event multiplexing to collect as much
information possible during a single run
Initially only Sampling; support for Counting
(EBC) will be added in future update
EBC gains relevance since most UNCORE events of latest
processors from Intel can not be “sampled”
EBC doesn't capture CPU state information
EBC data cannot be attributed to the program flow
EBC mode has lower overhead and collects smaller trace
files (tb5 / tb6)
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
36
Event Based Sampling
Pre-defined event sets for supported
processors
Top level Triage
Cache analysis and false sharing
Branching issues
Long-latency computation operations
Structural hazards
Working set size
Data access patterns
Memory latency
Memory bandwidth
Any other events can be collected
Some 1100 for Nehalem architecture
Pre-defined displays (viewpoints) transform
data into information
37
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
37
Event Based Sampling
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
38
Core Pipeline Overview
32K Instruction
Cache
uncore
Decode
Pipeline
256K L2
Cache
MSROM
4
Register Allocation
Table (RAT)
32K
Data
Cache
Out Of Order
Rename/retirement
Load
Store (address)
2
3
2
3
Store data
4
4
Integer
0
1
5
MMX/SSE
0
1
5
X87
0
1
5
Front End
Execution
Caches
UOPS_ISSUED
micro-op queue
Branch
Target Buffer
UOPS_EXECUTED
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
6
0
1
5
Scheduler
INT / SSE / X87
Register Stacks
4
UOPS_RETIRED
Nehalem – Last Level Cache Miss
(L3 Miss) ?
Probably the most wanted, single, non-trivial
“event”
The complexity of the memory structure
makes it difficult to define a LLC miss
A L3 miss can be a hit to in another socket’s L3 – modified and
non-modified, in any of the LLC, one of the remote L1 or L2
caches
A L3 modified hit in another socket L3 is more expensive than
accessing local DRAM
In case it would exist in hardware, it would be
an UNCORE event
Very difficult (impossible ?) to sample
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Nehalem LLC Miss – The Solution
Latency events for memory accesses:
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_<X>
<X> can be 4, 8, 16, 32, 128, 256, … 32768
E.g. all memory accesses taking more than
128 cycles
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_128
In fact, this is the event being alias to pseudo-event LLC MISS
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Bandwidth
Delivered + speculative traffic to local memory
Precise totals can be measured in IMC but
cannot be broken down per source
UNC_IMC_NORMAL_READS.ANY
UNC_IMC_WRITES.FULL.ANY
Intel provides a patch/script for Intel® Performance
Tuning Utility to simplify bandwidth measurements
• Available from Premier.intel.com, see product ‘VTune
Performance Analyzer”
• For PTU see whatif.intel.com
• Measures the total bandwidth load for a selected time
period
• See forums on software.intel.com for more details and
tips
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Introduction to Core architecture
Execution Unit: Port Mapping
Port 0
Port 1
Port 5
Port 2
Port 3
Port 4
ALU1
ALU2
ALU3
LOAD
AGU
STORE
AGU
STORE
DATA
(MIU)
SHIFT1
IMUL
SHIFT2
SIMD
FP
SIALU1
JEU
SIMUL
SISHIFT
SIALU2
SISHUF
LB
SB
48 entries
32 entries
DTLB
FMUL
FADD
FDIV
FPREM
L1-D$
FSHUF
ROB
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
MOB
LEA
Result Bus
INT
Reservation station (36 entries)
PMH
L2$/LL$
INTEL CONFIDENTIAL
Intel® Architecture Code Analyzer User Interface
> iaca –f matrix_multiply.exe
Analysis Report
--------------Total Throughput: 4 Cycles
Total Latency:
12 Cycles
Throughput Bottleneck: Port 5
Total number of Uops: 13
Port Binding in cycles:
------------------------------------------------------| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
------------------------------------------------------| Cycles | 1 | 0 | 1 | 3 | 2 | 2 | 2 | 2 | 4 |
-------------------------------------------------------
port 5 is the bottleneck
Actual port binding
N - port number or number of cycles port was bound, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
X - other ports that can be used by this instructions
F - Macro Fusion with the next instruction occurred
Alternative port
* - instruction micro-ops not bound to a port
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num of |
Ports pressure in cycles
|
|
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
|
-----------------------------------------------------------|
1
|
|
|
| 1 | 1 | X | X |
|
| CP |
|
2
|
|
|
| X | X | 1 | 1 |
| 1 | CP |
|
1
|
|
|
| 1 | 1 | X | X |
|
| CP |
|
2
|
|
|
| X | X | 1 | 1 |
| 1 | CP |
|
1
|
|
|
|
|
|
|
|
| 1 | CP |
|
1* |
|
|
|
|
|
|
|
|
|
|
|
1
|
|
| 1 |
|
|
|
|
|
| CP |
|
2
|
|
|
| 1 |
| X |
| 2 |
| CP |
|
1
| 1 |
| X |
|
|
|
|
| X |
|
|
0
|
|
|
|
|
|
|
|
| F |
|
|
1
|
|
|
|
|
|
|
|
| 1 |
|
binding
Identifies instructions in critical path
vmovsd xmm0, qword ptr [rax+rbx*1]
vunpcklpd xmm0, xmm0, xmmword ptr [rax+rbx*1+0x20]
vmovsd xmm1, qword ptr [rax+rbx*1+0x40]
vunpcklpd xmm1, xmm1, xmmword ptr [rax+rbx*1+0x60]
vinsertf128 ymm0, ymm0, xmm1, 0x1
vxorps ymm1, ymm1, ymm1
not bound to a port
vmaxpd ymm1, ymm1, ymm0
vmovaps ymmword ptr [rcx+rbx*4], ymm1
add rbx, 0x8
cmp rbx, 0x20
CMP & JNZ are macro-fused
jnz 0xffffffcc
Agenda
• Update Intel® Fortran Compiler
• Improvements for Automatic Vectorization
in Intel® Compilers 12.x
• Simultaneous Multi-Threading in latest Intel
processors
• Intel® Vtune™ Amplifier XE and
Performance Monitoring Unit
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
45
Optimization Notice
Optimization Notice
Intel compilers, associated libraries and associated development tools may include or utilize options that
optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD
instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler
options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for
Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and
specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under
“Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized
for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler
products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options
you select, your code and other factors, you likely will get extra performance on Intel microprocessors.
Intel compilers, associated libraries and associated development tools may or may not optimize to the same
degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions
3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are
intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best
performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and
libraries to determine which best meet your requirements. We hope to win your business by striving to offer
the best performance of any compiler or library; please let us know if you find we do not.
Notice revision #20110307
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
46
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components
and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance.
Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on
the performance of Intel products, reference www.intel.com/software/products.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino
Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386,
Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside,
Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel
NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel
XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside,
skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside
are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2011. Intel Corporation.
http://intel.com/software/products
Copyright © 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
47
48