How to Spice Up Java with the VTune™ Performance Analyzer?

How to Spice Up Java with the VTune™ Performance Analyzer?
by Levent Akyil
1.
Introduction
Managed environments enable developers to introduce their products to the market quickly while reducing if
not eliminating the need to spend valuable resources on porting efforts. One of the key advantages of managed
environments is the ability to extend through platform independence, allowing many different systems to run
the same software (Figure 1). In this article, we will focus on the Java programming language and platform.
Java is an object-oriented programming language that was designed to be small, simple, and portable across
platforms and operating systems at the source and binary levels. It is easy to use, and enables the user to
develop platform independent applications. On the flip side, applications written in Java have a reputation for
being slower and requiring more memory than those written in natively compiled languages such as Fortran, C
or C++.
Figure 1
The Java programs' execution speed has improved significantly over the years due to advancements in Just-In
Time (JIT) compilation, adaptive optimization techniques and in language features supporting better code
analysis. The Java Virtual Machine (JVM) itself is continuously optimized.
Java's platform independent
applications heavily depend on the JVM to provide optimal performance for the platform. Therefore, the
efficiency/success with which the JVM handles the code generation, thread management, memory allocation and
garbage collection becomes very critical to determine the performance of Java applications.
There is no easy way to present definitive advice on the performance of Java applications due to the fact that
applications have diverse performance characteristics with different Java development tools, such as compilers
and virtual machines, on various operating systems. The Java programming language is still evolving, and its
performance continues to improve. The aim of this article is to promote awareness of Java performance issues
and come up with a methodology so that developers can make appropriate choices for performance analysis of
their applications.
2.
Scope
The aim of this article is to provide a top-down methodology for analyzing applications written in Java programming
language, with a special focus on micro-architectural optimizations. In this article, I’ll show how the Intel® VTune™
Performance Analyzer can be used to analyze Java applications. This article is not an in-depth look at the expected
performance of managed environments, associated runtime engines and system architectures. This article also does
not intend to address all performance issues or discuss all types of tools available for Java performance analysis.
3.
Top-down Approach
Software optimization is the process of improving software by eliminating bottlenecks so that it operates more
efficiently on a given system and optimally uses resources. Identifying the bottlenecks in the target application
and eliminating them appropriately is the key to efficient optimization. There are many optimization
methodologies, which help developers answer the questions of why to optimize, what to optimize and how to
optimize, and these methods aid developers in reaching their performance requirements. In this article, I’ll use a
top-down approach (Figure 2) and this means that I’ll start at a very high level, taking a look at the overall
environment, and then successively drill down into more detail as I begin to tune the individual components
within the system. This approach is targeted towards Java server applications, but can be applied to client
applications as well.
Performance of the Java application in a nutshell depends on:
 the database and I/O configuration, if used;
 the choice of operating system;
 the choice of JVM and JVM parameters;
 the algorithms used;
 the choice of hardware
Figure 2
3.1. System and Application Level Analysis
If I/O and database accesses are part of any Java application, then the constraints introduced by I/O devices, such as
bandwidth and latency, have a bigger impact on the performance than the constraints introduced by microarchitecture. Although tuning and optimizing of system level parameters are critical, database, I/O and OS tuning are
outside the scope of this article.
Java code or managed code more generally speaking, is a very specific concept referring to an executable image
that runs under the supervision of a runtime execution engine. The top-down approach is reasonable due to some
of these unique language features (e.g. dynamic class loading and validation, runtime exceptions checking, automatic
garbage collection, multithreading, etc) in addition to memory foot-print and the choice of JVM configuration.
3.1.1. JVMs and Just-in-Time Compilers
Just-in-time compilation (JIT), also known as dynamic translation, is a technique for improving the runtime
performance of a program by converting the code during runtime before executing it natively (Figure 3).
Initially the JVMs perform interpret the bytecode and based on certain criteria they dynamically compile the
byte code. The JIT code generally offers far better performance than interpreters. In addition, it can in some or
many cases offer better performance than static compilation, as many optimizations are only possible at runtime. This actually resembles the profile guided optimization support provided by static compilers.
With JIT compilation, the code can be optimized for optimal performance; it can be recompiled, re-optimized for
the target CPU and the operating system model where the application runs. During runtime JIT can choose to
generate SSE (Streaming SIMD Extensions) instructions whenever the underlying CPU supports it. With static
compilers, if the code is compiled using SSE support then the generated binary might not execute on the target
processors if the target processors don’t support the appropriate SSE.
However, due to the nature of dynamic translation, a slight delay in initial execution of an application is
introduced. This is simply due to the bytecode compilation. This start-up delay is usually not a big concern for
server java applications but rather for client applications. In general, the more optimization a JIT compiler
performs, the better code it will generate but the start-up delay will increase. In client applications, less
compilation and optimization is performed to minimize the start-up time. In server mode, since server application
usually started once and ran for extensive period of time, more compilation and optimization is performed, to
maximize the performance.
Figure 3
More recent Java platforms have introduced many performance improvements, including faster memory
allocation, improved dynamic code generation, improved garbage collection, and reduction of class sizes. The
improvements will significantly help the Java applications but understanding and tuning key parameters in the
JVMs will get you closer to optimal performance.
3.1.2. Tuning Java Virtual Machine Parameters
Many JVM options can significantly impact performance. Improved performance is possible with proper
configuration of JVM parameters, particularly those related to memory usage and garbage collection.
Unfortunately, it is not possible to go over all the parameters and their usage; therefore I’ll try to introduce a
few useful ones.
3.1.2.1. Heap Size and Garbage Collection
All objects created by the executing Java program(s) are stored in the JVM’s heap, whereas all local variables live
on the Java stack and each thread has its own stack. Objects are created with the “new” operator and the
memory are allocated from the heap. Garbage collection is the process of cleaning unused (unreferenced)
objects which were created in the heap. An object is considered garbage when it can no longer be reached.
Therefore, the garbage collector (GC) should detect unused referenced objects, free the heap space and then
make it available again for the applications.
This functionality however doesn’t come for free. The JVM has to keep track of which objects are being
referenced, and then finalize and free unreferenced objects during the runtime. This activity steals precious
CPU time. Therefore, having optimal heap size and garbage collection strategy is vital for optimal application
performance.
Generally, JVMs provide different GC strategies (or combination of strategies) and choosing the correct GC type
is important based on the type of application (Figure 4).
Stop-the-world
Concurrent
Parallel
(stop-the-world)
Figure 4
The problems associated with Garbage Collection (GC) can be given as:
 Increased latency: application is paused during GC;
 Decreased throughput: GC sequential overhead (serialization) leads to low throughput and decreases
efficiency and scalability;
 Non-deterministic behaviour: GC pauses makes the application behaviour non-deterministic.
All of the above problems affect both client and server side Java applications. The client Java applications
generally require rapid response time and low pause times whereas server applications in addition to client side
requirement also require increased throughput. If we take server applications as an example, we usually see big
heap sizes, and as a result an increased pause time in GC.
Excessive object allocation increases the pressure on the heap and memory. Reducing the allocation rate will
help performance in addition to observing the GC behaviour and identifying the proper size.
 Observe garbage collection behaviour
o verbose:gc: This option will log valuable information about GC pause times, the frequency of GC,
application run times, size of objects created and destroyed, memory recycled at each GC, and the
rate of object creation.
o –XX[1]:+PrintGCDetails, XX:+PrintGCTimeStamps options will give information of GC.




For applications requiring low pause
o Use –Xconcgc
For throughput applications
o Use -XX:+AggressiveHeap
Avoid using System.gc()
o Use -XX:+DisableExplicitGC
Avoid old generation undersized heaps
o Reduces collection time, but leads to lot of other problems like fragmentation
Since all objects live in the JVM’s heap and its heap size affects the execution of GC, tuning heap size has a
strong impact on performance. Heap size affects GC frequency and collection times, number of short and long
term objects, fragmentation and locality.
•
•
•
•
Starting size (-Xms) too small causes resize
Max size (-Xmx) too small causes GC to run often without recovering much heap
Max size too large: GC runs longer, less efficiently
Identify proper young generation size: -Xmn, -XX:NewSize, -XX:MaxNewSize,
Also options such as Trace, Verbose, VerboseGC, NoJIT, NoClassGC, NoAsyncGC, MaxJStack, Verify (depends on
the JVM) which are usually used for debugging Java applications, will hurt performance.
3.1.2.2. Tools for GC and Heap analysis
Some of the useful tools for tuning JVM are JStack, VisualGC, GCPortal, JConsole, jstat, jps, NetBeans Profiler,
HeapAnalyzer and there are many others. I am planning to write more on this later.
3.2. Java Programming Tips
Some basic programming tips can be given as follows[2]:






Choose the right algorithm(s) and data structures.
o Algorithms with O(𝑁 2 ) complexity will be slower than the algorithm with O(N) or Nlog(N)
complexities.
Use the fastest JVM and JVM that takes advantage of the underlying processor architecture.
Compile with optimization flag, javac -O.
Use multithread for multi-core and multi-processor systems.
o For single threaded applications, avoid using synchronized methods (e.g Vector vs. ArrayList)
o Keep synchronized methods outside the loops.
Use private and static methods, and final classes, to encourage inlining.
Use local variables as much as possible. Local variables are faster than instance variables, which are faster
than array elements.
Memory Usage:





Significant time is spent in memory allocation. The new operator (or delete) uses a generic variable-size
allocator that is much slower than more specialized memory allocators.
Variable-size memory allocators degrade performance under heavy use because of memory fragmentation.
Complex memory usage can result in delayed or missed opportunities for object deletion by the garbage
collector. If significant time is being spent in memory allocation for a class or structure, replace (or overload)
the new operator (or delete) for the class with a more appropriate memory allocation routine.
Choose an algorithm which will reuse deleted space and also reduce memory fragmentation.
Use fixed-size allocation to manage blocks of fixed size, such as objects of a single class. A fixed-size
allocator maintains a linked list of blocks of a fixed size. Allocation takes a block off the list, and deallocation
adds a block to the list. Allocation and deallocation are very fast with fixed-size allocators and do not
degrade under heavy use.


Reuse obsolete objects whenever possible to avoid allocating new objects.
Make it clear to the garbage collector that an object is no longer being used by assigning null (or another
object) to each object sequence after its last use.
Object Creation:




Avoid creating objects in frequently used routines, this will prevent creating objects frequently, and
negatively impacting the overall performance of object cycling.
Group frequently accessed fields together so they end up in a minimum number of cache lines, often with
object header.
Experience shows that scalar fields should be grouped together, separately from object reference fields.
Do not declare an object twice.
Strings Usage:




Consider declaring a single StringBuffer object once at the beginning of the program, which can then be
reused every time concatenation is required.
The StringBuffer methods setLength(), append(), and toString() can then be used to initialize it, to append
one or more Strings, and to convert the result back to a String type, each time concatenation is needed. If
concatenation is being used to format strings for stream output, avoid concatenation altogether by writing
each string to the I/O buffer separately. For instance, if the result is being printed using System.out.println(),
print each operand individually using PrintStream.out.print(), and PrintStream.out.println() for the last.
Use of a general-purpose StringBuffer saves the time needed to allocate and free the temporary buffer
every time concatenation is used.
Separately writing each string to the output buffer avoids any concatenation overhead altogether, resulting
in somewhat faster program execution. The benefit is partially offset, however, by the added overhead of
the second I/O call.
I/O:



Use buffered I/O whenever possible. If the amount of data being moved is large, consider using the
specified I/O class instead, for increased performance through data buffering.
Data buffering can save a substantial amount of execution time by making fewer total accesses to the
physical I/O device while also allowing parallel operations to occur through multithreading. The more data is
being moved, the greater the benefit.
The use of readFully() instead of buffered I/O can significantly improve the performance of programs. The
overhead of synchronization employed by buffered I/O routines can be avoided by using readFully() to a
large buffer, and then managing and interpreting the data yourself.
3.3. Micro-architectural optimization
Performance tuning at micro-architecture level usually focuses on reducing the time it takes to complete a welldefined workload. Performance events can be used to measure the elapsed time; therefore, reducing the
elapsed time of completing a workload is equivalent to reducing measured processor cycles (clockticks).
Intel® VTune™ Performance Analyzer is one of the most powerful tools available for software developers who
are interested in this type of performance analysis. The easiest way to identify the hotspots in a given
application is to sample the application with processor cycles, and VTune™ analyzer utilizes two profiling
techniques, sampling and call graph, to help the developer identify where most of the clockticks are spent, in
addition to many other processor events[1]. The sampling can be in two forms: processor event based and time
based sampling. Event based sampling (EBS) relies on the performance monitoring unit (PMU) supported by the
processors. From this point forward, event based sampling (EBS) will be referred to as sampling.
3.3.1. VTune Performance Analyzer basics
In a compatible Java development environment, the VTune™ Performance Analyzer can be used to monitor and
sample JIT-compiled Java code. The VTune analyzer gets the names of the active methods along with their load
addresses, sizes, and debug information from the JVM and keeps this data in an internal file for later processing
when viewing results. When your Java application executes, the Just-in-Time (JIT) compiler converts your VM
bytecode to native machine code. Depending on your Java environment, either the VM or the JIT provides the
VTune analyzer with information about active Java classes and methods, such as their memory addresses, sizes,
and symbol information. The VTune analyzer uses this information to keep track of all the classes and methods
loaded into memory and the processes that are executed. It also uses this information for final analysis. In
summary, the VTune analyzer can identify performance bottlenecks, code emitted by the JIT Compiler (no
bytecode support with EBS) and analyze the flow control.
3.3.1.1. How to start analyzing with the VTune analyzer
Basically, there are two ways VTune Analyzer can analyze your Java application: VTune analyzer starts the JVM
and the application to analyze or VTune analyzer starts the analysis without starting the application and user
has to start application outside the VTune analyzer separately. The latter method is available only for sampling
method and is useful to analyze applications such daemons, services, long running apps, etc.
3.3.1.2. Analyzing applications started with the VTune analyzer
The first and easiest way for VTune analyzer to analyze your application is to start it within VTune analyzer
(Figure 6). VTune analyzer allows 3 types of applications or configuration: Application (.class or .jar), Script, and
Applet. The following steps are showing how to setup VTune analyzer to analyze .class application.







Start VTune™ Performance Analyzer.
Click Sampling/Call Graph Wizard (please see following sections for more detail on these methods).
Select Java Profiling.
Select one of the following methods: Application (.class or .jar), Script, and Applet. We’ll assume application is
selected for the following steps.
 Application: The VTune(TM) Performance Analyzer will launch a Java application. You must specify the
Java launcher and the application.
 Script: A launching script invokes a specified Java application.
 Applet: The VTune analyzer invokes a Java applet. You must specify the applet viewer and applet.
Select Java Launcher and enter any other special JVM arguments
Select the main class or jar file. If there are any command line arguments used, enter them. Also select any
component (.jar/directory) needed in the Classpath.
Click Finish. The VTune analyzer will now launch application.
Figure 5
3.3.1.3. Analyzing applications started outside the VTune Analyzer
Sometimes, it is not possible or desired to start the Java application (e.g daemons, services, etc) with VTune
analyzer however it is still possible to perform analysis on such applications (Figure 7).








Start VTune™ Performance Analyzer.
Click Sampling/Call Graph Wizard (please see following sections for more detail on these methods).
Select Window*/Windows* CE/Linux Profiling. Uncheck Automatically generate tuning advice if
selected.
Uncheck (de-select) No application to launch.
Check (select) Modify default configuration when done with wizard, and then click Finish.
In Advanced Activity Configuration window, select Start data collection paused if it is not desired to
start collection right away.
Resume the collections before starting the application or during the application running.
Start the application in the usual manner. Wait patiently for your software to complete and/or run your
software until you have executed the code path(s) of interest.
Note 1: Selecting Window*/Windows* CE/Linux Profiling is not a mistake because this is the only option where
we can tell VTune analyzer not to launch any application.
Note 2: If the Java application is executed outside the VTune analyzer then please make sure to pass correct
argument to the JVM.


For Java version 1.4.x, use -Xrunjavaperf
For Java version 1.5 and higher, use "-agentlib:javaperf"
Figure 6
For Java applications (running on BEA, Sun or IBM JVMs), all the Java methods are combined and displayed as
java.exe.jit on windows and java.jit on Linux in the Module view. You can view the individual methods and drill
down to the Hotspot view by double-clicking it. The IBM and Sun JVMs use both interpretation mode and jitted
mode of Java code execution. When sampling, only jitted code profiling is associated with the executing Java
methods. When the JVM interprets the code, the samples are attributed to the JVM itself. You can use the call
graph collector to obtain a complete view of all executed methods.
3.3.1.4. Identifying hotspots, Using Event Based Sampling
I used SciMark2[3] for this example (Figure 7). SciMark2 is a Java benchmark for scientific and numerical
computing. It measures several computational kernels and reports a composite score in approximate MFLOPS.
Figure 7
After the analysis, the VTune analyzer will display the information about the processes and modules (Figure 8).
Figure 8
When the sampling wizard is used, the VTune™ analyzer by default uses processor cycles (clockticks) and
instructions retired[4] to analyze the application. The count of cycles, also known as clockticks, forms the
fundamental basis for measuring how long a program takes to execute. The total cycle measurement is the
start to finish the view of total number of cycles to complete the application of interest. In typical performance
tuning situations, the metric “Total Cycles” can be measured by the event CPU_CLK_UNHALTED.CORE[5].
The instructions retired event indicates the number of instructions that retired or executed completely. This
does not include partially processed instructions executed due to branch mis-predictions. The ratio of clockticks
(non-halted) and instructions retired is called “clocks per instruction” (CPI) and it can be good indicator of
performance problems (indicator of the efficiency of instructions generated by the compiler and/or low CPU
utilization)i. It is also possible to change the processor performance events to use for sampling. )[6].
The “java.exe.jit” is the module of interest to us. If we drill down further (double click on the module or click on
Hotspot View button), hotspot view will show us all the functions executed during the benchmark for which we
have samples for.
Figure 9
From the hotspot view it is clear that the jitted bytecodes created very efficient code considering the
theoretical CPI ratio achievable is 0.25 (Figure 9). This is also natural when considering that the benchmark
without any command line arguments executes the sizes that fit in to the cache.
It is also possible to further drill down and see the source and associated sample or total event counts. Simply
double-clicking on a function of interest will take you to the java source of that method (Figure 10).
Figure 10
3.3.1.5. Identifying hotspots, Using Call Graph Analysis
Creating a call graph activity is similar to creating a sampling activity and one can follow the similar process. In
this step it is possible to change the Java Launcher and enter special JVM arguments as well. For call graph
analysis “-agentlib:javaperf=cg” JVM parameter will be picked up automatically by the VTune analyzer.
Please note that the VTune analyzer enables you to distinguish between JIT-compiled, interpreted Java and
inlined Java methods. You can then examine the timing differences between each type of method execution.
JIT-compiled methods are grouped together into modules with the .jit extension, interpreted methods into
modules with the .interpreted and inlined methods with .inlined extensions. Also note that call site[7]
information is not collected for Java call graphs.
Figure 11
The red arrows show us the critical path during the analysis; in other words, the flow path that consumed most
of the time (Figure 11).
3.3.1.6. When do JVMs decide to JIT ?
Call graph analysis gives us very valuable information. In addition to flow control of the Java application, it is also
easy to see how many times one particular method was interpreted before getting Jitted by the JVM. To see
this information, simply group the view by Class and then sort by Function (Figure 12).
After these arrangements, one can easily see matmult function in SparseCompRow.java was interpreted 3
times and then it was jitted. It is also possible to see that the jitted version of matmult took ~524 microseconds
(Self Time / Calls) where as interpreted version took ~952.66 microseconds. Similar calculations can also be
done. We can see more dramatic difference in inverse function. While the interpreted version of inverse
function runs in 73 microseconds, the jitted version only runs in 7 microseconds.
Figure 12
Figure 13
From the same information, we can also find out how many time a certain function has to be executed before
getting jitted (Figure 13).
3.3.2. Identifying Memory Problems
No single issue effects software performance more than the speed of memory. Slow memory or in efficient
memory accesses hurt performance by forcing the processor to wait for instructions operands. Identifying
memory access issues can be the first steps in analyzing the memory related performance problems.
Before going into memory related issues, it is time to give a basic formula used in micro-architecture
performance analysis. It is accurate to say that the total number of cycles an application takes is the sum of the
cycles dispatching μops[7] and the cycles not dispatching μops (stalls). This can be formulized with the Intel®
Core™ architecture processor event names as shown below. This formula is explained in greater details in Intel®
64 and IA-32 Intel® Architecture Optimization Reference Manual. Therefore, for more complete analysis please
refer to the optimization manual. In this approach,
Total Cycles = Cycles dispatching μops + Cycles not dispatching μops
CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE
Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles
where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event.
Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of
RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles
wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The
execution unit by definition is always the bottleneck because it defines the throughput and an application will
perform as fast as its bottleneck. Consequently, it is extremely critical to identify the causes for the stall cycles
and remove them, if possible.
Our goal is to determine how we can minimize the causes for the stalls and let the “bottleneck” (i.e: execution
unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle.
There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory
accesses (e.g: cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations
(ops) (e.g: long latency operations such as division, fp control word change etc) and μops not retiring due to the
out of order (OOO) engine can be given as some of them.
I will focus on memory related issues and how to identify them with the VTune analyzer. The memory related
issues that can happen in Java programs are not any different than the ones which can happen in other
programming languages. However, the JVM can overcome some of the data locality issues by runtime
optimizations it can perform during jitting.
There are three cases which cause the processors to access main memory (i.e cache loads): conflict, capacity
and compulsory.
 Capacity loads occur when data that was already in the cache being used is being reloaded. Using a
smaller working set of data can reduce the capacity loads.
 Conflict loads occur because every cache row can hold specific memory addresses. This can be avoided
by changing the alignment.
 Compulsory loads occur when data is loaded for the first time. The number of compulsory loads can be
reduced but can't be avoided (this should be taken care of by the prefetchers prefetch instructions).
For this section, I decided to focus on a sneaky memory related problem which usually occurs in multi-threaded
applications running on SMP systems. I wrote a Java application called FalseShare.java for this purpose (Figure
14 shows the section of interest). This Java application is multi-threaded (creates threads based on the number
of physical cores) and runs the same algorithm in two different data distribution models. The platform used for
this example is an Intel® 45nm Core 2 Quad Processor running Red Hat Enterprise Linux 5.
startindx = tid;
for (loops = 0; loops < Constants.ITERS; loops++)
for (p = startindx; p < Constants.NUMPOINTS; p += Constants.THREADS)
{
double d = java.lang.Math.sqrt(points[p].x*points[p].x + points[p].y*points[p].y + points[p].z*points[p].z);
points[p].x /= d;
points[p].y /= d;
points[p].z /= d;
}
Figure 14
If we recall from the formula given earlier and collect related events we can estimate the number of stall cycles.
After running the first version of the program, the stall cycles is ~88% of the useful cycles. Most of the time,
the stall cycles are the symptoms of something wrong happening in execution stage.
Figure 15
If we drill-down to the source code, we can see the source code contributing to the stall cycles. With careful
analysis the code section contributing heavily to the stall cycles, we see a phenomenon called false sharing in
version1. Unlike true sharing, where threads share variables, in false sharing threads do not share a global
variable but rather share the cache line.
Figure 16
False sharing occurs when multiple threads write to the same cache line over and over. In many cases, modified
data sharing implies that two or more threads race on using and modifying data in one cache line which is
64bytes in size. The frequent occurrences of modified data sharing causes demand misses that have a high
penalty. When false sharing is removed, code performance can dramatically improve.
Having said this, if we carefully look at the source code we can see
that access to point[] happens in cyclical distribution of data over each
iteration. Re-writing it as shown in Figure 18 (Version 2) and compare
it to the first version, we see that Version 1 will run roughly ~6.19 (if
we simply compare the clockticks) times slower.
If we run the second version (case: VERSION2) and compare it to the
first version, we can see that Version 1 will run roughly ~6.19 (if we
simply compare the clockticks) times slower.
Figure 17
switch (testType)
{
case VERSION1:
startindx = tid;
for (loops = 0; loops < Constants.ITERS; loops++)
for (p = startindx; p < Constants.NUMPOINTS; p += Constants.THREADS)
{
double d = java.lang.Math.sqrt(points[p].x*points[p].x + points[p].y*points[p].y + points[p].z*points[p].z);
points[p].x /= d;
points[p].y /= d;
points[p].z /= d;
}
break;
case VERSION2:
indx = tid;
int delta = (int) ((float)Constants.NUMPOINTS / Constants.THREADS);
startindx = indx * delta; endindx = startindx + delta;
if (indx == Constants.THREADS - 1) endindx = Constants.NUMPOINTS;
for (loops = 0; loops < Constants.ITERS; loops++)
for (p = startindx; p < endindx; p++)
{
double d = java.lang.Math.sqrt(points[p].x * points[p].x + points[p].y * points[p].y + points[p].z*points[p].z);
points[p].x /= d;
points[p].y /= d;
points[p].z /= d;
}
break;
}
Figure 18
Figure 19
So let’s look at L2 cache misses to see the impact. After sampling both versions with two key processor events,
MEM_LOAD_RETIRED.L2_LINE_MISS[9] and L2_LINES_IN.SELF[10], we can clearly see the differences in 2
versions. The second version uses a block distribution of the data to eliminate the cache line sharing or false
sharing.
Version 1
Version2
Figure 20
Clearly, memory access issues in this example will prevent the application scale as the number of cores
increase. Plotting the non-false sharing and false sharing versions shows the impact of false sharing on
scaling.
Seconds
30
25.588
25
22.805
20
15
12.539
12.553
10
6.239
3.116
5
0
1
2
Core 2 Extreme QX9650 NOFS
4 Threads
Core 2 Extreme QX9650 FS
Figure 21
3.3.3. Identifying SIMD and Vectorization Usage
Leveraging SIMD and SSE (Streaming SIMD Extensions) support available on target processors is one of the key
optimization techniques JVMs use (or should use). The question is how to identify which Jitted methods use
SSE? This has been a common question from many Java developers.
The VTune analyzer’s event based sampling can help users pinpoint exactly which methods are optimized to use
SSE. One can simply use SIMD_INST_RETIRED.ANY event (Retired Streaming SIMD instructions (precise event))
to count the overall number of SIMD instructions retired. The following events can give further break-down of
this one event. These are the events that are available on Core™ architecture based processors.
Symbol Name[11]
Description
FP_MMX_TRANS.TO_FP
Transitions from MMX (TM) Instructions to Floating Point
Instructions.
FP_MMX_TRANS.TO_MMX
Transitions from Floating Point to MMX (TM) Instructions.
SIMD_ASSIST
SIMD assists invoked.
SIMD_COMP_INST_RETIRED.PACKED_DOUBLE
Retired computational Streaming SIMD Extensions 2 (SSE2)
packed-double instructions.
SIMD_COMP_INST_RETIRED.PACKED_SINGLE
Retired computational Streaming SIMD Extensions (SSE) packedsingle instructions.
SIMD_COMP_INST_RETIRED.SCALAR_DOUBLE
Retired computational Streaming SIMD Extensions 2 (SSE2)
scalar-double instructions.
SIMD_COMP_INST_RETIRED.SCALAR_SINGLE
Retired computational Streaming SIMD Extensions (SSE) scalarsingle instructions.
SIMD_INSTR_RETIRED
SIMD Instructions retired.
SIMD_INST_RETIRED.ANY
Retired Streaming SIMD instructions (precise event).
SIMD_INST_RETIRED.PACKED_DOUBLE
Retired Streaming SIMD Extensions 2 (SSE2) packed-double
instructions.
SIMD_INST_RETIRED.PACKED_SINGLE
Retired Streaming SIMD Extensions (SSE) packed-single
instructions.
SIMD_INST_RETIRED.SCALAR_DOUBLE
Retired Streaming SIMD Extensions 2 (SSE2) scalar-double
instructions.
SIMD_INST_RETIRED.SCALAR_SINGLE
Retired Streaming SIMD Extensions (SSE) scalar-single
instructions.
SIMD_INST_RETIRED.VECTOR
Retired Streaming SIMD Extensions 2 (SSE2) vector integer
instructions.
SIMD_SAT_INSTR_RETIRED
Saturated arithmetic instructions retired.
SIMD_SAT_UOP_EXEC
SIMD saturated arithmetic micro-ops executed.
SIMD_UOPS_EXEC
SIMD micro-ops executed (excluding stores).
SIMD_UOP_TYPE_EXEC.ARITHMETIC
SIMD packed arithmetic micro-ops executed
SIMD_UOP_TYPE_EXEC.LOGICAL
SIMD packed logical micro-ops executed
SIMD_UOP_TYPE_EXEC.MUL
SIMD packed multiply micro-ops executed
SIMD_UOP_TYPE_EXEC.PACK
SIMD pack micro-ops executed
SIMD_UOP_TYPE_EXEC.SHIFT
SIMD packed shift micro-ops executed
SIMD_UOP_TYPE_EXEC.UNPACK
SIMD unpack micro-ops executed
Figure 22
If we collect some of these events on SciMark2 again (Figure 22 and 23), we’ll see that actually all our
benchmarks are using the SIMD (Single Instruction Multiple Data) unit. It is also important to note that
SIMD_INST_RETIRED.ANY should be equal to the total of all the sub events.
Figure 23
3.4. Parallelization, threads and more
As we mentioned earlier, Java uses heap to allocate memory for objects however, in multi-threaded cases it
becomes quite clear that the access to the heap can quickly become a significant concurrency bottleneck, as
every allocation would involve acquiring a lock that guards the heap. Luckily JVMs use thread-local allocation
blocks (TLAB), where each thread allocates a larger chunk of memory from the heap and services small
allocation requests sequentially out of that thread-local block.
As a result, this greatly enhances the scalability and reduces the number of times a thread must acquire the
shared heap lock, improving concurrency. TLAB[12] enables a thread to do object allocation using thread local
top and limit pointers, which is faster than doing a serialized access to the heap which is shared across threads.
However, as the number of threads exceeds the number of processors, the cost of committing memory to localallocation buffers becomes a challenge and sophisticated sizing policies must be employed.
The single-threaded copying collector can become a bottleneck in an application which is parallelized to take
advantage of multiple processors. To take full advantage of all available CPUs on a multiprocessor machine (e.g
version 1.4.1 of the HotSpot JVM) offers an optional multithreaded collector[13]. The parallel collector tries to
keep related objects together to improve memory locality and cache utilization. This is accomplished by copying
objects in depth first order.
It is hard not to think about multi-threading when current generation of processors are having more and more
core. As more and more multi-core processors are becoming available, utilizing all these cores becoming
increasingly important. Luckily Java language and JVMs are inherently multi-threaded. Threading can help the
user increase the throughput and determinism (GC threads can find more hardware threads/cores available).
Now, we’ll look at how to increase performance and improve resource utilization by leveraging the threading.
I’ll use the Java Grande benchmark suite[14] v1.0 to demonstrate the necessity of parallelism and multithreading to leverage the potential of any multi-core platform. Some of the benchmark results can be found in
the graph below.
0.498
0.888
Section3:RayTracer:Total:SizeA
0.941
0.62
Section2:SparseMatmult:Kernel:SizeB
0.827
0.249
0.484
0.754
Section2:Crypt:Kernel:SizeB
seconds 0
0.5
2
1
2.794
1.513
1.038
0.941
0.915
Section2:SOR:Kernel:SizeB
4
1.588
Section3:MonteCarlo:Total:SizeA
Section3:MolDyn:Total:SizeA
threads
1.554
1
1.318
1.476
1.5
2.906
2
2.5
3
3.5
Figure 24
3.5. How to use VTune Analyzer Programmatic APIs?
“How to use the VTune analyzer programmatic APIs?” is a very common question and the solution is already
provided by VTune analyzer. Before I introduce how to use programmatic APIs, let’s quickly mention what
VTune Programmatic APIs are and what they are good for. The VTune Analyzer provides Pause and Resume
APIs (VTPause() and VTResume() ) to allow the user analyze only certain parts of the application. By using these
APIs user can skip the uninteresting parts such as initialization, GUI interaction, etc. These APIs are also
available for Java and are implemented in the VTuneAPI.class which is located in the <installdir>\analyzer\bin\com\Intel\VTune directory.
The following Pause/Resume API enable pause and resume data collection for Java applications:
 com.Intel.VTune.VTuneAPI.VTPause() : pauses sampling and call graph data collection for a Java application.
 com.Intel.VTune.VTuneAPI.VTResume() : resumes sampling and call graph data collection for a Java
application.
 com.Intel.VTune.VTuneAPI.VTNameThread(StringthreadName) : names the current thread for call graph
data collection for a Java application.
Figure 25
In between the calls to VTPause() and VTResume() performance data is not collected. One can simply select
“Start with data collection paused” during the configuration and resume the collection anytime with VTResume()
call. If you run the Java application using these APIs, VTune analyzer console will report these API usages during
the analysis.
Thu Sep 25 15:32:47 2008 Data collection started...
Thu Sep 25 15:34:12 2008 Data collection paused...
Thu Sep 25 15:34:12 2008 Data collection resumed...
Thu Sep 25 15:34:12 2008 Data collection paused...
Thu Sep 25 15:34:12 2008 Data collection finished...
However, if you are one of those people who would like to do their way and want to implement your own
wrapper for these APIs, JNI (Java Native Interface) can always help.
/* VTApis.java */
class VTApis
{
public native void assignThreadName();
public native void resumeVTune();
public native void pauseVTune();
static
{
System.loadLibrary("VTApis");
}
}
/* VTApis.c */
#include <stdio.h>
#include <vtuneapi.h>
#include "VTApis.h" // this header file was generated by
javah
/*
* Class: VTApis
* Method: resumeVTune
* Signature: ()V
*/
JNIEXPORT void JNICALL Java_VTApis_resumeVTune (JNIEnv
*, jobject)
How to compile and generate the wrapper using JNI
{
Compile VTune analyzer API class:
}
VTResume();
printf("Resuming VTune collection.\n");
$>javac.exe VTApis.java
Generate the header file VTApis.h:
$>javah.exe -jni VTApis
Compile and generate .dll:
$>cl /Zi –I"C:\Program Files\Java\jdk1.6.0_04\include" I"C:\Program Files\Java\jdk1.6.0_04\include\win32" -I"C:\Program
Files\Intel\VTune\Analyzer\Include" -LD VTApis.c -fixed:no FeVTApis.dll VtuneApi.lib
/*
* Class: VTApis
* Method: pauseVTune
* Signature: ()V
*/
JNIEXPORT void JNICALL Java_VTApis_pauseVTune (JNIEnv *,
jobject)
{
VTPause();
printf("Pausing VTune collection.\n");
}
Figure 26
3.5.1. VTune Analyzer’s JIT Profiling API
Optimizing how the JVM generates jitted code is not the goal of this article, but it is worth mentioning a few
things about this topic.
 Tune JIT code generation to match the underlying processor architecture:
o Take advantage of new architectural features;
 Example: Use efficient SSE instructions to move data
o Improve decode/allocation efficiency;
 Avoid length-changing prefixes to improve decode efficiency
 Branch target alignment
o Eliminate inherent stalls in generated code;
o Tune register allocation to reduce memory traffic;
 Better register allocation can reduce stack operations
o Use additional registers afforded by SSE or Intel64®
It is also important to note that 64-bit
JVMs enable heaps larger than 4 GB, but
they are usually slower than 32-bit JVMs
simply due to extra memory needed and
system pressure caused by using and
moving 64-bit pointers. Using 32-bit
offsets from a Java heap base address
instead
of
64-bit
pointers
can
significantly improve the performance.
On Intel® Xeon® platforms, resulting 64bit JVM can be faster than 32-bit
equivalent!
In addition to the Intel® VTune™
Performance Analyzer’s normal Java
support, starting with the VTune
Performance Analyzer 9.1, the JIT
Profiling API became public and provide
further functionality and control in profiling runtime generated code (Figure 27). The JIT compilers or JVMs can
use this API to insert API calls to gather more detailed information about the interpreted or dynamically
generated code.
Figure 27
3.5.2. Instructions to include JIT Profiling Supportii

Include JITProfiling.h file located under “C:\Program Files\Intel\VTune\Analyzer\include” directory for
Microsoft* operating systems and under /opt/intel/vtune/analyzer/include for Linux* operating
systems. This header file provides all API function prototypes and type definitions.

Link the Virtual Machine (or any code using these APIs) with JITProfiling.lib located under “C:\Program
Files\Intel\VTune\Analyzer\lib” on Windows*, and with JITProfiling.a located under
“/opt/intel/vtune/analyzer/bin” on Linux* operating systems. On Linux* please link with the standard
libraries libdl.so and libpthread.so.
Note: JITProfiling.a which comes with VTune analyzer is compiled with g++ and not with gcc, therefore
either compile your code with g++ or compile with gcc and link with -lstdc++ library.
In order to function properly, a VM that uses the JITProfiling API should implement a mode-change callback
function and register it using iJIT_RegisterCallbackEx. The callback function is executed every time the profiling
mode changes. This ensures that the VM issues appropriate notifications when mode changes happen.
To enable JIT profiling support, set the environment variable ENABLE_JITPROFILING=1.
On Windows:
set ENABLE_JITPROFILING=1
On Linux:
export ENABLE_JITPROFILING=1
On Linux JIT profiling can only be used with the command line interface (vtl) and jitprofiling option needs to be
used.
For call graph analysis:
vtl activity jitcg -c callgraph -o jitprofiling -app ./jitprof run
For sampling analysis:
vtl activity jitsamp -c sampling -o jitprofiling -app ./jitprof run
If you wish to perform JIT profiling on a remote Linux OS system, define the
BISTRO_COLLECTORS_DO_JIT_PROFILING environment variable in the shell where vtserver executes.
export BISTRO_COLLECTORS_DO_JIT_PROFILING=1
End Notes and References
1.
Please note that options that begin with -X are non-standard while –XX options are not stable. These options are not guaranteed to be
supported on all VM implementations, and are subject to change without notice in subsequent releases of the JDK. Please check
http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp for more information.
2. http://math.nist.gov/scimark2/
3. Instructions Retired: Recent generations of Intel 64 and IA-32 processors feature microarchitectures using an out-of-order execution
engine. They are also accompanied by an in-order front end and retirement logic that enforces program order. Instructions executed to
completion are referred as instructions retired. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System
Programming Guide, Part 2.
4. For these examples, I have used Intel Core™ micro-architecture based systems.
5. The Intel Core™ micro-architecture is capable of reaching CPI as low as 0.25 in ideal situations. The greater value of CPI for a given
workload indicates that there are more opportunities for code tuning to improve performance. Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 3B: System Programming Guide, Part 2
6. Call site is defined as the location in the caller function from where a call is made to a callee.
7. Micro-operations, also known as a micro-ops or μops, are simple, RISC-like microprocessor instructions used by some CISC processors to
implement more complex instructions. Wikipedia: http://en.wikipedia.org/wiki/Micro-operation
8. Counts the number of retired load operations that missed the L2 cache.
9. Counts the number of cache lines allocated in the L2 cache. Cache lines are allocated in the L2 cache as a result of requests from the
L1 data and instruction caches and the L2 hardware prefetchers to cache lines that are missing in the L2 cache.
10. Taken from Intel® VTune™ Performance Analyzer help. Intel® VTune™ Performance Analyzer
(http://www3.intel.com/cd/software/products/asmo-na/eng/index.htm)
11. http://blogs.sun.com/jonthecollector/entry/the_real_thing
12. http://java.sun.com/products/hotspot/whitepaper.html
13.
The Java Grande Forum is a community initiative to promote the use of Java for so-called “Grande” applications. A Grande application is
an application which has large requirements for any or all of: Memory, Bandwidth and processing power. You can download the
benchmark suite from http://www2.epcc.ed.ac.uk/computing/research_activities/java_grande/index_1.html