HLO Overview

High-level Optimizations (HLO) exploit the properties of source code constructs (for example, loops and arrays) in applications developed in high-level programming languages, such as C++. The high-level optimizations include loop interchange, loop fusion, loop unrolling, loop distribution, unroll-and-jam, blocking, prefetching, scalar replacement, and data layout optimizations.

While the -O2 (Linux*) or /O2 (Windows*) option performs some high-level optimizations (for example, prefetching, complete unrolling, etc.), using the -O3 (Linux) or /O3 (Windows) option provides the best chance for performing loop transformations to optimize memory accesses; the scope of optimizations enabled by these options is different for IA-32, Itanium®, and Intel® EM64T architectures. See Optimization Options Summary.

IA-32 and Itanium®-based Applications

The -O3 (Linux) or /O3 (Windows) option enables the -O2 (Linux) or /O2 (Windows) option and adds more aggressive optimizations (like loop transformations); O3 optimizes for maximum speed, but may not improve performance for some programs.

IA-32 Applications

In conjunction with the vectorization options, -ax and -x (Linux) or /Qax and /Qx (Windows), the -O3 (Linux) or /O3 (Windows) option causes the compiler to perform more aggressive data dependency analysis than the default -O2 (Linux) or /O2 (Windows). This may result in longer compilation times.

Tuning Itanium-based Applications

The -ivdep-parallel (Linux) or /Qivdep-parallel (Windows) option asserts there is no loop-carried dependency in the loop where an IVDEP directive is specified. This is useful for sparse matrix applications.

Follow these steps to tune applications on Itanium®-based systems:

Compile your program with -O3 (Linux) or /O3 (Windows) and -ipo (Linux) or /Qipo (Windows). Use profile guided optimization whenever possible. (See Understanding Profile-Guided Optimization.)
Identify hot spots in your code. (See Using Intel® Performance Analysis Tools.)
Generate a high-level optimization report.
Check why loops are not software pipelined.
Make the changes indicated by the results of the previous steps.
Repeat these steps until you have achieved a satisfactory optimization level.

Tuning Applications

In general, you can use the following strategies to tune applications:

Use #pragma ivdep to indicate there is no dependence. You might need to compile with the -ivdep-parallel (Linux) or /Qivdep-parallel (Windows) option to absolutely specify no loop carried dependence.
Use #pragma swp to enable software pipelining (useful for lop-sided controls and unknown loop count).
Use #pragma loop count(n) when needed.
Use of -ansi-alias (Linux) or /Qansi-alias (Windows) is helpful.
Add the restrict keyword to insure there is no aliasing. Compile with -restrict (Linux) or /Qrestrict (Windows).
Use -alias-args (Linux) or /Qalias-args- (Windows) to indicate arguments are not aliased.
Use #pragma distribute point to split large loops (normally this is done automatically).
For C code, do not use unsigned int for loop indexes. HLO may skip optimization due to possible subscripts overflow. If upper bounds are pointer references, assign it to a local variable whenever possible.
Check that the prefetch distance is correct. Use #pragma prefetch to override the distance when it is needed.