High-Level Optimization (HLO) Report

Run the High-level Optimization (HLO) report by entering a command similar to the following:

Platform	Example Command
Linux*	icpc -opt-report -opt-report-phase hlo -O3 a.c b.c
Windows*	icl /Qopt-report /Qopt-report-phase hlo /O3 a.c b.c

HLO performs specific optimizations based on the usefulness and applicability of each optimization. See Loop Transformations for specific optimizations HLO performs.

The HLO report provides information on all relevant areas plus structure splitting and loop-carried scalar replacement. The following is an example of the HLO report for a matrix multiply program:

Example
multiply_d_lx.c HLO REPORT LOG OPENED ================================================ Total #of lines prefetched in multiply_d for loop in line 15=2, dist=74 Block, Unroll, Jam Report: (loop line numbers, unroll factors and type of transformation) Loop at line 15 unrolled without remainder by 8 ================================================ ... -out:matrix1.exe

Example

multiply_d_lx.c

HLO REPORT LOG OPENED
================================================
Total #of lines prefetched in multiply_d for loop in line 15=2, dist=74
Block, Unroll, Jam Report:
(loop line numbers, unroll factors and type of transformation)
Loop at line 15 unrolled without remainder by 8
================================================
...
-out:matrix1.exe

These report results demonstrate the following information:

There were 2 cache lines prefetched 74 loop iterations ahead, that is, with a distance of 74. The prefetch instruction corresponds to line 15 of the source code.
The compiler has unrolled the loop at line 15 eight times.

Manual optimization techniques, like manual cache blocking, should be generally avoided and used only as a last resort.

The HLO report tells you explicitly what loop transformations the compiler performed. By not mentioning a given loop transformation, the report might imply, by omission, that there are transformations the developer might perform. Some transformation that you might want to try are listed in the following table.

Transformation	Description
Distribution	Distribute or split up one large loop into two smaller loops. This may be advantageous when too many registers are being consumed in a given large loop.
Interchange	Swap the order of execution of two nested loops to gain a cache locality or Unit Stride access performance advantage.
Fusion	Fuse two smaller loops with the same trip count together to improve data locality
Block	Cache blocking arranges a loop so that it will perform as many computations as possible on data already residing in cache. The next block of data is not read into cache until all computations with the first block are finished.
Unroll	Disassemble the loop structure. Unrolling is a way of partially disassembling a loop structure so that fewer numbers of iterations of the loop are required, at the expense of each loop iteration being larger. It can be used to hide instruction and data latencies, to take advantage of floating point loadpair instructions, to increase the ratio of real work done per memory operation.
Prefetch	Makes request to bring data in from relatively slow memory to a faster cache several loop iterations ahead of when the data is actually needed.
LoadPair	Makes use of an instruction to bring two floating point data elements in from memory at a time.

See Optimizer Report Generation for more information about options you can use to generate reports.

High-Level Optimization (HLO) Report

Platform

Example Command

Example

Transformation

Description