The SWP report can provide details information about loops currently taking advantage of software pipelining available on Itanium®-based systems. Additionally, the report suggests reasons for the loops not being pipelined.
The following command syntax examples demonstrates how to generate a SWP report for the Itanium® Compiler Code Generator (ECG) Software Pipeliner (SWP).
Platform |
Syntax Examples |
---|---|
Linux* |
icc -c -opt-report -opt-report-phase ecg_swp swp.cpp |
Windows* |
icl /c /Qopt-report /Qopt-report-phaseecg_swp swp.cpp |
where -c (Linux) or /c (Windows) tells the compiler to stop at generating the object code (no linking occurs), -opt-report (Linux) or /Qopt-report (Windows) invokes the report generator, and -opt-report-phase ecg_swp (Linux) or /Qopt-report-phase ecg_swp (Windows) indicates the phase (ecg) for which to generate the report.
Linux* only: The space between the option and the phase is optional.
Typically, loops that software pipeline will have a line that indicates the compiler has scheduled the loop for SWP in the report. If the -O3 (Linux) or /O3 (Windows) option is specified, the SWP report merges the loop transformation summary performed by the loop optimizer.
You can compile this example code to generate a sample SWP report, but you must use compile the example using a combination of -c -restrict (Linux) and /c /Qrestrict (Windows). The sample reports is also shown below.
Example |
---|
#define NUM 1024 void multiply_d(double a[][NUM], double b[][NUM], double c[restrict][NUM]){ int i,j,k; double temp; for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } } |
The following sample report shows the report phase that results from compiling the example code shown above (when using the ecg_swp phase).
Sample SWP Report |
---|
Swp report for loop at line 8 in _Z10multiply_dPA1024_dS0_S0_ in file SWP report.cpp Resource II = 2 Recurrence II = 2 Minimum II = 2 Scheduled
II = 2 Estimated
GCS II =
7 Percent of Resource II needed by arithmetic ops = 100% Percent of Resource II needed by memory ops = 50% Percent
of Resource II needed by floating point ops = 50% Number of stages in the software pipeline = 6 |
One fast way to determine if specific loops are software pipelining is to search the report output for the phrase “Number of stages in the software pipeline”; if this phrase is present in the report, it indicates that software pipelining suceeded for the associated loop.
To understand the SWP report results, you must know something about the terminology used and the related concepts. The following table describes some of the terminology used in the SWP report.
Term |
Definition |
---|---|
II |
Initiation Interval (II). The number of cycles between the start of one iteration and the next in the SWP. The presence of the term II in any SWP report indicates that SWP succeeded for the loop in question. II can be used in a quick calculation to determine how many cycles your loop will take, if you also know the number of iterations. Total cycle time of the loop is approximately N * Scheduled II + number Stages (Where N is the number of iterations of the loop). This is an approximation because it does not take into account the ramp-up and ramp-down of the prolog and epilog of the SWP, and only considers the kernel of the SWP loop. As you modify your code, it is generally better to see scheduled II go down, though it is really N* (Scheduled II) + Number of stages in the software pipeline that is ultimately the figure of merit. |
Resource II |
Resource II implies what the Initiation Interval should be when considering the number of functional units available. |
Recurrence II |
Recurrence II indicates what the Initiation Interval should be when there is a recurrence relationship in the loop. A recurrence relationship is a particular kind of a data dependency called a flow dependency like a[i] = a[i-1] where a[i] cannot be computed until a[i-1] is known. If Recurrence II is non-zero and there is no flow dependency in the code, then this indicates either Non-Unit Stride Access or memory aliasing. See Helping the Compiler for more information. |
Minimum II |
Minimum II is the theoretical minimum Initiation Interval that could be achieved. |
Scheduled II |
Scheduled II is what the compiler actually scheduled for the SWP. |
number of stages |
Indicates the number of stages. For example, in the report results below, the line Number of stages in the software pipeline = 3 indicates there were three stages of work, which will show, in assembly, to be a load, an FMA instruction and a store. |
loop-carried memory dependence edges |
The loop-carried memory dependence edges means the compiler avoided WAR (Write After Read) dependency. Loop-carried memory dependence edges can indicate problems with memory aliasing. See Helping the Compiler. |
The most efficient path to solve problems is to analyze the loops that did not SWP in order to determine how to enable SWP.
If the compiler reports the Loop was not SWP because..., see the following table for suggestions about how to mitigate the problems:
Message in Report |
Suggested Action |
---|---|
acyclic global scheduler can achieve a better schedule: => loop not pipelined |
Indicates that the most likely cause is memory aliasing issues. For memory alias problems see memory aliasing (restrict, #pragma ivdep). Might also indicate that the application might be accessing memory in a non-Unit Stride fashion. Non-Unit Stride issues may be indicated by an artificially high recurrence II; If you know there is no recurrence relationship (a[i] = a[i-1] + b[i] for example) in the loop, then a high recurrence II (greater than 0) is a sign that you are accessing memory non-Unit Stride. Rearranging code, perhaps a loop interchange, might help mitigate this problem. |
Loop body has a function call |
Indicates that inlining the function might help solve the problem. |
Not enough static registers |
Indicates that you should distribute the loop by separating it into two or more loops. On Itanium®-based systems you may use #pragma distribute point. |
Not enough rotating registers |
Indicates that the loop carried values use the rotating registers. Distribute the loop. On Itanium-based systems you may use #pragma distribute point. |
Loop too large |
Indicates that you should distribute the loop. On Itanium-based systems you may use the #pragma distribute point. |
Loop has a constant trip count < 4 |
Indicates that unrolling was insufficient. Attempt to fully unroll the loop. However, with small loops fully unrolling the loop is not likely to affect performance significantly. |
Too much flow control |
Indicates complex loop structure. Attempt to simplify the loop. |
Index variable type used can greatly impact performance. In some cases, using loop index variables of type short or unsigned int can prevent software pipelining. If the report indicates performance problems in loops where the index variable is not int and if there are no other obvious causes, try changing the loop index variable to type int.
See Optimizer Report Generation for more information about options you can use to generate reports.