Prefetching Support

Data prefetching refers to loading data from a relatively slow memory into a relatively fast cache before the data is needed by the application. Data prefetch behavior depends on the architecture:

Itanium® processors: the Intel® compiler generally issues prefetch instructions when you specify -O1, -O2, and -O3 (Linux*) or /O1, /O2, and /03 (Windows*).
Pentium® 4 processors: these processors does a hardware prefetch so the compiler will not issue prefetch instructions when targeted for Pentium® 4 processors.
Pentium® III processors: the Intel® compiler issues prefetches when you specify -xK (Linux) or /QxK (Windows).

Issuing prefetches improves performance in most cases; there are cases where issuing prefetch instructions might slow application performance. Experiment with prefetching; it might be helpful to specifically turn prefetching on or off with a compiler option while leaving all other optimizations unaffected to isolate a suspected prefetch performance issue. See Prefetching with Options for information on using compiler options for prefetching data.

There are two primary methods of issuing prefetch instructions. One is by using compiler directives and the other is by using compiler intrinsics.

Pragmas

prefetch and noprefetch

The prefetch and noprefetch directives are supported by Itanium® processors only. These directives assert that the data prefetches be generated or not generated for some memory references. This affects the heuristics used in the compiler. The general syntax for these pragmas is shown below:

Syntax
#pragma noprefetch #pragma prefetch #pragma prefetch a,b

Syntax

#pragma noprefetch

#pragma prefetch

#pragma prefetch a,b

If loop includes expression A(j), placing prefetch A in front of the loop, instructs the compiler to insert prefetches for A(j + d) within the loop. d is the number of iterations ahead to prefetch the data and is determined by the compiler. This directive is supported only when option -O3 (Linux*) or /O3 (Windows*) is on. These directives are also supported when you specify options -O1 and -O2 (Linux) or /O1 and /O2 (Windows). Remember that -O2 or /O2 is the default optimization level.

Example
#pragma noprefetch b #pragma prefetch a for(i=0; i<m; i++) { a[i]=b[i]+1; }

Example

#pragma noprefetch b

#pragma prefetch a

for(i=0; i<m; i++)

{

a[i]=b[i]+1;

}

The following example, which is for Itanium®-based systems only, demonstrates how to use the prefetch, noprefetch, and memref_control pragmas together:

Example
#define SIZE 10000 int prefetch(int a, int b) { int i, sum = 0; #pragma memref_control a:l2 #pragma noprefetch a #pragma prefetch b for (i = 0; i<SIZE; i++) sum += a[i] * b[i]; return sum; } #include <stdio.h> int main() { int i, arr1[SIZE], arr2[SIZE]; for (i = 0; i<SIZE; i++) { arr1[i] = i; arr2[i] = i; } printf("Demonstrating the use of prefetch, noprefetch,\n" "and memref_control pragma together.\n"); prefetch(arr1, arr2); return 0; }

Example

#define SIZE 10000

int prefetch(int *a, int *b)

{

int i, sum = 0;

#pragma memref_control a:l2

#pragma noprefetch a

#pragma prefetch b

for (i = 0; i<SIZE; i++)

sum += a[i] * b[i];

return sum;

}

#include <stdio.h>

int main()

{

int i, arr1[SIZE], arr2[SIZE];

for (i = 0; i<SIZE; i++) {

arr1[i] = i;

arr2[i] = i;

}

printf("Demonstrating the use of prefetch, noprefetch,\n"

"and memref_control pragma together.\n");

prefetch(arr1, arr2);

return 0;

}

memref_control

The memref_control pragma is supported on by Itanium® processors only. This pragma provides a method for controlling load latency and temporal locality at the variable level. The memref_control pragma allows you to specify locality and latency at the array level. For example, using this pragma allows you to control the following:

The location (cache level) to store data for future access.
The most appropriate latency value to be used for a load, or the latency that has to be overlapped if a prefetch is issued for this reference.

The syntax for this pragma is shown below:

Syntax
#pragma memref_control [name1[:<locality>[:<latency>]],[name2...]

The following table lists the supported arguments.

Argument	Description
name1, name2	Specifies the name of array or pointer. You must specify at least one name; however, you can specify names with associated locality and latency values.
locality	An optional integer value that indicates the desired cache level to store data for future access. This will determine the load/store hint (or prefetch hint) to be used for this reference. The value can be one of the following: l1 = 0 l2 = 1 l3 = 2 mem = 3 To use this argument, you must also specify name.
latency	An optional integer value that indicates the load (or the latency that has to be overlapped if a prefetch is issued for this address). The value can be one of the following: l1_latency = 0 l2_latency = 1 l3_latency = 2 mem_latency = 3 To use this argument, you must also specify name and locality.

Argument

Description

name1, name2

Specifies the name of array or pointer. You must specify at least one name; however, you can specify names with associated locality and latency values.

locality

An optional integer value that indicates the desired cache level to store data for future access. This will determine the load/store hint (or prefetch hint) to be used for this reference. The value can be one of the following:

l1 = 0
l2 = 1
l3 = 2
mem = 3

To use this argument, you must also specify name.

latency

An optional integer value that indicates the load (or the latency that has to be overlapped if a prefetch is issued for this address). The value can be one of the following:

l1_latency = 0
l2_latency = 1
l3_latency = 2
mem_latency = 3

To use this argument, you must also specify name and locality.

When you specify source-level and the data locality information at a high level for a particular data access, the compiler decides how best to use this information. If the compiler can prefetch profitably for the reference, then it issues a prefetch with a distance that covers the specified latency specified and then schedules the corresponding load with a smaller latency. It also uses the hints on the prefetch and load appropriately to keep the data in the specified cache level.

If the compiler cannot compute the address in advance, or decides that the overheads for prefetching are too high, it uses the specified latency to separate the load and its use (in a pipelined loop or a Global Code Scheduler loop). The hint on the load/store will correspond to the cache level passed with the locality argument.

You can use this with the prefetch and noprefetch to further tune the hints and prefetch strategies. When using the memref_control with noprefetch, keep the following guidelines in mind:

Specifying noprefetch along with the memref_control causes the compiler to not issue prefetches; instead the latency values specified in the memref_control is used to schedule the load.
There is no ordering requirements for using the two pragmas together. Specify the two pragmas in either order as long as both are specified consecutively just before the loop where it is to be applied. Issuing a prefetch with one hint and loading it later using a different hint can provide greater control over the hints used for specific architectures.
memref_control is handled differently from the prefetch or noprefetch. Even if the load cannot be prefetched, the reference can still be loaded using a non-default load latency passed to the latency argument.

This following example illustrates the case where the address is not known in advance, so prefetching is not possible. In this case, the compiler will schedule the loads of the tab array with an L3 load latency of 15 cycles (inside a software pipelined loop or GCS loop).

Example: gather
#pragma memref_control tab : l2 : l3_latency for (i=0; i<n; i++) { x = <generate 64 random bits inline>; dum += tab[x&mask]; x>>=6; dum += tab[x&mask]; x>>=6; dum += tab[x&mask]; x>>=6; }

Example: gather

#pragma memref_control tab : l2 : l3_latency

for (i=0; i<n; i++)

{

x = <generate 64 random bits inline>;

dum += tab[x&mask]; x>>=6;

}

The following example illustrates one way of using memref_control, prefetch, and noprefetch together.

Example: sparse matrix
if( size <= 1000 ) { #pragma noprefetch cp, vp #pragma memref_control x:l2:l3_latency #pragma noprefetch yp, bp, rp #pragma noprefetch xp for (iii=0; iii<rag1m0; iii++) { if( ip < rag2 ) { sum -= vp[ip]x[cp[ip]]; ip++; } else { xp[i] = sumyp[i]; i++; sum = bp[i]; rag2 = rp[i+1]; } } xp[i] = sumyp[i]; } else { #pragma prefetch cp, vp #pragma memref_control x:l2:mem_latency #pragma prefetch yp, bp, rp #pragma noprefetch xp for (iii=0; iii<rag1m0; iii++) { if( ip < rag2 ) { sum -= vp[ip]x[cp[ip]]; ip++; } else { xp[i] = sumyp[i]; i++; sum = bp[i]; rag2 = rp[i+1]; } } xp[i] = sumyp[i]; }

Example: sparse matrix

if( size <= 1000 ) {

#pragma noprefetch cp, vp

#pragma memref_control x:l2:l3_latency

#pragma noprefetch yp, bp, rp

#pragma noprefetch xp

for (iii=0; iii<rag1m0; iii++) {

if( ip < rag2 ) {

sum -= vp[ip]*x[cp[ip]];

ip++;

} else {

xp[i] = sum*yp[i];

i++;

sum = bp[i];

rag2 = rp[i+1];

}

xp[i] = sum*yp[i];

} else {

#pragma prefetch cp, vp

#pragma memref_control x:l2:mem_latency

#pragma prefetch yp, bp, rp

#pragma noprefetch xp

for (iii=0; iii<rag1m0; iii++) {

if( ip < rag2 ) {

sum -= vp[ip]*x[cp[ip]];

ip++;

} else {

xp[i] = sum*yp[i];

i++;

sum = bp[i];

rag2 = rp[i+1];

}

xp[i] = sum*yp[i];

}

Intrinsics

Before inserting compiler intrinsics, experiment with all other supported compiler options and pragmas. Compiler intrinsics are less portable and less flexible than either a compiler option or compiler pragmas.

Pragmas enable compiler optimizations while intrinsics perform optimizations. As a result, programs with pragmas are more portable, because the compiler can adapt to different processors, while the programs with intrinsics may have to be rewritten/ported for different processors. This is because intrinsics are closer to assembly programming.

Some prefetching intrinsics are:

Intrinsic	Description
__lfetch	Generate the lfetch.lfhint instruction.
__lfetch_fault	Generate the lfetch.fault.lfhint instruction.
__lfetch_excl	Generate the lfetch.excl.lfhint instruction.
__lfetch_fault_excl	Generate the lfetch.fault.excl.lfhint instruction.
__mm_prefetch	Loads one cache line of data from address a to a location closer to the processor.

See Operating System Related Intrinsics and Cacheability Support Using Streaming SIMD Extensions in the Compiler Reference for more information about these intrinsics.

The following example demonstrates how to generate an lfetch.nt2 instruction using prefetch intrinsics:

Example
for (i=i0; i!=i1; i+=is) { float sum = b[i]; int ip = srow[i]; int c = col[ip]; for(; ip<srow[i+1]; c=col[++ip]) lfetch(2, &value[ip+40]); _// mm_prefetch(&value[ip+40], 2); sum -= value[ip] * x[c]; y[i] = sum; }

Example

for (i=i0; i!=i1; i+=is) {

float sum = b[i];

int ip = srow[i];

int c = col[ip];

for(; ip<srow[i+1]; c=col[++ip])

lfetch(2, &value[ip+40]);

_// mm_prefetch(&value[ip+40], 2);

sum -= value[ip] * x[c];

y[i] = sum;

}

For SSE-enabled processors you could also use the following SSE intrinsics:

_mm_prefetch
_mm_stream_pi
_mm_stream_ps
_mm_sfence

See Intel® Itanium® Architecture Software Developer's Manual, Volume 3: Instruction Set Reference, Revision 2.1. Part I: Intel®Itanium® Instruction Set Descriptions.