Parallelism Overview

This section discusses the three major features of parallel programming supported by the Intel® compiler:

Parallelization with OpenMP*
Auto-parallelization
Auto-vectorization

Each of these features contributes to application performance depending on the number of processors, target architecture (IA-32 or Itanium® architecture), and the nature of the application. These features of parallel programming can be combined to contribute to application performance.

Parallel programming can be explicit, that is, defined by a programmer using OpenMP directives. Parallel programming can also be implicit, that is, detected automatically by the compiler. Implicit parallelism implements auto-parallelization of outer-most loops and auto-vectorization of innermost loops (or both).

Parallelism defined with OpenMP and auto-parallelization directives is based on thread-level parallelism (TLP). Parallelism defined with auto-vectorization techniques is based on instruction-level parallelism (ILP).

The Intel® compiler supports OpenMP and auto-parallelization for IA-32, Intel EM64T, and Itanium architectures for multiprocessor systems, dual-core processors systems, and systems with Hyper-Threading Technology (HT Technology) enabled.

Auto-vectorization is supported on the families of the Pentium®, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors. To enhance the compilation of the code with auto-vectorization, users can also add vectorizer directives to their program. A closely related technique software pipelining (SWP) is available on the Itanium-based systems.

The following table summarizes the different ways in which parallelism can be exploited with the Intel® Compiler.

Parallelism	Description
Explicit	Parallelism programmed by the user
OpenMP* (thread-level parallelism) IA-32 and Itanium® architectures	Supported on: IA-32, Intel EM64T, and Itanium-based multiprocessor systems and dual-core processors Hyper-Threading Technology-enabled systems
Implicit	Parallelism generated by the compiler and by user-supplied hints
Auto-parallelization (thread-level parallelism) of outer-most loops; IA-32 and Itanium architectures	Supported on: IA-32, Intel EM64T, and Itanium-based multiprocessor systems and dual-core processors Hyper-Threading Technology-enabled systems
Auto-vectorization (instruction-level parallelism) of inner-most loops; IA-32 and Itanium architectures	Supported on: Pentium®, Pentium with MMX™ Technology, Pentium II, Pentium III, and Pentium 4 processors

Parallel Program Development

OpenMP

The Intel® compiler supports the OpenMP* C/C++ version 2.5 API specification available from the OpenMP* (http://www.openmp.org) web site. The OpenMP directives relieve the user from having to deal with the low-level details of iteration space partitioning, data sharing, and thread scheduling and synchronization.

Auto-Parallelization

The Auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor and dual-core systems and IA-32 processors with the Hyper-Threading Technology.

Auto-Vectorization

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results.

The following example demonstrates how code can be designed to explicitly benefit from parallelization and vectorization. Assuming you compile the code shown below using -parallel -xP (Linux*) or /Qparallel /QxP (Windows*), the compiler will parallelize the outer loop and vectorize the innermost loop.

Example
#include <stdio.h> #define ARR_SIZE 500 //Define array int main() { int matrix[ARR_SIZE][ARR_SIZE]; int arrA[ARR_SIZE]={10}; int arrB[ARR_SIZE]={30}; int i, j; for(i=0;i<ARR_SIZE;i++) { for(j=0;j<ARR_SIZE;j++) { matrix[i][j] = arrB[i]*(arrA[i]%2+10); } } }

Example

#include <stdio.h>

#define ARR_SIZE 500 //Define array

int main()

{

int matrix[ARR_SIZE][ARR_SIZE];

int arrA[ARR_SIZE]={10};

int arrB[ARR_SIZE]={30};

int i, j;

for(i=0;i<ARR_SIZE;i++)

{

for(j=0;j<ARR_SIZE;j++)

{

matrix[i][j] = arrB[i]*(arrA[i]%2+10);

}

Compiling the example code with the correct options, the compiler should report results similar to the following:

vectorization.c(18) : (col. 6) remark: LOOP WAS VECTORIZED.

vectorization.c(16) : (col. 3) remark: LOOP WAS AUTO-PARALLELIZED.

Auto-vectorization can help improve performance of an application that runs on systems based on Pentium®, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors.

The following tables summarize the options that enable auto-vectorization, auto-parallelization, and OpenMP support.

Auto-vectorization: IA-32 only

Windows*	Linux*	Description
/Qx	-x	Generates specialized code to run exclusively on processors with the extensions specified by {K,W,N,B,P,T}. P is the only valid value on Mac OS* systems. See the following topic in Compiler Options: -x
/Qax	-ax	Generates, in a single binary, code specialized to the extensions specified by {K,W,N,B,P,T} and also generic IA-32 code. P is the only valid value on Mac OS systems. The generic code is usually slower. See the following topic in Compiler Options: -ax
/Qvec-report	-vec-report	Controls the diagnostic messages from the vectorizer, see subsection that follows the table. See the following topic in Compiler Options: -vec-report

Windows*

Linux*

Description

/Qx

-x

Generates specialized code to run exclusively on processors with the extensions specified by {K,W,N,B,P,T}. P is the only valid value on Mac OS* systems.

See the following topic in Compiler Options:

-x

/Qax

-ax

Generates, in a single binary, code specialized to the extensions specified by {K,W,N,B,P,T} and also generic IA-32 code. P is the only valid value on Mac OS systems.

The generic code is usually slower.

See the following topic in Compiler Options:

/Qvec-report

-vec-report

Controls the diagnostic messages from the vectorizer, see subsection that follows the table.

See the following topic in Compiler Options:

-vec-report

Auto-parallelization: IA-32 and Itanium® architectures

Windows*	Linux*	Description
/Qparallel	-parallel	Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. Intel® Itanium®-based systems only: Implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows). See the following topic in Compiler Options: -parallel
/Qpar-threshold[:n]	-par-threshold{n}	Sets a threshold for the auto of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100. See the following topic in Compiler Options: -par-threshold
/Qpar-report	-par-report	Controls the auto-parallelizer's diagnostic levels. See the following topic in Compiler Options: -par-report

Windows*

Linux*

Description

/Qparallel

-parallel

Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel.

Intel® Itanium®-based systems only:

Implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows).

See the following topic in Compiler Options:

-parallel

/Qpar-threshold[:n]

-par-threshold{n}

Sets a threshold for the auto of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100.

See the following topic in Compiler Options:

-par-threshold

/Qpar-report

-par-report

Controls the auto-parallelizer's diagnostic levels.

See the following topic in Compiler Options:

-par-report

OpenMP: IA-32 and Itanium® architectures

Windows*	Linux*	Description
/Qopenmp	-openmp	Enables the parallelizer to generate multithreaded code based on the OpenMP directives. Intel® Itanium®-based systems only: Implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows). See the following topic in Compiler Options: -openmp
/Qopenmp-report	-openmp-report	Controls the OpenMP parallelizer's diagnostic levels. See the following topic in Compiler Options: -openmp-report
/Qopenmp-stubs	-openmp-stubs	Enables compilation of OpenMP programs in sequential mode. The OpenMP directives are ignored and a stub OpenMP library is linked. See the following topic in Compiler Options: -openmp-stubs

Windows*

Linux*

Description

/Qopenmp

-openmp

Enables the parallelizer to generate multithreaded code based on the OpenMP directives.

Intel® Itanium®-based systems only:

Implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows).

See the following topic in Compiler Options:

-openmp

/Qopenmp-report

-openmp-report

Controls the OpenMP parallelizer's diagnostic levels.

See the following topic in Compiler Options:

-openmp-report

/Qopenmp-stubs

-openmp-stubs

Enables compilation of OpenMP programs in sequential mode. The OpenMP directives are ignored and a stub OpenMP library is linked.

See the following topic in Compiler Options:

-openmp-stubs

Note

When both -openmp (Linux) or /Qopenmp (Windows) and -parallel (Linux) or /Qparallel (Windows) are specified on the command line, the -parallel (Linux) or /Qparallel (Windows) option is only applied in routines that do not contain OpenMP directives. For routines that contain OpenMP directives, only the -openmp (Linux) or /Qopenmp (Windows) option is applied.

With the right choice of options, you can:

Increase the performance of your application with minimum effort
Use compiler features to develop multithreaded programs faster

Additionally, with the relatively small effort of adding OpenMP directives to existing code you can transform a sequential program into a parallel program.

The following example demonstrates one method of using the OpenMP pragmas within code.

Example
#include <stdio.h> #define ARR_SIZE 100 //Define array void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int a, int b, int c); int main() { int arr_a[ARR_SIZE]; int arr_b[ARR_SIZE]; int arr_c[ARR_SIZE]; int i,j; int matrix_a[ARR_SIZE][ARR_SIZE]; int matrix_b[ARR_SIZE][ARR_SIZE]; #pragma omp parallel for // Initialize the arrays and matrices. for(i=0;i<ARR_SIZE; i++) { arr_a[i]= i; arr_b[i]= i; arr_c[i]= ARR_SIZE-i; for(j=0; j<ARR_SIZE;j++) { matrix_a[i][j]= j; matrix_b[i][j]= i; } } foo(matrix_a, matrix_b, arr_a, arr_b, arr_c); } void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int a, int b, int c) { int i, num, arr_x[ARR_SIZE]; #pragma omp parallel for private(num) // Expresses the parallelism using the OpenMP pragma: parallel for. // The pragma guides the compiler generating multithreaded code. // Array arr_X, mb, b, and c are shared among threads based on OpenMP // data sharing rules. Scalar num si specifed as private // for each thread. for(i=0;i<ARR_SIZE;i++) { num = ma[b[i]][c[i]]; arr_x[i]= mb[a[i]][num]; printf("Values: %d\n", arr_x[i]); //prints values 0-ARR_SIZE-1 } }

Example

#include <stdio.h>

#define ARR_SIZE 100 //Define array

void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int *a, int *b, int *c);

int main()

{

int arr_a[ARR_SIZE];

int arr_b[ARR_SIZE];

int arr_c[ARR_SIZE];

int i,j;

int matrix_a[ARR_SIZE][ARR_SIZE];

int matrix_b[ARR_SIZE][ARR_SIZE];

#pragma omp parallel for

// Initialize the arrays and matrices.

for(i=0;i<ARR_SIZE; i++)

{

arr_a[i]= i;

arr_b[i]= i;

arr_c[i]= ARR_SIZE-i;

for(j=0; j<ARR_SIZE;j++)

{

matrix_a[i][j]= j;

matrix_b[i][j]= i;

}

foo(matrix_a, matrix_b, arr_a, arr_b, arr_c);

}

void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int *a, int *b, int *c)

{

int i, num, arr_x[ARR_SIZE];

#pragma omp parallel for private(num)

// Expresses the parallelism using the OpenMP pragma: parallel for.

// The pragma guides the compiler generating multithreaded code.

// Array arr_X, mb, b, and c are shared among threads based on OpenMP

// data sharing rules. Scalar num si specifed as private

// for each thread.

for(i=0;i<ARR_SIZE;i++)

{

num = ma[b[i]][c[i]];

arr_x[i]= mb[a[i]][num];

printf("Values: %d\n", arr_x[i]); //prints values 0-ARR_SIZE-1

}