OpenMP: For

The for loop construct is probably one of the most widely used features of the OpenMP. The construct’s aim is to parallelize an explicitly written for loop.

For loop construct

The syntax of the loop construct is

#pragma omp parallel [clauses]
{
    #pragma omp for [clauses]
    for (...)
    {
        // body 
    }
}

The parallel construct specifies the region which should be executed in parallel. A program without the parallel construct will be executed sequentially.

The for loop construct (or simply the loop construct) specifies that the iterations of the following for loop will be executed in parallel. The iterations are distributed across the threads that already exist.

If there is only one #pragma omp for inside #pragma omp parallel we can simplify both constructs into the combined construct.

#pragma omp parallel for [clauses]
for (...)
{

}

Shared clause

The clauses are additional options which we can set to the constructs.

An example of a clause for parallel construct is shared(...) clause. When a program encounters the parallel construct, the program forks a team of threads. The variables, which are listed in the shared(...) clause, are then shared between all the threads.

Example

Let us write a first parallel loop with the OpenMP loop construct.

std::vector<int> iterations(omp_get_max_threads(), 0);
int n = 100;

#pragma omp parallel shared(iterations, n)
{    
    #pragma omp for 
    for (int i = 0; i != n; i++)
    {
        iterations[omp_get_thread_num()]++;
    }
}

print_iterations(iterations);

We define the vector of ints. Each thread increments its corresponding entry in the vector. At the end, the i-th entry of the vector tells us how many iterations was executed by the i-th thread.

The parallel construct creates a team of threads which execute in parallel. The variables iterations and n are shared between all the threads.

The loop construct specifies that the for loop should be executed in parallel.

We use a couple of OpenMP functions. The omp_get_max_threads() returns an upper bound on the number of threads that could form a new team of threads. This upper bound is valid only if we later do not explicitly specify the number of threads in the team. Additionally, we also used omp_get_thread_num() which returns the number of the calling thread.

Canonical loop form

When we compile the upper program, the compilation fails with an error. The error is

In function ‘int main()’:
error: invalid controlling predicate
         for (int i = 0; i != n; i++)

What is wrong with our program? The printings suggests that the condition of the for loop is invalid.

Well, OpenMP is able to parallelize a loop only if it has a certain structure. The structure is

for (initialize; test; increment)
{
...
}

The initialize expression is of the form var = lb, where var is an integer or a random access iterator and lb is a loop invariant.

The test expression must have the form var operator b or b operator var, where b is the loop invariant and operator is one of the following

<,
<=,
>,
>=.

The increment expression has to be one of the following

++var,
var++,
--var,
var--,
var += incr,
var -= incr,
var = var + incr,
var = incr + var,
var = var - incr,

where incr is a loop invariant integer expression.

The loop, which satisfies these conditions, has the canonical loop form (defined in the OpenMP specification). You can find the conditions and the precise definition of the canonical loop form in the OpenMP specification on the page 53.

Corrections

Having the knowledge about the canonical loop form, we can correct the test in the for loop. The program now looks like

std::vector<int> iterations(omp_get_max_threads(), 0);
int n = 100;

#pragma omp parallel shared(iterations, n)
{    
    #pragma omp for 
    for (int i = 0; i < n; i++)
    {
        iterations[omp_get_thread_num()]++;
    }
}

print_iterations(iterations);

The whole source code is available here.

The output of the program is

$ ./ompFor
Thread 0 -> 13 iterations
Thread 1 -> 13 iterations
Thread 2 -> 13 iterations
Thread 3 -> 13 iterations
Thread 4 -> 12 iterations
Thread 5 -> 12 iterations
Thread 6 -> 12 iterations
Thread 7 -> 12 iterations

The program used eight threads to execute the loop. OpenMP uniformly divided the iterations between all the threads – each one executed 12 or 13 iterations.

Summary

In this article, we looked at the basics of the OpenMP loop construct. In order to parallelize the for loop, it must be in the canonical loop form. We examined the definition of the canonical loop form. At the end, we parallelized a for loop with the OpenMP loop clause.

Links: