OpenMP: Monte Carlo method for Pi
In the previous article we studied a
sequential Monte Carlo algorithm for approximating Pi (π
).
In this article we briefly repeat the idea of the sequential algorithm. Then, we parallelize the sequential version with the help of OpenMP.
The sequential algorithm
A Monte Carlo algorithm for approximating π
uniformly generates the points in the square [-1, 1] x [-1
1]
. Then it counts the points which lie in the inside of the unit
circle.
An approximation of π
is then computed by the following formula:
For more details about the formula look in the previous article.
The source code for the approximation algorithm is
It uses UniformDistribution
to generate samples in
[-1, 1]
. Then it counts the number of points in the inside of
the circle and computes an approximation of π
by the
approximation formula.
Parallelization with OpenMP
The task is to parallelize the approximation algorithm with the help of OpenMP.
Looking at the function approximatePi
, we might be tempted to
simply parallelize the for loop in the function with #pragma omp
parallel for
. But unfortunately, this approach has some problems. It
introduces data races in the program.
The reasons for the data races are the variables distribution
and counter
. They are updated by different iterations. If we
would parallelize the for loop, then the variables might be updated
by different threads at the same time.
Note that the distribution
is updated in every iteration,
because the sampling updates the internal state of the procedure which generates
the random samples.
Since a simple parallelization of the approximatePi
is not
possible, we slightly change the existing code.
Modifications
Our strategy is to divide the sampling into chunks. This strategy introduces two nested loops
- an outer loop over the chunks and
- an inner loop over the samples for each chunk.
The code for the outer loop is presented below
The function parallelPi
divides the number of total samples
into eight chunks. The chunk
variable holds the number of
samples for each chunk. Then the function loops over each chunk. In the for
loop, the function samplesInsideCircle
generates samples and
counts the number of samples in the inside of the unit circle. In the end, the
parallelPi
computes an approximation of the
π
with the help of the known formula.
The missing piece is the function samplesInsideCircle
. It is
presented below.
This function is very similar to the approximatePi
. The
samplesInsideCircle
generates numSamples
uniform samples and counts how many of them are in the inside of the circle.
Parallelization
Now, we are ready for a parallelization of the algorithm. We choose to parallelize the for loop over the chunks, since it is the outer for loop.
In this for loop
we have three shared variables: numChunks
,
counter
and chunk
. We do not modify the
numChunks
and chunk
. They are read-only
variables. Therefore, this is not a problem. On the other hand, we modify the
counter
. We must deal with the problem of data races for this
variable.
Before we continue, let us remember the sequential version of the program. There
we have problems (data races) with two variables:
distribution
and counter
. The
counter
is still problematic but we do not have a problem
with the distribution
. We avoided this problem by having a
different distribution
for each chunk. With other words, the
sequential version initializes one object of
UniformDistribution
and the parallel version initializes
numChunks
objects of UniformDistribution
.
Now, we continue with the problem of counter
.
OpenMP has a special clause of the loop construct which solves this kind of
problems. This is the reduction
clause which can specify a
reduction. An example of a reduction is
Here, +
is the operation which reduces a0
,
a1
, …, a9
into the
sum
.
The reduction clause requires two things to specify a reduction
- the reduction operation and
- the variable, which will hold the result of the reduction.
In the for loop, counter
is the variable, which will hold the
result and +
is the reduction operation. The loop construct
then looks like
This loop construct correctly parallelizes the for loop. OpenMP increments the
shared variable counter
without introducing a data race.
Example
We run the sequential and parallel versions of the Monte Carlo algorithm for
approximating π
. We measure the execution time of the both
versions. The source code is available
here.
We can see that the execution time of the parallel version (3.6 seconds) is lower than the execution time of the sequential version (20 seconds). Both results are accurate up to four decimal places. I ran the computations on my machine which has 4 two-threaded cores.
In the next article, we will look in more detail at the reduction clause.
Summary
We parallelized a Monte Carlo method for approximating
π
. We saw that the parallelization can decrease the execution
time.
Links:
Jaka’s Corner OpenMP series: