Execution on Intel® Xeon Phi™ co-processor

From Medusa: Coordinate Free Mehless Method implementation
Revision as of 17:21, 1 March 2017 by Mkolman (talk | contribs) (Speedup by vectorization)

Jump to: navigation, search

Speedup by parallelization

We tested the speedups on the Intel® Xeon Phi™ with the following code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <omp.h>
#include <math.h>

int main(int argc, char *argv[]) {
    int numthreads;
    int n;

    assert(argc == 3 && "args: numthreads n");
    sscanf(argv[1], "%d", &numthreads);
    sscanf(argv[2], "%d", &n);

    printf("Init...\n");
    printf("Start (%d threads)...\n", numthreads);
    printf("%d test cases\n", n);

    int m = 1000000;
    double ttime = omp_get_wtime();

    int i;
    double d = 0;
#pragma offload target(mic:0)
    {
#pragma omp parallel for private (i) schedule(static) num_threads(numthreads)
        for(i = 0; i < n; ++i) {
            for(int j = 0; j < m; ++j) {
                d = sin(d) + 0.1 + j;
                d = pow(0.2, d)*j;
            }
        }
    }
    double time = omp_get_wtime() - ttime;
    fprintf(stderr, "%d %d %.6f\n", n, numthreads, time);
    printf("time: %.6f s\n", time);
    printf("Done d = %.6lf.\n", d);

    return 0;
}

The code essentially distributes a problem of size $n\cdot m$ among numthreads cores, We tested the time of execution for $n$ from the set $\{1, 10, 20, 50, 100, 200, 500, 1000\}$ and numthreads from $1$ to $350$. The plots of exectuion times and performance speeups are shown below.

A square of nodes coloured according to the solution(with smaller and larger node density)
Figure 1: A picture of our solution (with smaller and larger node density)


A square of nodes coloured according to the solution(with smaller and larger node density)
Figure 2: A picture of our solution (with smaller and larger node density)



The code was compiled using:
icc -openmp -O3 -qopt-report=2 -qopt-report-phase=vec -o test test.cpp
without warnings or errors. Then, in order to offload to Intel Phi, user must be logged in as root:
sudo su
To run correctly, intel compiler and runtime variables must be sourced:
source /opt/intel/bin/compilervars.sh intel64
Finally, the code was tested using the following command, where test is the name of the compiled executable:
for n in 1 10 20 50 100 200 500 1000; do for nt in {1..350}; echo $nt $n; ./test $nt $n 2>> speedups.txt; done; done


Speedup by vectorization

Intel Xeon Phi has a 512 bit of space for simultaneous computation, which means it can calculate 8 double (or 16 single) operations at the same time. This is called vectorization and greatly improves code execution.

Consider the following code of speedtest.cpp:

#include <cmath>
#include <iostream>

int main() {
    const int N = 104;
    double a[N];
    for (int i = 0; i < 1e5; i++)
        for (int j = 0; j < N; j++)
            a[j] = std::sin(std::exp(a[j]-j)*3 * i + i*j);
    std::cout << a[4] << "\n";
    return 0;
}

Intel's C++ compiler ICPC will successfully vectorize the inner for loop, so that it will run significantly faster than with vectorization disabled.

The code can be compiled with or without vectorization

$ icpc speedtest.cpp -o vectorized_speedtest -O3
$ icpc speedtest.cpp -o unvectorized_speedtest -O3 -no-vec
Machine ASUS ZenBook Pro UX501VW Intel® Xeon® CPU E5-2620 v3 Intel® Xeon® CPU E5-2620 v3 Intel® Xeon® CPU E5-2620 v3 Intel® Xeon Phi™ Coprocessor SE10/7120 Intel® Xeon Phi™ Coprocessor SE10/7120
Compiler g++-6.3.1 g++-4.8.5 icpc-16.0.1 icpc-16.0.1 -no-vec icpc-16.0.1 icpc-16.0.1 -no-vec
Execution time[s] 0.63 - 0.66 0.65 - 0.66 0.155 - 0.160 0.50 - 0.51 0.25 - 0.26 11.1 - 11.2

Code incapable of vectorization

On the other hand there is a very similar code that can not be vectorized.

#include <cmath>
#include <iostream>

int main() {
    const int N = 104;
    double a;
    for (int i = 0; i < 1e5; i++)
        for (int j = 0; j < N; j++)
            a = std::sin(std::exp(a-j)*3 * i + i*j);
    std::cout << a << "\n";
    return 0;
}
Machine ASUS ZenBook Pro UX501VW Intel® Xeon® CPU E5-2620 v3 Intel® Xeon® CPU E5-2620 v3 Intel® Xeon® CPU E5-2620 v3 Intel® Xeon Phi™ Coprocessor SE10/7120 Intel® Xeon Phi™ Coprocessor SE10/7120
Compiler g++-6.3.1 g++-4.8.5 icpc-16.0.1 icpc-16.0.1 -no-vec icpc-16.0.1 icpc-16.0.1 -no-vec
Execution time[s] 0.80 - 0.82 0.72 - 0.73 0.58 - 0.59 0.58 - 0.59 10.9 - 11.0 10.9 - 11.0