Execution on Intel® Xeon Phi™ co-processor

Speedup by parallelization

We tested the speedups on the Intel® Xeon Phi™ with the following code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <omp.h>
#include <math.h>

int main(int argc, char *argv[]) {
    int numthreads;
    int n;

    assert(argc == 3 && "args: numthreads n");
    sscanf(argv[1], "%d", &numthreads);
    sscanf(argv[2], "%d", &n);

    printf("Init...\n");
    printf("Start (%d threads)...\n", numthreads);
    printf("%d test cases\n", n);

    int m = 1000000;
    double ttime = omp_get_wtime();

    int i;
    double d = 0;
#pragma offload target(mic:0)
    {
#pragma omp parallel for private (i) schedule(static) num_threads(numthreads)
        for(i = 0; i < n; ++i) {
            for(int j = 0; j < m; ++j) {
                d = sin(d) + 0.1 + j;
                d = pow(0.2, d)*j;
            }
        }
    }
    double time = omp_get_wtime() - ttime;
    fprintf(stderr, "%d %d %.6f\n", n, numthreads, time);
    printf("time: %.6f s\n", time);
    printf("Done d = %.6lf.\n", d);

    return 0;
}

The code essentially distributes a problem of size $n\cdot m$ among numthreads cores, We tested the time of execution for $n$ from the set $\{1, 10, 20, 50, 100, 200, 500, 1000\}$ and numthreads from $1$ to $350$. The plots of exectuion times and performance speeups are shown below.

A square of nodes coloured according to the solution(with smaller and larger node density)

Figure 1: A picture of our solution (with smaller and larger node density)

Figure 2: A picture of our solution (with smaller and larger node density)

The code was compiled using:
icc -openmp -O3 -qopt-report=2 -qopt-report-phase=vec -o test test.cpp
without warnings or errors. Then, in order to offload to Intel Phi, user must be logged in as root:
sudo su
To run correctly, intel compiler and runtime variables must be sourced:
source /opt/intel/bin/compilervars.sh intel64
Finally, the code was tested using the following command, where test is the name of the compiled executable:
for n in 1 10 20 50 100 200 500 1000; do for nt in {1..350}; echo $nt $n; ./test $nt $n 2>> speedups.txt; done; done

Speedup by vectorization

Intel Xeon Phi has a 512 bit of space for simultaneous computation, which means it can calculate 8 double (or 16 single) operations at the same time. This is called vectorization and greatly improves code execution.

Consider the following code of speedtest.cpp:

#include <cmath>
#include <iostream>

int main() {
    const int N = 104;
    double a[N];
    for (int i = 0; i < 1e5; i++)
        for (int j = 0; j < N; j++)
            a[j] = std::sin(std::exp(a[j]-j)*3 * i + i*j);
    std::cout << a[4] << "\n";
    return 0;
}

Intel's C++ compiler ICPC will successfully vectorize the inner for loop, so that it will run significantly faster than with vectorization disabled.

The code can be compiled with or without vectorization

$ icpc speedtest.cpp -o vectorized_speedtest -O3
$ icpc speedtest.cpp -o unvectorized_speedtest -O3 -no-vec

The below table shows execution times of the code displayed before on different machines with different settings. Two times represent execution time with double and float data type, respectively.

Machine	ASUS ZenBook Pro UX501VW	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon Phi™ Coprocessor SE10/7120	Intel® Xeon Phi™ Coprocessor SE10/7120
Compiler	g++-6.3.1	g++-4.8.5	icpc-16.0.1	icpc-16.0.1 -no-vec	icpc-16.0.1 -mmic	icpc-16.0.1 -mmic -no-vec
Double time[s]	0.63 - 0.66	0.65 - 0.66	0.155 - 0.160	0.50 - 0.51	0.25 - 0.26	11.1 - 11.2
Float time[s]	0.65 - 0.71	0.53 - 0.55	0.155 - 0.160	0.17 - 0.19	0.37 - 0.38	4.2 - 4.3

We can see massive 44 fold speedup with and without vectorization.

Code incapable of vectorization

On the other hand there is a very similar code that can not be vectorized. Now all iterations of the inner loop access the same variable instead of each its own element in a list. ICPC is now unable to vectorize the code resulting in no difference when using -no-vec compile flag.

#include <cmath>
#include <iostream>

int main() {
    const int N = 104;
    double a;
    for (int i = 0; i < 1e5; i++)
        for (int j = 0; j < N; j++)
            a = std::sin(std::exp(a-j)*3 * i + i*j);
    std::cout << a << "\n";
    return 0;
}

Machine	ASUS ZenBook Pro UX501VW	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon Phi™ Coprocessor SE10/7120	Intel® Xeon Phi™ Coprocessor SE10/7120
Compiler	g++-6.3.1	g++-4.8.5	icpc-16.0.1	icpc-16.0.1 -no-vec	icpc-16.0.1 -mmic	icpc-16.0.1 -mmic -no-vec
Double time[s]	0.80 - 0.82	0.72 - 0.73	0.58 - 0.59	0.58 - 0.59	10.9 - 11.0	10.9 - 11.0
Float time[s]	0.69 - 0.72	0.66 - 0.67	0.32 - 0.33	0.32 - 0.34	4.1 - 4.2	4.1 - 4.2

Speedup by vectorization and parallelization

Consider the following code:

#include <stdio.h>
int main(){
    double *a, *b, *c;
    int i,j,k, ok, n=1000;  // or n=10000
    // allocated memory on the heap aligned to 64 byte boundary
    ok = posix_memalign((void**)&a, 64, n*n*sizeof(double));
    ok = posix_memalign((void**)&b, 64, n*n*sizeof(double));
    ok = posix_memalign((void**)&c, 64, n*n*sizeof(double));
    // initialize matrices
    for (i = 0; i < n*n; ++i) {
        a[i] = 1;
        b[i] = 1;
        c[i] = 1;
    }
    //parallelize via OpenMP on MIC
    #pragma omp parallel for
    for( i = 0; i < n; i++ ) {
        for( k = 0; k < n; k++ ) {
            #pragma vector aligned
            #pragma ivdep
            for( j = 0; j < n; j++ ) {
                //c[i][j] = c[i][j] + a[i][k]*b[k][j];
                c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];
            }
        }
    }
    printf("%f\n", c[n]);
}

It is designed to be parallelizable and vectorizable. But with smaller systems (N=1000) we lose the benefits of vectorization with larger number of threads used.

However even the larger test case (N=10000) is barely faster than execution on the host processor. With 61 threads and vectorization it runs for a flat minute, where the host with 24 threads needs 1 minute and 15 seconds.

It is interesting to note that so-called "real time", i.e. the total processor time used, behaves differently for vectorized and nonvectorized code. For $N=10^4$ nonvectorized code constantly uses 133 minutes for completion regardless of thread number, but vectorized code goes from 33 minutes with one thread to an hour of total processing time with 61 threads. Similarly with $N=10^3$ and 61 threads nonvectorized code uses 125% of it beginning processing time, where for vectorized that figure is 600%.

Figure 3: Times used for the sample problem with and without vectorization for two different N.

Execution on Intel® Xeon Phi™ co-processor

Contents

Speedup by parallelization

Speedup by vectorization

Code incapable of vectorization

Speedup by vectorization and parallelization

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools