Difference between revisions of "Execution on Intel® Xeon Phi™ co-processor"

Revision as of 19:37, 1 March 2017

Speedup by parallelization

We tested the speedups on the Intel® Xeon Phi™ with the following code:

 1 #include <stdio.h>
 2 #include <stdlib.h>
 3 #include <string.h>
 4 #include <assert.h>
 5 #include <omp.h>
 6 #include <math.h>
 7 
 8 int main(int argc, char *argv[]) {
 9     int numthreads;
10     int n;
11 
12     assert(argc == 3 && "args: numthreads n");
13     sscanf(argv[1], "%d", &numthreads);
14     sscanf(argv[2], "%d", &n);
15 
16     printf("Init...\n");
17     printf("Start (%d threads)...\n", numthreads);
18     printf("%d test cases\n", n);
19 
20     int m = 1000000;
21     double ttime = omp_get_wtime();
22 
23     int i;
24     double d = 0;
25 #pragma offload target(mic:0)
26     {
27 #pragma omp parallel for private (i) schedule(static) num_threads(numthreads)
28         for(i = 0; i < n; ++i) {
29             for(int j = 0; j < m; ++j) {
30                 d = sin(d) + 0.1 + j;
31                 d = pow(0.2, d)*j;
32             }
33         }
34     }
35     double time = omp_get_wtime() - ttime;
36     fprintf(stderr, "%d %d %.6f\n", n, numthreads, time);
37     printf("time: %.6f s\n", time);
38     printf("Done d = %.6lf.\n", d);
39 
40     return 0;
41 }

The code essentially distributes a problem of size $n\cdot m$ among numthreads cores, We tested the time of execution for $n$ from the set $\{1, 10, 20, 50, 100, 200, 500, 1000\}$ and numthreads from $1$ to $350$. The plots of exectuion times and performance speeups are shown below.

A square of nodes coloured according to the solution(with smaller and larger node density)

Figure 1: A picture of our solution (with smaller and larger node density)

Figure 2: A picture of our solution (with smaller and larger node density)

The code was compiled using:
icc -openmp -O3 -qopt-report=2 -qopt-report-phase=vec -o test test.cpp
without warnings or errors. Then, in order to offload to Intel Phi, user must be logged in as root:
sudo su
To run correctly, intel compiler and runtime variables must be sourced:
source /opt/intel/bin/compilervars.sh intel64
Finally, the code was tested using the following command, where test is the name of the compiled executable:
for n in 1 10 20 50 100 200 500 1000; do for nt in {1..350}; echo $nt $n; ./test $nt $n 2>> speedups.txt; done; done

Speedup by vectorization

Intel Xeon Phi has a 512 bit of space for simultaneous computation, which means it can calculate 8 double (or 16 single) operations at the same time. This is called vectorization and greatly improves code execution.

Consider the following code of speedtest.cpp:

 1 #include <cmath>
 2 #include <iostream>
 3 
 4 int main() {
 5     const int N = 104;
 6     double a[N];
 7     for (int i = 0; i < 1e5; i++)
 8         for (int j = 0; j < N; j++)
 9             a[j] = std::sin(std::exp(a[j]-j)*3 * i + i*j);
10     std::cout << a[4] << "\n";
11     return 0;
12 }

Intel's C++ compiler ICPC will successfully vectorize the inner for loop, so that it will run significantly faster than with vectorization disabled.

The code can be compiled with or without vectorization

$ icpc speedtest.cpp -o vectorized_speedtest -O3
$ icpc speedtest.cpp -o unvectorized_speedtest -O3 -no-vec

Machine	ASUS ZenBook Pro UX501VW	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon Phi™ Coprocessor SE10/7120	Intel® Xeon Phi™ Coprocessor SE10/7120
Compiler	g++-6.3.1	g++-4.8.5	icpc-16.0.1	icpc-16.0.1 -no-vec	icpc-16.0.1 -mmic	icpc-16.0.1 -mmic -no-vec
Double time[s]	0.63 - 0.66	0.65 - 0.66	0.155 - 0.160	0.50 - 0.51	0.25 - 0.26	11.1 - 11.2	Float time[s]	0.65 - 0.71	0.53 - 0.55	0.155 - 0.160	0.17 - 0.19	0.37 - 0.38	4.2 - 4.3

We can see massive 44 fold speedup with and without vectorization.

Code incapable of vectorization

On the other hand there is a very similar code that can not be vectorized. Now all iterations of the inner loop access the same variable instead of each its own element in a list. ICPC is now unable to vectorize the code resulting in no difference when using -no-vec compile flag.

 1 #include <cmath>
 2 #include <iostream>
 3 
 4 int main() {
 5     const int N = 104;
 6     double a;
 7     for (int i = 0; i < 1e5; i++)
 8         for (int j = 0; j < N; j++)
 9             a = std::sin(std::exp(a-j)*3 * i + i*j);
10     std::cout << a << "\n";
11     return 0;
12 }

Machine	ASUS ZenBook Pro UX501VW	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon® CPU E5-2620 v3	Intel® Xeon Phi™ Coprocessor SE10/7120	Intel® Xeon Phi™ Coprocessor SE10/7120
Compiler	g++-6.3.1	g++-4.8.5	icpc-16.0.1	icpc-16.0.1 -no-vec	icpc-16.0.1 -mmic	icpc-16.0.1 -mmic -no-vec
Double time[s]	0.80 - 0.82	0.72 - 0.73	0.58 - 0.59	0.58 - 0.59	10.9 - 11.0	10.9 - 11.0

! Float time[s] | 0.69 - 0.72 | 0.66 - 0.67 | 0.32 - 0.33 | 0.32 - 0.34 | 4.1 - 4.2 | 4.1 - 4.2 |}

@@ Line 114: / Line 114: @@
 | icpc-16.0.1
 | icpc-16.0.1 -no-vec
-| icpc-16.0.1
+| icpc-16.0.1 -mmic
-| icpc-16.0.1 -no-vec
+| icpc-16.0.1 -mmic -no-vec
 |-
-! Execution time[s]
+! Double time[s]
 | 0.63 - 0.66
 | 0.65 - 0.66
@@ Line 124: / Line 124: @@
 | 0.25 - 0.26
 | 11.1 - 11.2
+! Float time[s]
+| 0.65 - 0.71
+| 0.53 - 0.55
+| 0.155 - 0.160
+| 0.17 - 0.19
+| 0.37 - 0.38
+| 4.2 - 4.3
 |}
-''' Code incapable of vectorization '''
+We can see massive 44 fold speedup with and without vectorization.
-On the other hand there is a very similar code that can not be vectorized.
+=== Code incapable of vectorization ===
+On the other hand there is a very similar code that can not be vectorized. Now all iterations of the inner loop access the same variable instead of each its own element in a list. ICPC is now unable to vectorize the code resulting in no difference when using -no-vec compile flag.
 <syntaxhighlight lang="c++" line>
@@ Line 160: / Line 169: @@
 | icpc-16.0.1
 | icpc-16.0.1 -no-vec
-| icpc-16.0.1
+| icpc-16.0.1 -mmic
-| icpc-16.0.1 -no-vec
+| icpc-16.0.1 -mmic -no-vec
 |-
-! Execution time[s]
+! Double time[s]
 | 0.80 - 0.82
 | 0.72 - 0.73
@@ Line 170: / Line 179: @@
 | 10.9 - 11.0
 | 10.9 - 11.0
+|}
+! Float time[s]
+| 0.69 - 0.72
+| 0.66 - 0.67
+| 0.32 - 0.33
+| 0.32 - 0.34
+| 4.1 - 4.2
+| 4.1 - 4.2
 |}

Difference between revisions of "Execution on Intel® Xeon Phi™ co-processor"

Revision as of 19:37, 1 March 2017

Speedup by parallelization

Speedup by vectorization

Code incapable of vectorization

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools