| Number of doubles | Orig. Bcast / p=1 | New Bcast / p=1 | Orig. Bcast / p=2 | New Bcast / p=2 | Orig. Bcast / p=3 | New Bcast / p=3 | Orig. Bcast / p=4 | New Bcast / p=4 |
| 1 | 0,000004 | 0,000004 | 0,000223 | 0,000262 | 0,000275 | 0,000267 | 0,000459 | 0,000290 |
| 10 | 0,000004 | 0,000004 | 0,000268 | 0,000268 | 0,000290 | 0,000276 | 0,000457 | 0,000312 |
| 100 | 0,000004 | 0,000007 | 0,000746 | 0,000741 | 0,000818 | 0,000739 | 0,001220 | 0,000894 |
| 1000 | 0,000006 | 0,000037 | 0,002422 | 0,002329 | 0,003103 | 0,003093 | 0,005784 | 0,003748 |
| 10000 | 0,000052 | 0,000379 | 0,017470 | 0,015790 | 0,029754 | 0,027422 | 0,049730 | 0,040134 |
|
|||||||||||
These are the results of the tests I performed using the original broadcast-function (Orig. Bcast) provided by the MPI-framework and the functions written by myself (New Bcast) using P2P-communication. Each test is performed 100 times before calculating the average result. The number of processes used for each test is denoted in the tabel by p=x, where x is the actual number of procs. As it is not necessary for performance testing, I have not implemented my own functions for ALL datatypes and ALL reduction-functions but only for MPI_DOUBLE and MPI_SUM.
The results show that the performance of the original and the self-written functions mostly does not differ much. An interesting point is, that the original functions seem to be a little faster when using few processes whereas my own functions perform better at higher numbers of processes. The cause could be a restriction on BC-traffic on the machines I tested the program on. I used the machines at the university in the late evening. As I am not aware of the algorithms used in the original implementation I cannot denote differences between the original and the new implementation.