Title of project: | Communications and mapping in parallel computers |

Major field of science: | Technical Science |

Research area: | Information and Communication Technologies |

Duration of project: | 3 years (1996-98) |

Total work load in hours: | 11.313 |

Principal investigator: | Dr. Roman Trobec |

Project summary: | In the project, different types of communications among co-operating processors, and processes mapping on parallel computers will be investigated; in a general case, some faulty processors will also be considered. Our results are intended to be implemented in software for the future parallel operating systems as well as in custom designed VLSI circuits. To confirm research results, the theory will be verified with some important computationally intensive applications. New methods will be contributed to parallel integration in molecular dynamics, solutions of difference equations, and parallel parsing. |

Recent advances in technology have enabled the construction of computers whose architectures no longer conform to the traditional von Neumann model of computation. Parallelism is the dominating feature these machines have in common. Parallel computing is one of the most rapidly growing research fields in computer science and engineering. The earliest parallel machines were SIMD computers and vector computers which generally relied on internal parallelism and pipelining to improve data throughput. In the next generation, several independently operating processors were connected in order to co-operate on the solution of a common problem (MIMD). Those multiprocessors generally provided the programmer with shared memory, enabling code to be written in a similar style as for conventional von Neumann computers. However, architectural complexity imposes a limit on the size of computers being built on a pure shared memory basis, and more attention was paid to the design and construction of multiprocessor systems with memory distributed among independent processors (Distributed Memory Multiprocessors - DMMP). Among the first commercial systems that were built as DMMP were the Intel iPSC hypercube series, the Floating Point Systems T-Series and a number of Inmos transputer based systems. Recently, well known supercomputer companies such as Cray Research and Convex have been testing the parallel machines CRAY T3D and CONVEX Exemplar. Besides, workstation networks have become a suitable platform for parallel processing. The concept of such systems offers virtually unlimited design flexibility and scalability. We hope to be able to build computers with orders of magnitude more computing power than we have now. Moreover, those computers would provide high performance at a comparatively low cost. One of the most fundamental challenges computer science is facing with today is need for developing algorithms, programming tools, and applications to harness the vast computing power of parallel computers; moreover, a theoretical basis for this new technology has to be improved. Since it is expected that parallel computers will gain an ever increasing share of the market, these developments will have important commercial consequences. The basic project goals are the following. First, to widen the knowledge of parallel computing area, particularly communications and mapping, and systems with faulty nodes. Second, some important computationally intensive applications will be implemented in order to test our research results. Third, the project will establish minimal scientific potential capable to make researches into future computer systems.

Parallel computer systems represent a promising tool for increasing computational power of computers, hence the theoretical basis for parallel computation has to be matured. The proposed research project is concentrated on the following topics:

**Mapping, Communications and Software Tools in Parallel Systems.**
Assigning processes to processors, data distribution, and communication
among the processors is fundamental for all applications running on parallel
computers. Efficient execution of parallel programs on an existing parallel
computer requires proper mapping of programs onto the underlying machine
architecture which can either be general-purpose (*Bokhari, 1981; Robic
et al., 1992*) or special-purpose (*Ozimek and Tasic, 1995*). A
practical model of a parallel computer has at each of its nodes a fixed
size buffer allowing only limited link contention during message routing.
Broadcasting (one-to-all), multicasting (one-to-more) and gossiping (all-to-all)
are communication problems arising in distributed memory multicomputers
(*Hedetniemi et al., 1986*). The need for selective broadcasting arises
in many important applications, yet there is still no appropriate mechanism
for such broadcasting currently available in parallel computers. Study
of fault tolerant communication algorithms is also a very important topic
of parallel computing (*Chen and Shing, 1990; Robic and Silc, 1994*).
The main reason for delayed practical usage of parallel computers is the
lack of software for developing parallel programs. Functional programming
languages are among the most promising platforms for parallelisation, because
they are not inherently sequential as conventional imperative language
are (*Peyton Jones, 1987*). A tool for designing platform-independent
parallel programs is, for example, PCL - parallel computation library (*Jerebic
et al., 1992*). It enables writing portable programs for mesh and torus
networks of processors, and developing programs on sequential computers.

**Failures in Parallel Systems.** A popular approach for handling
faults in parallel systems is to restore regularity of a system, deteriorated
by failures. Many authors have discussed hardware redundancy with spare
units and switches combined with internal coding of computational data.
These methods can be used for improving reliability, since they also cover
intermittent faults. However, they cannot be used for dealing with fault
localisation. Some authors have investigated multi-level diagnosis and
configuration in the presence of faults. In papers dealing with distributed
diagnosability, the assumption is made that each fault-free unit is capable
of determining the diagnostic state of all other units in the system (*Somani
and Agarwal, 1992*). In the case of massively parallel systems, this
approach is not feasible due to a large number of components in the system
and the amount of communication required. Also, if the number of faults
is greater than some upper bound that depends on connectivity, a false
diagnosis can be obtained. These facts represent a serious drawback for
reusability of earlier developed methods in the area of parallel system-level
diagnosability. In several papers, it has been shown that a local procedure
could be used in massively parallel systems. Some recent works (e.g. *Blough
and Pelc, 1993*) on the diagnosis of regular constant-degree systems
have been done using probability models. Promising results for the strategy
that uses only local information for system diagnosis have been reported.
Our work (*Trobec and Jerebic, 1994*) is based on the assumption that
random production and run-time failures are present. The run-time diagnostic
is local and organised as a parallel procedure.

**Parallel Numerical and Non-numerical Applications.** *Parallel
Numerical Linear Algebra* is an important tool for solving computationally
intensive problems arising in various scientific areas. Talking about the
development of parallel algorithms, we have in mind the design of scaleable
routines that can be included into existing libraries and run on various
High Performance Computing platforms. Recently, parallel implementations
have been directed by the development of scaleable and platform-independent
libraries based on Basic Linear Algebra Subroutines (BLAS) (*Choi et
al., 1994*) and introduction of message-passing standard. This enables
the design of parallel applications not only on supercomputers but also
on clusters of workstations. Our main research interest has been in the
area of parallel algorithms for solving large sparse systems of linear
equations with the emphasis on the system structure (*Evans and Caf,
1994*) and use of updating techniques (*Blaznik et al., 1995*).
*Parallel Molecular Dynamics Integration.* Demands for the extension
of computer simulation methods to molecular systems of increasing size
and complexity cannot be met solely by developing better hardware and consuming
more computer time; since the needs for conceptual improvements of existing
algorithms are becoming more and more obvious. Enlargement of the scope
of existing methods of molecular dynamics of complex molecular systems
and their parallel implementations on the ring topology for distributed
memory computers is presented in (*Trobec and Janezic, 1995*). *Parallel
Parsing:* The membership problem for formal languages is one of the
basic problems in the theory of computation. It is well solved for sequential
computers, but no significant progress has been made in the field of parallel
computers yet. Syntax analysis of context-free languages represents an
important subproblem because syntax analysers or parsers form the skeleton
of the majority of compilers and other program processing tools (*Jain
and Chuadhari, 1994*). The syntax analysis should be split to a number
of independent parsing processes, which should be carried out by embedded
LR(k) parsers. An algorithm for splitting LR(k) parsing tables and generating
tables for embedded LR(k) parsers is still to be found. *Parallel Heat
Transfer Simulation:* Computer simulation and other scientific applications
that have until recently been beyond the capability of conventional computers
are becoming reality on parallel computers. These applications are an important
tool in searching for new phenomena in nature, worst-case predictions,
investigations of dangerous situations, etc. In a parallel computer, a
physical domain is evenly partitioned among $p$ nodes. The same program
in each node solves its own subdomain in parallel, some data from domain
boundaries are exchanged and the solution is ideally reached $p$ times
faster than on a single processor (*Fox et al., 1988*). The above
description follows the nature that is in foundation parallel. We have
developed a simple parallel computer simulation method that is reliable
enough for practical use in medicine (*Trobec and Slivnik, 1994*).
We have analysed a heat distribution in a human heart during a cardiac
surgery. The human heart is an irregularly shaped three dimensional object.
Any real measurements and experiments are dangerous to be performed on
humans, therefore, the computer simulation results will be of crucial importance
in investigating and improving practical methods in medicine.

P.Blaznik, D.J.Evans, J.Tasic. Parallel updating techniques in image
restoration. *Parallel Algorith. Appl.* **9** (1995)

D.M.Blough, A.Pelc. Diagnosis and repair in multiprocessor systems. *IEEE
Trans. Comput.* **42** (1993)

S.H.Bokhari. On the mapping problem. *IEEE Trans. Comput.* **30**
(1981)

M.Chen, K.Shing. Depth-first search approach for fault-tolerant routing
in hypercube multicomputers. *IEEE Trans. Parallel Distributed Syst.*
**1** (1990)

J.Choi, J.J.Dongarra, D.W.Walker. PB-BLAS: A set of parallel block basic
linear algebra subprograms. *Proc. Scalable High-Performance Comput.
Conf.* (IEEE, Los Alamitos, CA, 1994)

D.J.Evans, D.Caf. A sparse preconditioner for symmetric positive definite
banded circulant and Toeplitz linear systems. *Inter. J. Computer Math.*
**54** (1994)

G.C.Fox et al. *Solving Problems on Concurrent Processors.* (Prentice-Hall,
1988)

S.M.Hedetniemi, T.Hedetniemi, A.L.Liestman. A survey of gossiping and broadcasting
in communication networks. *Networks* **18** (1986)

A.Jain, N.S.Chaudhari. Efficient parallel recognition of contex-free languages.
*Parallel Comput.* **20** (1994)

I.Jerebic, B.Slivnik, R.Trobec. Library for programming on torus-connected
multiprocessors. M.Valero et al. (Eds.) *PACTA 92* (IOS Press, 1992)

I.Ozimek, J.Tasic. New approach to parallel systolic algorithm implementations
- RLSL example. *Signal Processing* **42** (1995)

S.L.Peyton Jones. *The Implementation of Functional Programming Languages.*
(Prentice-Hall, 1987)

B.Robic, P.Kolbezen, and J.Silc. Area optimization of dataflow-graph mappings.
*Parallel Comput.* **18** (1992)

B.Robic, J.Silc. Fault tolerant mapping onto VLSI/WSI processor arrays.
*Proc. EUROMICRO 94* (IEEE Computer Society Press, 1994)

A.K.Somani, V.K.Agarwal. Distributed diagnosis algorithm for regular interconnected
structures. *IEEE Trans. Comput.* **41** (1992)

R.Trobec, D.Janezic. Comparison of parallel Verlet and implicit Runge-Kutta
methods for molecular dynamics integration. *J. Chem. Inf. Comput. Sc.*
**35** (1995)

R.Trobec, I.Jerebic. Optimization of diagnostic examination. *Lect. Notes
Comput. Sc.* **854** (1994)

R.Trobec, B.Slivnik. Parallel heat transfer computation on generally shaped
bodies. *Proc. Int'l Workshop Parallel Numerics '94* (1994)

One of the most fundamental challenges computer science is facing with
today is need for developing algorithms, programming tools, and applications
to harness the vast computing power of parallel computers; moreover, a
theoretical basis for this new technology has to be improved. Since it
is expected that parallel computers will gain an ever increasing share
of the market, these developments will have important commercial consequences.
Research in the field of concurrent computer systems is a great challenge,
especially for the following reasons:

1.) There is constant and strong human desire to posses even stronger computer
systems. This can be met only by parallel architectures, where the set
of subtasks is carried out by independent processors communicating by message
passing.

2.) Computer systems consist of enormous number of electronic components
and malfunction of only one of them may result in a total system crash.
Since electronic components will never be totally reliable, the possibility
of failures must always be considered. Taking this fact into account, a
computer system should be developed, capable of self-diagnosis and fault-tolerance.

As claimed by many computer scientists dealing with distributed computation,
interconnection of more and more processors will be effective only if the
implemented mapping and message routing algorithms possess fault tolerance
properties. Having this in mind, the problems of process-to-processor mapping,
efficient selective broadcasting methods, and message routing with limited
link contention are of great theoretical and practical significance. To
implement a high speed custom design application in the area of digital
communications, a parallel algorithm must be implemented in silicon as
a VLSI circuit. To simplify the development of such special machines, it
is important to use software tools for automatic mapping of mathematical
(computing) algorithms to the structure. For a given algorithm (described
as a set of equations), the proposed tool will produce a description of
its parallel fine grain implementation. This description is suitable for
further processing with integrated circuit development packages. We expect
that the project will contribute to the theoretical foundation of parallel
computing especially in the areas of communication and algorithm mapping,
even in the presence of faulty units. We will promptly verify the theory
with several computationally intensive applications (molecular dynamics,
difference equations, and parallel parsing) in order to confirm our results.

Our project will contribute new results to theoretical foundations of parallel computing. Research will be focused on communications and algorithm mapping. New theoretical results are also expected in handling faults and increasing reliability of parallel systems. Theory will be verified with some computationally intensive applications. With new approaches to parallel computing, our recent results in molecular dynamics, solutions of partial differential equations and parallel syntax analysis will be improved. In the framework of the project, we expect a research group capable of studying and using future computer systems will be formed.

In practice, we expect faster implementations of computationally intensive applications and simulation of natural phenomena that could not be performed experimentally. By increasing the number of processors, the computation time will decrease. In molecular dynamics, only short time intervals of order several femtoseconds can be simulated today. The time interval could be widen by using parallel computers, thus researchers would be able to study matter in more detail. Another example is simulation of heart cooling process during the surgery. Often, to perform in-vivo experiments is dangerous or impossible, while simulation can give us the insight into physical behaviour of the cooling process without any harm.

The main goal of the project is the design and implementation of new approaches to solving numerical and non-numerical computationally intensive applications on parallel computers. This goal will be achieved by using analysis and simulation methods and by practical implementations. The basic parallel computer model used will be a system composed of processing units that communicate with messages and each unit is capable of executing a different program independently of other units (DMMP). In many applications, we will try to develop scaleable programs (by increasing the number of processors computing time is reduced) for a Single Program Multiple Data (SPMD) models. The following parallel computers will be used: a network of 32 transputers, clusters of workstations with PVM (Parallel Virtual Machine) and Convex Exemplar (32 processors). We will concentrate on parallel computer system support and we will analyse general approaches and tools for design and implementation of parallel programs. Process mapping and data distribution among processors are crucial for successful implementation of parallel algorithms. This is not a trivial task and it becomes even harder if there are faulty nodes in the system. In designing new mapping and implementation strategies, we will use our experience as well as new discoveries of similar research groups in the world. We will verify theoretical results in practical applications. Our attention will be focused especially in the following applications we have examined recently. First, in the field of molecular dynamics we will co-operate with the Chemical Institute in parallelisation of a new symplectic method. Second, we will extend the heat transfer simulation with other applications such as multichannel EKG signal processing, heart modelling, and simulation of blood flow through veins and arteries with Medical Centre Ljubljana. Third, we will work on non-numerical applications such as parallel syntax analysis.

- 1st year:
- Diagnostic examination in parallel systems.
- PVM installation for the development and verification of parallel algorithms.
- Definition of a theoretical model for mapping arbitrary parallel algorithms onto parallel architectures with the consideration of link contention.
- Analysis of procedures for automatic mapping of regular algorithms to fine grain array computing structures.
- Definition of an embedded LR(k) parser.
- Parallel numerical applications focused on parallel molecular dynamics integration.
- 2nd year:
- Testing of algorithms for multicasting and gossiping (also in faulty environment).
- Development of mapping procedure software tools for transforming a given algorithm to a form, suitable to be used for VLSI implementation.
- Development of parallel algorithms for solving large sparse systems of linear equations.
- Using lambda calculus and high-level functional programming languages for parallel programming.
- Derivation of a parallel LR(k) parsing algorithm and parallel LR(k) parsing table construction.
- Parallel numerical applications focused on parallel simulations.
- 3rd year:
- Measures for local fault-tolerance and diagnosability for massively parallel systems.
- Implementation of algorithms for message routing with limited link contention.
- Experimental mapping of equalisation system, computer simulations, realisability and suitability from the technological and commercial viewpoint.
- Implementation of parallel algorithms for solving large sparse systems of linear equations.
- Establishing the rate of parallelism achieved for some common types of grammars.
- Parallel non-numerical applications focused on the parallel parsing.

- Convex Exemplar (32 processors)
- Transputer mesh (32 processors)
- UNIX workstation network (6 x Sun, 2 x HP)
- PC network (5 x 486)