Project J2-7648

Title of project:	Communications and mapping in parallel computers
Major field of science:	Technical Science
Research area:	Information and Communication Technologies
Duration of project:	3 years (1996-98)
Total work load in hours:	11.313
Principal investigator:	Dr. Roman Trobec
Project summary:	In the project, different types of communications among co-operating processors, and processes mapping on parallel computers will be investigated; in a general case, some faulty processors will also be considered. Our results are intended to be implemented in software for the future parallel operating systems as well as in custom designed VLSI circuits. To confirm research results, the theory will be verified with some important computationally intensive applications. New methods will be contributed to parallel integration in molecular dynamics, solutions of difference equations, and parallel parsing.

Scientific background and purpose of the research

Recent advances in technology have enabled the construction of computers whose architectures no longer conform to the traditional von Neumann model of computation. Parallelism is the dominating feature these machines have in common. Parallel computing is one of the most rapidly growing research fields in computer science and engineering. The earliest parallel machines were SIMD computers and vector computers which generally relied on internal parallelism and pipelining to improve data throughput. In the next generation, several independently operating processors were connected in order to co-operate on the solution of a common problem (MIMD). Those multiprocessors generally provided the programmer with shared memory, enabling code to be written in a similar style as for conventional von Neumann computers. However, architectural complexity imposes a limit on the size of computers being built on a pure shared memory basis, and more attention was paid to the design and construction of multiprocessor systems with memory distributed among independent processors (Distributed Memory Multiprocessors - DMMP). Among the first commercial systems that were built as DMMP were the Intel iPSC hypercube series, the Floating Point Systems T-Series and a number of Inmos transputer based systems. Recently, well known supercomputer companies such as Cray Research and Convex have been testing the parallel machines CRAY T3D and CONVEX Exemplar. Besides, workstation networks have become a suitable platform for parallel processing. The concept of such systems offers virtually unlimited design flexibility and scalability. We hope to be able to build computers with orders of magnitude more computing power than we have now. Moreover, those computers would provide high performance at a comparatively low cost. One of the most fundamental challenges computer science is facing with today is need for developing algorithms, programming tools, and applications to harness the vast computing power of parallel computers; moreover, a theoretical basis for this new technology has to be improved. Since it is expected that parallel computers will gain an ever increasing share of the market, these developments will have important commercial consequences. The basic project goals are the following. First, to widen the knowledge of parallel computing area, particularly communications and mapping, and systems with faulty nodes. Second, some important computationally intensive applications will be implemented in order to test our research results. Third, the project will establish minimal scientific potential capable to make researches into future computer systems.

Present state of knowledge in the proposed research field

Parallel computer systems represent a promising tool for increasing computational power of computers, hence the theoretical basis for parallel computation has to be matured. The proposed research project is concentrated on the following topics:

Mapping, Communications and Software Tools in Parallel Systems. Assigning processes to processors, data distribution, and communication among the processors is fundamental for all applications running on parallel computers. Efficient execution of parallel programs on an existing parallel computer requires proper mapping of programs onto the underlying machine architecture which can either be general-purpose (Bokhari, 1981; Robic et al., 1992) or special-purpose (Ozimek and Tasic, 1995). A practical model of a parallel computer has at each of its nodes a fixed size buffer allowing only limited link contention during message routing. Broadcasting (one-to-all), multicasting (one-to-more) and gossiping (all-to-all) are communication problems arising in distributed memory multicomputers (Hedetniemi et al., 1986). The need for selective broadcasting arises in many important applications, yet there is still no appropriate mechanism for such broadcasting currently available in parallel computers. Study of fault tolerant communication algorithms is also a very important topic of parallel computing (Chen and Shing, 1990; Robic and Silc, 1994). The main reason for delayed practical usage of parallel computers is the lack of software for developing parallel programs. Functional programming languages are among the most promising platforms for parallelisation, because they are not inherently sequential as conventional imperative language are (Peyton Jones, 1987). A tool for designing platform-independent parallel programs is, for example, PCL - parallel computation library (Jerebic et al., 1992). It enables writing portable programs for mesh and torus networks of processors, and developing programs on sequential computers.

Failures in Parallel Systems. A popular approach for handling faults in parallel systems is to restore regularity of a system, deteriorated by failures. Many authors have discussed hardware redundancy with spare units and switches combined with internal coding of computational data. These methods can be used for improving reliability, since they also cover intermittent faults. However, they cannot be used for dealing with fault localisation. Some authors have investigated multi-level diagnosis and configuration in the presence of faults. In papers dealing with distributed diagnosability, the assumption is made that each fault-free unit is capable of determining the diagnostic state of all other units in the system (Somani and Agarwal, 1992). In the case of massively parallel systems, this approach is not feasible due to a large number of components in the system and the amount of communication required. Also, if the number of faults is greater than some upper bound that depends on connectivity, a false diagnosis can be obtained. These facts represent a serious drawback for reusability of earlier developed methods in the area of parallel system-level diagnosability. In several papers, it has been shown that a local procedure could be used in massively parallel systems. Some recent works (e.g. Blough and Pelc, 1993) on the diagnosis of regular constant-degree systems have been done using probability models. Promising results for the strategy that uses only local information for system diagnosis have been reported. Our work (Trobec and Jerebic, 1994) is based on the assumption that random production and run-time failures are present. The run-time diagnostic is local and organised as a parallel procedure.

Parallel Numerical and Non-numerical Applications. Parallel Numerical Linear Algebra is an important tool for solving computationally intensive problems arising in various scientific areas. Talking about the development of parallel algorithms, we have in mind the design of scaleable routines that can be included into existing libraries and run on various High Performance Computing platforms. Recently, parallel implementations have been directed by the development of scaleable and platform-independent libraries based on Basic Linear Algebra Subroutines (BLAS) (Choi et al., 1994) and introduction of message-passing standard. This enables the design of parallel applications not only on supercomputers but also on clusters of workstations. Our main research interest has been in the area of parallel algorithms for solving large sparse systems of linear equations with the emphasis on the system structure (Evans and Caf, 1994) and use of updating techniques (Blaznik et al., 1995). Parallel Molecular Dynamics Integration. Demands for the extension of computer simulation methods to molecular systems of increasing size and complexity cannot be met solely by developing better hardware and consuming more computer time; since the needs for conceptual improvements of existing algorithms are becoming more and more obvious. Enlargement of the scope of existing methods of molecular dynamics of complex molecular systems and their parallel implementations on the ring topology for distributed memory computers is presented in (Trobec and Janezic, 1995). Parallel Parsing: The membership problem for formal languages is one of the basic problems in the theory of computation. It is well solved for sequential computers, but no significant progress has been made in the field of parallel computers yet. Syntax analysis of context-free languages represents an important subproblem because syntax analysers or parsers form the skeleton of the majority of compilers and other program processing tools (Jain and Chuadhari, 1994). The syntax analysis should be split to a number of independent parsing processes, which should be carried out by embedded LR(k) parsers. An algorithm for splitting LR(k) parsing tables and generating tables for embedded LR(k) parsers is still to be found. Parallel Heat Transfer Simulation: Computer simulation and other scientific applications that have until recently been beyond the capability of conventional computers are becoming reality on parallel computers. These applications are an important tool in searching for new phenomena in nature, worst-case predictions, investigations of dangerous situations, etc. In a parallel computer, a physical domain is evenly partitioned among $p$ nodes. The same program in each node solves its own subdomain in parallel, some data from domain boundaries are exchanged and the solution is ideally reached $p$ times faster than on a single processor (Fox et al., 1988). The above description follows the nature that is in foundation parallel. We have developed a simple parallel computer simulation method that is reliable enough for practical use in medicine (Trobec and Slivnik, 1994). We have analysed a heat distribution in a human heart during a cardiac surgery. The human heart is an irregularly shaped three dimensional object. Any real measurements and experiments are dangerous to be performed on humans, therefore, the computer simulation results will be of crucial importance in investigating and improving practical methods in medicine.

P.Blaznik, D.J.Evans, J.Tasic. Parallel updating techniques in image restoration. Parallel Algorith. Appl. 9 (1995)
D.M.Blough, A.Pelc. Diagnosis and repair in multiprocessor systems. IEEE Trans. Comput. 42 (1993)
S.H.Bokhari. On the mapping problem. IEEE Trans. Comput. 30 (1981)
M.Chen, K.Shing. Depth-first search approach for fault-tolerant routing in hypercube multicomputers. IEEE Trans. Parallel Distributed Syst. 1 (1990)
J.Choi, J.J.Dongarra, D.W.Walker. PB-BLAS: A set of parallel block basic linear algebra subprograms. Proc. Scalable High-Performance Comput. Conf. (IEEE, Los Alamitos, CA, 1994)
D.J.Evans, D.Caf. A sparse preconditioner for symmetric positive definite banded circulant and Toeplitz linear systems. Inter. J. Computer Math. 54 (1994)
G.C.Fox et al. Solving Problems on Concurrent Processors. (Prentice-Hall, 1988)
S.M.Hedetniemi, T.Hedetniemi, A.L.Liestman. A survey of gossiping and broadcasting in communication networks. Networks 18 (1986)
A.Jain, N.S.Chaudhari. Efficient parallel recognition of contex-free languages. Parallel Comput. 20 (1994)
I.Jerebic, B.Slivnik, R.Trobec. Library for programming on torus-connected multiprocessors. M.Valero et al. (Eds.) PACTA 92 (IOS Press, 1992)
I.Ozimek, J.Tasic. New approach to parallel systolic algorithm implementations - RLSL example. Signal Processing 42 (1995)
S.L.Peyton Jones. The Implementation of Functional Programming Languages. (Prentice-Hall, 1987)
B.Robic, P.Kolbezen, and J.Silc. Area optimization of dataflow-graph mappings. Parallel Comput. 18 (1992)
B.Robic, J.Silc. Fault tolerant mapping onto VLSI/WSI processor arrays. Proc. EUROMICRO 94 (IEEE Computer Society Press, 1994)
A.K.Somani, V.K.Agarwal. Distributed diagnosis algorithm for regular interconnected structures. IEEE Trans. Comput. 41 (1992)
R.Trobec, D.Janezic. Comparison of parallel Verlet and implicit Runge-Kutta methods for molecular dynamics integration. J. Chem. Inf. Comput. Sc. 35 (1995)
R.Trobec, I.Jerebic. Optimization of diagnostic examination. Lect. Notes Comput. Sc. 854 (1994)
R.Trobec, B.Slivnik. Parallel heat transfer computation on generally shaped bodies. Proc. Int'l Workshop Parallel Numerics '94 (1994)

Hypothesis presentation

One of the most fundamental challenges computer science is facing with today is need for developing algorithms, programming tools, and applications to harness the vast computing power of parallel computers; moreover, a theoretical basis for this new technology has to be improved. Since it is expected that parallel computers will gain an ever increasing share of the market, these developments will have important commercial consequences. Research in the field of concurrent computer systems is a great challenge, especially for the following reasons:
1.) There is constant and strong human desire to posses even stronger computer systems. This can be met only by parallel architectures, where the set of subtasks is carried out by independent processors communicating by message passing.
2.) Computer systems consist of enormous number of electronic components and malfunction of only one of them may result in a total system crash. Since electronic components will never be totally reliable, the possibility of failures must always be considered. Taking this fact into account, a computer system should be developed, capable of self-diagnosis and fault-tolerance.
As claimed by many computer scientists dealing with distributed computation, interconnection of more and more processors will be effective only if the implemented mapping and message routing algorithms possess fault tolerance properties. Having this in mind, the problems of process-to-processor mapping, efficient selective broadcasting methods, and message routing with limited link contention are of great theoretical and practical significance. To implement a high speed custom design application in the area of digital communications, a parallel algorithm must be implemented in silicon as a VLSI circuit. To simplify the development of such special machines, it is important to use software tools for automatic mapping of mathematical (computing) algorithms to the structure. For a given algorithm (described as a set of equations), the proposed tool will produce a description of its parallel fine grain implementation. This description is suitable for further processing with integrated circuit development packages. We expect that the project will contribute to the theoretical foundation of parallel computing especially in the areas of communication and algorithm mapping, even in the presence of faulty units. We will promptly verify the theory with several computationally intensive applications (molecular dynamics, difference equations, and parallel parsing) in order to confirm our results.

Significance of the results for science development

Our project will contribute new results to theoretical foundations of parallel computing. Research will be focused on communications and algorithm mapping. New theoretical results are also expected in handling faults and increasing reliability of parallel systems. Theory will be verified with some computationally intensive applications. With new approaches to parallel computing, our recent results in molecular dynamics, solutions of partial differential equations and parallel syntax analysis will be improved. In the framework of the project, we expect a research group capable of studying and using future computer systems will be formed.

Significance of the results for application

In practice, we expect faster implementations of computationally intensive applications and simulation of natural phenomena that could not be performed experimentally. By increasing the number of processors, the computation time will decrease. In molecular dynamics, only short time intervals of order several femtoseconds can be simulated today. The time interval could be widen by using parallel computers, thus researchers would be able to study matter in more detail. Another example is simulation of heart cooling process during the surgery. Often, to perform in-vivo experiments is dangerous or impossible, while simulation can give us the insight into physical behaviour of the cooling process without any harm.

Detailed description of experimental methods and procedures

The main goal of the project is the design and implementation of new approaches to solving numerical and non-numerical computationally intensive applications on parallel computers. This goal will be achieved by using analysis and simulation methods and by practical implementations. The basic parallel computer model used will be a system composed of processing units that communicate with messages and each unit is capable of executing a different program independently of other units (DMMP). In many applications, we will try to develop scaleable programs (by increasing the number of processors computing time is reduced) for a Single Program Multiple Data (SPMD) models. The following parallel computers will be used: a network of 32 transputers, clusters of workstations with PVM (Parallel Virtual Machine) and Convex Exemplar (32 processors). We will concentrate on parallel computer system support and we will analyse general approaches and tools for design and implementation of parallel programs. Process mapping and data distribution among processors are crucial for successful implementation of parallel algorithms. This is not a trivial task and it becomes even harder if there are faulty nodes in the system. In designing new mapping and implementation strategies, we will use our experience as well as new discoveries of similar research groups in the world. We will verify theoretical results in practical applications. Our attention will be focused especially in the following applications we have examined recently. First, in the field of molecular dynamics we will co-operate with the Chemical Institute in parallelisation of a new symplectic method. Second, we will extend the heat transfer simulation with other applications such as multichannel EKG signal processing, heart modelling, and simulation of blood flow through veins and arteries with Medical Centre Ljubljana. Third, we will work on non-numerical applications such as parallel syntax analysis.

Timescale

1st year:

Diagnostic examination in parallel systems.
PVM installation for the development and verification of parallel algorithms.
Definition of a theoretical model for mapping arbitrary parallel algorithms onto parallel architectures with the consideration of link contention.
Analysis of procedures for automatic mapping of regular algorithms to fine grain array computing structures.
Definition of an embedded LR(k) parser.
Parallel numerical applications focused on parallel molecular dynamics integration.

2nd year:

Testing of algorithms for multicasting and gossiping (also in faulty environment).
Development of mapping procedure software tools for transforming a given algorithm to a form, suitable to be used for VLSI implementation.
Development of parallel algorithms for solving large sparse systems of linear equations.
Using lambda calculus and high-level functional programming languages for parallel programming.
Derivation of a parallel LR(k) parsing algorithm and parallel LR(k) parsing table construction.
Parallel numerical applications focused on parallel simulations.

3rd year:

Measures for local fault-tolerance and diagnosability for massively parallel systems.
Implementation of algorithms for message routing with limited link contention.
Experimental mapping of equalisation system, computer simulations, realisability and suitability from the technological and commercial viewpoint.
Implementation of parallel algorithms for solving large sparse systems of linear equations.
Establishing the rate of parallelism achieved for some common types of grammars.
Parallel non-numerical applications focused on the parallel parsing.

Research equipment

Convex Exemplar (32 processors)
Transputer mesh (32 processors)
UNIX workstation network (6 x Sun, 2 x HP)
PC network (5 x 486)