This chapter describes some of the techniques the HPF language and the PGHPF compiler use to handle data distribution among processors on a parallel system. This chapter also describes some data distribution limitations for the current version of PGHPF. The PGHPF compiler distributes data and generates necessary communications with the assistance of the PGHPF runtime library. Depending on the data type, distribution specification, and alignment specification, as well as each computation's required data access, data is communicated for a particular expression involving some computation. Data distribution is based on the data layout specified in the HPF program, the design of the parallel computer system and the layout and number of processors used. Some mapping of data to processors is specified by the programmer, other mapping is determined by the compiler.
The PGHPF compiler targets an SPMD programming model. In the SPMD model, each processor executes the same program, but operates on different data. This is implemented by loading the same program image into each processor. Each processor then allocates and operates on its own local portion of distributed arrays, according to the distributions, array sizes and number of processors as determined at runtime. Special attention is required to address the unique communication characteristics of many parallel systems. The PGHPF Runtime library handles HPF data distribution tasks in a generic manner so that HPF programs will work on distributed memory and shared memory systems (some parallel systems use shared memory others use distributed memory, there are also hybrid systems. Lower levels of the PGHPF runtime library are customized for different parallel architectures). Figure 5-1, "Distributed Memory Parallel System", shows a conceptual view of a parallel system running an HPF program.
The PGHPF runtime library takes into account the communications to be performed and is optimized at two levels, the transport independent level where efficient communications are generated based on the type and pattern of data access for computations, and at the transport dependent level where the runtime library's communication is performed using a communications mechanism and the system's hardware. To generate efficient code, data locality, parallelism, and communications must be managed by the compiler. This chapter describes the principles of data distribution that HPF and the PGHPF compiler use; the HPF programmer needs to be aware of some details of data distribution to generate efficient parallel code.
The data distribution phase of the PGHPF compiler has two important tasks that map data to a parallel system's memory and enable computations on that data:
The following sections describe these tasks in more detail.
The PGHPF compiler distributes data for several classes of variables:
The compiler uses HPF directives as a guide for distributing the data that has a user specified distribution. Data without distribution directives is replicated across all processors. Compiler-created temporaries are distributed corresponding to their required usage.
Using the compiler's default distribution, all unspecified data, data without an explicit HPF distribution, is replicated among the available processors. For example, if the integer array BARRAY is used in a program and no HPF directives are supplied for distributing or aligning BARRAY, the default distribution is used and BARRAY is replicated. PROG1 and PROG2 in Example 5-1 show the default distribution. In PROG1, the compiler generates code using the default distribution because BARRAY is specified without a distribution, PROG2 shows an equivalent user specified distribution where BARRAY is also replicated.
! distribution directives not supplied - replication PROG1 INTEGER BARRAY(100) !default distribution directives supplied - replication PROG2
INTEGER BARRAY(100) !HPF$ DISTRIBUTE BARRAY(*)
As described in Chapters 4 and 5 of the High Performance Fortran Handbook, PGHPF distributes data according to the supplied HPF directives. The ALIGN and DISTRIBUTE directives allow data to be distributed over processors in a variety of patterns. For example, the following code represents a distribution where a computation is partitioned over the available processors. With the given ALIGN directive, this computation involves no communication.
REAL X(15), Y(16) !HPF$ DISTRIBUTE Y(BLOCK) !HPF$ ALIGN X(I) WITH Y(I+1) FORALL(I=1:15) X(I)=Y(I+1)
The next example is very similar, but uses a CYCLIC distribution. A cyclic distribution divides data among processors in a round-robin fashion. A block distribution divides data into evenly distributed chunks (as evenly as possible) over the available processors. A cyclic distribution divides data over processors so that each processor gets one element from each group of n elements, where n is the number of processors.
Figure 5-2 shows block and cyclic distributions for a one dimensional array. Depending on the computation performed different data distributions may be advantageous. For this computation a CYCLIC distribution would involve communication for each element computed.
REAL X(15), Y(16) !HPF$ DISTRIBUTE Y(CYCLIC) !HPF$ ALIGN X(I) WITH Y(I+1) FORALL(I=1:15) X(I)=Y(I+1)
In the next example, a similar distribution represents a computation that would be partitioned over the available processors, (for the example we call the processors processor one and processor two). Because of the alignment specified in these ALIGN and DISTRIBUTE directives, the computation involves communication since the value for Y(9) when I is 8 needs to be communicated to assign it to X(8). X(8) is stored on processor one and Y(9) is stored on processor two.
REAL X(15), Y(16) !HPF$ DISTRIBUTE Y(BLOCK) !HPF$ ALIGN X(I) WITH Y(I) FORALL(I=1:15) X(I)=Y(I+1)
The following example shows an erroneous distribution that programmers should avoid. According to the HPF specification, the value of a dummy index variable, I in this example, must be valid for all subscript values possible for the data, X in this example. When the ALIGN dummy index ranges for all possible value of I, 1 to 16 for this example, there would be an invalid access to the value Y(16+1). This error will give a runtime error.
REAL X(16), Y(16) !HPF$ DISTRIBUTE Y(BLOCK) !HPF$ ALIGN X(I) WITH Y(I+1) FORALL(I=1:15) X(I)=Y(I+1)
This code produces the following runtime error message :
0: set_aligment: invalid alignment 1: set_aligment: invalid alignment
For more details on different data distributions and examples showing more HPF data mapping directives, refer to Chapter 4 of The High Performance Fortran Handbook.
Allocatable arrays can be distributed in a manner similar to standard arrays (arrays without the ALLOCATABLE attribute). The directives that determine the distribution and alignment of an allocatable array are evaluated on entry to the allocatable array's scoping unit and are used throughout the scoping unit for creation of the array, although the arrays may later be realigned or redistributed.
Using allocatable arrays, it is important to keep in mind that an object that is being aligned with another object must exist. Thus, in the following example, the order of the ALLOCATE statements is correct; however, an incorrect ordering, when B is allocated before A, will produce a runtime alignment error.
0: TEMPLATE: invalid align-target descriptor REAL, ALLOCATABLE:: A(:), B(:) !HPF$ ALIGN B(I) WITH A(I) !HPF$ DISTRIBUTE A(BLOCK) ALLOCATE (A(16)) ALLOCATE (B(16))
The distribution of procedure arguments is described in detail in Chapter 5 of The High Performance Fortran Handbook. An important principle for HPF is the alignment of an argument when a procedure is called is maintained when the procedure returns, regardless of the distribution of the argument within the procedure. Thus, the compiler may need to redistribute the variable upon entry to the procedure, and when exiting the procedure.
The PGHPF compiler creates a distribution for compiler-created temporary variables. Compiler-created temporaries are distributed corresponding to the required usage. The compiler creates temporaries for several reasons:
Distribution of temporaries and user variables are performed identically; the use of temporaries is transparent from the HPF programmer's point of view (the temporaries are visible in the intermediate code).
The algorithm PGHPF uses to determine distribution of temporaries takes the statement in which the temporary is used into account. Temporaries are allocated before the statement in which they are used and deallocated immediately after that statement. For example, an array assignment:
INTEGER, DIMENSION(100,100):: A,B,C,D
A = SUM(B) + MATMUL(C,D)
would generate intermediate code using a temporary array.
For this class of temporaries, distribution is based on the usage of the temporary. If a temporary is used as the argument to an intrinsic, the compiler tries to determine the distribution based on the other intrinsic arguments. Otherwise, it tries to assign a distribution based on the value assigned to the temporary. Otherwise, the temporary is replicated across all processors.
Numerous factors including array alignment, array distribution, array subsection usage and argument usage need to be taken into account in determining temporary distribution. For example, consider the following :
A(1:m:3) = SUM(B(1:n:2,:) + C(:,1:n:4), dim = 2)
The section of A is passed directly to the SUM intrinsic to receive the result. A temporary is needed to compute the argument to SUM. The distribution of that temporary has two possibly conflicting goals: minimize communication in the B+C expression, or minimize communication in the SUM computation and in the assignment to A.
Computations are partitioned when PGHPF applies the owner-computes rule. This rule causes the computation to be partitioned according to the distribution of the assigned portion of the computation and involves localization based on the left-hand-side (lhs) of an array assignment statement.
The bounds of a FORALL statement are localized according to the array elements owned by the left-hand-side.
For BLOCK partitioned dimensions, the loop bounds are adjusted to index the slice of data owned by the current processor.
For CYCLIC partitioning, two loops are required. The outer loop iterates over the cycles of the data, and the inner loop iterates over the data items in the cycle.
The PGHPF compiler has a InterProcedural Analysis (IPA) phase. Using the option -Mipa, the compiler checks routines across subroutine boundaries and reports a number of errors and warnings that will not be detected otherwise. Optimizations are performed across procedure boundaries when possible.
The format of this option is:
The IPA phase performs the following types of analysis:
Once IPA checking is complete, the IPA lib directory will contain a number of files. It is the programmer's task to remove these files. If they are not removed they will remain in the directory and will be used in future compilations when the same lib directory is selected for the IPA phase (see the following subsections).
The IPA Phases
The compiler runs three phases to support Interprocedural Analysis. The command:
%pghpf -Mipa=lib source.hpf
is equivalent to running all three phases. The following subsections describe these phases.
phase 1 analysis
This phase analyzes each procedure and creates the following files. To run only this phase, use the -ca IPA option, for example: -Mipa=lib -ca source.hpf.
phase 2: propagation
This is the propagation phase which analyzes the entire program. To run only this phase, use the -cp IPA option, for example: -Mipa=lib -cp.
This phase creates the following files:
phase 3: inheriting
This compiles the routines in source.hpf and creates no new files. To run only this phase, include the .ipa source file on the command line, for example:
% pghpf -Mipa=lib source.ipa
Data that is distributed with a !HPF$ DISTRIBUTE directive, and initialized with a DATA statement is valid in PGHPF. A new option, -Mkeepstatic keeps the intermediate file which is normally removed by the compiler. This option has been added to the definition of -Mg, so that the intermediate file is retained when flags are set for debugging.
To support certain HPF features, including static data initialization, PGHPF now implements a prelink phase. If you are familiar with output from PGHPF using the -v option in PGHPF 2.2, this change will be very noticeable with newer PGHPF compilers, as a number of new prelink phases have been added to the compilation process. The prelinker collects the following information about the program being linked and generates a new subroutine (pghpf$static$init) to implement them:
Necessary information about the routines in source.hpf is saved by PGHPF in a file named source.d The prelinker reads the appropriate .d files to generate the initialization subroutine pghpf$static$init. If -Mkeepstatic is set, this subroutine is written to the file pghpf.prelink.f and saved.
Additionally, with -Mnoprelink, distributed arrays in modules or common blocks will generate less efficient code.
A new option has been added to support variations with the prelink phase. The option, -W9, will pass switches to the prelink phase, but not to the regular Fortran compilation.