The HPF language in conjunction with Fortran 90 array features provides several methods for the programmer to convey parallelism which the PGHPF compiler will detect and parallelize. Using the HPF INDEPENDENT directive HPF provides a method to specify the degree of dependence, and thus parallelism, between iterations in a DO loop or a FORALL. This chapter provides examples showing how to use parallelism in an HPF program and provides examples of the FORALL statement.
The PGHPF compiler treats Fortran array expressions as parallel expressions. Each node or processor on the parallel system will execute its part of the computation (if the arrays associated with the left-hand-side of the expression are distributed). Array constructs are internally converted to an equivalent FORALL statement and then the distributed array is computed with a FORALL statement that is parallelized by localizing array indices. For example, the following Fortran 90 array statement is parallelized by the compiler and produces the Fortran code shown:
Assuming the array Y is distributed, the code would be generated and run locally on each processor.
Note that calls to the PGHPF runtime library routines are found in the generated Fortran. Some of the tasks of the runtime routines include generating the bounds for the index space of arrays residing on local processors. For example, on a four processor system, a call would generate different loops bound depending on the processor the call is made on, and the portion of the array stored on that processor, as shown below:
! Processor 1 do i1 = 1, 4 y(i1) = x(i1) + 1 enddo ! Processor 2 do i1 = 5, 8 y(i1) = x(i1) + 1 enddo ! Processor 3 do i1 = 9, 12 y(i1) = x(i1) + 1 enddo ! Processor 4 do i1 = 13, 16 y(i1) = x(i1) + 1 enddo
The WHERE statement is a Fortran 90 statement that conveys parallelism in a manner similar to array assignment described in the previous section. The compiler adds a conditional statement to mask the elements of the array's index space that are assigned (or not assigned) a particular value. For example, given that X and Y are distributed arrays, the following WHERE statement produces code similar to the Fortran output shown.
WHERE(X/=0) Y=X call pghpf_localize_bounds(x$d1,1,1,16,1,i$l,i$u) do i1 = i$l, i$u if (x(i1) .ne. 0) then y(i1) = x(i1) endif enddo
The generated code is similar to the node code for an array expression, with the addition of the conditional within the DO loop.
The WHERE construct is a Fortran 90 statement that conveys parallelism with a conditional mask for a number of statements in a block, optionally also for the alternative to the mask condition. Due to the definition of the WHERE construct, the code generated involves a temporary that holds an array of logicals that specify the mask result for all local array elements. This logical array is computed before the where block is executed.
WHERE(X/=0) X=0 END WHERE
The FORALL statement allows specification of a set of index values and an assignment expression utilizing the index values (or using a masked subset of the index values). The computation involving the index values for the assignment expression may be performed in an unspecified order on a scalar machine, or in parallel on a parallel system. For more details on the definition of FORALL, refer to The High Performance Fortran Handbook. The following example shows a simple masked FORALL.
FORALL(I=1:15, I>5) X(I)=Y(I)
Note that HPF intrinsic functions can be called from the expression part of a FORALL statement.
The FORALL construct provides a parallel mechanism to assign values to the elements of arrays. The FORALL construct is interpreted essentially as a series of single statement FORALLs.
FORALL (I = 1:3) A(I) = D(I) B(I) = C(I) * 2 END FORALL
Many of the HPF library routines and intrinsics provide functions that are executed in parallel. The Fortran 90 array valued intrinsics also execute in parallel, when possible. Refer to the PGHPF Reference Manual for a list of the HPF and Fortran 90 Intrinsics and the HPF Library Routines.
Many of the standard 3F library routines are supported on platforms running PGHPF. An include file is available, named lib3f.h, that supports using these routines. Using the lib3f.h include file, programs can call standard 3F routines. The statement INCLUDE "lib3f.h" is required when using 3F procedures.
Programs that use getarg() or iargc() in PGHPF require this INCLUDE statement.
An INDEPENDENT DO loop is designated by the programmer by preceding it with the INDEPENDENT directive. For example:
The compiler accepts the above, or any of the standard HPF directive prefixes, as well as additional INDEPENDENT clauses (section 7.5.2 "INDEPENDENT Clauses").
No command-line switches are needed to invoke parallelization of INDEPENDENT loops. The -Mnoindependent switch is available to inhibit parallelization of all INDEPENDENT loops. The -Minfo command-line switch reports which loops have been parallelized.
At present, only INDEPENDENT loops with Fortran-77 constructs can be parallelized. In particular, the presence of array assignments, WHERE statements, FORALL statements, and ALLOCATE statements will eliminate loops from consideration for parallelization. INDEPENDENT loops can be nested, currently the limit is seven loop nests, but there can be at most one INDEPENDENT loop directly nested within another INDEPENDENT loop. For example, the following loop nest will not be parallelized since two independent loops are present at the same level.
!HPF$ INDEPENDENT DO i = 1, n !HPF$ INDEPENDENT DO 10 j = 1, m 10 A(j,i) = (j-1) * n + i !HPF$ INDEPENDENT DO 20 k = m, 1, -1 20 B(k,i) = A(m-k+1,i) ENDDO
This restriction has been added to ensure that a unique home array can be found for the entire INDEPENDENT loop nest (see section 7.5.2 "The On Home Clause" for a discussion of home arrays). For the same reason, trip counts and strides for non-outermost INDEPENDENT loops must be invariant with respect to the entire loop nest.
There are additional cases where INDEPENDENT loops are not parallelized or are only parallelized if an INDEPENDENT clause is used (refer to the following section for a description of INDEPENDENT clauses). To describe these cases, we must first define several terms. Each INDEPENDENT DO loop defines an INDEPENDENT index, which is the DO loop's index. In processing INDEPENDENT loops, the compiler will replicate those variables that do not contain subscripts that are functions of INDEPENDENT indices. As a degenerate case, all scalars will be replicated. Variables that the compiler replicates may originally be distributed. To perform parallelization, the compiler will create replicated copies. The resulting variables are compiler-replicated.
The compiler must ensure that values of compiler-replicated variables will be identical across all processors. If a compiler-replicated variable can be modified within an INDEPENDENT loop, and is used outside the loop, the loop will not be parallelized.
Modifications to compiler-replicated variables can be made through assignment statements, or through procedure calls. Any modification to a compiler replicated variable disables parallelization of the INDEPENDENT loop unless NEW or REDUCTION clauses are specified for the modified variable or there are no uses (refer to section 7.5.2). The presence of INTERFACE blocks for procedures describing the INTENTs of parameters will help the compiler to identify variables that are not modified across procedure calls (refer to section 7.5.3 "Procedure calling").
Uses of variables may be explicit, and can occur either after the INDEPENDENT loop nest, or within the same loop nest. For example, the following INDEPENDENT loop has a likely programming error because variable j is both read and written on different iterations, violating Bernstein's conditions (refer to page 193 of The High Performance Fortran Handbook).
!HPF$ INDEPENDENT DO 10 i = 1, n 10 j = j + A(i)
Implicit uses of variables arise either because the variables exist in COMMON blocks, or because the variables occur as dummy parameters with INTENT INOUT or INTENT OUT.
Another reason that INDEPENDENT loops may not be parallelized is the presence of array aliases: there may be distinct array references, where at least one is a store, that refer to the same array locations on certain iterations. When the compiler must copy programmer-defined arrays to compiler-created arrays and array aliasing arises, the compiler cannot determine how to replace a given array reference. This problem can arise in the following INDEPENDENT loop.
!HPF$ INDEPENDENT DO i = 1, n A(J1(I)) = 0 A(J2(I)) = 1 ENDDO
If the first reference to array A is replaced with A$TMP1 and the second is replaced by A$TMP2, the compiler cannot determine which temporary array to copy back to array A in order to set its final value after the loop.
The full syntax of the PGHPF implementation of INDEPENDENT directive is the following.
INDEPENDENT [, ON HOME ( home-array )] [, NEW ( var-list )] [, REDUCTION ( var-list )]
The following sections, describe the NEW, ON HOME, and REDUCTION clauses.
The NEW clause specifies a list of compiler-replicated variable names (separated with commas). Assignment to a compiler-replicated variable violates Bernstein's conditions (the variable will be assigned values in multiple iterations), and will prevent parallelization (see Section 2). However, when the variable is present in a NEW clause, the loop is treated as if a new instance of the variable is created for each iteration of the INDEPENDENT loop, and Bernstein's conditions are discharged.
The following example demonstrates use of the NEW clause.
!HPF$ INDEPENDENT, NEW (S) DO I = 1, n s = SQRT(A(i)**2 + B(i)**2) C(i) = s ENDDO
After execution of the INDEPENDENT loop, values of compiler-replicated variables appearing in NEW clauses may be different across different processors, causing errors if these variables are used without intervening assignments.
The ON HOME clause specifies an array reference to be used to localize loop iterations for an INDEPENDENT loop nest. The ON HOME clause associates INDEPENDENT indices to dimensions of the home array.
The ON HOME clause is optional, and if not specified, the compiler will select a suitable home array from array references within the INDEPENDENT loop, or will create a home array (without actually allocating space for it).
Each INDEPENDENT index of a loop nest should be a subscript in a mapped dimension of the home array reference in the ON HOME clause. Valid distribution attributes are BLOCK and BLOCK(N). The home-array should reference valid array locations for all values of the INDEPENDENT indices. When a subscript is not an INDEPENDENT index, it can be a triple. The following example demonstrates use of the ON HOME clause.
DIMENSION A(0:n+1,1:m) !HPF$ DISTRIBUTE A(BLOCK,*) !HPF$ INDEPENDENT, ON HOME (A(i,:)) DO 1 i = 1, n 1 B(i) = i
The REDUCTION clause specifies a list of accumulator variable names (separated with commas). When an accumulator is compiler-replicated, its appearance in a reduction statement will violate Bernstein's conditions in the same way that other assignments to compiler-replicated variables violate these conditions. It is not correct to place accumulators in NEW clauses because their values must be accumulated across processors. The REDUCTION clause specifies that reduction statements do not violate Bernstein's conditions.
!HPF$ INDEPENDENT, REDUCTION (S) DO I = 1, n s = s + A(I) ENDDO
A reduction statement is an assignment statement in one of the forms below:
A = A + E A = A * E A = A .or. E A = A .and. E A = A .neqv. E A = iand(A, E1, ..., En) A = ior(A, E1, ..., En) A = ieor(A, E1, ..., En) A = min(A, E1, ..., En) A = max(A, E1, ..., En)
In these reduction statements, A is an accumulator appearing in a REDUCTION clause, and expressions E, E1, ..., En do not contain A. The compiler produces statements to perform reductions locally on all processors, then combines all local accumulators globally.
Calls to subroutines, functions, and most intrinsics can occur within INDEPENDENT loops. Due to the presence of side-effects in their implementation, intrinsics of class "Subroutine" (for example random_number()) will prevent parallelization of INDEPENDENT loops. All called subroutines and functions must be PURE, for the programmer to specify that no communication will be generated within the called program unit. If a called subroutine or function is not PURE, as described in an INTERFACE block, the compiler issues a warning message.
DO loops are marked with INDEPENDENT directives to inform the compiler that the loops can be executed in parallel, thereby attaining improved performance. To achieve correct program behavior in the presence of parallelism, the compiler must first analyze INDEPENDENT loops, and then may need to perform transformations on the loops. Some of the transformations may impede performance to the point that they execute slower than similar loops running sequentially on replicated data. While the compiler attempts to reduce the number of such transformations, the programmer has a large role to play in eliminating these transformations. This section will discuss strategies for programmers that will improve performance of INDEPENDENT loops.
Every INDEPENDENT loop nest is assigned a home array by the compiler. All array references in an INDEPENDENT loop nest are examined to see if they are aligned with the home array. Array references that are not aligned are replaced with new temporary arrays which are aligned with the home array. The time to allocate and deallocate new temporary arrays, as well as the time to copy data both to the temporary arrays and then back to the original arrays can be substantial, and is the primary cause of slowdown in performance of INDEPENDENT loops.
The compiler's -Minfo command-line switch informs programmers about the presence of temporary arrays for which performance overhead of array copying may be substantial. In this case, the compiler produces messages such as the following:
14, Independent loop parallelized expensive communication:all-to-all communication (copy_section) 18, expensive communication: all-to-all communication (copy_section)
The first "expensive communication" message is produced for the copy into a temporary array, and is associated with the first line of the INDEPENDENT loop nest (line 14 in the above message). The second "expensive communication" message is produced for the copy from the temporary array to the original array, and is associated with the last line of the INDEPENDENT loop nest (line 18 in the above message).
Small changes to programs can lead to substantial reductions in the number of temporary arrays created by the compiler. There are two primary strategies that can be followed:
In the following loop nest, no suitable home array can be found because its only array is distributed over just one dimension, while both loops in the nest are INDEPENDENT .
!HPF$ DISTRIBUTE (BLOCK,*) :: A
DO 1 i = 1, m
DO 1 j = 1, n
1 A(i,j) = (i-1) * n + j
For this loop nest, a temporary copy will be created for A. The temporary array will be distributed over both of its dimensions. This temporary can be eliminated in one of two ways:
PGHPF includes support for inlining procedures within loops. The compiler inlines procedure calls within loops when -Minline is used. Inlining procedure calls is an essential step for the compiler to take to determine if non-PURE procedure calls can be parallelized within independent loops containing procedure calls.
When inlining is to occur within loops, use -Minline on the compilation command line. Inlining requires a preliminary extraction phase which saves compiler information about procedures. You can allow the compiler to create a temporary extraction, thus handling the inlining automatically, or you can create and maintain a directory of "extract" files using -Mextract. The compiler produces a message when it is creating its database of extract procedures. For example the following message indicates that the compiler is extracting the routine scatter_count.
pghpfc_ex: extracting scatter_count
All forms of the PGHPF -Minline switch only inline procedures within DO loops or loop nests. The full syntax of the -Minline switch is as follows:
For most users, supplying -Minline on the command line will suffice, for example,
% pghpf -Minline filename.hpf
This instructs the compiler to perform inlining within loops and to use a temporary directory for the extract phase. Note that while the extract phase extracts all possible procedures, the inliner will only inline procedures in a DO loop.
Including the lib:dir parameter, assumes that an extract phase has been completed, and that the extracted procedures, if any, will be taken from directory dir (created using the -Mextract switch described below).
If the levels:n parameter is specified, inlining is repeated up to n times within any inlined loop so that calls up to n levels deep can be removed (the default for this value is one 1).
If the name:fun parameter is provided, only the function or subroutine fun will be inlined within the loop. Multiple name parameters can be provided in order to inline multiple procedures.
The full syntax of the -Mextract PGHPF switch is as follows:
-Mextract[=name1,name2...] -o dir
This switch instructs the compiler to extract inlineable functions and subroutines using directory dir to store the extract files. Names of procedures to be extracted may be specified as parameters to the -Mextract switch. Note that while the extract phase extracts all possible procedures, the inliner will only inline procedures in an DO loop.
If you are using the PGHPF inlining capability and you want to keep track of which functions are inlined, use the -Minfo=inline. For example:
% pghpf -Minline -Minfo=inline test1.hpf 12, Inlining f
When the -Mextract switch is specified, an extract directory is created for holding extract files. An extract file is an ASCII file holding information created by the HPF compiler about a single procedure. The procedure's name is in the first line of the extract file. The extract directory contains a special table-of-contents file, named TOC. This ASCII file associates procedure names with extract file names.
The inliner first inserts the statements of a called procedure into the calling program unit at the point of the call. If a function has been inlined, the function call is replaced by the variable holding the function's return value. If there are name conflicts between variables local to the inlined procedure and variables within the calling program unit, the inlined procedure's local variables will be renamed.
Assignments of actual arguments to dummy parameters with INTENT IN or INTENT OUT are made at the beginning of the inlined statements. Actual arguments that are identifiers or subscript expressions associated with dummy parameters with INTENT INOUT textually substitute for the dummy parameters where ever they occur within the inlined statements. If necessary, adjustments are made to array subscripts to accommodate array bounds that are different between the calling program unit and the called procedure.
Extracting to a temporary extract directory and inlining.
% pghpf -Minline test1.hpf
Creating an extract library.
% pghpf -Mextract -o exlib test1.hpf
Inlining with extract library exlib.
% pghpf -Minline=lib:exlib test1.hpf