Efficient parallel reduction to bidiagonal form

Bruno Lang

doi:10.1016/S0167-8191(99)00041-1

Outline

Efficient parallel reduction to bidiagonal form

Bruno Lang

1999, Parallel Computing

https://doi.org/10.1016/S0167-8191(99)00041-1

visibility

…

description

18 pages

link

1 file

Abstract

Most methods for calculating the SVD (singular value decomposition) require to ®rst bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense 3 banded and banded 3 bidiagonal with cubic and quadratic arithmetic costs, respectively, we are able to carry out a much higher portion of the calculations in matrix±matrix multiplications. Thus, higher performance can be expected. This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang, Parallel Comput. 22 (1996) 1±18]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be signi®cantly superior if only the singular values are required. Ó . This work was partially funded by Deutsche Forschungsgemeinschaft, Gesch aftszeichen Fr 755/6-1 and Fr 755/6-2. 0167-8191/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 -8 1 9 1 ( 9 9 ) 0 0 0 4 1 -1 parallel computers [1,2,4] and to novel accuracy issues, do most of the work on a full or triangular matrix.

Figures (11)

Approximate flop counts for the bidiagonalization methods discussed in this paper The remainder of the paper is structured as follows. In Section 2 we briefly summarize the features of the direct bidiagonalization algorithm and shortly describe the 2D block cyclic data layout, which is used in the direct method as well as in the reduction to banded form. The following section presents three different realizations of the first reduction stage. Since the final bidiagonalization of the banded matrix relies on a one-dimensional data layout, the data must be redistributed between the two stages, as discussed in Section 4. Section 5 presents the numerical results, and a summary of our findings concludes the paper.

Fig. 1. Distribution of a matrix with five block rows and four block columns on a 2 x 2 process grid. Top left: global view (“To which processor does each block 4;; of the matrix belong ?’’). Top right: column view (“Which blocks lie in each processor column P,; ?”’). Bottom left: row view (“Which blocks lie in each processor row P,,, ?’”’). Bottom right: local view (“Which blocks does each processor P,; hold ?”’).

Fig. 2. Third step of Algorithm | (standard). The transformations Q, and P, are represented by the cor- responding Householder pairs.

Fig. 3. Snapshots of Algorithm 2 (PxLEFTRED). The routine updates the whole submatrix [A(ORFact) | 4(ORupd) | 3.2. Rank-2b updates

Fig. 4. Third step of (rk2b). Algorithm 3. rk2b (reduction to b upper and b lower diagonals)

Fig. 5. Local (top) and global (bottom) phases in the “split?” QR decomposition.

Fig. 6. Distribution of the banded matrix at the end of Stage I (top) and at the beginning of Stage II (bottom).

Fig. 7. Speedup of the two-stage reduction technique (standard variant) over the direct bidiagonalization algorithm PDGEBRD.

Detailed timings for the two-stage bidiagonalization cyclic data layout, which leads to well-balanced work load and to favorable com- munications patterns, and therefore they scale almost perfectly. The good scaling characteristics of Stage II are discussed in [16]. As a result, the overall two-stage algorithm adapts this property.

Fig. 8. Comparison of scaling properties. Top: direct bidiagonalization, mid: overall two-stage algorithm. bottom: standard reduction to banded form (Stage I).

Comparison of the overall execution times for the two-stage algorithm (standard and splitfac variants, both run with b = 24) and the direct algorithm

References (17)

M. Be cka, S. Robert, M. Vajter sic, Experiments with parallel one-sided and two-sided algorithms for SVD, in: P. Zinterhof, M. Vajter sic, A. Uhl (Eds.), Parallel Computation, Springer, Berlin, 1999, pp. 48±57.
M. Be cka, M. Vajter sic, Block-Jacobi SVD algorithms for distributed memory systems I : Hypercubes and rings, Parallel Algorithms Appl. 13 (1999) 265±287.
M.W. Berry, J.J. Dongarra, Y. Kim, A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-Hessenberg form, Parallel Comput. 21 (8) (1995) 1184±1200.
C. Bischof, Computing the singular value decomposition on a distributed system of vector processors, Parallel Comput. 11 (1989) 171±186.
C. Bischof, C. Van Loan, The WY representation for products of Householder matrices, SIAM J. Sci. Stat. Comput. 8 (1) (1987) s2±s13.
J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, ScaLAPACK: A portable linear algebra library for distributed memory computers ± design issues and performance, Comput. Phys. Comm. 97 (1996) 1±15.
J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, R.C. Whaley, A proposal for a set of parallel basic linear algebra subprograms, in: J. Dongarra, K. Masden, J. Wa sniewski (Eds.), Applied Parallel Computing, Springer, Berlin, 1995, pp. 107±114.
J. Choi, J.J. Dongarra, L.S. Ostrouchov, A.P. Petitet, D.W. Walker, R.C. Whaley, The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines, Sci. Programming 5 (1996) 173±184.
J. Choi, J.J. Dongarra, D.W. Walker, The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form, Numer. Alg. 10 (1995) 379±399.
J.J. Dongarra, J. Du Croz, S. Hammarling, I. Du, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft. 16 (1) (1990) 1±17.
J.J. Dongarra, J. Du Croz, S. Hammarling, R.J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Soft. 14 (1) (1988) 1±17.
J.J. Dongarra, R.C. Whaley, LAPACK Working Note 94: A user's guide to the BLACS v1.0, Technical Report CS-95-281, University of Tennessee at Knoxville, March 1995.
B. Groûer, Parallele zweistu®ge Verfahren zur Reduktion auf Bidiagonalgestalt, Diplomarbeit, Fachbereich Mathematik, Bergische Universit at GH Wuppertal, 1997.
M.R. Hestenes, Inversion of matrices by biorthogonalization and related results, SIAM J. Appl. Math. 6 (1958) 51±90.
E.G. Kogbetliantz, Solution of linear equations by diagonalization of coecients matrix, Quart. Appl. Math. 13 (1955) 123±132.
B. Lang, Parallel reduction of banded matrices to bidiagonal form, Parallel Comput. 22 (1996) 1±18.
R. Schreiber, C. Van Loan, A storage-ecient WY representation for products of Householder transformations, SIAM J. Sci. Stat. Comput. 10 (1) (1989) 53±57.

Efficient parallel reduction to bidiagonal form

Sign up for access to the world's latest research

Abstract

Related papers

References (17)

Related papers

Related topics

Cited by