Papers by José R . Herrero

arXiv (Cornell University), Apr 27, 2023
This paper advocates for an intertwined design of the dense linear algebra software stack that br... more This paper advocates for an intertwined design of the dense linear algebra software stack that breaks down the strict barriers between the high-level, blocked algorithms in LAPACK (Linear Algebra PACKage) and the low-level, architecture-dependent kernels in BLAS (Basic Linear Algebra Subprograms). Specifically, we propose customizing the GEMM (general matrix multiplication) kernel, which is invoked from the blocked algorithms for relevant matrix factorizations in LAPACK, to improve performance on modern multicore processors with hierarchical cache memories. To achieve this, we leverage an analytical model to dynamically adapt the cache configuration parameters of the GEMM to the shape of the matrix operands. Additionally, we accommodate a flexible development of architecture-specific micro-kernels that allow us to further improve the utilization of the cache hierarchy. Our experiments on two platforms, equipped with ARM (NVIDIA Carmel, Neon) and x86 (AMD EPYC, AVX2) multi-core processors, demonstrate the benefits of this approach in terms of better cache utilization and, in general, higher performance. However, they also reveal the delicate balance between optimizing for multi-threaded parallelism versus cache usage. CCS Concepts: • Mathematics of computing → Mathematical software performance; • Computer systems organization → Multicore architectures.

Matrix computations lie at the heart of most scientific computational tasks. The solution of line... more Matrix computations lie at the heart of most scientific computational tasks. The solution of linear systems of equations is a very frequent operation in many fields in science, engineering, surveying, physics and others. Other matrix operations occur frequently in many other fields such as pattern recognition and classification, or multimedia applications. Therefore, it is important to perform matrix operations efficiently. The work in this thesis focuses on the efficient execution on commodity processors of matrix operations which arise frequently in different fields.We study some important operations which appear in the solution of real world problems: some sparse and dense linear algebra codes and a classification algorithm. In particular, we focus our attention on the efficient execution of the following operations: sparse Cholesky factorization; dense matrix multiplication; dense Cholesky factorization; and Nearest Neighbor Classification.A lot of research has been conducted on...

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, 2017
Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been ... more Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been proposed as a means to address the end of Dennard power scaling law. The idea of these architectures is to activate only the type (and number) of cores that satisfy the quality of service requested by the application(s) in execution while delivering high energy efficiency. For dense linear algebra problems though, performance is of paramount importance, asking for an efficient use of all computational resources in the AMP. In response to this, we investigate how to exploit the asymmetric cores of an ARMv7 big.LITTLE AMP in order to attain high performance for the reduction to tridiagonal form, an essential step towards the solution of dense symmetric eigenvalue problems. The routine for this purpose in LAPACK is especially challenging, since half of its floating-point arithmetic operations (flops) are cast in terms of compute-bound kernels while the remaining half correspond to memory-bound kernels. To deal with this scenario: 1) we leverage a tuned implementation of the compute-bound kernels for AMPs; 2) we develop and parallelize new architecture-aware micro-kernels for the memory-bound kernels; 3) and we carefully adjust the type and number of cores to use at each step of the reduction procedure.

Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores, 2021
We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorit... more We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by µ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU). The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.

Computer Physics Communications, 2021
The calculation of overlaps between many-electron wave functions at different nuclear geometries ... more The calculation of overlaps between many-electron wave functions at different nuclear geometries during nonadiabatic dynamics simulations requires the evaluation of a large number of determinants of matrices that differ only in a few rows/columns. While this calculation is fast for small systems, its cost grows faster than the alternative electronic structure calculation used to obtain the wave functions. For wave functions that can be written as a CIS expansion, all determinants can be computed using the set of level-2 minors of the reference matrix. However, this is still a costly computation for large systems. In this paper, we provide an algorithm for efficiently calculating all level-2 minors of a matrix by reutilizing and updating the LU factorization for the determinants of the minors. This approach results in a parallel version of the algorithm that is up to an order of magnitude faster then the current best parallel implementation. The algorithm thus allows the computation of exact wave function overlaps for relatively large systems, with a high density of states, at virtually no cost compared with the electronic structure calculations. Furthermore, the new algorithm opens the path to further investigations in efficient computing of the exact wave function overlaps for complex wave functions such as MR-CIS and MR-CISD.

Cluster Computing, 2019
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using ... more We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multithreaded version of BLAS. This approach is also different from the more sophisticated runtime-assisted implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a highlevel of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of LAPACK functionality on any multicore platform with an OpenMP-like runtime.

IEEE Access, 2019
We propose two novel techniques for overcoming load-imbalance encountered when implementing so-ca... more We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the basic linear algebra subprograms, and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on an Intel-Xeon system with 12 cores show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution. INDEX TERMS Solution of linear systems, multi-threading, workload balancing, thread malleability, basic linear algebra subprograms (BLAS), linear algebra package (LAPACK).

Numerical Algorithms, 2018
We address the reduction to compact band forms, via unitary similarity transformations, for the s... more We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case we revisit the reduction to symmetric band form while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular-band output. In both cases, we describe algorithmic variants of the standard Level-3 BLAS-based procedures, enhanced with look-ahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure.

Parallel Computing, 2018
We investigate how to leverage the heterogeneous resources of an Asymmetric Multicore Processor (... more We investigate how to leverage the heterogeneous resources of an Asymmetric Multicore Processor (AMP) in order to deliver high performance in the reduction to condensed forms for the solution of dense eigenvalue and singular-value problems. The routines that realize this type of two-sided orthogonal reductions (TSOR) in LAPACK are especially challenging, since a significant fraction of their floating-point operations are cast in terms of memory-bound kernels while the remaining part corresponds to efficient compute-bound kernels. To deal with this scenario: 1) we leverage implementations of memory-bound and compute-bound kernels specifically tuned for AMPs; 2) we select the algorithmic block size for the TSOR routines via a practical model; and 3) we adjust the type and number of cores to use at each step of the reduction. Our experiments validate the model and assess the performance of our asymmetry-aware TSOR routines, using an ARMv7 big.LITTLE AMP, for three key operations: the reduction to tridiagonal form for symmetric eigenvalue problems, the reduction to Hessenberg form for non-symmetric eigenvalue problems, and the reduction to bidiagonal form for singular-value problems.

Journal of Computational Science, 2016
Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numeric... more Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power hungry versus slow/energy efficient. For this purpose, we integrate coarsegrain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM R big.LITTLE TM processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach.
The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. Da... more The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. Data subblocks are stored as dense matrices. Since we are dealing with sparse matrices some zeros can be stored in those dense blocks. The overhead introduced by the operations on zeros can become really large and considerably degrade performance. In this paper, we present several techniques for reducing the operations on zeros in a sparse hypermatrix Cholesky factorization. By associating a bit to each column within a data submatrix we create a bit vector. We can avoid computations when the bitwise AND of their bit vectors is null. By keeping information about the actual space within a data submatrix which stores non-zeros (dense window) we can reduce both storage and computation.
The sparse Cholesky factorization of some large matrices can require a two dimensional partitioni... more The sparse Cholesky factorization of some large matrices can require a two dimensional partitioning of the matrix. The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. The subblocks are stored as dense matrices so BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. In this paper we present an improvement to our sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure. We compare its performance with several codes and analyze the results.
The use of highly optimized inner kernels is of paramount importance for obtaining efficient nume... more The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.
We present the way in which we adapt data and computations to the underlying memory hierarchy by ... more We present the way in which we adapt data and computations to the underlying memory hierarchy by means of a hierarchical data structure known as hypermatrix. The application of orthogonal block forms produced the best performance for the platforms used.
In this paper we present our work on the the parallelization of a matrix multiplication code base... more In this paper we present our work on the the parallelization of a matrix multiplication code based on the hypermatrix data structure. We have used OpenMP for the parallelization. We have added OpenMP directives to a few loops and experimented with several features available with OpenMP in the Intel Fortran Compiler: scheduling algorithms, chunk sizes and nested parallelism. We found that the load imbalance introduced by the hypermatrix structure could not be solved by any of those OpenMP features.
We present two implementations of dense matrix multiplication based on two different non-canonica... more We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, h... more Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the use of specialized inner kernels. The use of non-canonical data structures together with specialized inner kernels has low overhead and can produce excellent performance. ⋆ This work was supported by the Ministerio de Educación y Ciencia of Spain with grants TIN2004-07739-C02-01 and TIN2007-60625. 1 Experiments have been conducted on an Intel Itanium 2 processor running at 1.3 GHz with a theoretical peak performance of 5.2 GFlops (indicated by the dashed line at the top of some graphs).

Lecture Notes in Computer Science, 2003
This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accompl... more This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the resulting executable and keep its Mflops. The best combination is then used to produce the object introduced in a library. Thus, a routine for each desired matrix size is available from the library. The large overhead incurred by the hypermatrix Cholesky factorization of sparse matrices can therefore be lessened by reducing the block size when those routines are used. Using the routines, e.g. matrix multiplication, in our small matrix library produced important speed-ups in our sparse Cholesky code.
Data prefetching and multilevel blocking for linear algebra operations
Proceedings of the 10th international conference on Supercomputing - ICS '96, 1996
Much effort has been directed towards obtaining near peak performancefor linear algebra operation... more Much effort has been directed towards obtaining near peak performancefor linear algebra operations on current high performance workstations.The large amounts of data accesses however, make performance highlydependent on the behavior of the memory hierarchy. Techniques such asMultilevel Blocking (Tiling), Data Precopying, Software Pipelining andSoftware Prefetching have been applied in order to improve performance.Nevertheless, to our knowledge, no other work has been done consideringthe...
Lecture Notes in Computer Science, 2014
This contribution describes a Square Block, SB, format for storing a banded symmetric matrix. Thi... more This contribution describes a Square Block, SB, format for storing a banded symmetric matrix. This is possible by rearranging "in place" LAPACK Band Layout to become a SB layout: store submatrices as a set of square blocks. The new format reduces storage space, provides higher locality of memory accesses, results in regular access patterns, and exposes parallelism.
Uploads
Papers by José R . Herrero