Papers by Sivasankaran Rajamanickam
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
Acm Transactions on Mathematical Software, 2008
CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the f... more CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or AA T , updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for both symmetric and unsymmetric matrices. Its supernodal Cholesky factorization relies on LAPACK and the Level-3 BLAS, and obtains a substantial fraction of the peak performance of the BLAS. Both real and complex matrices are supported. CHOLMOD is written in ANSI/ISO C, with both C and MATLAB TM interfaces. It appears in MATLAB 7.2 as x=A\b when A is sparse symmetric positive definite, as well as in several other sparse matrix functions.
Zoltan2: Next-Generation Combinatorial Toolkit
BFS and Coloring-Based Parallel Algorithms for Strongly Connected Components and Related Problems
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
A Hybrid Approach for Parallel Transistor-Level Full-Chip Circuit Simulation
Lecture Notes in Computer Science, 2015
High-Performance Graph Analytics on Manycore Processors
2015 IEEE International Parallel and Distributed Processing Symposium, 2015
Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, 2013
Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The ... more Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The distribution of data across processes becomes important on distributed-memory computers with thousands of cores. It has been shown that twodimensional layouts (edge partitioning) can have significant advantages over traditional one-dimensional layouts. However, simple 2D block distribution does not use the structure of the graph, and more advanced 2D partitioning methods are too expensive for large graphs. We propose a new two-dimensional partitioning algorithm that combines graph partitioning with 2D block distribution. The computational cost of the algorithm is essentially the same as 1D graph partitioning. We study the performance of sparse matrix-vector multiplication (SpMV) for scale-free graphs from the web and social networks using several different partitioners and both 1D and 2D data layouts. We show that SpMV run time is reduced by exploiting the graph's structure. Contrary to popular belief, we observe that current graph and hypergraph partitioners often yield relatively good partitions on scale-free graphs. We demonstrate that our new 2D partitioning method consistently outperforms the other methods considered, for both SpMV and an eigensolver, on matrices with up to 1.6 billion nonzeros using up to 16,384 cores.
Using architecture information and real-time resource state to reduce power consumption and communication costs in parallel applications
Multi-Jagged: A Scalable Parallel Spatial Partitioning Algorithm
IEEE Transactions on Parallel and Distributed Systems, 2015
With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical s... more With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy.
PuLP: Scalable multi-objective multi-constraint partitioning for small-world networks
2014 IEEE International Conference on Big Data (Big Data), 2014
2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012
We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in ... more We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in a bipartite graph on multicore and massively multithreaded computers. As computers with larger numbers of slower cores dominate the commodity processor market, the design of multithreaded algorithms to solve large matching problems becomes a necessity. Recent work on serial algorithms for the matching problem has shown that their performance is sensitive to the order in which the vertices are processed for matching. In a multithreaded environment, imposing a serial order in which vertices are considered for matching would lead to loss of concurrency and performance. But this raises the question: Would parallel matching algorithms on multithreaded machines improve performance over a serial algorithm?

Exploiting Geometric Partitioning in Task Mapping for Parallel Computers
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
ABSTRACT We present a new method for mapping applications' MPI tasks to cores of a parall... more ABSTRACT We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that communication and execution time are reduced. We consider the case of sparse node allocation within a parallel machine, where the nodes assigned to a job are not necessarily located within a contiguous block nor within close proximity to each other in the network. The goal is to assign tasks to cores so that interdependent tasks are performed by "nearby" cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We show that, for the structured finite difference mini-app Mini Ghost, our mapping method reduced execution time 34% on average on 65,536 cores of a Cray XE6. In a molecular dynamics mini-app, Mini MD, our mapping method reduced communication time by 26% on average on 6144 cores. We also compare our mapping with graph-based mappings from the LibTopoMap library and show that our mappings reduced the communication time on average by 15% in MiniGhost and 10% in MiniMD.
Electrical modeling and simulation for stockpile stewardship
XRDS: Crossroads, The ACM Magazine for Students, 2013
ABSTRACT
Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos
Parallel Processing Letters, 2014
ABSTRACT

2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014
Trilinos is an object-oriented software framework for the solution of large-scale, complex multip... more Trilinos is an object-oriented software framework for the solution of large-scale, complex multiphysics engineering and scientific problems. While the original version of Trilinos was designed for highly scalable solutions for large problems, the need for increasingly higher fidelity simulations has pushed the problem sizes beyond what could have been envisioned two decades ago. When problem sizes exceed a billion elements even highly scalable applications and solver stacks require a complete revision. The next-generation Trilinos employs C++ templates in order to solve arbitrarily large problems and enable extreme-scale simulations. We present a case study that involves integration of Trilinos with an engineering application (Sierra low Mach module/Nalu), involving the simulation of low Mach fluid flow for problems of size up to nine billion elements. Through the use of improved algorithms and better software engineering practices, we demonstrate good weak scaling for the matrix assembly and solve for the engineering application for up to a nine billion element fluid flow large eddy simulation (LES) problem on unstructured meshes with a 27 billion row matrix on 131,072 cores of a Cray XE6 platform.
Poster: a hybrid-hybrid solver for manycore platforms
Tutorial: the Zoltan toolkit
The Energy Citations Database (ECD) provides access to historical and current research (1948 to t... more The Energy Citations Database (ECD) provides access to historical and current research (1948 to the present) from the Department of Energy (DOE) and predecessor agencies.
Load balancing with Zoltan and Isorropia
The Energy Citations Database (ECD) provides access to historical and current research (1948 to t... more The Energy Citations Database (ECD) provides access to historical and current research (1948 to the present) from the Department of Energy (DOE) and predecessor agencies.
Uploads
Papers by Sivasankaran Rajamanickam