Papers by Grzegorz Kwasniewski
High-Performance and Programmable Attentional Graph Neural Networks with Global Tensor Formulations
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
世界トップスーパコンピュータ上での極端なスケールプラズマ乱流シミュレーション【Powered by NICT】
IEEE Conference Proceedings, 2016
密集加速器サーバのためのPCIe輻輳を意識した性能モデル【Powered by NICT】
IEEE Conference Proceedings, 2016

Geoscientific Model Development, May 2, 2018
The best hope for reducing long-standing global climate model biases is by increasing resolution ... more The best hope for reducing long-standing global climate model biases is by increasing resolution to the kilometer scale. Here we present results from an ultrahighresolution non-hydrostatic climate model for a near-global setup running on the full Piz Daint supercomputer on 4888 GPUs (graphics processing units). The dynamical core of the model has been completely rewritten using a domainspecific language (DSL) for performance portability across different hardware architectures. Physical parameterizations and diagnostics have been ported using compiler directives. To our knowledge this represents the first complete atmospheric model being run entirely on accelerators on this scale. At a grid spacing of 930 m (1.9 km), we achieve a simulation throughput of 0.043 (0.23) simulated years per day and an energy consumption of 596 MWh per simulated year. Furthermore, we propose a new memory usage efficiency (MUE) metric that considers how efficiently the memory bandwidth-the dominant bottleneck of climate codes-is being used.

Performance modeling can be utilized in a number of scenarios, starting from finding performance ... more Performance modeling can be utilized in a number of scenarios, starting from finding performance bugs to the scalability study of applications. Existing dynamic and static approaches for automating the generation of performance models have limitations for precision and overhead. In this work, we explore combination of a number of static and dynamic analyses for lifelong performance modeling and investigate accuracy, reduction of the model search space, and performance improvements over previous approaches on a wide range of parallel benchmarks. We develop static and dynamic schemes such as kernel clustering, batched model updates and regulation of modeling frequency for reducing the cost of measurements, model generation, and updates. Our hybrid approach, on average can improve the accuracy of the performance models by 4.3%(maximum 10%) and can reduce the overhead by 25% (maximum 65%) as compared to previous approaches.
Software for Exascale Computing - SPPEXA 2013-2015

arXiv (Cornell University), Aug 24, 2022
Important graph mining problems such as Clustering are computationally demanding. To significantl... more Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as Bloom filters. These representations are much faster to process than the original vertex sets thanks to vectorizability and small size. We use these representations as building blocks in important parallel graph mining algorithms such as Clique Counting or Clustering. When enhanced with ProbGraph, these algorithms significantly outperform tuned parallel exact baselines (up to nearly 50× on 32 cores) while ensuring accuracy of more than 90% for many input graph datasets. Our novel bounds and algorithms based on probabilistic set representations with desirable statistical properties are of separate interest for the data analytics community.

arXiv (Cornell University), Mar 5, 2021
We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates e... more We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing highperformance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by up to >2×), maximal clique listing (by up to >9×),-clique listing (by 1.1×), and subgraph isomorphism (by up to 2.5×), also obtaining better theoretical performance bounds.

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algo... more Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O lower bounds on a general class of programs, which we call Simple Overlap Access Programs (SOAP). SOAP analysis covers a wide variety of algorithms, from ubiquitous computational kernels to full scientific computing applications. Using the red-blue pebble game and combinatorial methods, we are able to bound the I/O of the SOAP-induced Computational Directed Acyclic Graph (CDAG), taking into account multiple statements, input/output reuse, and optimal tiling. To deal with programs that are outside of our representation (e.g., non-injective access functions), we describe methods to approximate them with SOAP. To demonstrate our method, we analyze 38 different applications, including kernels from the Polybench benchmark suite, deep learning operators, and-for the first time-applications in unstructured physics simulations, numerical weather prediction stencil compositions, and full deep neural networks. We derive tight I/O bounds for several linear algebra kernels, such as Cholesky decomposition, improving the existing reported bounds by a factor of two. For stencil applications, we improve the existing bounds by a factor of up to 14. We implement our method as an open-source tool, which can derive lower bounds directly from provided C code.

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-op... more We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ 3 times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation. 1 We also focus only on "classical" MMM algorithms which perform n 3 multiplications and additions. We do not analyze Strassen-like routines [53], as in practice they are often slower [19].

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algo... more Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O lower bounds on a general class of programs, which we call Simple Overlap Access Programs (SOAP). SOAP analysis covers a wide variety of algorithms, from ubiquitous computational kernels to full scientific computing applications. Using the red-blue pebble game and combinatorial methods, we are able to bound the I/O of the SOAP-induced Computational Directed Acyclic Graph (CDAG), taking into account multiple statements, input/output reuse, and optimal tiling. To deal with programs that are outside of our representation (e.g., non-injective access functions), we describe methods to approximate them with SOAP. To demonstrate our method, we analyze 38 different applications, including kernels from the Polybench benchmark suite, deep learning operators, and-for the first time-applications in unstructured physics simulations, numerical weather prediction stencil compositions, and full deep neural networks. We derive tight I/O bounds for several linear algebra kernels, such as Cholesky decomposition, improving the existing reported bounds by a factor of two. For stencil applications, we improve the existing bounds by a factor of up to 14. We implement our method as an open-source tool, which can derive lower bounds directly from provided C code.
IEEE International Conference on High Performance Computing, Data, and Analytics, Nov 13, 2016

The doubling of cores every two years requires programmers to expose maximum parallelism. Applica... more The doubling of cores every two years requires programmers to expose maximum parallelism. Applications that are developed on today's machines will often be required to run on many more cores. Thus, it is necessary to understand how much parallelism codes can expose. The work and depth model provides a convenient mental framework to assess the required work and the maximum parallelism of algorithms and their parallel efficiency. We propose an automatic analysis to extract work and depth from a source-code. We do this by statically counting the number of loop iterations depending on the set of input parameters. The resulting expression can be used to assess work and depth with regards to the program inputs. Our method supports the large class of practically relevant loops with affine update functions and generates additional parameters for other expressions. We demonstrate how this method can be used to determine work and depth of several real-world applications. Our technique enables us to prove if the theoretically maximum parallelism is exposed in a practical implementation of a problem. This will be most important for future-proof software development.

Data movement is the dominating factor affecting performance and energy in modern computing syste... more Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project 1 to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms. c

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-op... more We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ 3 times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation. 1 We also focus only on "classical" MMM algorithms which perform n 3 multiplications and additions. We do not analyze Strassen-like routines [53], as in practice they are often slower [19].

arXiv (Cornell University), Nov 13, 2019
Synchronization overheads pose a major challenge as applications advance towards extreme scales. ... more Synchronization overheads pose a major challenge as applications advance towards extreme scales. In current large-scale algorithms, synchronization as well as data communication delay the parallel computations at each time step in a time-dependent partial differential equation (PDE) solver. This creates a new scaling wall when moving towards exascale. We present a weakly-synchronous algorithm based on novel asynchrony-tolerant (AT) finite-difference schemes that relax synchronization at a mathematical level. We utilize remote memory access programming schemes that have been shown to provide significant speedup on modern supercomputers, to efficiently implement communications suitable for AT schemes, and compare to two-sided communications that are state-of-practice. We present results from simulations of Burgers' equation as a model of multi-scale strongly non-linear dynamical systems. Our algorithm demonstrate excellent scalability of the new AT schemes for large-scale computing, with a speedup of up to 3.3x in communication time and 2.19x in total runtime. We expect that such schemes can form the basis for exascale PDE algorithms.
Proceedings of the VLDB Endowment, Jul 1, 2021
This page was generated automatically upon download from the ETH Zurich Research Collection. For ... more This page was generated automatically upon download from the ETH Zurich Research Collection. For more information, please consult the Terms of use.

arXiv (Cornell University), Dec 28, 2020
Function-as-a-Service (FaaS) is one of the most promising directions for the future of cloud serv... more Function-as-a-Service (FaaS) is one of the most promising directions for the future of cloud services, and serverless functions have immediately become a new middleware for building scalable and cost-efficient microservices and applications. However, the quickly moving technology hinders reproducibility, and the lack of a standardized benchmarking suite leads to ad-hoc solutions and microbenchmarks being used in serverless research, further complicating metaanalysis and comparison of research solutions. To address this challenge, we propose the Serverless Benchmark Suite: the first benchmark for FaaS computing that systematically covers a wide spectrum of cloud resources and applications. Our benchmark consists of the specification of representative workloads, the accompanying implementation and evaluation infrastructure, and the evaluation methodology that facilitates reproducibility and enables interpretability. We demonstrate that the abstract model of a FaaS execution environment ensures the applicability of our benchmark to multiple commercial providers such as AWS, Azure, and Google Cloud. Our work facilities experimental evaluation of serverless systems, and delivers a standardized, reliable and evolving evaluation methodology of performance, efficiency, scalability and reliability of middleware FaaS platforms.
Lecture notes in computational science and engineering, 2016

Simple graph algorithms such as PageRank have been the target of numerous hardware accelerators. ... more Simple graph algorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with nonstraightforward parallelism and complicated memory access patterns. In this work, we address this problem with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10× speedup over the established Bron-Kerbosch algorithm for listing maximal cliques. We deliver more than 10 SISA set-centric algorithm formulations, illustrating SISA's wide applicability. CCS CONCEPTS • Hardware → Emerging architectures; Memory and dense storage; Application-specific VLSI designs; Application specific instruction set processors; • Computer systems organization → Architectures; • Theory of computation → Design and analysis of algorithms; Graph algorithms analysis; Data structures design and analysis; Parallel algorithms; • Mathematics of computing → Graph algorithms; • Information systems → Data mining; Clustering; • Computing methodologies → Parallel computing methodologies.
Uploads
Papers by Grzegorz Kwasniewski