Publications

Presentations from Denver SOC for HPC workshop available here which requires access permission.

June, 2016 release: Abstract Machine Models and Proxy Architectures for Exascale Computing (full document) release v2.0.
May, 2014 release: Abstract Machine Models and Proxy Architectures for Exascale Computing (full document) release v1.1.


1.     J.A. Ang, R.F. Barrett, R.E. Benner, D. Burke, C. Chan, J. Cook, C.S. Daley, D. Donofrio, S.D. Hammond, K.S. Hemmert, R.J. Hoekstra, K. Ibrahim, S.M. Kelly, H. Le, V.J. Leung, G. Michelogiannakis, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, D. Unat, N.J. Wright, G.R. Voskuilen, ”Abstract Machine Models and Proxy Architectures for Exascale Computing Version 2.0,”  DOE Technical Report (joint report of Sandia Laboratories and Berkeley Laboratory), June, 2016.

Status: published


2. C. Chan, D. Unat, M. Lijewski, W. Zhang, J. B. Bell, J. Shalf, “Software Design Space Exploration for Exascale Co-Design”, International Supercomputing Conference (ISC), 2013

Status: accepted

Abstract: The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a computer-driven static analysis and performance-modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tined configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to delivery better performance and energy efficiency.

 

3.     S. Hammond, K. S. Hemmert, A. Rodrigues, S. Yalamanchili, J. Wang, “Towards A standard Architectural Simulation Framework: A Call to Arms,” Workshop on Modeling & Simulation of Exascale Systems & Applications, Sept. 2013..

Status: published

 

4.    J.A. Ang, R.F. Barrett,R.E. Benner, D. Burke, C. Chan, D. Donofrio, S.D. Hammond, K.S. Hemmer, S.M. Kelly, H. Le,V.J. Leung, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, D. Unat, N.J. Wright, Abstract Machine Models and Proxy Architectures for Exascale Computing,”  DOE Technical Report (joint report of Sandia Laboratories and Berkeley Laboratory), May, 2014.

Status: Published

Abstract: This report we present an alternative view of industry’s Exascale system hardware architectures. Instead of providing highly detailed models of each potential architecture, as may be presented by any individual processor vendor, we propose initially to utilize simpler, abstract models of a compute node that allow an application developer to reason about data structure placement in the memory system and the location at which computational kernels may be run. This Abstract Machine Model (AMM) will provide software developers with sufficient detail of gross architectural features so they may begin tailoring their codes for these new high performance machines as well as avoid pitfalls when creating new codes or porting existing codes to Exascale machines.

 

5.     K. Ibrahim, E. Strohmaier, J. Shalf, “Simulation Acceleration for Extreme Concurrency Parallel Applications Using Representative Sampling,” Workshop on Modeling & Simulation of Exascale Systems & Applications, Oct. 2013..

Status: published

 

6.     G. Michelogiannakis, A. Williams, S. Williams, J. Shalf, “Collective Memory Transfers for Multi-Core Chips,” International Conference on Supercomputing (ICS), 2014.

Status: accepted (to appear)

Abstract: Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in on-chip parallelism. Performance improvements for a wide variety of parallel applications require domain decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. In this paper, we propose collective memory scheduling (CMS) that uses simple software and inexpensive hardware to identify collective transfers and guarantee that loads and stores arrive in memory address order to the memory controller. CMS actively takes charge of collective transfers and pushes or pulls data to or from the on-chip processors according to memory address order. CMS reduces application execution time by up to 55% (20% average) compared to a state-of-the- art architecture where each processor reads and writes its data independently. CMS also reduces DRAM read power by 2.2× and write power by 50%.

 

 

7.     G. Michelogiannakis, J. Shalf. “Variable Width Datapath for On-chip Network Static Power Reduction” International Symposium on Networks-on-Chip (NOCS), 2014.

Status: accepted (to appear)

Abstract: With the tight power budgets in modern large-scale chips and the unpredictability of application traffic, on-chip network designers are faced with the dilemma of designing for worst-case bandwidth demands and incurring high static power overheads, or designing for an average traffic pattern and risk degrading performance. This paper proposes adaptive bandwidth networks (ABNs), which divide channels and switches into lanes such that the network provides just the bandwidth necessary in each hop. ABNs also activate input virtual channels (VCs) individually and take advantage of drowsy SRAM cells to eliminate false VC activations. In addition, ABNs readily apply to silicon defect tolerance with just the extra cost for detecting faults. For application traffic, ABNs reduce total power consumption by an average of 45% with comparable performance compared to single-lane power-gated networks, and 33% compared to multi-network designs.

 

8.     G. Michelogiannakis, Xiaoye S., D. H. Bailey, J. Shalf, “Extending Summation Precision for Network Reduction Operations” International Symposium on Computer Architecture and High Performance Computing (SBAC), 2014.

Status: published

Abstract: Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

 

9.     D. Unat, G. Michelogiannakis, W. Zhang, J. B. Bell, J Shalf, “TiDA: An API for Data-Centric Topology-Aware Scientific Computing”, International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014.

Status: under review

Abstract: This paper uses CAL architectural simulators to understand how locality-aware programming constructs would benefit future chip architectures with thousands of cores.  Contemporary HPC processor designs are moving towards massively parallel chips. This, combined with the relatively high energy costs for data movement compared to computation, gives paramount importance to data locality management in programs. Programming models play a crucial role in providing the necessary tools to express locality and minimize data movement, while also abstracting complexity from programmers. Unfortunately, existing compute-centric programming environments provide few abstractions to manage data movement and locality, and rely on a large shared cache to virtualize data movement. We introduce the TiDA (Tiling as a Durable Abstraction) library that provides a simple API to naturally express data locality and layout. TiDA elevates tiling to the programming model to expose high degrees of parallelism through domain decomposition. We use the BookSim cycle accurate architectural simulator demonstrate how TiDA automates both cache-locality and topology optimizations with minimal coding effort on current and future NUMA node architectures.

 

10.     D. Unat, C. Chan, W. Zhang, S. Williams, J. Bachan, J. Bell, and J. Shalf, “ExaSAT: An Exascale Co-Design Tool for Performance Modeling”, International Journal of High Performance Computing Applications (IJHPCA).

Status: Under revision

Abstract: One of the emerging challenges to design HPC systems is to understand and project the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytical models are essential because they can provide fast feedback to the co-design centers and chip designers without costly simulations. However, current attempts to analytically model program performance typically rely on the user manually specifying a performance model. We introduce the ExaSAT framework that automates the extraction of parameterized performance models directly from source code using compiler analysis. The parameterized analytic model enables quantitative evaluation of a broad range of hardware design trade-offs and software optimizations on a variety of different performance metrics, with a primary focus on data movement as a metric. We demonstrate the ExaSAT framework’s ability to perform deep code analysis of a proxy application from the DOE Combustion Co-design Center to illustrate its value to the exascale co-design process. ExaSAT analysis provides insights in the hardware and soft- ware tradeoffs and lays the groundwork for exploring a more targeted set of design points using cycle-accurate architectural simulators.

 

11.  D. Unat, C. Chan, S. Williams, D. Quinlan, P. McCormick, S. Pakin, J. Shalf, “Compiler­Based Approach for Automated Performance Modeling,” Workshop on Modeling & Simulation of Exascale Systems & Applications, Sept. 2013..

Status: published

Abstract: Compiler Driven performance Analysis (CDA) tools would enable rapid exploration of hardware and software design impacts and help bridge the gap between application developers and hardware designers. Examples of existing CDA tools include Byfl, Pbound, and ExaSAT. The CDA approach has following main features: 1) The CDA­-generated model extracts key application characteristics about data movement, arithmetic intensity, degree of parallelism in a hardware-­independent fashion. 2) The model is parameterized to help estimate code performance on different hardware. 3) The model allows exploring the impact of source code transformations. 4) The model is generated automatically by the compiler, thus less labor intensive and more easily applied to large codes. 5) The model is lightweight and can be integrated into a source­-to-­source compiler, auto-­tuner, and an adaptive runtime system. This approach would enable tools to reason about the benefits of software and hardware optimizations and provide feedback on performance bottlenecks.

 

12.  John Shalf and Peter Kogge, “Exascale Computing Trends: Adjusting to the New Normal in Computer Architecture,” Computing in Science and Engineering, September, 2013.

Status: Published

Abstract: We now have 20 years of data under our belt as to the performance of supercomputers against at least a single floating-point benchmark from dense linear algebra. Until about 2004 a single model of parallel programming, bulk synchronous using the MPI model, has for the most part been sufficient to permit translation of this into reasonable parallel programs for more complex applications. Starting in 2004, however, a confluence of events changed forever the architectural landscape that underpinned MPI. The first part of this paper goes into the underlying reasons for these changes, and what it means to system architectures.  Next, we describe an Abstract Machine Model that clarifies the current technology roadmaps leading towards exascale. Finally, the paper addresses what this means looking forward in terms of what were our standard scaling models and their profound implications for the programming and algorithm design for future systems.

 

 

13.  Myoungsoo Jung, Wonil Choi, John Shalf, Mahmut T. Kandemir: Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems. ASPLOS 2014: 441-454

Status: Published

 

14.  Myoungsoo Jung, John Shalf, Mahmut T. Kandemir: Design of a large-scale storage-class RRAM system. ICS 2013: 103-114

Status: Published

 

15.  Myoungsoo Jung, Ellis Herbert Wilson, Wonil Choi, John Shalf, Hasan Metin Aktulga, Chao Yang, Erik Saule, Ümit V. Çatalyürek, Mahmut T. Kandemir: Exploring the future of out-of-core computing with compute-local non-volatile memory. SC 2013: 75

Status: Published

 

16.  Myoungsoo Jung, Ellis Herbert Wilson, David Donofrio, John Shalf, Mahmut T. Kandemir: NANDFlashSim: Intrinsic latency variation aware NAND flash memory system modeling and simulation at microarchitecture level. MSST 2012: 1-12

Status: Published

 

 

*** Papers under preparation (all information tentative) ***

 

17.  J. Bachan, G. Michelogiannakis, J. Shalf, “Scaling Study of Cache Coherency for Future Exascale Processor chips” < in preparation >

 

18.  G. Michelogiannakis, J. Shalf, “Bandwidth Adaptivity in Future Large-Scale Networks” < in preparation >

 

19.  G. Michelogiannakis, J. Shalf, “Collective Prefetching for Maximizing Memory Bandwidth" < in preparation >