SC22 BoF131: Disaggregated Heterogeneous Architectures

SuperComputing

SC22 BoF131: Disaggregated Heterogeneous Architectures

95 95 people viewed this event.

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Nov 13–18, 2022, Dallas, Texas. – Link: https://sc22.supercomputing.org/ 
BoF Session 131. Schedule: November 16th, Wednesday 12:15 – 13:15 

Abstract

This BoF will be a forum to discuss most recent topics of research around disaggregated heterogeneous architectures, their operation and use. “Disaggregated” aka “modular supercomputing” refers to a system-level architecture in which heterogeneous resources are organised in partitions or modules, each one with a different type of node-configuration. This approach is gaining traction in the HPC landscape, with Perlmutter, Lumi and JUWELS representing just some examples. This BoF discusses the challenges seen by operators, vendors, developers of system software, programming models and tools, as well as application developers when adapting their codes to make use of such machines.

Details

Today’s HPC systems are highly heterogeneous machines combining different processors, network, memory, and storage technologies. This diversification is expected to grow further with the integration of disruptive technologies, such as AI-accelerators, neuromorphic devices, or even quantum computers. Orchestrating and using this hardware-zoo poses enormous challenges. System developers and operators require scalable ways to interconnect the different technologies, advanced scheduling and management techniques, and I/O and data management mechanisms to deal with increasingly data-intensive workflows. The users, on their side, need methods to efficiently transfer data between compute, memory and storage elements, and strategies for programming thousands of devices with partially different instruction set architectures and vendor-specific environments. 

The exact manifestation of the above challenges depends on how the hardware resources are organised at system level. Some experts advocate for a monolithic approach in which all nodes are equal, each node containing a variety of computing elements. Others go in exactly the opposite direction and segregate the resources at system level, grouping the different types into partitions or modules. This latter category is the focus of this BoF. 

“Disaggregated” aka “modular supercomputing” refers to a system-level architecture in which heterogeneous resources are organised in partitions or modules, each one with a different type of node-configuration. This approach is gaining traction in the HPC landscape, with Perlmutter, Lumi and JUWELS representing just some examples. This BoF will be a forum to discuss most recent topics of research around disaggregated heterogeneous architectures, their operation and use. Discussions will include the challenges seen by operators, vendors, developers of system software, programming models and tools, as well as application developers when adapting their codes to make use of such machines. 
Addressed audience comprises HPC centres operating or planning to deploy modular/disaggregated supercomputers, vendors building them including storage and network administrators, developers of system software, programming models and tools that address system-level heterogeneity, and application developers that are adapting their codes to make use of such machines. The panel of speakers represent these sectors and will raise their respective challenges. 

Questions & Discussion

This BoF is a forum for open and constructive discussion. We encourage every participant to directly speak up during the session. Still, if you do not feel like speaking up in public, you can post your questions in this google-doc. Our moderator will take them and pose them for you:

https://docs.google.com/document/d/1xxmwy4XmPhK7tpQqNNWeTeHy-X2mlfTWDaCQeOidxTU/edit#

Session Format

  • Pitch presentations: 
    • 5 min: motivation for heterogeneous disaggregation (Nick Wright, NERSC)  
    • 5 min: main challenges from operation perspective (Kengo Nakajima, Univ. Tokyo and RIKEN R-CCS)
    • 5 min: main challenges for data management (Philippe Deniel, CEA)
    • 5 min: main challenges for programming and use (Anshu Dubey, ANL)
  • 40 min moderated discussion between panel and audience. 

Panel Members

Moderator: Maike Guilliot (CEA)

Nick Wright
(NERSC)

Kengo Nakajima
(Univ. Tokyo)

Philippe Deniel
(CEA)

Anusha Dubey
(ANL)

Organizers

Estela Suarez
(Forschungzentrum Juelich, University of Bonn)

Philipe Deniel
(CEA)

Kengo Nakajima
(Univ of Tokyo and RIKEN R-CCS) 

Anshu Dubey
(Argonne National Laboratory and University of Chicago) 

Said Derradi
(Bull Atos)

Nick Wright
(Lawrence Berkeley National Laboratory and NERSC) 

EuroHPC project cooperation: DEEP-SEA, IO-SEA, RED-SEA 

To register for this event please visit the following URL: https://sc22.supercomputing.org/attend/registration/ →

Share With Friends