Internet-Draft HP-WAN STATE OF ART October 2024
King, et al. Expires 10 April 2025 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-kcrh-state-of-art-hpwan-00
Published:
Intended Status:
Informational
Expires:
Authors:
D. King
Lancaster University
T. Chown
Jisc
C. Rapier
Pittsburgh Supercomputing Center
D. Huang
ZTE Corporation

Current State of the Art for High Performance Wide Area Networks

Abstract

High Performance Wide Area Networks (HP-WANs) represent a critical infrastructure for the modern global research and education community, facilitating collaboration across national and international boundaries. These networks, such as Janet, ESnet, GÉANT, Internet2, CANARIE, and others, are designed to support the general needs of the research and education users they serve but also the the transmission of vast amounts of data generated by scientific research, high-performance computing, distributed AI-training and large-scale simulations.

This document provides an overview of the terminology and techniques used for existing HP-WANS. It also explores the technological advancements, operational tools, and future directions for HP-WANs, emphasising their role in enabling cutting-edge scientific research, big data analysis, AI training and massive industrial data analysis.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 10 April 2025.

Table of Contents

1. Introduction

High Performance Wide Area Networks (HP-WANs) are the backbone of global research and education infrastructure, enabling the seamless transfer of vast amounts of data and supporting advanced scientific collaborations worldwide. These networks are designed to meet the demanding requirements of data-intensive research fields, including high-energy physics, climate modeling, genomics, and artificial intelligence.

The evolution of HP-WANs is deeply intertwined with the growing need for advanced scientific research and the increasing globalisation of collaboration. Traditional WANs, which were sufficient for general business and communication needs, quickly became inadequate for the specialised requirements of research institutions. As scientific endeavours began to generate larger datasets, ranging from terabytes to petabytes, there arose a need for networks capable of transferring these massive volumes of data reliably and securely across large distances.

The first HP-WANs emerged as specialised research networks, such as ESnet in the United States, Janet in the UK, and GÉANT in Europe, developed to support the unique needs of the scientific community. These networks were designed to provide high bandwidth and ensure low latency, high reliability, and robust security, which are critical for applications like real-time data analysis, distributed computing, and remote instrumentation.

Today, HP-WANs are foundational to the research community and are leading the way in demonstrating how advanced networking technologies can be applied to other sectors. They serve as testbeds for innovations in networking that eventually trickle down to broader commercial applications. As we look toward the future, HP-WANs will continue to play a critical role in enabling scientific discoveries and fostering international collaboration, particularly as emerging technologies such as quantum computing and the Internet of Things (IoT) push the boundaries of what these networks must support.

This document explores the current state of the art in HP-WANs, examining the technological advancements, operational challenges, and emerging trends shaping the future of networks built for research, education, massive data analysis and collaborative AI training at scale and speed. Through this exploration, we aim to provide a better understanding of the current state of the art in high performance computing across wide area networking.

1.1. Background

[Editor's note - to add a historical development of HP-WANs description.]

[Editor's note - to add description of the role of HP-WANs in supporting scientific research and education.]

2. Terminology

This document provides a lexicon terminology that relates to high performance WANs.

CERN:
The European Organization for Nuclear Research, housing the Large Hadron Collider (LHC).
High Performance Computing (HPC):
Is a general term for computing with a high level of performance. Often high performance computing specifically refers to running jobs which are very parallel, often running on hundreds or even thousands of cores.
High Performance Wide Area Network (HP-WAN):
A type of Wide Area Network (WAN) designed specifically to meet the high-speed, low-latency, and high-capacity needs of scientific research, education, and data-intensive applications. These networks connect research institutions, universities, and data centers across large geographical areas.
Infiniband:
Traditionally, a localised data interconnect used by many high performance computing (HPC) systems providing high bandwidth and low latency.
National Research and Education Network (NREN):
A specialised network supporting the research and education community within a specific country or region. NRENs provide high-speed connectivity and other services tailored to the needs of academic and research institutions.
Remote direct memory access (RDMA):
Enables one networked node to access another networked nodes's memory without involving either computer's operating system or interrupting either nodes's processing. This helps minimise latency and maximise throughput, reducing memory bandwidth bottlenecks.
RDMA over Converged Ethernet (RoCE):
Traditionally, a network protocol which allows remote direct memory access (RDMA) over a local Ethernet network. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.
Worldwide LHC Computing Grid (WLCG):
Is a global network of over 170 computing centres across more than 40 countries, designed to process, store, and analyse the vast amounts of data generated by the Large Hadron Collider (LHC) at CERN.
Performance Service Oriented Network monitoring Architecture(PerfSONAR):
Is a network performance monitoring toolkit designed to provide end-to-end performance measurement and monitoring across multi-domain network infrastructures.
Science DMZ:
A model for deployment of infrastructure at a site (campus) to optimise the performance of data transfers in and out of data transfer nodes (DTNs) at the site – see https://fasterdata.es.net/science-dmz/. Elements of the model include the local network architecture, tuning of DTNs, choice of data transfer software, efficient security policy implementation and persistent monitoring.

3. Example Use Cases for HP-WANs

HP-WAN applications have become synonymous with large scale research and experimentation, big data, and AI. HPC and therefore HP-WAN, is driving continuous innovation in use cases across the following industries.

The data rates required by HPC applications vary significantly based on the application type and data scale.

Scientific simulations, such as climate modeling and molecular dynamics, typically demand data rates from 10 Gbps to over 100 Gbps due to the large volumes of data processed and moved between nodes and storage systems.

In high-energy physics, such as experiments at CERN, data rates can reach hundreds of gigabits per second, with aggreagte peaks between site exceeding 1 Tbps currently, and predicted to rise to 10 Tbps, during intensive data processing.

Healthcare, Genomics, and Life Sciences might typically operate at rates between 1 Gbps and 40 Gbps. These applications require high throughput to handle large datasets efficiently, often through parallel data streams.

AI learning and tasks, particularly those involving deep learning, require data rates ranging from 10 Gbps to 100 Gbps to ensure efficient data movement, keeping GPUs and other accelerators fully utilised.

These varying data rates underscore the high demands of HPC applications, which are expected to grow as the field evolves and datasets become larger.

4. Current Technologies Used in HP-WANs: Key Components

High Performance Computing (HPC) networks are specialised networks designed to connect supercomputers and other high-performance computing resources, enabling them to collaborate on computational tasks that require significant processing power, memory, and data storage. These networks are essential for facilitating large-scale scientific research, complex simulations, and data-intensive tasks beyond standard computing systems' capabilities.

The following sub-sections outline typical characterics and requirements for HP-WANs. These technical requirements ensure that wide-area interconnects can meet the demanding needs of distributed HPC environments, enabling researchers and scientists to collaborate effectively across the globe.

4.1. Topology

HPC networks can be broadly categorised into intra-site networks, which connect components within a single HPC site, such as a data centre, and inter-site networks, which link multiple HPC sites across different geographical locations. Intra-site networks typically use high-speed, low-latency non-Internet interconnects like InfiniBand or high-speed Ethernet. In contrast, inter-site networks rely on dedicated high-capacity wide area networks (WANs) to facilitate distributed computing and data sharing on a regional and global scale.

Each NREN operator, e.g., Jisc in the case of Janet in the UK, will build and operate the NREN infrastructure for its research and education users. This may typically take the form of a well-provisioned backbone, with regional access networks extending to the end sites (campuses, research organisations, etc). The NREN demarcation is typically at the campus edge. In some countries the regional networks are separately operated.

The NRENs then typically have interconnects to other NRENs, forming a worldwide RE network infrastructure. In Europe, GÉANT provides connectivity between the European NRENs and then wider connectivity to the rest of the world. And NRENs will have other interconnects to non-RE networks, e.g., via one or more national IXs, direct peerings to content providers (including the big cloud providers) and then "catch-all" commodity connectivity via one or more Tier 1 ISPs.

Dedicated infrastructure is commonly used in HPC environments where performance, security, and reliability are paramount. In these cases, the network infrastructure is built exclusively for HPC applications, including dedicated fibre-optic connections, private data centres, and specialised network transport like RDMA over Converged Ethernet (RoCE) and InfiniBand nodes. The primary benefits of dedicated infrastructure are its ability to provide optimised performance for HPC tasks, ensure high levels of security by preventing unauthorised access, and maintain consistent reliability by avoiding congestion or performance issues caused by other network traffic.

Usually, the responsibility for networking within an end site or campus lies with that organisation, e.g., a university IT department, while the operation of an HPC facility may have dedicated (separate) staff. With the additional administrative domains of the NRENs and inter-NREN backbones like GÉANT, end-to-end traffic may pass through many networks operated by different organisations. To achieve optimal e2e performance, everyone needs to implement best practice.

4.2. Bandwidth and Latency

The technical requirements for wide area interconnects between HPC sites are stringent, given the unique demands of distributed high-performance computing. High bandwidth is a primary requirement, as these interconnects must support the rapid transfer of large datasets between sites, ensuring that data movement does not become a bottleneck in computational workflows. HPC data flows might typical consume 1Gbit to beyond 400GBit/s.

Low latency is equally critical, as many HPC applications. Latency requirements for inter-DC locations will be in the low-millisecond range. This low latency is essential for applications that require real-time or near-real-time data processing.

4.3. Data Movement Protocols

Network-intensive applications like networked storage or cluster computing need a network infrastructure with high bandwidth and low latency.

These interconnects may need to support specialised communication protocols designed for HPC environments, such as Remote Direct Memory Access (RDMA) [RFC5040] and [RFC7306], which optimises the performance of distributed HPC applications by reducing overhead and improving data transfer efficiency.

InfiniBand (IB) is another computer networking communications standard used in high-performance computing that features very high throughput and very low latency. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems.

The advantages of RDMA and IB over other network application programming interfaces, are lower latency, CPU load, and bandwidth. The downside with these specialised protocols is the need for all interfaces and nodes to support the technique on the end-to-end path.

iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. Severl IETF techniques are used for iWARP:

  • [RFC5040] A Remote Direct Memory Access Protocol Specification is layered over Direct Data Placement Protocol (DDP). It defines how RDMA Send, Read, and Write operations are encoded using DDP into headers on the network.
  • [RFC5041] Direct Data Placement over Reliable Transports is layered over MPA/TCP or SCTP. It defines how received data can be directly placed into an upper layer protocols receive buffer without intermediate buffers.
  • [RFC5042] Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security analyzes security issues related to iWARP DDP and RDMAP protocol layers.
  • [RFC5043] Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation defines an adaptation layer that enables DDP over SCTP. Elephant flows: For each burst, the intensity of each flow could reach up to the line rate of NICs.
  • [RFC5044] Marker PDU Aligned Framing for TCP Specification defines an adaptation layer that enables preservation of DDP-level protocol record boundaries layered over the TCP reliable connected byte stream.
  • [RFC6580] IANA Registries for the Remote Direct Data Placement (RDDP) Protocol defines IANA registries for Remote Direct Data Placement (RDDP) error codes, operation codes, and function codes.
  • [RFC6581] Enhanced Remote Direct Memory Access (RDMA) Connection Establishment fixes shortcomings with iWARP connection setup.
  • [RFC7306] Remote Direct Memory Access (RDMA) Protocol Extensions extends [RFC5040] with atomic operations and RDMA Write with Immediate Data.

4.4. Forwarding Optimisation

The scaling of HPC applications, especially across a WAN between multiple sites, requires the ability to route the massive traffic. Specifically, this requires network infrastructure to provide several routing and forwarding characteristics, detailed below.

  • Low entropy: Compared to traditional data center workloads, the number and the diversity of flows for workloads and flow patterns are usually repetitive and predictable.
  • Burstiness: Flows usually exhibit the "on and of”’ nature in the time granularity of milliseconds.
  • Jumbo frames: Ethernet frames larger than the standard maximum transmission unit (MTU) size of 1,500 bytes, typically carrying payloads of up to 9,000 bytes. Using jumbo frames can significantly enhance network efficiency and reduce CPU overhead.
  • Elephant flows: For each burst, the intensity of each flow could reach up to the line rate of NICs.

It should be noted that efficiently handling these elephant flows is crucial in HPC as they can otherwise saturate network links, leading to congestion and reduced performance for other network traffic. Strategies to manage elephant flows effectively, such as prioritising these flows or segmenting network traffic, help maintain overall network performance and ensure that large data transfers do not hinder the execution of other critical tasks within the HPC environment.

HPC transport options include IP (both UDP and TCP), and emerging mechanisms such as QUIC. However, each transport technology provides strengths and weaknesses. In all cases, the primary goal is to ensure the effective high-throughput, low latency abd jitter, low-packet loss ratio, transmission of massive data sets.

4.5. Reliability and High Availability

In HPC networks, the resilience of the data stream is important due to the critical need for precise, high-speed data transfer. These networks must maintain continuous data flow to support large-scale computations, where even minor interruptions or packet loss can severely impact performance, causing delays or incorrect results. Therefore, resilience must be implemented to ensure the network can recover from disruptions without compromising speed or integrity.

For retransmission and lossless data transfer, HPC networks must have mechanisms to handle data loss efficiently. They must quickly retransmit lost or corrupted packets while maintaining a seamless data flow to avoid performance degradation. The requirement for lossless communication is essential to meet the needs of scientific computations, simulations, and data-intensive tasks.

High availability and redundancy are also essential to prevent data loss and ensure continuous operation, especially given that HPC tasks often run for extended periods and involve critical research. These networks must also incorporate advanced security measures, including encryption and secure access controls, to protect the often sensitive or classified data being transmitted.

4.6. Quality of Service

The network should support Quality of Service (QoS) mechanisms to prioritise traffic, ensuring that critical HPC tasks receive the necessary bandwidth and low-latency performance.

An approach may be needed to enable applications to request specific bandwidth or latency guarantees, ensuring that high-priority tasks receive the resources they require.

Differentiated Services (Diffserv) offers a flexible method to manage traffic prioritization without the need for an explicit request-and-grant process. Diffserv operates by marking packets with different levels of priority, allowing the network to prioritize and protect access to capacity for critical tasks. This approach may be useful in HPC environments where dynamic traffic patterns require adaptive resource management.

4.7. Congestion Control

Congestion control mechanisms ensures that data transfers between nodes and across networks are efficient and do not overwhelm the HPC network infrastructure. By managing and regulating the flow of data, congestion control mechanisms help prevent bottlenecks, reduce latency, and maintain high throughput, which are essential for the performance and reliability of HPC applications that require the rapid movement of large volumes of data across distributed systems.

Depending on the transport technology used in the HPC enviroment, several congestion control schemes may be use:

  • InfiniBand Congestion Control
  • RDMA-based Data Center Quantized Congestion Notification (DCQCN)
  • TCP-based Bottleneck Bandwidth and Round-Trip Time (BBRv3)
  • Explicit Congestion Protocol (XCP)

4.8. Performance Monitoring

End-to-end performance measurement and monitoring across multi-domains and network infrastructures are important in HPC environments. They provide a method to diagnose and troubleshoot network performance issues that can affect data-intensive applications and distributed computing tasks commonly found in HPC.

PerfSONAR is a network measurement toolkit commonly used. It is designed to provide federated coverage of network paths. It provides an interface that allows for the scheduling of measurements, storage of data and generate visualisations.

4.9. Scalability

Scalability is another crucial aspect, allowing the network to expand efficiently as computational needs grow, accommodating additional sites or increased capacity without significant reconfiguration. Interoperability is also necessary, ensuring that the network can communicate seamlessly across different types of hardware, software, and protocols used at various HPC sites.

4.10. Resource Scheduling

[Editor's Note - Do we need to discuss service and resource scheduling?]

5. Examples of HP-WANs

The following sub-sections highlight examples of HP-WANS, and their technical specifications.

5.1. GÉANT

The GÉANT network is a pan-European data network dedicated to research and education, providing high-speed, high-capacity connectivity across Europe, between European NRENs and to other worldwide NRENs. It is an essential infrastructure for HPC applications, enabling collaboration and data sharing among research institutions, universities, and HPC centers across the continent and beyond.

The core of GÉANT operates at speeds of up to 600 Gbps, using Dense Wavelength Division Multiplexing (DWDM) technology. This provides connectivity suitable for HPC applications, particularly those involving large-scale simulations, scientific research, and real-time data processing. Reliability is provided by using multiple optical underlay paths for data to travel between GÉANT nodes. This design ensures high availability and reliability, which is crucial for the continuous operation of HPC environment.

The GÉANT network integrates PerfSONAR for real-time network performance monitoring and reporting of IP performance metrics [RFC6703] , allowing HPC users to detect and troubleshoot potential issues that could impact data transfer and overall performance. This ensures that the high-performance requirements of HPC applications are met consistently across the network.

GÉANT provides specialized services for specific HPC projects, such as the LHC Optical Private Network (LHCOPN) and LHC Open Network Environment (LHCONE), which are critical for supporting the data-intensive needs of the Large Hadron Collider (LHC) at CERN. These services offer dedicated, high-bandwidth connections that are optimised for the massive data flows generated by LHC experiments.

The GÉANT network connects over 50 million users across more than 10,000 institutions in 40 countries. This extensive reach supports a wide range of HPC applications by enabling seamless collaboration between geographically dispersed research facilities. Beyond Europe, GÉANT connects to other major research and education networks, including Internet2 in the United States and CANARIE in Canada, allowing for global HPC collaborations and data exchanges.

5.2. Janet

The Janet network is the UK NREN, operated by Jisc. First established in 1984, backbone links now run at up to 800Gbps, with a growing number of sites connected at 100Gbps, in some cases with multiple 100G links. A typical university site will have multiple 10G links.

Janet connects to other RE networks via a 400G resilient link to GÉANT. It has a presence in multiple IXes, predominantly LINX, connects/peers directly to many content and cloud providers, and has commodity connectivity via Tier1 ISPs. The total aggregate external capacity is around 4-5 Tbit/s.

Some private, dedicated optical links are used by Janet sites, e.g., the CERN to RAL (UK Tier 1 site) LHCOPN link, which is a 200G path.

To be discussed.

7. IANA Considerations

This document makes no requests for action by IANA.

8. Security Considerations

The security requirements for HPC networks, particularly in inter-data center scenarios, are crucial to ensuring the integrity, confidentiality, and availability of sensitive data and computational resources. These requirements are stringent due to the high-value and often sensitive nature of the data processed within HPC systems, such as research data in fields like national defense, pharmaceuticals, and climate science.

9. Acknowledgements

This document was in part motivated by the discussion occuring on the IETF hp-wan@ietf.org mailing list.

The authors would like to thank Gorry Fairhurst for his review and suggestions.

10. Normative References

11. Informative References

[RFC5040]
Recio, R., Metzler, B., Culley, P., Hilland, J., and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, DOI 10.17487/RFC5040, , <https://www.rfc-editor.org/info/rfc5040>.
[RFC5041]
Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct Data Placement over Reliable Transports", RFC 5041, DOI 10.17487/RFC5041, , <https://www.rfc-editor.org/info/rfc5041>.
[RFC5042]
Pinkerton, J. and E. Deleganes, "Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, , <https://www.rfc-editor.org/info/rfc5042>.
[RFC5043]
Bestler, C., Ed. and R. Stewart, Ed., "Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation", RFC 5043, DOI 10.17487/RFC5043, , <https://www.rfc-editor.org/info/rfc5043>.
[RFC5044]
Culley, P., Elzur, U., Recio, R., Bailey, S., and J. Carrier, "Marker PDU Aligned Framing for TCP Specification", RFC 5044, DOI 10.17487/RFC5044, , <https://www.rfc-editor.org/info/rfc5044>.
[RFC6580]
Ko, M. and D. Black, "IANA Registries for the Remote Direct Data Placement (RDDP) Protocols", RFC 6580, DOI 10.17487/RFC6580, , <https://www.rfc-editor.org/info/rfc6580>.
[RFC6581]
Kanevsky, A., Ed., Bestler, C., Ed., Sharp, R., and S. Wise, "Enhanced Remote Direct Memory Access (RDMA) Connection Establishment", RFC 6581, DOI 10.17487/RFC6581, , <https://www.rfc-editor.org/info/rfc6581>.
[RFC6703]
Morton, A., Ramachandran, G., and G. Maguluri, "Reporting IP Network Performance Metrics: Different Points of View", RFC 6703, DOI 10.17487/RFC6703, , <https://www.rfc-editor.org/info/rfc6703>.
[RFC7306]
Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R. Sharp, "Remote Direct Memory Access (RDMA) Protocol Extensions", RFC 7306, DOI 10.17487/RFC7306, , <https://www.rfc-editor.org/info/rfc7306>.

Authors' Addresses

Daniel King
Lancaster University
Tim Chown
Jisc
Chris Rapier
Pittsburgh Supercomputing Center
Daniel Huang
ZTE Corporation