Internet-Draft Collective Communication Optimization: R October 2023
Yao, et al. Expires 25 April 2024 [Page]
Workgroup:
Transport Area Working Group
Internet-Draft:
draft-yao-tsvwg-cco-requirement-and-analysis-00
Published:
Intended Status:
Informational
Expires:
Authors:
K. Yao
China Mobile
S. Xu
China Mobile
Y. Li
Huawei Technologies
H. Huang
Huawei Technologies
D. KUTSCHER
HKUST (Guangzhou)

Collective Communication Optimization: Requirement and Analysis

Abstract

As is mentioned in draft [CCO PS & USECASE], the most obvious problem on why existing protocols cannot meet the high-performance requirements of collective communications is that these distributed applications are not co-designed with the underlying networking protocols. There is a semantic gap between inter-process message transportation and packet forwarding, which should be bridged by efficient mapping and optimization.

This draft further presents the technical requirements on how the collective communication optimization should be designed, and makes some discussion and analysis on several related work.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 25 April 2024.

Table of Contents

1. Introduction

Optimizing collective communication originates from parallel computing which is limited in High-Performance Computing(HPC) at the very beginning. However, with the development of distributed applications, especially distributed Artificial Intelligence(AI), collective communication becomes a common inter-process communication pattern which can be extended to apply in various scenarios, and the use cases are still evolving repidly, from clusters in controlled environments, to campus network, and even to open Internet, like Federated Learning.

Collective communication represents the behaviors of applications, which currently is not co-designed with network protocols. Specific requirements on the protocol design should be raised to instruct the standardization. Currently, there are some other work related on this topic, but their scope is either too broad or limited in single applications. This document focuses on optimizing collective communication from the network perspective. Solutions may be first introduced to limited domain [RFC8799], we still consider about how to extend the work into Internet-scale applications.

2. Existing Work and Analysis

In this section, we elaborate some work that are related to collective communication optimization. These work comes from IETF working groups, research groups, open source or industry practice. We make some analysis as well to help understand why the collective communication optimization is worth more investigations and studies in IETF.

2.1. COIN

According to the charter of Computing in the Network(COIN) research group, its scope is to investigate how network data plane programmability can improve Internet architecture. Use cases that COIN defines are relatively broad, including network functions offloading, machine learning acceleration, in-network caching and in-network control, etc. What "computing" means here is very generic. It is highly coupled to the programmability of the chips. The collective communication offloading is briefly mentioned as one the use cases of COIN. Nevertheless, optimizing collective communication is not only leverage the programmable chip help the calculation of collective operations in network devices, but also includes the potential standazation wotk lik communication protocol enhancement or application interface design.

2.2. SHArP

Scalable Hierachical Aggregation Protocol(SHArP) [SHArP] is a private solution for optimizing collective communication, owned by Nvidia. It has been realized in its commodity switches. SHArP breaks the end-to-end transport rule by implementing Target Channel Adapter(TCA) in switches. The TCA supports both Reliable Connection (RC) transport to enable reliable delivery of data through the aggregation tree as well as Unreliable Datagram (UD) transport to enable Multicast distribution of the aggregation result. As is stated, SHArP is based on Infiniband architecture. Currently it cannot work interoperably with the other network architectures, thus limiting its applicability.

2.3. Relations with IP multicast

As is analyzed in [CCO PS & USECASE], existing IP multicast protocols havn't been made the full use for distributed application systems which require collective communication. Most collective operations still use unicast mode for transmission which experience a lot of overhead in terms of network bandwidth consumption and host CPU consumption. Exisiting IP multicast uses the forwarding based on the destination multicast IP address. It is not clear yet how the group based collective communications leverage the multicast to improve the performance and save the bandwidth.

2.4. Unified Communication X

Unified Communication X(UCX) defines a set of network APIs and their implementation for high throughput computing. It's oriented towards programming models that need collective communication. The core objective of UCX is to design a framework for collective communication implementation and address the cross-platform functionality and performance portability challenges, since there are multiple efforts in the industry to optimize the collective-based programming models, and they are either specific model oriented or specific hardware limited, which creates a lot of duplicate work and doesn't scale well. UCX hopes to address these issues by unifying network APIs across different network architecture, like Infiniband, RoCE, TCP/IP, and Cray, etc. The goal seems challenging because it will indeed introduce a lot of abstraction which may impact system overall performance.

The object of this draft cares more about the optimization of collective communication, including protocols that realize message transportation, multicast, and OAM. Cross-platform portability is not the major goal.

2.5. Other Considerations

It's very necessay to discuss about some other issues like where the optimization should be applied and who should be operator. The scope analysis helps better find the landscape of the new technique.

2.5.1. Applicable Domain

Since there need some discussions on whether end-to-end transport mechanism should be broken to optimize collective communication, there are different technical roadmaps. On the one hand, SHArP is the representative of breaking end-to-end transportation, but it's based on Infiniband. On the other hand, transport transparent solutions like [NetReduce] doesn't break the end-to-end rule and could be a host-friendly solution, so that it has better scalability and interoperability to be applied in Internet. But there still need further investigation on other potential collective operations that could be accelerated than Allreduce. Considering that optimizing collective communication may introduce addtional overhead in transport upgrade and there are some security issues as side-effects as stated in [CCO PS & USECASE], collective optimization maybe first introduced into limited domain and then extended to the Internet.

2.5.2. Operational considerations

Use cases like AI model training, distributed storage, and big data analysis usually need infrastructure to be deployed in clusters which are operated by single entities. In this case, not only the compute and network infrastructure, but also the application could be owned by single service providers. These use cases are typically performance-driven, which means they need application and infrastructure to be co-designed to reach optimization. However, applications are not co-designed with underlying network protocols case-by-case, as long as the definition and realization of certain collective operations that would be offloaded can be reached in consensus across vendors, like unified primitives used for implementing the collective communication, applications can leverage on the standardized north bound API to improve performance, albeit the applications do not belong to the same service providers [NetRPC].

3. Requirements

3.1. Improve Tranport Efficiency

Distributed parallel computing needs high-performance inter-process communication. The ability of Remote Memory Access(RMA) can reduce the number of memory copies which could improve the performance of distributed systems.

R1. Transport layer MUST support RMA function.

R2. Memory sharing is RECOMMENDED to support collective operations.

3.2. Transport Suitability for Different Communication Modes

The communication modes and traffic characteristics of different applications in parallel computing vary, resulting in different transport layer capabilities required. The transport layer needs to make adjustments to adapt to different communication modes. Especially after supporting collective operations offloading, the communication mode has changed to: end-to-end, and end-to-network, whitch means suporting in-network processing/computing functions.

R3. The implementation of blocking or non-blocking communication of applications MUST be adjusted according to different communication modes.

R4. The transport layer MUST provide appropriate reliability mechanisms to adapt to different communication modes.

3.3. End-network Collaborative Collective Communication

Collective operations offloading utilizes network devices to achieve low computational accuracy and high IO computing tasks, achieving optimization of collective operations. Network computing devices not only complete packet routing and forwarding, but also need to process transport layer messages. Therefore, the transport layer needs to complete the mapping mechanism between packets and messages, and complete the transmission of transport layer messages. Therefore, the transport layer protocol that supports collective communication optimization needs to meet the following requirements:

R5. The transport layer MUST carry messages that network devices can recognize to complete offloaded collective operations processing.

R6. The transport layer MUST support fallback mechanism, in case network devices are not sufficient for collective operations offloading.

3.4. Management and Control

Current hardware implementations of offloading collective operations vary, and there is no unified standard for the supported in-network computing primitives, which poses great challenges to the management and control plane. In addition, as the network scale continues to expand, it is necessary to rely on topology-aware algorithms to complete computing-related task allocation, path planning, etc., in order to optimize the control effect.

R7. MUST support unified definition and management of data types, data structures supported by network devices considering collectives offloading, and the unified management of resources such as memory of network devices.

R8. It is RECOMMENDED to achieve topology awareness, task scheduling, and allocation in collaboration between the end and network.

3.5. Make Use of IP Multicast

IP multicast protocols bring native one-to-group communication mechanisms which is very suited for one-to-group and group-to-group collective operations. Even though designing IP multicast protocols is not within the scope of transport and application area, but extending IP multicast protocols to support collective communication is necessary. Thus:

R9. IP multicast protocols SHOULD be extended to support collective communication. However, whether to design new multicast algorithms and protocols that are dedicated for collective communication is out of the scope.

3.6. Self-healing

Self-healing is required since there is chance that single in-network device can run out of service, due to either single point failure or link break down. Therefore:

R10. The mechanism of choosing alternative node for implementing collective operations MUST be designed, to ensure system robustness and reliability.

4. Security Considerations

Some security concerns have been described in the [CCO PS & use cases].

5. IANA Considerations

TBD.

6. References

6.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8799]
Carpenter, B. and B. Liu, "Limited Domains and Internet Protocols", RFC 8799, DOI 10.17487/RFC8799, , <https://www.rfc-editor.org/info/rfc8799>.

6.2. Informative References

[NetReduce]
Liu, S., "In-Network Aggregation with Transport Transparency for Distributed Training", DOI 10.1145/3582016.3582037, , <https://doi.org/10.1145/3582016.3582037>.
[NetRPC]
Zhao, B., "NetRPC: Enabling In-Network Computation in Remote Procedure Calls", .
[SHArP]
Graham, R. L., "Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction", DOI 10.1109/COMHPC.2016.006, , <https://doi.org/10.1109/COMHPC.2016.006>.

Authors' Addresses

Kehan Yao
China Mobile
Beijing
100053
China
Shiping Xu
China Mobile
Beijing
100053
China
Yizhou Li
Huawei Technologies
Nanjing, Jiangsu
China
Hongyi Huang
Huawei Technologies
Beijing
China
Dirk KUTSCHER
HKUST (Guangzhou)
Guangzhou
China