Internet-Draft | Deploying AI services | July 2024 |
Hong, et al. | Expires 9 January 2025 | [Page] |
As the development of AI technology matured and AI technology began to be applied in various fields, AI technology is changed from running only on very high-performance servers with small hardware, including microcontrollers, low-performance CPUs and AI chipsets. In this document, we consider how to configure the network and the system in terms of AI inference service to provide AI service in a distributed method. Also, we describe the points to be considered in the environment where a client connects to a cloud server and an edge device and requests an AI service. Some use cases of deploying AI services in a distributed method such as self-driving car and network digital twin are described.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 9 January 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In the Internet of Things (IoT), the amount of data generated from IoT devices has exploded along with the number of IoT devices due to industrial digitization and the development and dissemination of new devices. Various methods are being tried to effectively process the explosively increasing IoT devices and data of IoT devices. One of them is to provide IoT services in a place located close to IoT devices and users, away from cloud computing that transmits all data generated from IoT devices to a cloud server [RFC9556].¶
IoT services also started to break away from the traditional method of analyzing IoT data collected so far in the cloud and delivering the analyzed results back to IoT objects or devices. In other words, AIoT (Artificial Intelligence of Things) technology, a combination of IoT technology and artificial intelligence (AI) technology, started to be discussed at international standardization organizations such as ITU-T. AIoT technology, discussed by the ITU-T CG-AIoT group, is defined as a technology that combines AI technology and IoT infrastructure to achieve more efficient IoT operations, improve human-machine interaction, and improve data management and analysis [CG-AIoT].¶
The first work started by the IETF to apply IoT technology to the Internet was to research a lightweight protocol stack instead of the existing TCP/IP protocol stack so that various types of IoT devices, not traditional Internet terminals, could access the Internet [RFC6574][RFC7452]. These technologies have been developed by 6LoWPAN working group, 6lo working group, 6tisch working group, core working group, t2trg group, etc. As the development of AI technology matured and AI technology began to be applied in various fields, just as IoT technology was mounted on resource-constrained devices and connected to the Internet, AI technology is also changed from running only on very high-performance servers. The technology is being developed to run on small hardware, including microcontrollers, low-performance CPUs and AI chipsets. This technology development direction is called On-device AI or TinyML[tinyML].¶
In this document, we consider how to configure the network and system in terms of AI inference service to provide AI service in the IoT environment. In the IoT environment, the technology of collecting sensing data from various sensors and delivering it to the cloud has already been studied by many standardization organizations including the IETF and many standards have been developed. Now, after creating an AI model to provide AI services based on the collected data, how to configure this AI model as a system has become the main research goal. Until now, it has been common to develop AI services that collect data and perform inferences from the trained servers, but in terms of the spread of AI services, it is not appropriate to use expensive servers to provide AI services. In addition, since the server that collects and trains data mainly exists in the form of a cloud server, there are also many problems in proceeding in the form of requesting AI service by connecting a large number of terminals to these cloud servers to provide AI services. Therefore, when an AI service is requested to an edge device located at a close distance, it may have effects such as real-time service support, network traffic reduction, and important data security rather than requesting an AI service to an AI server located in a distant cloud[RFC9556].¶
Even if an edge device is used to serve AI services, it is still important to connect to an AI server in the cloud for tasks that take a lot of time or require a lot of data. Therefore, an offloading technique for properly distributing the workload between the cloud server and the edge device is also a field that is being actively studied. In this contribution, in the following proposed network structure, the points to be considered in the environment where a client connects to a server and an edge device and requests an AI service are derived and described. That is, the following considerations and options could be derived.¶
AI inference service execution entity¶
Hardware specifications of the machine to perform AI inference services¶
Selection of AI models to perform AI inference services¶
A method of providing AI services from cloud servers or edge devices¶
Communication method to transmit data to request AI inference service¶
The proposed considerations and items could be used to describe the use case of self-driving car and network digital twin. Since providing AI services in a distributed method can provide various advantages, it is desirable to apply it to self-driving car and network digital twin.¶
Since research on AI services has been started for a long time, there may be shapes to provide various types of AI services. However, due to the nature of AI technology, in general, a system for providing AI services consists of the following steps [AI_inference_archtecture] [Google_cloud_iot].¶
Data collection & Store¶
Data Analysis & Preprocess¶
AI Model Training¶
AI Model Deploy & Inference¶
Monitor & Maintain Accuracy¶
In the data collection step, data required for training is prepared by collecting data from sensors and IoT devices or by using data stored in a database. Equipment involved in this step includes sensors, IoT devices and servers that store them, and database servers. Since the operations performed at this step are conducted through the Internet, many IoT technologies studied by the IETF so far have developed technologies suitable for this step.¶
In the data analysis and pre-processing step, the features of the prepared data are analyzed and pre-processing for training is performed. Equipment involved in this step includes a high-performance server equipped with a GPU and a database server, and is mainly performed in a local network.¶
In the model training step, a training model is created by applying an algorithm suitable for the characteristics of the data and the problem to be solved. Equipment involved in this step includes a high-performance server equipped with a GPU, and is mainly performed on a local network.¶
In the model deploying and inference service provision step, the problem to be solved (e.g., classification, regression problem) is solved using AI technology. Equipment involved in this step may include a target machine, a client, a cloud, etc. that provide AI services, and since various equipment is involved in this stage, it is conducted through the Internet. This document summarizes the factors to be considered at this step.¶
In the accuracy monitoring step, if the performance deteriorates due to new data, a new model is created through re-training, and the AI service quality is maintained by using the newly created model. This step is the same as described in the model training, model deploying, and inference service provision steps described in the previous step because re-training and model deploying are performed again.¶
In general, after training a AI model, the AI model can be built on a local machine for AI model deploying and inference services to provide AI services. Alternatively, we can place AI models on cloud servers or edge devices and make AI service requests remotely. In addition, for overall service performance, some AI service requests to the cloud server and some AI service requests to edge devices can be performed through appropriate load balancing.¶
The following figure shows a case where a client module requesting AI service on the same local machine requests AI service from an AI server module on the same machine.¶
This method is often used when configuring a system focused on training AI models to improve the inference accuracy and performance of AI models without considering AI services or AI model deploying and inference in particular. In this case, since the client module that requests the AI inference service and the AI server module that directly performs the AI inference service are on the same machine, it is not necessary to consider the communication/network environment or service provision method too much. Alternatively, this method can be used when we want to simply decorate the AI inference service on one machine without changing the AI service in the future, such as an embedded machine or a customized machine.¶
In this case, a high level of hardware performance is not required to train the AI model, but hardware performance sufficient to run the AI inference service is required, so it is possible on a machine with a certain amount of hardware performance.¶
The following figure shows the case where the client module that requests AI service and the AI server module that directly performs AI service run on different machines.¶
In this case, the client module requesting the AI inference service runs on the local machine, and the AI server module that directly performs the AI inference service runs on a separate server machine, and this server machine is in the cloud network. In this case, the performance of the local machine does not need to be high because the local machine simply needs to request the AI inference service and, if necessary, deliver only the data required for the AI service request. For the AI server module that directly performs AI inference service, we can set up our own AI server, or we can use commercial clouds such as Amazon, Microsoft, and Google.¶
The following figure shows the case where the client module that requests AI service and the AI server module that directly performs AI service are separated, and the AI server module is located in the edge device.¶
Even in this case, the client module that requests the AI inference service runs on the local machine, the AI server module that directly performs the AI inference service runs on the edge device, and the edge device is in the edge network. Even in this case, the client module that requests the AI inference service runs on the local machine, the AI server module that directly performs the AI inference service runs on the edge device, and the edge device is in the edge network. The AI module that directly performs the AI inference service on the edge device can directly configure the edge device or use a commercial edge computing module.¶
The difference from the above case where the AI server module is in the cloud is that the edge device is usually close to the client, whereas the performance is lower than that of the server in the cloud, so there are advantages in data transfer time and inference time, but in unit time Inference service performance is poor.¶
The following figure shows the case where AI server modules that directly perform AI services are distributed in the cloud and edge devices.¶
There is a difference between the AI server module performed in the cloud and the AI server module performed on the edge device in terms of AI inference service performance. Therefore, the client requesting the AI inference service may request by distributing the AI inference service request to the cloud and edge device appropriately in order to perform the desired AI service. In other words, in the case of an AI service with low inference accuracy but short inference time, we can request an AI inference service to the edge device.¶
In the previous section, to provide AI inference service, the network configuration that consisted of local machines, edge devices, and cloud servers is a kind of vertical hierarchy. Because the capabilities of each machine are different, the overall performance of the network using vertical hierarchy is dependent of each machine. Generally, a cloud server has a most powerful performance and then an edge device has the second powerful performance.¶
In this network configuration, AI service may have different performance according to the load level of the server, computing capability of the server machine and link-state between the local machine and the server machines of the horizontal level. Thus, to look for the server machine that can support the best AI service, it is necessary for the network element that can monitor network link-state and current state of the computing capability of the server machines and the network load-balance that can perform a scheduling policy of load balancing. The following figure shows the case where the local machine that requests AI service to horizontal multiple cloud servers.¶
Collecting and preprocessing of data and training an AI model requires a high-performance resource such as CPU, GPU, Power, and Storage. To mitigate this requirement, we can utilize a network-side configuration. Typically, federating learning is a machine learning technique that trains an AI model across multiple decentralized servers. It is a contrast to traditional centralized machine learning techniques where all the local datasets are uploaded to one server. In this federated learning, it enables multiple network nodes to build a common machine learning model.¶
And, transfer learning is a machine learning technique that focuses on storing information gained while solving one problem and applying it to a different but related problem. In this transfer learning, we can utilize a network configuration to transfer common information and knowledge between different network nodes.¶
As described in the previous chapter, the AI server module that directly performs AI inference services by utilizing AI models can be performed on a local machine or a cloud server or an edge device.¶
In theory, if AI inference service is performed on a local machine, AI service can be provided without communication delay time or packet loss, but a certain amount of hardware performance is required to perform AI service inference. So, in the future environment where AI services become popular, such as when various AI services are activated and AI services are disseminated, the cost of a machine that performs AI services is important¶
If so, whether the AI inference service will be performed on the cloud server or the discount price on the edge device can be a determining factor in the system configuration.¶
When AI inference service request is made to a distant cloud server, it may take a lot of time to transmit, but it has the advantage of being able to perform many AI inference service requests in a short time, and the accuracy of AI service inference increases. Conversely, when an AI service request is made to a nearby edge device, the transmission time is short, but many AI inference service requests cannot be performed at once, and the accuracy of AI service inference is lowered.¶
Therefore, by analyzing the characteristics and requirements of the AI service to be performed, it is necessary to determine where to perform the AI inference service on a local machine, a cloud server, or an edge device.¶
The hardware characteristics of the machine performing the AI service varies. In general, machines on cloud servers are viewed as machines with higher performance than edge devices. However, the performance of AI inference service varies depending on how the hardware such as CPU, RAM, GPU, and network interface is configured for each cloud server and edge device. If we do not think about cost, it is good to configure a system for performing AI services with a machine with the best hardware performance, but in reality, we should always consider the cost when configuring the system. So, according to the characteristics and requirements of the AI service to be performed, the performance of the local machine, cloud server, and edge device must be determined.¶
Performance evaluation is possible through the performance matrix presented in the standard of ETSI[MEC.IEG006]. The performance metrics suggested by the ETSI standard are as follows. These metrics is divided into two groups, namely Functional metrics, which assess the user performance and include some classical indexes such as latency in task execution, device energy efficiency, bit-rate, loss rate, jitter, Quality of Service (QoS), etc.; and Non-functional metrics that, instead, focus on the MEC(Mobile Edge Computing) network deployment and management. Non-functional metrics include the following indexes. Service life-cycle(instantiation, service deployment, service provisioning, service update (e.g. service scalability and elasticity), service disposal), service availability and fault tolerance (aka reliability), service processing/computational load, global mobile equipment host load, number of API request (more generally number of events) processed/second on mobile equipment host, delay to process API request (north and south), number of failed API request. The sum of service instantiation, service deployment, and service provisioning provide service boot-time.¶
According to the characteristics of the AI service, although not directly related to communication/network, the biggest influence on AI inference services is the AI model to be used for AI inference service. For example, in AI services such as image classification, there are various types of AI models such as ResNet, EfficientNet, VGG, and Inception. These AI models differ in AI inference accuracy, but also in AI model file size and AI inference time. AI models with the highest inference accuracy typically have very large file sizes and take a lot of AI inference time. So, when constructing an AI service system, it is not always good to choose an AI model with the highest AI inference accuracy. Again, it is important to select an AI model according to the characteristics and requirements of the AI service to be performed.¶
Experimentally, it is recommended to use an AI model with high AI inference accuracy in the cloud server, and use an AI model that can provide fast AI inference service although the AI inference accuracy is slightly lower for the fast AI inference service in the edge device.¶
It might be a bit of an implementation issue, but we should also consider how we deliver AI services on cloud servers or edge devices. With the current technology, a traditional web server method or a server method specialized for AI service inference (e.g., Google's Tensorflow Serving) can be used. Traditional web server methods such as Flask and Django have the advantage of running on various types of machines, but since they are designed to support general web services, the service execution time is not fast. Tensorflow Serving uses the features of Tensorflow to make AI service inference services very fast and efficient. However, older CPUs that do not support AVX cannot use the Tensorflow serving function because Google's Tensorflow does not run. Therefore, rather than unconditionally using the server method specialized in AI service inference, it is necessary to decide the AI server module method that provides AI services in consideration of the hardware characteristics of the AI system that can be built.¶
The communication method for transferring data to request AI inference service is also an important decision in constructing an AI system. Using the traditional REST method, it can be used for various machines and services, but its performance is inferior to Google's gRPC. There are many advantages to using gRPC for AI inference services because Google's gRPC enables large-capacity data transfer and efficient data transfer compared to REST.¶
Cloud-edge collaboration-based AI service development is actively underway. In particular, in the case of AI services that are sensitive to network delays, such as object recognition and autonomous vehicle services, (micro)services for inference are placed on edge devices to obtain fast inference results and provide services. As such, in the development of intelligent IoT services, various devices that can provide computing services within the network, such as edge devices, are being added as network elements, and the number of IoT devices using them is rapidly increasing. Therefore, a new function for computing resource management and operation is required in terms of providing computing services within the network. In addition, to operate distributed AI service on network, the network policy for collaboration between edge devices that support computing resource for AI service.¶
In network policy, in order to efficiently support distributed AI services, existing networks must provide with the collaboration of AI service between edge devices such as multi-edge network configuration and AI service aware traffic steering in multi-edge network to receive distributed AI service support efficiently in various network environments that dynamically varies network resource and the computing resource of edge device. For example, in order to efficiently provide distributed AI inference service in multi-edge network environment, AI tasks message exchanges must be possible between edge devices. Also, there are various delay sensitive AI services based on edge device on network. They are divided into in-time delay AI services with a deadline limit and on-time delay AI services with the set of a time-range. In particular on-time delay AI services want to return the results of prediction within the time range. Therefore, distributed AI service must be able to is provided proper AI service in terms of the delay service for distributed AI service through network. Therefore, the client of AI service should be able to be provided both in-time and on-time delay services and be interacted with edge devices where distributed AI service is built to provide both delay services.¶
Various sensors are used in self-driving cars, and the final judgment is made by combining these data. Among them, camera data-based object detection solves parts that expensive equipment such as LiDAR and RADAR cannot solve. Camera-based object detection performs various tasks, and in addition to lane recognition for maintaining driving lanes and changing lanes, it also supports safe driving and parking assistance by distinguishing shape information such as pedestrians, signs, and parking vehicles along the road.¶
In order to perform such driving assistance and autonomous driving, object detection needs to be performed in real time. The minimum FPS(Frames Per Second) to be considered real-time in autonomous driving is 30 FPS[Object_detection]. No matter how high the accuracy is, it cannot be used for autonomous driving if it does not meet the corresponding reference value.¶
Task offloading refers to a technology or structure that transfers computing tasks to other processing devices or systems to perform them. Task offloading can quickly process tasks that exceed the performance limits of devices that lack resources by delivering tasks from devices with limited computing power, storage space, and power to devices that are rich in computing resources.¶
For devices with low hardware performance (e.g., NVIDIA Jetson Nano board, Qual-core ARM A57, 4GB RAM), all locally without task offloading results in 4.6 FPS, which is difficult to perform object detection-based autonomous driving. On the other hand, if task offloading is applied to perform object detection on devices with high hardware performance (e.g., Intel i7, RTX 3060, 32GB RAM) and the rest of the work is performed on the client, 41.8 FPS will be obtained. This is a result that satisfies 30 FPS, which is the reference FPS of object detection-based autonomous driving.¶
In the case of AI services such as object detection, if it is difficult to perform on resource-constrained devices, it can be seen that the task offloading structure shows some efficiency. However, without performing all operations locally, task offloading operations between network nodes can affect the entire time because the larger the size of the data, the greater the communication latency. Therefore, in such a network distributed environment, the provision of AI services should be designed in consideration of various variables. The Figure 7 shows an example of distributed AI deployment in a self-driving car when a car does not have enough capabilities to proceed the object detection operation in real-time and it asks some tasks to edge devices and cloud servers.¶
Network digital twin also need to build distributed AI services. The purpose of a network digital twin is described in [I-D.irtf-nmrg-network-digital-twin-arch]. In particular, the network digital twin provides network operators with technology that enables stable operation of the physical network and stable execution of optimal network policies and deployment procedures. To achieve this, the network digital twin will use AI capabilities for various purposes.¶
Various AI functions will be applied for optimal network operation and management. However, the actual physical network consists of many network devices and has a complex structure. In addition, in a large-scale network environment, the network overhead is very large to collect and store information from many network devices in a centralized manner, and to create and operate network operation policies based on it.¶
Therefore, there is a need for a method to apply AI functions based on a distributed form for network operation and management. In particular, the actual physical network structure is built in a logical hierarchical structure. Therefore, it is necessary to apply a distributed AI method that considers the logical hierarchical network structure environment.¶
In order to optimally perform network operation and management through distributed AI methods, it is necessary to generate AI function-based network operation and management policy models and an operational method to distribute the generated AI function-based network policies. In particular, in order to operate a network digital twin in a large-scale network environment, it is necessary to generate AI-based network policy models in a distributed manner. A federated learning algorithm or a transfer learning algorithm that can learn large-scale networks in a distributed manner can be applied.¶
As shown in Figure 8, in order to learn a large-scale network through a distributed learning method, a local data repository to store network data must be established in each region, for example, based on location or AS (Autonomous System). Therefore, the distributed learning method learns through each worker (agent) based on the local network data stored in the local network data repository, and generates a large-scale network policy model through the master. This distributed learning method can reduce the network overhead of centralized data collection and storage, and reduce the time required to create AI models for network operation and management policies for large-scale networks. In addition, the network policy model generated by the worker can be used as a locally optimized network policy model to provide AI-based network operation and management policy services optimized for local network operations.¶
The distributed deployment of trained AI network policy models can be deployed on network devices that can manage and operate the local network to minimize network data movement. For example, in a large-scale network consisting of multiple ASes, AI network policy models can be deployed per AS to optimize network operation and management. Figure 9 shows an example of operating and managing a network by distributing AI network policy models by AS.¶
There are no IANA considerations related to this document.¶
When AI service is performed on a local machine, there is no security issue, but when AI service is provided through a cloud server or edge device, IP address and port number may be known to the outside can attack. Therefore, when providing AI services by utilizing machines on the network such as cloud servers and edge devices, it is necessary to analyze the characteristics of the modules to be used well, identify vulnerabilities in security, and take countermeasures.¶
TBA¶