Algorithm_Innovation_Lab_Research_ResourceManagement3--HUAWEI CLOUD

Introduction to Resource Management (3)

Every word you are writing shows your personality. Xizhi Wang，a calligrapher thought，the word is the photo of one man’s personality. People's Daily Online has published an article evaluating the chairman's calligraphy: “The word is just like yourself”. It proposes that the calligraphy works of the chairman are full of momentum during the revolutionary period. Today, I would like to introduce to you a pair of antithetical couplets "Know oneself and know the other make you always win" adopted by the Chairman of the Central Military Commission at the Beijing meeting on 7 May 1950. "The Art of War· Meticulous " points out: “one who knows oneself and the other, no battel failures; one who knows others but do not know himself, 50% to win; One does not know himself and the other, always fails”. This has revealed the universal law governing war: only when the war directors understand the situation on both sides can they formulate the correct strategy for fighting a war and win a war. The chairman changed the word "Not lose" in "The Art of War" to "Always Win" creatively, further emphasizing the importance of intelligence work in war. If the planning and calculation of scheduling resources in Huawei's next-generation smart cloud brain "Alkaid" system are self-aware, then the prediction and profiling of resource usage are about knowing the enemy.

In the previous two issues, we introduced the resource planning, resource calculation, and resource scheduling algorithms and iterative optimization platforms. From the perspective of resource lifecycle, we have basically seen the first half of resources. That is, we consider and solve problems from the perspectives of data center resource preparation and tenant static resource quota declaration. In the latter half of the resource life cycle, the resource usage is closer to the actual services of tenants. Therefore, we need to pay attention to the actual resource usage and its impact on services.

Analysis of the resources in the actual data center shows that sometimes the resource allocation rate of the data center is high, but the actual load of the VM or host is low, that is, the resource usage is low. In some cases, resource allocation is normal. However, the service load of some VMs increases, affecting the service performance of their neighboring VMs. This indicates that there is no necessary relationship between the resource amount claimed by tenants and the actual usage effect. If we can schedule and manage resources from the perspective of actual resource usage, we can improve resource utilization and provide better performance experience. This is the scheduling based on utilization and task scheduling. With these intelligent prediction and profiling of resources and service loads, we can not only guide better scheduling and improve resource utilization, but also benefit customers with these intelligent capabilities, such as thousands of people, intelligent recommendation, cost and performance optimization suggestions, and service SLA assurance algorithms, greatly improving the resource efficiency in the tenant dimension, better cloud experience. The research of these algorithms has been carried out and incorporated into the planning of Alkaid.

The seventh end of the Northern Dipper scoop Yaoguang which can pry the world. Huawei's "Alkaid" solution is designed to address the core challenges of all-domain scheduling, dynamic negotiation and governance, multi-objective optimization, intelligent matching of diverse computing capabilities, and full-stack trustworthiness. As an important part of the "Alkaid" smart cloud brain, the resource management and scheduling algorithms are bound to be deployed worldwide.

Wearing gold armor, make hundreds of battles in dessert, do not go back untie defeat Loulan. Wish Alkaid an invincible battle in the 5G+cloud+AI era.

1. Resource forecast

Based on the historical usage of hosts or tenant VMs in the data center, the system uses the time sequence prediction algorithm to predict the future change trend of the hosts or tenant VMs in advance and take measures accordingly. This helps elastic scheduling decision-making based on the usage, hotspot identification and migration, and dynamic threshold alarms. This solution is improved on the basis of Google's research results. The "decomposition - combination" prediction method based on EEMD is used. The original complex wave band is decomposed into several stationary subsequences and then the subsequences are predicted. Finally, the final prediction result is synthesized, compared with the traditional HW algorithm, the long-term prediction algorithm effectively reduces the error by 29.8% to 41.7%.

2. Resource profile

VM intelligent profiling is a key part of fine-grained scheduling in the secondary system. By analyzing VM data and profiling, the system deeply describes and mines real service features of each resource dimension, and mines effective rules from data, providing fine-grained decision feedback for the scheduling system.

Machine learning technologies are introduced to the intelligent profile system. Based on mining and analysis of real data, a series of key decision-making aspects of the scheduling system are proposed, and a series of corresponding components are launched. These components are the core of the resource profile system and serve a large number of service applications.

3. Association analysis

In cloud computing, resources in different dimensions have different characteristics, but resources in different dimensions also have similarities and differences. The similarities and differences of resource data in different dimensions are crucial to service analysis and resource scheduling.

By analyzing resource data, we propose a series of algorithms based on association analysis and prediction, such as "cold start". These algorithms use the existing information in the data to predict and analyze related resources. In this way, the resource scheduling system can be provided with more dimension information to enrich and enhance the control capability of the scheduling system.

4. Calculating task scheduling

A short task is a user requirement that can be split into multiple tasks that are completed by computing resources (nodes) in a cluster. Short tasks mainly include the need for heavy computations such as TensorFlow training, Spark calculation, etc. The jobs submitted by users are divided into tasks that transmit data with each other. The nodes that execute the tasks are physically connected to each other. The scheduling algorithm needs to map the tasks decomposed by jobs to nodes for processing. The amount of data transmitted between tasks is unknown, and the physical connections and bandwidth (upper limit of data transmission) between nodes are known. The scheduling algorithm pursues the optimal task allocation solution under the constraints of multiple conditions such as the processing capability of a single node, internal data transmission capacity of a single node, and data transmission bandwidth between nodes, so as to shorten the completion time of the entire job.

The data transmission relationship between tasks, physical topology between nodes, and resource restriction are described using quantized elements in the matrix, and the target function is specially designed for this scenario. Under the constraint of resource limitation, the optimal solution of the objective function provides the mapping between tasks and nodes (scheduling and allocation solution).

The objective function we designed takes service constraints into consideration and ensures load balancing among nodes when tasks are allocated to nodes with large bandwidth.

In the algorithm, we develop the heuristic search method based on business logic and the method based on the division of the topology structure between nodes to speed up the solution process and shorten the time of solving the objective function to meet the requirements of the commercial environment.

Algorithm Powers Innovation

Introduction to Resource Management (3)