Toward Massive Distribution of Intelligence for 6G Network Management Using Double Deep Q-Networks

In future 6G networks, the deployment of network elements is expected to be highly distributed, going beyond the level of distribution of existing 5G deployments. To fully exploit the benefits of such a distributed architecture, there needs to be a paradigm shift from centralized to distributed management. To enable distributed management, Reinforcement Learning (RL) is a promising choice, due to its ability to learn dynamic changes in environments and to deal with complex problems. However, the deployment of highly distributed RL – termed massive distribution of intelligence – still faces a few unsolved challenges. Existing RL solutions, based on Q-Learning (QL) and Deep Q-Network (DQN) do not scale with the number of agents. Therefore, current limitations, i.e., convergence, system performance and training stability, need to be addressed, to facilitate a practical deployment of massive distribution. To this end, we propose improved Double Deep Q-Network (IDDQN), addressing the long-term stability of the agents’ training behavior. We evaluate the effectiveness of IDDQN for a beyond 5G/6G use case: auto-scaling virtual resources in a network slice. Simulation results show that IDDQN improves the training stability over DQN and converges at least 2 times sooner than QL. In terms of the number of users served by a slice, IDDQN shows good performance and only deviates on average 8% from the optimal solution. Further, IDDQN is robust and resource-efficient after convergence. We argue that IDDQN is a better alternative than QL and DQN, and holds immense potential for efficiently managing 6G networks.

Abstract-In future 6G networks, the deployment of network elements is expected to be highly distributed, going beyond the level of distribution of existing 5G deployments.To fully exploit the benefits of such a distributed architecture, there needs to be a paradigm shift from centralized to distributed management.To enable distributed management, Reinforcement Learning (RL) is a promising choice, due to its ability to learn dynamic changes in environments and to deal with complex problems.However, the deployment of highly distributed RL -termed massive distribution of intelligence -still faces a few unsolved challenges.Existing RL solutions, based on Q-Learning (QL) and Deep Q-Network (DQN) do not scale with the number of agents.Therefore, current limitations, i.e., convergence, system performance and training stability, need to be addressed, to facilitate a practical deployment of massive distribution.To this end, we propose improved Double Deep Q-Network (IDDQN), addressing the longterm stability of the agents' training behavior.We evaluate the effectiveness of IDDQN for a beyond 5G/6G use case: auto-scaling virtual resources in a network slice.Simulation results show that IDDQN improves the training stability over DQN and converges at least 2 times sooner than QL.In terms of the number of users served by a slice, IDDQN shows good performance and only deviates on average 8% from the optimal solution.Further, IDDQN is robust and resource-efficient after convergence.We argue that IDDQN is a better alternative than QL and DQN, and holds immense potential for efficiently managing 6G networks.Index Terms-6G, network management, network automation, reinforcement learning, machine learning, distributed intelligence, model training stability, scalability.

I. INTRODUCTION
I N 2017, the Third Generation Partnership Project (3GPP), an organization that standardizes mobile communication systems, proposed a novel architecture for the Fifth Generation (5G) Core Network (CN) [1], which relies increasingly on service-and software-based concepts.That is, functionalities (software), called Network Functions (NFs), are decoupled from underlying physical infrastructure, enhancing flexibility of deployment.To meet the requirements of 5G use cases, these CN NFs have to be increasingly deployed in a distributed manner [2].For example, CN user plane NFs, e.g., User Plane Functions (UPFs), are deployed to edge sites [3], to keep user plane latency low.Similarly, distributed installation of the Network Data Analytics Function (NWDAF), a control plane NF, has been proposed [4], to optimize consumption of network resources and ensure data security.
For such distributed deployments of NFs, distributed management is more suitable than a centralized approach.A centralized orchestrator needs global view of its managed entities (by means of extensive data collection) to make localized management decisions -thereby requiring enormous communication and computation resources [5].For example, resource allocation decisions of an edge server A are not necessarily dependent on edge server B, located thousands of kilometers apart -resulting in unnecessary data collection at the orchestrator for the specific, local decision at A. While, with a distributed approach, each management function (MF) is located close to the source of the data, i.e., the NF, enabling fast, local decision-making [6].Therefore, for distributed management, 5G provides Management Services (MnS) [7], which split management tasks of an orchestrator into smaller blocks providing local, more efficient management functionalities, e.g., monitoring data, performance reporting and so on.
With a distributed approach, each MF needs to exchange information periodically with other functions that influence its actions, in order to encourage cooperative decision-making, thereby improving system Key Performance Indicators (KPIs) [6].Consequently, the complexity increases with increasing levels of distribution.Solutions based on traditional mathematical optimization or rule-based automation are not feasible anymore, as, with the number (abbreviated as no.) of MFs, the complexity of the overall problem increases.Artificial Intelligence (AI) [8] is a promising solution approach in this direction, as it solves complex problems by learning patterns in the data (not parameters).It leverages the enormous data available and derives optimal decisions in real-time [9].Further, Reinforcement Learning (RL) [10], a branch of AI, is particularly suitable for learning complex environments -as RL contiimproves its learning based on the constant feedback 1932-4537 c 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
received from its environment.We define a framework of multiple RL agents as distributed intelligence.This trend of distributed intelligence will continue for the Sixth Generation (6G) system [11].For example, in a V2X management scenario [12], vehicles must efficiently manage their respective computation tasks (e.g., inference of large neural network models), where tasks should be offloaded to edge servers if needed, to avoid overloading the local computation unit of the vehicles.For this use case, distributed intelligence has shown improvements over the state-of-theart [13] and over centralized task offloading, as the orchestrator does not scale with increasing no. of vehicles.
However, distributed intelligence for 6G network management is expected to become more challenging, as more NFs are deployed in an increasingly distributed manner [14]."Continuous" orchestration of resources (on which these NFs are deployed) would then be essential, meaning that Intelligent Agents (IAs) would manage highly distributed deployments of network elements, e.g., all the way to the edge nodes and even the end devices [15].Therefore, in this work, we define massive distribution of intelligence as each network element being managed by its own, individual IA.
However, distributed intelligence based on RL suffers from a no. of practical issues, that must be addressed to enable massive distribution.First, the IAs must converge in their learning, and learn the optimal action for a given state, before they could be deployed in the network.As the total convergence time is determined by the last IA to converge, the overall convergence time increases with the no. of distributed IAs [6].Hence, scalability in terms of convergence time is an issue.Second, with increasing no. of IAs, the overall performance of distributed intelligence, measured w.r.t.system KPIs, may degrade [6], due to insufficient data collected from other IAs, inappropriately located physical resources, etc.Finally, instability during training is a critical problem, as IAs may diverge over time from their learned optimal behavior, destabilizing and degrading system performance [16].
The above discussion, therefore, leads us to the novelty of this manuscript.Our work is novel in two aspects.i) We address the identified three dimensions of scalability togetherconvergence time, performance w.r.t.system KPIs and training stability -towards massive distribution of intelligence in 6G.To the best of our knowledge, so far, these dimensions have not been combined and holistically investigated in any prior art.We consider the most extreme (massive) case of distribution, i.e., each NF managed by an individual IA, where each IA is trained and inferred in a fully distributed manner.This proposal grants generality to each IA.ii) We also propose the application of improved DDQN (IDDQN), in which we combine Double DQN and reward scaling, in order to achieve a good balance among the dimensions of scalability.
We summarize the main contributions of this study in Fig. 1: 1 Solution Design: We identify the instability problem in using neural network-based RL, Deep Q-Networks (DQN).By identifying the root cause of DQN instability, we propose the application of improved DDQN (IDDQN), in which we combine two solutions for improving DQN -Sol.#1Double DQN (DDQN) [17] and Sol.#2 reward scaling [18]. 2 Methodology: For the use case under study: auto-scaling virtual CPU resources of NFs in a Network Slice (NS), we implement a software platform to simulate the behavior of auto-scaling in a realistic 6G NS.The system performance is measured by the no. of User Equipments (UEs) served by the NS per unit time, referred to as served load.The goal of autoscaling is to maximize the served load while keeping the VNF CPU utilization as close as possible to a pre-defined target.We implement IDDQN and other RL-based distributed auto-scaling algorithms -Q-Learning for Cooperation (QLC) [19] and Deep Q-Network for Cooperation (DQNC) [20] -for comparison.We consider the following benchmarks: 1) a no-auto-scaling algorithm, NO_AUT, which is the lower bound of the system performance, 2) a centralized Mixed Integer Optimization, MIO, which allocates the optimal no. of CPUs, hence is the upper bound, and 3) a scaling algorithm, THR, inspired from existing threshold-based implementations in open source orchestration frameworks, e.g., Open Source MANO [21].
3 IDDQN Effectiveness Proof & Tuning: We show the effectiveness of Sol.#1 DDQN in terms of learning behavior.Compared to DQNC, DDQN shows greater training stability.Then, we investigate Sol.#2, i.e., reward clipping and scaling, showing the effectiveness of reward scaling over clipping.We find a suitable parameter set for further IDDQN (reward scaling combined with DDQN) evaluation.
4 Solution Comparison: We conduct a broad investigation of the RL algorithms -QLC, DQNC, IDDQN.The algorithms are compared to NO_AUT, MIO and THR benchmarks.We explore a large parameter and problem space to obtain a holistic picture of IDDQN's benefits.From evaluations, we infer several conclusions about the behavior of IDDQN during training and deployment over a varied KPI set.IDDQN is observed to be more stable than DQN, and ensures that the mean absolute error of the performance (served load), compared to the optimum achieved by MIO, is only 8%.This is a promising figure for a neural network-based approach, that trades off the performance with the convergence time.In comparison, IDDQN converges at least 2× sooner than QLC.We show that for the selected deployment scenarios, IDDQN achieves scalability in all three dimensions and proves to be a better candidate than existing solutions.
5 Implications for 6G: Our work could be considered as a practical guide for deploying distributed intelligence in 6G networks, unveiling the possible influence factors, w.r.t.stability, system performance and convergence time.IDDQN particularly brings higher reliability of network management owing to improved training stability, better scalability due to reduced convergence time -allowing a higher degree of flexibility of deployment for the network operator.This holistic interpretation may thus help the operator to elaborate which algorithm could be the better fit, given their specific requirements and the characteristics of their networks.
The rest of the article is structured as follows.Section II discusses background and related work.Section III describes the automated system and the RL algorithms, motivates and outlines the objectives of the manuscript.Section IV describes IDDQN.Section V outlines the simulation setup and the platform design, while Section VI presents IDDQN evaluation.Section VII summarizes lessons learned and discusses limitations.Finally, Section VIII concludes the article.

II. BACKGROUND & RELATED WORK
In this section, we provide background on the management of virtualized resources (VR) in 5G, highlighting existing standardization efforts that encourage adoption of distributed management.Next, we review related work that apply RLbased centralized intelligence to manage VR, showing the suitability and relevance of RL for network management.Although in this work, comparison to centralized learning is out of scope, we introduce this part as competing alternatives to the distributed intelligence approach.Then, we carefully analyze gaps in the usage of distributed intelligence, reiterating our contributions and highlighting the novelty of this paper beyond our previously published works.

A. Standardization Efforts Towards Distributed Management
In legacy mobile networks, the software and associated logic was embedded within dedicated telecommunication equipment, for example, in the Evolved Packet Core (EPC) in the Fourth Generation (4G) mobile system [22].In the Fifth Generation System (5GS), this type of monolithic architecture was decomposed by Network Function Virtualization (NFV).Being a technology that decouples software from hardware resources, NFV allows vendors to offer their software solutions, known as Network Functions (NFs), to run in commercial off-the-shelf hardware, as Virtual Network Functions (VNFs).Due to this shift towards virtualization in 5G, the 5GPPP Architecture Working Group foresees the adoption of software-based, distributed concepts not only in 5G but also in future 6G networks [23].Therefore, a Service-Based Management Architecture (SBMA) was introduced for 5G network management [24] by 3GPP.For, e.g., in SBMA, Management Data Analytics Services (MDASs) offering management capabilities, may be produced by service producers and accessed by consumers via a standardized interface [7].
The Management Data Analytics Function (MDAF) exposes one or more of these MDAS(s).Although the 5G management architecture encourages a distributed approach using the SBMA, existing deployments of the MDAF are still predominantly centralized [25].
On the other hand, the ETSI Industry Specification Group (ISG) for NFV also introduced an architecture framework for the management and orchestration (MANO) of NFV resources and their associated interfaces [26].Currently, deployments of the ETSI NFV MANO framework are also mostly centralized [15], despite a clear architectural interaction between the 3GPP SBMA and ETSI NFV MANO [24] that encourages distributed management.Therefore, there is a clear trend towards distributed management in standard efforts in 5G, even though early deployments resort to the more straightforward centralized option.This trend of distributing intelligence will not be limited to 5G, and is expected to continue through the evolution to 6G [23].

B. Suitability of RL-Based Network Management
Several previous works apply RL models to orchestrate VR using a centralized approach.Based on the O-RAN architecture, Murti et al. [27] formulate a virtualized RAN reconfiguration problem and propose a Dueling Double Deep Q-Network (D3QN)-based framework to solve the problem.Evaluation results on trace-driven simulations show that the proposal learns the optimal policy and achieves at least 35% cost savings compared to benchmarks.Kim et al. [28] propose an efficient Cloud-native Network Function (CNF) placement scheme in the control plane based on a centralized DQN formulation, known as DQN-CFPA.The objective of DQN-CFPA is to minimize the total cost incurred during backhaul control traffic and costs to launch and operate CNFs.Nouruzi et al. [29] address the challenge of provisioning online service requests in an NFV-enabled network, to minimize the costs incurred in resource utilization while fulfilling the QoS under limited resources.The authors propose a DQN-based resource allocation algorithm to that end, achieving up to 14% increase in the no. of admitted requests.Lee et al. [30] propose a method based on DQN to provide the optimal no. of VNF instances in a Service Function Chain (SFC), factoring in the tier to be scaled and the node used for scaling.The solution minimizes the no. of Service Level Objective violations.Based on the ETSI MANO architecture, the authors develop an autoscaling module in an OpenStack environment.These works, therefore, show that RL algorithms, i.e., QL and DQN, are well-suited for management tasks such as orchestration of VR.

C. Distributed Intelligence and Identified Gaps
Recent studies such as [31] have demonstrated the applicability of DRL for distributed or federated resource optimization.We analyze some works in detail.Dalgkitsis et al. [32] introduce SCHEMA, a distributed RL framework, that addresses the SFC placement problem for low latency URLLC services.Being one of the few works to address scalability of distributed learning, the authors in this study assign a local agent to each domain level network graph.
Each agent implements a bidding mechanism, employing a local placement problem that ensures scalability across domains.However, the agents need to share the enormous problem state space to participate in the bidding, introducing huge communication overhead in the network.Moreover, the evaluation shows the latency improvement only up to 11 local domains, without an outlook on either the convergence of the DQN agents in SCHEMA or the associated overhead.Another work that addresses the scalability of DRL is Liu et al. [33].The authors propose a decentralized DRL-based resource orchestration system, known as "EdgeSlice", that dynamically provisions end-to-end network slices, with the objective to ensure SLA of the slices.The network can be partitioned into Resource Autonomies (RAs), where each RA is defined as a set of base stations and edge servers in a geographical area.Evaluation results show that as the number of RAs and slices increases, "EdgeSlice" scales in terms of system performance, and performs better compared to baseline algorithms."EdgeSlice" also outperforms baselines under different performance functions, showing that it is compatible to different scenarios considered by the authors.Although "EdgeSlice" is a good proposal in terms of scalability and compatibility, differently from our work, it uses a central performance coordinator to optimize network performance on a much larger timescale.
Further, Li et al. [34] propose weighted distributed DQN to optimize the edge caching replacement problem in a Device-to-Device communications model.They compare the convergence of centralized and the proposed distributed scheme, reporting faster convergence and lower loss values in the latter.Although "concussions" still exist in the local model training after convergence, no further evidence is provided that shows the disappearance of these fluctuations.Moreover, the no. of base stations (or the no. of DQN agents) is not specified in the paper, introducing ambiguity in the results.In [35], Chergui et al. refer to 6G massive slicing, in which slices span multiple technological domains -the radio access network, core, edge and cloud.The authors emphasize the large monitoring overhead involved in centralized MANO, and the associated delay and subsequent SLA violation.They argue that the energy consumption of training multiple distributed models is much lower than the cost of data transmission.To this end, they propose a novel energy-efficient decentralized AI engine based on Federated Learning (FL).Although this work fails to address FL model instability, it is relevant, as it confirms the high convergence and performance scalability of the decentralized algorithm.For AI-based service provisioning for network slices at the edge, Li et al. [6] compare degrees of centralized and distributed learning (ranging from fully centralized to fully distributed), in terms of the loss function convergence and accuracy of the AI models, by introducing a resource pooling policy.Fewer the number of resource pools used for model training, the more centralized is the training procedure.It is observed that as the number of sub-pools increases, both the accuracy and convergence rate reduce.
The above review shows that although there have been attempts to address the scalability of distributed AI, no existing work, to the best of our knowledge, unifies the three aspects -performance, convergence and stability of RL -relevant for 6G network management, while investigating massively distributed trained agents.

D. Extensions on Own Previously Published Work
To overcome the pitfalls of centralization, i.e., a single point of failure, signaling redundant data etc., in [19] we proposed a distributed intelligence framework for network management.This framework, QLC, was based on Q-Learning, a tabular RL mechanism.Next, we identified a relevant sub-problem in [36]: to understand how long QLC IAs need to learn their environment and converge.This understanding would help network operators to estimate how long they would need to wait for QLC to converge and take reliable management tasks.To this end, we proposed novel Knowledge Indicators (KIs), metrics indicating the progress of QLC learning, derived from their Q-tables.Further, in [37] we analyzed the scalability of QLC, with increasing no. of IAs, applied to the NS use case mentioned briefly before.Evaluations showed that QLC is scalable in terms of the system performance, i.e., w.r.t. the no. of users served by the NS.However, the convergence time of IAs was found to increase exponentially with the no. of IAs, thus limiting the scalability of QLC.To this end, in [20], to improve scalability, we proposed a function approximationbased alternative, Deep Q-Networks (DQNC).We posited that since DQNC estimates Q-values instead of calculating them rigorously like QLC, the performance of QLC (the no. of users served by the NS per unit time) could be traded off with the time it takes to converge.We achieved at least 37% reduction in convergence time with DQNC, compared to QLC.However, we found a major limitation of DQNC: its training instability, that hinders its ability to be deployed reliably for network management [20].Then, in [38], we investigated the effort needed to distribute intelligence in 6G, by analyzing its impact on the existing 5G architecture.This work, therefore, concludes our long-term effort on understanding the practical impact of distributing intelligence for 6G network management, on the three dimensions of performance, convergence and training stability, not investigated in prior art before, e.g., for the NS use case.

III. MASSIVELY DISTRIBUTED INTELLIGENCE FOR AUTO-SCALING RESOURCES IN A NETWORK SLICE
In this section, we focus on a beyond 5G/6G use caseauto-scaling virtual resources in a network slice (NS) -to which we apply the various distributed intelligence solutions.We then motivate the auto-scaling problem using an example of the resource allocation conflict and define the objective of the auto-scaling mechanism.Then, we describe the RL algorithms, QLC and DQNC, investigated earlier, highlighting their individual limitations.Finally, we motivate our objective of addressing those limitations.

A. The Automated System
The automated system, shown in Fig. 2, consists of an NS, that is composed of a total N network functions (NFs).User Equipments (UEs) send service requests to the NS.The Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.incoming load of the NS is determined from the no. of UEs requesting services.Based on the architectural principles of ETSI NFV MANO [26], each NF (in the 5G core) is implemented as software on its dedicated virtual network function (VNF), via a virtualization layer (i.e., hypervisor, represented by the green arrows).We assume that these N NF-VNF pairs share CPU resources from M shared pools, therefore, the framework consists of The k th shared pool, v k , provisions a no. of CPUs to its N k VNFs, where 1 ≤ k ≤ M .In regard to v k , each NF, denoted by NF j , is deployed on an individual VNF, VNF j , where 1 ≤ j ≤ N k .NS admission control ensures that, if, admitting a UE, the CPU utilization of an individual VNF (VNF j ) would exceed the threshold AC, the NS would deny it the service.The goal of auto-scaling is to guarantee that for a given incoming NS load, the load served by the NS is maximized, while keeping the CPU utilization of VNF j close to a predefined target U, as stipulated by the operator's infrastructure requirements [39].
To understand the objective of the auto-scaling system, let us make an example, demonstrating the importance of efficient resource (CPU) allocation to the NF-VNFs, for a given incoming NS load.In Fig. 3, we assume two NF-VNF pairs sharing an arbitrary resource pool in the NS, for simplicity.We further assume that 2 CPUs are remaining in the pool.During high incoming load, at time instant t, both VNFs need two additional CPUs each to serve the load.However, since the resource pool is not sufficient to satisfy the combined demand of the scale-up actions from both VNFs, a conflict is said to occur.Ideally, at high incoming load and considering the total no. of CPUs shared between the VNFs, both VNFs should have sent out scale-up requests by 1 CPU each, enabling more efficient allocation of the CPUs remaining in the pool.However, in reality, the faster VNF, i.e.,  the VNF that sends the scale-up request to the pool sooner than the other, is allocated the 2 remaining CPUs from the pool, following a greedy mechanism.The consequence is that the conflict results in disproportionate resource allocation, as the slower VNF gets no additional CPUs, thereby degrading the ability of the NS to serve the maximum no. of UEs per unit time.During low incoming load, the VNFs attempt to scale down, releasing them to the shared pool.Hence, the problem of conflict does not occur during a scale down action.In addition, auto-scaling must also ensure that the VNFs do not over-utilize CPUs in low load, that would result in wastage of resources.

B. Components of the Distributed RL-Based Algorithms
We revisit the Reinforcement Learning (RL)-based distributed intelligence framework for auto-scaling, depicted in Fig. 4.An IA, denoted by A j (1 < j ≤ N k ), accessing v k , manages an NF-VNF pair.From hereon, A j denotes the j th IA of v k .The IAs whose VNFs share a resource pool are defined as Neighbor Intelligent Agents (NIAs).The rationale is that an IA's actions may be affected by those of its NIAs, e.g., scale up actions causing conflicts, therefore knowledge of neighbor information would encourage cooperation within the given NIA group.The components of both QLC and DQNC are described next.Important notations are summarized in Table I.
1) Monitored Variables: The CPU utilization of an NF-VNF pair is defined as the ratio of the number of CPU cycles utilized to the maximum number of CPU cycles provided.In the design of our framework, at time instant t, the no. of UEs served by the NS is w(t).The local variables monitored by each IA A j are the load μ z contributed by the z th UE (1 ≤ z ≤ w (t)) to the NS and the no. of CPUs n j allocated to the j th VNF.Then, the CPU utilization at t is expressed as a function of μ z and n j (t) as follows where u j (t) is directly proportional to μ 1 , μ 2 , . . ., μ z and inversely to n j (t).Detailed formulation of Eqn.(1) can be found in the Appendix.Additionally, the IA communicates u j (t) to its NIAs, while receiving their utilization information.This signaling exchange is indicated in Fig. 4 by the red arrows.
2) Discrete State Space: The state of A j , s l (t) ∈ S (1 ≤ l ≤ |S|), is calculated based on the variables it monitors, upon which u j is computed, and the NIA utilization it receives.A proper auto-scaling action must consider that for a given incoming NS load: i) the served load is maximized, and ii) the distance between u j (t) and U is minimized.To encode both dimensions, the state is formalized in complex no.representation as where X(t) encodes how loaded the NIAs are overall.For v k , X(t) is calculated as where h(•) is a function to quantize a real no.Eqn.(3) indicates that X(t) is identical for a given NIA group, hence representing the aggregate loading of IAs in the k th group.Detailed explanation of h(•) can be found in the Appendix.In contrast, Y j (t) is IA-specific and corresponds to the "balancing" part of the state space representation.Its value is +1, when the CPU utilization u j is higher than the mean utilization of its neighbors, and vice versa.Its value is zero, when both u j and the mean utilization of its neighbors is equal -indicating perfect balancing of the load among the VNFs.Y j (t) is therefore computed as A benefit of this state space representation is that with increasing no. of IAs in the framework, this design avoids a corresponding increase in the size of the state space, since the no. of states is dependent only on h(•), hence independent of N k .Detailed explanation of the state can be found in the Appendix.
A further effect of this state space design is that, for different no. of neighbors in the framework, the information encoded in each state is also different.For e.g., the state of an IA sharing a resource pool with 4 other NIAs will encode different information (in terms of the no. of NIAs and their utilization), from the state of the IA having 9 other NIAs.
3) Discrete Action Space: A j can select and perform an action a m ∈ A (1 ≤ m ≤ |A|) from one of the following: add (scale up), release (scale down) or maintain (no scaling) the no. of CPUs assigned to its corresponding VNF.
4) Reward Model: The goal of an IA is to maximize its long-term rewards that it receives from its environment [10].In our scenario, the reward accounts for two aspects.For the current state-action pair (s l , a m ), A j is given a positive reward if the action brings u j closer to U. The closer the new utilization u j to U, the higher the reward.In addition, an action that causes a conflict is penalized and all conflicting IAs are affected.When the action is no scaling, the reward is inversely proportional to the difference between u j (t) and U.The reward function is therefore expressed as [19] where F is a flag representing occurrence of a conflict, K is a constant to shape the reward and 0 < δ << 1 is a constant.

C. Algorithm 1: Q-Learning for Cooperation (QLC)
After calculating the state, performing the action and receiving the reward, the IA evaluates how "good" the action was at that particular state.This measure reflects the "quality" of an arbitrary state-action pair (s l , a m ), constituting the learning of the IA, known as Q-Learning (QL).The quality is represented by a function Q(s l , a m ) and computed by the Bellman Eqn.[10] where a m is an arbitrary action at the next state s l .The Q-value for each state-action pair is stored in a tabular data structure known as a Q-table.The goal of QL is to maximize the long-term rewards using the Bellman Eqn.A sequence of Bellman Eqn.updates is referred to as an episode.The distributed intelligence framework implementing QL is referred to as Q-Learning for Cooperation (QLC).
1) Learning Principle: Each IA learns its environment based on a trial-and-error approach.When the IA selects the best action at a given state, i.e., the action with the highest Q-value at the state, it is said to exploit its knowledge.Once in a while, the IA selects an action randomly from the action set, with probability ε.This random action selection constitutes exploration.Only by trying sub-optimal actions would the IA be able to positively reinforce the optimal action (learned so far).Therefore, depending on the dynamics of the environment, the IA needs to keep exploring, while, to perform well, it should exploit what it has already learned [40].The balance between exploration and exploitation is determined by the probability ε.This learning approach is known as ε-greedy.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) Proposed Knowledge Indicators for QLC: A Q-table is said to have converged, when all Q-values have reached their optimal or sub-optimal values [41].Balancing exploration and exploitation, or tuning ε, to ensure convergence is not straightforward.In realistic deployments, the network operator needs the understanding of how it takes to wait until all Qtables of all IAs have converged, before they could be deployed for 6G network management.Since all cells of a Q-table may not be visited proportionately, confidence in the Q-values would also vary.Therefore, we proposed novel Knowledge Indicators (KIs) in [37], which track the progress of the learning at every state of each IA.KIs are metrics derived from Q-tables and are updated at the beginning of every training episode x (1 ≤ x ≤ E ), where E is the total no. of episodes.We define following two KIs at s l : 1) KI   l ( t) can be neglected, ε l for A j at the beginning of episode x is updated as where t * = t 0 + (x − 1) • T , p 1 and p 2 are parameters to tune KI l ( t) and Eqn.(7).

D. Algorithm 2: Deep Q-Network for Cooperation (DQNC)
QLC iteratively calculates Q-values using the Bellman Eqn.An alternative is using function approximation (FA), where, instead of directly computing the Q-values, the algorithm estimates them, by approximating a function representing the Q-values.Typically, neural networks (NNs) [8] are widely applied as effective FA tools.When an NN uses the Bellman Eqn. to estimate Q-values, the algorithm is called a Deep Q-Network (DQN).In [20] we proposed Deep Q-Network for Cooperation DQNC IAs, the FA counterpart of QLC.
Each DQNC IA consists of an NN, denoted as main, that is composed of an input layer, hidden layer(s) and an output layer of neurons.The main NN learns the environment by adjusting the parameters -weights and biases, denoted by θ -of the neurons, with the goal of minimizing the loss between the inputs and the target outputs.The loss calculated at the output layer is back-propagated to the internal layers.The parameters are updated to minimize the loss in the direction of the steepest slope or gradient, and this optimization process is known as stochastic gradient descent (SGD) [8].Moreover, according to the architecture of DQN first proposed in the seminal paper [42], the input layer of main is the state, hence consists of |S| neurons while the output is the Q-values of actions, hence contains |A| neurons.Therefore, a forward pass of the input state s l to main retrieves the Q-values q = [q 1 , q 2 , . . ., q |A| ].Further, DQN consists of a second NN, target with parameters θ, whose architecture is identical to that of main."Freezing" θ of target for a fixed no. of iterations helps to stabilize the training of main.Finally, each IA also requires a data structure known as the Replay Buffer (RB) [42], that stores the transition e = (s l , a m , r (s l , a m ), s l ) at t.This sequence is known as the IA's experience.RB allows experiences to be picked randomly for training, thereby breaking temporal correlations.
At a given iteration, a mini batch of experiences of size b is sampled randomly from RB.The estimated Q-values are calculated by a forward pass of (s l , a m ) denoted by Q(s l , a m ; θ).The target Q-values are calculated as Then, the error between the estimated Q-values and target Q-values for the current mini batch b, denoted by Δ(s l , a m ; θ), is computed by a loss function L = f (Δ(s l , a m ; θ)), e.g., Mean Square Error (MSE) or Mean Absolute Error (MAE).Eqn. ( 9) is the Bellman Eqn. ( 6) reflecting NN-based learning.

E. Motivation for a New Approach for Massive Distribution
We defined massive distribution of intelligence as each network element (i.e., NF-VNF) having its own intelligent entity (i.e., IA) for its management.We show that existing RL solutions, i.e., QLC and DQNC, are not applicable for this purpose.Let us consider the NS use case defined earlier.In Fig. 5(a), we compare the combined convergence time of QLC and DQNC in terms of the no. of training episodes versus no. of IAs sharing one resource pool.We observe that although the performance of QLC is scalable, the convergence time is not, as it increases with the no. of IAs [37], in a nearexponential trend.In contrast, DQNC improves the scalability of distributed intelligence [20].Higher cumulative rewards of DQN over QL is also demonstrated in our previous work [20], and confirmed more recently by the findings in [43].Hoa et al. in [43] in particular compares DQN with QL, showing that DQN achieves higher cumulative rewards in the longterm (i.e., better convergence behavior) over QL.This finding further motivates the need for new approaches for massive distribution.Further, in Fig. 5(b) we demonstrate the evolution of average rewards across training episodes, for an arbitrary scenario with two IAs, trained up to 500 episodes.We observe that when allowed to train further, DQNC moves away from the convergence point or the learned optimal behavior.Moving away from the optimum is referred to as divergence, that consequently imparts instability to the training process.Due to the training instability, one of the IAs is unable to maximize its average rewards in the long-term, as it is "trapped" in a suboptimal zone [10].Therefore, considering the issues that each RL algorithm brings, our objective in this work is to address these problems of convergence scalability, training stability and performance scalability of the IAs, not investigated before.

IV. IMPROVED DDQN FOR ACHIEVING MASSIVE DISTRIBUTION OF INTELLIGENCE
In this section, we discuss a drawback of DQNC, the overestimation bias.Then, we propose a novel combination of Demonstrating unsuitability of QLC and DQNC for massive distribution.

TABLE II IMPACT OF Q-VALUE OVER-ESTIMATION
two distinct improvements to DQNC: i) Double Deep Q-Network (DDQN) to enhance both performance and training stability, and ii) reward scaling to further improve stability.We denote our algorithm as improved DDQN (IDDQN), and propose its application to massively distribute intelligence in 6G.Finally, we outline how we train IDDQN and denote its computational complexity.

A. Limitation of DQNC: The Over-Estimation Bias
So far, in both QLC and DQNC, calculation of the target Q-values has involved a max operation.A maximum of the estimated values can inherently lead to a maximization bias [10].Consequently, over-estimations make these algorithms susceptible to noisy Q-value updates, that might eventually lead to long-term inaccurate greedy action selection.However, over-estimations might not necessarily degrade the performance of QL or DQN in general, as long as the relative correlations or "degree of preferences" among the actions are preserved [17].For example, we assume that the true Q-values q 1 , q 2 and q 3 for three possible actions in an arbitrary state are known according to Table II.The over-estimated values of q 1 , q 2 and q 3 in Case 2 would not negatively affect the performance of the algorithms, as q 1 and q 3 are both 1/10 that of q 2 .The relative preference to select the action corresponding to q 2 has been preserved even in the over-estimated Q-values in Case 2, similar to Case 1.However, in Case 3, the over-estimation has disrupted the relationship between the Q-values, where q 1 and q 3 are now both 1/2 that of q 2 .Intuitively, this disruption might cause training instability in the long-term, if the Q-values in Case 3 continue to evolve in the same manner.
An open issue in our previous work [20] was training instability in DQNC.In some iterations, after converging to the optimal behavior, when the algorithm was allowed to train further, one or more DQNC IAs were observed to diverge (move away from the optimal behavior), causing instability in the training.This led to sub-optimal auto-scaling, degrading system performance.In fact, the divergence problem of DQN is a widespread issue in the research community [16].Several improvements have been proposed earlier, and one of them is reducing the over-estimation bias to stabilize DQN training and improve performance [17].

B. Improvement 1: Double Deep Q-Network (DDQN)
The target Q-values of DQNC, Y DQNC , in the Bellman Eqn. ( 8), involves the max operator.Alternatively, Eqn. ( 8) may be expressed as where the term arg max a k Q(s l , a k ; θ) represents the action selection, as DQNC selects the action with the maximum Q-value from target.The Q-value of the selected action Q(•) is then evaluated by the same target NN.
To reduce the adverse impact of over-estimations on the performance and stability, an improvement of this equation was proposed in [44].The idea in [44] was to decouple action selection (arg max) and action evaluation Q(•), by decomposing the max operator in Eqn. ( 8).The rationale being; instead of using the target NN to both select and estimate the Q-value, the main NN selects the action while the target estimates the corresponding Q-value.This decoupling avoids the induced over-estimation bias as different networks are used to select the action and estimate the action value, and requires minimal computation overhead [17].The target Q-value update therefore is Therefore, in this paper, we adopt the DDQN approach, aimed at stabilizing NN training and improving performance, by reducing the over-estimation bias.We remark that all algorithm components described previously for DQNC are valid for DDQN, except the target Q-value computation.

C. Improvement 2: Reward Scaling
Previous works [42], [45] have adopted the mechanism of reward clipping, in which r (s l , a m ) is clipped according to where r cl (s l , a m ) is the reward that has been clipped within the range [ − η, +η], and η is a predefined threshold.Reward clipping ensures that the target Q-values do not increase or decrease in an unprecedented manner, avoiding huge fluctuations in the gradients of main and unstable training.The drawback in clipping is that η needs to be tuned according to the algorithm hyper-parameters.Moreover, clipping disrupts the relative scale of the rewards, that could adversely affect the performance of the IAs.When rewards are clipped, e.g., to η = ±100, r (s l , a m ) = 10 4 has the same impact on the Q-value evaluation as r (s l , a m ) = 10 3 .Therefore, as an improvement, reward scaling was proposed [18], where, instead of clipping based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
on hard thresholds, rewards are "scaled" according to the relationship r sc (s l , a m ) = sign(r (s l , a m )) • log(1 + |r (s l , a m )|).(13) Eqn.(13) achieves three benefits.i) The relative scale of the individual rewards is preserved.ii) Extreme high reward values are converted to their logarithmic counterparts, that prevents huge variations in the gradients of the SGD algorithm.We posit that both these benefits ultimately stabilize training, a critical issue remaining in distributed intelligence.Finally, we know that normalizing the rewards within the range [a, b] is also possible, that has been shown to bring some stability to the training [42].We argue that iii) reward scaling brings generalizability to the model, as, even in scenarios where the reward model is unknown, hence a and b are unknown, any arbitrary reward can be scaled using the logarithm in Eqn.(13).

D. Our Proposal: Improved DDQN (IDDQN)
From previous discussions, we argued that DDQN improves the performance and training stability of DQNC, while reward scaling stabilizes the training.Moreover, we assert that convergence scalability, necessary for massive distribution, will be achieved by DDQN, as it is an FA approach.Therefore, in this paper, we propose the adoption of both DDQN and reward scaling, denoting this algorithm improved DDQN or IDDQNthat ultimately addresses scalability in all three dimensions: convergence, training stability and performance.The target Q-values for an IDDQN IA are then expressed as

E. IDDQN Training & Computational Complexity
Here, we describe the training procedure of an IDDQN IA, that is identical for all IAs, and outline the computational complexity of training each IA.
1) Algorithm: Algorithm 1 outlines the training procedure of each IDDQN IA.IDDQN IAs learn according to the ε-greedy approach.Typically, at the start of the training, the value of ε is configured to a high value, close to 1, enforcing higher likelihood of exploration in the beginning.Then, as the IAs explore and learn the environment, ε may be decayed every episode by d%.Moreover, the objective of target is to stabilize the training of main.Therefore, we update target at every iteration by a very small amount ψ << 1 (line 25 of Algorithm 1).Instead of directly copying θ to θ, soft updates constrain the target values to change very slowly, greatly improving the stability of learning [46].The updated θ takes ψ% of θ and (1−ψ)% of its old values.This mechanism is called a soft network update [47].
2) Computational Complexity: The computational complexity of matrix multiplication for C = A × B, where A is a n × m matrix, B is m × p and C is n × p, is given as O(nmp).Training a neural network in one iteration involves a feed forward and back propagation, where, in the simplest form of multiplication, the complexity of both are identical.compute Y IDDQN according to Eqn. ( 14) Back-propagate L through layers of main , where E is the no. of training episodes, b is the batch size trained at every episode and c is the no. of neurons in the hidden layer.Higher the no. of training episodes, greater is the computational complexity.

V. METHODOLOGY
In this section, we first describe how we model the NS and the simulation setup.Then, we define benchmarks to compare the performance of the RL algorithms -1) a centralized Mixed Integer Optimization (MIO) formulation, 2) an algorithm that does not perform auto-scaling, referred to as "no automation" (NO_AUT), and 3) threshold-based distributed scaling THR.We then configure the RL algorithms -QLC and IDDQN.Finally, we describe the design of our software platform.UE is τ .To simulate dynamic or time-varying arrivals of UEs, Λ in (t) is varied periodically over an episode duration T. Λ in (t) reflects an alternating smooth increase and decrease of UE arrivals over Ts, inspired from the load fluctuations trends in [48].We set two load peaks across T = 10 4 s in simulation time.This variation represents a peak hour traffic scenario in a business area on a weekday.We refer to this load profile as peak-hour load profile (PHLP) depicted in Fig. 6 (top).

A. System Modeling & Simulation Setup 1) NS Load Generation:
We know that increasing the no. of IAs increases the total convergence time of the algorithms [37].To conduct extensive simulations of massively distributed scenarios, we increase the no. of IAs up to the limits of our hardware infrastructure.In this respect, the PHLP poses a significant challenge to the evaluation.This is because our simulator is event-based, where all the steps of the RL algorithms are executed, when a new UE is admitted to the NS.Since the high load occurrences constitute around 60% of the PHLP, the total wall clock time of the algorithms for increased no. of IAs (e.g., N ≥ 11) [37] becomes intractable.Therefore, we consider a load profile that is complementary to PHLP, that consists of 60% low load occurrences instead of high load.We denote this load profile as complementary load profile or CLP, shown in Fig. 6 (bottom).CLP is also justified from [48], where it reflects the load profile for an entertainment scenario on a weekday.All evaluations in this paper are conducted on CLP, unless otherwise stated.
2) NS Configuration: The system is configured according to Table III.We implement distributed admission control in each NF of the NS.When admitting a UE in the NS, if u j would exceed the admission threshold AC, the UE is denied  the service.Further, the CPU utilization target of a VNF is U.To allow fair comparison of the RL algorithms, the initial no. of CPUs allocated to VNF j is configured to n ini j in episode 1.Therefore, all IAs in all NS deployment scenarios start from identical initial states.For any other episode, each VNF starts with the no. of CPUs at the last time instant of the previous episode.Finally, the no. of available CPUs in the k th resource pool, v k , at episode 1, is adjusted with the no. of NF-VNF pairs in the NS e.g., in Table IV.This is because for a direct comparison of different scenarios, system KPIs should be compared with the same optimum value.
3) NS Deployment: To evaluate massively distributed scenarios, we consider up to N = 20 NFs, managed by N = 20 IAs in the NS.We remark that this figure is reasonable by practical means, owing to i) the functional split of the EPC to 5G NFs [49], consisting of 10 control plane NFs and one user plane NF (i.e., User Plane Function) [2], and ii) factoring in the limits of our hardware infrastructure.The deployment of the NFs in the NS is classified in two groups.According to Table IV, in Group 1, we scale the no. of NFs by considering M = 1 shared pool.In Group 2, shown in Table V, the total no. of NFs is constant, where N = 20, while the no. of shared resource pools M is varied.Moreover, v k is adjusted according to the deployment.
4) Benchmarks: We consider three benchmarks for comparing the performance of the RL algorithms.First, the mechanism that never scales any CPUs based on the incoming NS load, the "no automation" NO_AUT algorithm, indicates the lower bound of the performance.NO_AUT always serves the maximum no. of UEs possible with the no. of CPUs initially allocated to the VNFs, i.e., n ini j .Second, the upper bound of the performance is MIO.At t for a given Λ in (t), MIO solves an optimization problem, aiming to maximize the load served by the NS, while allocating CPUs among the VNFs such that individual CPU utilization u j is as close to the target U as possible.Then, it allocates this optimum no. of CPUs to each VNF.A detailed formulation of MIO (omitted in this work for brevity) is found in [37].Finally, we implement a distributed, threshold-based scaling baseline, THR, similar to existing implementations in open source orchestration frameworks, such Open Source MANO [21].During high incoming NS load, when u j exceeds a "high" threshold u H , THR attempts to scale up by the no. of CPUs required reach U. A conflict is potentially inevitable as THR is a greedy mechanism and all VNFs would attempt to scale up simultaneously.Conversely, when u j falls below a "low" threshold, u L , THR scales down by the no. of CPUs needed to reach U.The scaling thresholds are configured according to Table III.

5) RL Algorithm Configuration:
The action space of any QLC or IDDQN IA is |A| = {−2, −1, 0, +1, +2}.An IA is configured according to Table VI.In IDDQN, target's parameters are updated according to the soft update principle.The exploration probability ε is decayed by d % at the start of every episode and is constant throughout the episode.In QLC, ε is decayed based on KI updates.Since QLC IAs are slow to converge, we train QLC IAs longer than IDDQN.

B. Simulation Platform Design
We develop a platform that simulates the system model defined earlier.It simulates the arrival of the UEs, the admission control logic and the behavior of the system based on the benchmark algorithms' and the IAs' decisions -MIO, QLC, DQNC and IDDQN.The Unified Modeling Language (UML) class diagram of the platform is depicted in Fig. 7. Class SimulationFactory defines, instantiates and implements all attributes and behavior related to simulating a UE.It instantiates an object each of classes LoadGenerator, AdmissionControlManager. MetricUpdateManager and Training.Based on input configuration, LoadGenerator generates the load profile (PHLP/CLP).
The following sequence of steps is triggered by a UE arrival.AdmissionControlManager decides whether to admit or reject the UE to the NS.MetricUpdateManager updates all necessary metrics when a UE is admitted to or rejected from the NS.When the algorithm type is MIO or NO_AUT, the processUser() method of class SimulationFactory logs UE statistics and updates NS metrics, without invoking automa-teUser().If the algorithm type is THR, QLC, DQNC or IDDQN, automateUser() method is invoked, and corresponding objects are instantiated.For THR, each object of class AgentTHR chooses and performs the action.For AgentQLC, the Q-table is initialized and updated, apart from the calculation of the state and action selection.Similarly, AgentDeep inherits methods of AgentQLC.In addition, method checkWhichDeepAlgorithm() determines if the algorithm is DQNC or IDDQN.The method forward() defines one forward pass of main.Next, class ReplayBuffer defines methods to push and sample experiences, and class QValues to calculate current Q-values of main and next Q-values of target.An object of each of these classes is instantiated by the instance of Training, and invoked by train-DeepAlgorithms().The concept of inheritance is leveraged for code re-usability.
The platform is developed in Python [50].The UE arrivals are simulated using the discrete event simulator Simpy [51], while the Deep RL classes are written in PyTorch [52].All simulations run on an Intel Xeon workstation with 48 CPUs and a processor speed of 2.7 GHz.

VI. SOLUTION EVALUATION
This section evaluates the performance, stability and convergence of IDDQN.First, the convergence and stability of the deep RL algorithms are analyzed.The performance of IDDQN is evaluated, along with the no. of conflicts after convergence.We demonstrate superior robustness of IDDQN over DQNC, i.e., how well IDDQN performs in an environment it has not been trained in.Finally, a discussion on the resourceefficiency of IDDQN concludes the evaluation.

A. Convergence and Stability Analysis
In general, the learning of an IA is considered to be stable, if, in the long-term, its rewards are maximized and its loss is minimized, with minimal fluctuations across episodes.All IAs must exhibit this stable behavior.We conduct the analysis of the convergence and stability of DQNC, DDQN and IDDQN IAs, testing these algorithms with two IAs, in three phases.
1) Learning Behavior of DQNC vs. DDQN IAs: To observe the learning behavior of DQNC and DDQN IAs, we begin with an arbitrary clipping threshold η = 10.All input conditions and hyper-parameters of both algorithms are identical.Fig. 8 depicts the average reward and average loss per IA for 500 training episodes.Even after 500 episodes, DQNC IAs (in Fig. 8(a)) are unable to maximize long-term rewards.The average loss (in Fig. 8(c)) decreases after around the 200 th episode, but tends to fluctuate, indicating an overall unstable training process.However, DDQN IAs are seen to mitigate this unstable behavior, as they successfully maximize long-term rewards, and minimize both the loss function and fluctuations as the learning evolves, in Fig. 8(b) and 8(d).We remark that the instability of DQNC is due to the over-estimation bias of the maximization operator, that is mitigated by DDQN.Further, we tested several clipping thresholds and observed similar behavior for DQNC and IDDQN for any setting.For the sake of brevity, we omit these results here.Since DDQN is clearly more stable, from now on, we focus on DDQN.2) Impact of Reward Clipping on DDQN Stability: In the second phase, we compare using raw rewards (no clipping) versus applying different clipping thresholds η, to evaluate the impact of clipping and its magnitude on the learning stability of DDQN IAs.We select arbitrary thresholds i.e., 300, 100 and 50.Similar to the first phase, the evolution of the average reward and loss across episodes is given in Fig. 9(a) to 9(d) and in Fig. 9(e) to 9(h), respectively.We remark that in Fig. 9(e) to 9(h), showing the evolution of the loss function, the y-axis have different scales.We observe that with decrease in the magnitude of clipping thresholds, the DDQN IAs improve the stability of their learning progressively.Moreover, we see a high fluctuation for the no clipping case, as the gradients fluctuate drastically over consecutive iterations in order to minimize the loss, making the loss function extremely sensitive to outliers.From this discussion, we infer that i) reward clipping improves the training stability compared to no clipping, and ii) lower the clipping threshold, better is the stability.
3) Stability Over the Long Term: Stability of Deep RL is subject to ongoing research in the RL community [53].Therefore, we validate if reward clipping, for an appropriate threshold, is able to maintain the stability of DDQN over the long term.Therefore, we train the IAs up to a practically long period of time, i.e., in our case, up to 2500 episodes.Again, we plot the evolution of the average reward and loss per IA across episodes in Fig. 10. Figure 10 depicts that indeed, clipping does not ensure stability infinitely.The average loss in Fig. 10(c) shows that from the 700 th episode onward, the DDQN IAs have started to move towards sub-optimal CPU allocation.The sudden, drastic fall in the average reward at the 2000 th episode (Fig. 10(a)) is when the IAs diverge completely, as only IA A 1 manages to minimize its loss function, at the cost of increasing the loss of the other.We refer to this phenomenon as training instability in the long term.
IDDQN, however, scales the rewards.We see the impact of scaling, as applied with IDDQN, in Fig. 10(b) and 10(d).We observe that reward scaling mitigates the problem of training instability early, around the 650 th episode, thereby avoiding instability in the long-term.The mild fluctuations observed after convergence are the IAs moving around the optimum, but not moving away from it, over the whole remaining horizon of up to 2500 episodes.Therefore, we validate the importance of reward scaling on the learning stability of DDQN, arguing for the strong need for our proposal IDDQN.

B. Performance Analysis and Evolution of the no. of Conflicts
At time instant t and for a given incoming NS load Λ in (t), the performance of an algorithm ALG (i.e., MIO, NO_AUT, THR, QLC, IDDQN) is evaluated by measuring the load served, Λ ALG out (t) by the corresponding algorithm.The optimum load served by the NS is determined by MIO.This implies that at any t, the closer the served load of an algorithm to that of MIO, the better is its performance.We demonstrate the performance gain of IDDQN using three metrics: • The instantaneous load served Λ ALG out (t) in an episode • Mean Absolute Percentage Error (MAPE) -With D samples in an episode, taken from MIO as true values (i.e., Λ MIO out ), we compute the MAPE of an algorithm ALG as • Absolute distance between the optimum load served and the load served by ALG in an episode, denoted by ) Aggregated Performance of All IDDQN IAs: Figure 11 shows the MAPE of the performance and convergence in terms of episodes for both Group 1 and Group 2 QLC and IDDQN IAs.We plot the average MAPE, along with 95% confidence intervals over 50 episodes after convergence, on the y-axis.The x-axis shows the corresponding episode of convergence.In general, as QLC is a tabular approach, QLC IAs are more accurate, in terms of the MAPE, than IDDQN.In both Group 1 and Group 2 after convergence, the performance of QLC IAs deviate by 2±1% from the optimum, while IDDQN deviate by 8±2%.However, IDDQN IAs converge sooner than QLC, as IDDQN is a function approximation approach, hence is more scalable with the no. of IAs.Further, IDDQN may Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.be considered to be stable over subsequent training episodes after convergence, if the tolerance level of the error is at most 10%.From these results, we infer that increasing the no. of IAs, e.g., from 10 to 20 in Fig. 11(a), does not increase the MAPE beyond the tolerance level.Since there is no major performance gap between 10 and 20 IAs, similarly, we do not expect a major performance gap for the no. of IAs beyond 20.
2) Evolution of the no. of Conflicts: Both QLC and IDDQN IAs should ideally learn to avoid conflicts after convergence.In practice, conflicts are not eliminated completely.Figure 12 depicts the average no. of conflicts after convergence, along with 95% confidence intervals versus the episode of convergence for the given NS deployment scenario.We observe that in general, IDDQN IAs are able to avoid more conflicts than QLC after convergence.Further, after convergence of Group 1 IAs in Fig. 12(a), the no. of conflicts increases mostly linearly with the no. of IAs.In Fig. 12(b) for Group 2 IAs, although a clear trend is less obvious due to varied deployments, we observe that the scenario with N = 20 IAs sharing one resource pool exhibits the highest no. of conflicts among the scenarios for both algorithms.This behavior is attributed to the fact that as the no. of IAs sharing a given resource pool increases, the aggregate CPU utilization of the neighbors lose granularity, with increasing no. of neighbors.Nevertheless, we infer that for a given no. of IAs, IDDQN achieves faster convergence and reduced no. of conflicts, compared to QLC.
3) Performance (Evolving) Over Time: Next, we observe the performance of the algorithms evolving in time for an arbitrary NS deployment, where N = 20 IAs.We depict Λ in (t) and Λ ALG out (t) for the corresponding algorithms -Fig.13.Performance of distributed algorithms in terms of Λ ALG out (t) and Λ ALG dist (t) in s −1 for selected episodes.
MIO, NO_AUT, THR, QLC and IDDQN -in Fig. 13 for four episodes of interest.The upper and lower bounds are MIO and NO_AUT respectively.We remark that at certain time instants, Λ MIO out (t) is greater than Λ in (t), due to the granularity of the time window within which the served and incoming load are measured.In episode 1, shown in Fig. 13(a), the performance of both QLC and IDDQN are poor and comparable to NO_AUT, as they have just begun to learn their environment, with a high probability of exploration ε.This behavior is also evident from Fig. 13(b), where Λ dist (t) shows that the RL algorithms are far away from MIO.In subsequent episodes, the learning is incremental.At episode 501, IDDQN converges, as its performance is close to MIO, but QLC is yet to converge.At a later episode, i.e., episode 2361, QLC converges.Fig. 14 shows this episode, zoomed into the 4000 th to 6000 th time interval.We observe further, from Fig. 13(b), that THR performs consistently worse than the RL algorithms, as it is a threshold-based algorithm, and it scales up only when the individual utilization exceeds u H at high load.Finally, after a long period of training, e.g., at episode 2500 in Fig. 13, IDDQN IAs retain their learning, validating that the training is reliable.

C. Demonstrating Robustness of IDDQN
Robustness of an RL model is defined as the performance of the model when applying in an environment for which it has not been trained [54].Here, we demonstrate the robustness of IDDQN, that has been trained on CLP, on an untrained, inference (test) environment, i.e., the PHLP.Hence, to test the robustness of IDDQN on PHLP, we do not retrain our IDDQN models on PHLP.Instead, for a given deployment scenario, we measure the performance Λ IDDQN out (t) on PHLP, by using the individual, already trained IDDQN PyTorch models.We also measure the performance Λ DQNC out (t) of   (t) along with 95% confidence intervals over 20 episodes after convergence, on PHLP.We observe that on average, the performance of IDDQN deviates by 4±1% from the optimum, while DQNC deviates by 10±6%.Besides being closer to the optimum on average, IDDQN also has less variance than DQNC, hence the performance of IDDQN is more stable, reliable and robust than DQNC.

D. Resource Efficiency
Although resource efficiency is not the main focus of this paper, we stress its importance here.An auto-scaling algorithm is considered resource efficient, when, for a given Λ in (t), i) Λ IDDQN out (t) is maximized (with higher priority) and, ii) individual VNF CPU utilization is close to the target U.When the CPU allocation is balanced among VNFs, Λ IDDQN out (t) is maximized.Moreover, maximizing Λ IDDQN out (t) implies minimizing the rejection ratio rr of UEs to the NS.We define, at an arbitrary episode, the NS rejection ratio rr (t) at t as g t (n rej )/g t (n req ), where g t is the moving average for a window size w at t, n rej is the no. of UEs rejected and n req is the no. of UEs that requested for admission to the NS.
We consider the scenario, N = 3 IDDQN IAs sharing M = 1 resource pool, to demonstrate resource efficiency.The bold figures indicate the respective better score of a given comparison.We observe the impact of training on individual CPU utilization u j (t) and NS rr(t) for low and high Λ in (t) at two arbitrary episodes, before and after convergence.According to Table VII, for a given Λ in (t), we observe that not only does rr(t) decrease after convergence, but also individual u j (t) are more balanced than before convergence.u j (t) is also closer to the target U = 0.5 after convergence, but reduction of rr(t) is prioritized.We show that the reduction of rr(t) is prioritized, by referring to the example casehigh incoming load and before convergence -from Table VII.For this case, we observe that even though u 1 (t) and u 2 (t) are close to U, u 3 (t) is far away from U -implying that the CPU allocation is unbalanced, hence rr(t) is high.For the same high incoming load case, after convergence, the tendency of IDDQN to be resource-efficient is evident.This is because we observe that individual u j (t) are balanced, achieved by proportionate allocation of CPU resources to the VNFs, resulting in lower NS rr(t) than before convergence.Therefore, IDDQN enables the NS to serve more load after convergence for any incoming load.

A. Lessons Learned From Massively Distributed Intelligence
The improvements of IDDQN in terms of stability, system performance and convergence time lead us to a few practical implications for 6G: 1) Higher Reliability of Management: We show that IDDQN is more stable (in long-term training) than existing comparable RL solutions.Once an IDDQN IA is sufficiently trained, we can rely that its performance will be robust, even in the long-term and in untrained environments.The IA will readily adapt to network conditions, and scale CPUs in a stable manner -implying that an operator will have greater confidence on intelligence-based decision-making for managing the 6G network, thereby reducing manual effort for network operation and maintenance.
2) Better Scalability: IDDQN has enabled us to simulate up to 20 IAs, contrary to only 10 IAs, which was a key limitation in our previous work.We observe that IDDQN has significantly improved the scalability of distributed intelligence.With faster convergence, the operator does not need to wait infinitely for IAs to be deployed in the network.Further, IDDQN performance is good, even with increasing no. of IAs.These practical aspects improve the overall efficiency of deploying distributed intelligence.
3) Higher Degree of Flexibility of Deployment: Finally, lower convergence time, hence better scalability, allow the operator to deploy higher no. of IAs, without increased effort in maintenance of the network -ultimately providing higher degree of flexibility of deployment.

B. Limitations in Our Approach
We highlight three limitations.First, the system model is robust and reliable, but relatively simple.Particularly, the state space takes into account the loading (X) and balancing (Y j ) factors of the VNFs.This design provides an accurate representation of the state without exploding the state space, and consequently, the Q-table size in QLC.With increasing no. of IAs in the framework, an increase in the size of the state space is avoided -ensuring that QLC converges in a reasonable period of time.To simulate a system with more complex conditions, such as, multiple network slices with shared and dedicated NFs, multiple auto-scaling goals or including other system variables i.e., memory, disk usage etc., the state space design will become more complex -that will no longer be scalable in QLC.Nevertheless, the existing system model is sufficient and effective for the purpose of evaluating and doing a fair comparison of the behavior of inherently different DRL algorithms (i.e., tabular vs. neural network).We stress that our goal was to conduct a comparative study of the performance and convergence aspects of the algorithms including QLC, and the improvements achieved by IDDQN, evaluated in a selected no. of scenarios, would still hold in more complex deployments, albeit after minor hyper-parameter tuning.
Second, in our evaluations, we consider NFs to be identical in their characteristics and UEs "load" or impact the NFs in an identical manner.These settings imply that under identical conditions, identical decisions will be made by IAs after convergence.When some heterogeneity is introduced, e.g., different NF characteristics, additional considerations may arise from IDDQN investigations -beyond the identified dimensions of convergence, performance and training stability.
Finally, the simulator is event-based, meaning that upon admission of every UE, all subsequent measurement and computation steps such as calculation of the utilization, state, action, reward etc. are invoked.A timer-based implementation would help to tune the granularity of the measurements over a given time window.However, this consideration is implementation-specific, affecting all approaches similarly, and has no expected impact with respect to the conclusions drawn in this manuscript.

VIII. CONCLUSION
In this manuscript, we investigated the applicability of our proposed RL-based solution, improved DDQN or IDDQN, to massively distribute intelligence for 6G network management.A distributed intelligence framework consists of multiple IAs.Existing RL solutions suffer from different issues.The performance of QLC, based on Q-Learning, is near-optimal, but QLC is not scalable in terms of convergence time, when the no. of IAs increases.DQNC, based on Deep Q-Networks, performs well and converges sooner than QLC, but suffers from training instability.Therefore, for massive distribution, our aim was to achieve scalability in the three dimensions of KPIs -system performance, convergence and stability of RL.To this end, we proposed our solution IDDQN, where we combined Double DQN and reward scaling, to achieve a good balance among the three dimensions.Applying IDDQN to the use-case of auto-scaling resources in a network slice, we evaluated the algorithm for several massively distributed and realistic IA deployments.Our results show that IDDQN is more stable than DQNC and converges at least 2× faster than QLC.The system performance, expressed as the no. of UEs served by the slice, is good, with only 8% mean absolute deviation compared to the optimum values.IDDQN is also proven to be robust and resource efficient after convergence.We derived some positive implications of using IDDQNhigher reliability of management, better scalability and higher degree of flexibility to the operator -making it a promising solution for distributing intelligence in practical 6G deployments.

A. CPU Utilization Formula
The detailed formulation of Eqn.(1) for CPU utilization of the j th VNF (dropping t for simplicity) is where K 1 , K 2 and K 3 are constants.K 1 ≥ 0 represents all background processes necessary for the machine to run, and K 2 > K 1 is a constant to shape the formula.K 3 > 0 is a constant to shape the utilization w.r.t. the no. of CPUs.

B. Quantizing the Mean CPU Utilization of the System or Defining h(•)
For the j th IA, the function h(•) determines the loading factor X (from Eqn.(3)), of the state s l in Eqn.(2).Here, we elaborate on the definition of h(•).
The mean utilization of VNFs in the k th resource pool is ū = N k j =1 u j /N k .To quantize ū, where 0 ≤ ū ≤ 1, we consider ν thresholds.The no. of quantized steps, therefore, is ν + 1.The function h(•) from Eqn. (3), representing the overall loading X of the VNFs, is expressed as where o p is the p th threshold/step to quantize ū within the range [0, 1].Since the quantized steps returned by function h(•) depends only on ū, X is identical for all IAs.
In contrast, Y j , representing the balancing part of the overall state (Eqn.(4)), is IA-specific, and can be one of three values: +1, 0, or -1.Hence, the total no. of states is (ν + 1) • 3.For example, in this work, we define thresholds around the target utilization U = 0.5.We consider ν = 4 thresholds: 0.2, 0.4, 0.6, 0.8, shown in Fig. 16.In our work, there is a total of |S| = 15 states for each NF.

C. State Interpretation
The state of the j th IA, s l , encodes two dimensions, i.e., the loading and the balancing factors, and is represented as a complex number according to Eqn. (2) as s l = X + iY j , where X is the loading factor of the VNFs and Y j represents the balancing factor.The overall state s j = 0 + i 0 can be interpreted as follows.The first part, X = 0, indicates that the mean utilization of NFs, ū, is within the favorable region (Fig. 16).The second part Y j = 0 indicates that all NFs are perfectly balanced, i.e., their individual utilization values are equal.This is the ideal or goal state of the NF.Similarly, the state s j = 0 − i , where Y j = −1, indicates that although ū is within the favorable region, the utilization of the j th NF u j is lower than the mean utilization of its neighbors.

Fig. 1 .
Fig. 1.Overview of the contributions in this manuscript.

( 1 )
l ( t) is the discrete derivative of the variance of Q-values over the time interval t = t 0 + x • T , where t 0 is the start time instant of QLC, T is the episode duration, and 2) KI

( 2 )
l ( t) is the no. of times s l has been visited in t.Assuming the training data traverses S uniformly, such that the impact of KI

Algorithm 1 1 :
Training Procedure of Each IDDQN IA A j Begin: initialize RB capacity to B experiences 2: initialize main with random parameters θ 3: initialize target parameters θ = θ 4: initialize exploration probability ε = ε ini 5: for episode ← 1, E do 6: initialize counter c = 0 to count experiences in RB 7: if episode > 1 then 8: ε ← ((1 − d /100) • ε a m randomly with probability ε, otherwise select a m = arg max ar Q(s l , a r ) 13: execute action a m 14: observe reward r (s l , a m ), next state s l 15: store experience {s l , a m , r (s l , a m ), s l } in B 16: c ← c + 1 17: if c ≥ B then 18: sample random mini batch of size b from B 19: compute Q(s l , a m ; θ) using main 20:

Fig. 8 .
Fig. 8.Comparison of the learning behavior of DQNC and DDQN IAs.

Fig. 9 .
Fig. 9. Impact of different clipping thresholds on the learning behavior of DDQN IAs.

Fig. 10 .
Fig. 10.Impact of reward scaling on the learning stability of DDQN IAs for 2500 episodes.
Figure 15 shows the average MAPE of Λ IDDQN out (t) and Λ DQNC out

TABLE I TABLE OF NOTATIONS
Table III outlines the configuration parameters of the simulations.The incoming load to the NS is represented by arrivals of UEs in time, simulated by a Poisson arrival process.At a discrete time instant t, UEs, from a population P, request admission to the NS, with aggregate arrival rate Λ in (t).Therefore, Λ in (t) represents the overall incoming NS load at t.The mean service duration of each

TABLE III SIMULATION
CONFIGURATION PARAMETERS Fig. 6.PHLP and CLP as incoming NS load Λ in (t) in s −1 .

TABLE VI CONFIGURATION
FOR EACH IA

TABLE VII RESOURCE
EFFICIENCY OF IDDQN FOR SCENARIO N = 3, M = 1 AND U = 0.5