Resource allocation for UAV-assisted 5G mMTC slicing networks using deep reinforcement learning

The Internet of Things (IoT) application scenarios is becoming extensive due to the quick evolution of smart devices with fifth-generation (5G) network slicing technologies. Hence, IoTs are becoming significantly important in 5G/6G networks. However, communication with IoT devices is more sensitive in disasters because the network depends on the main power supply and devices are fragile. In this paper, we consider Unmanned Aerial Vehicles (UAV) as a flying base station (BS) for the emergency communication system with 5G mMTC Network Slicing to improve the quality of user experience. The UAV-assisted mMTC creates a base station selection method to maximize the system energy efficiency. Then, the system model is reduced to the stochastic optimization-based problem using Markov Decision Process (MDP) theory. We propose a reinforcement learning-based dueling-deep-Q-networks (DDQN) technique to maximise energy efficiency and resource allocation. We compare the proposed model with DQN and Q-Learning models and found that the proposed DDQN-based model performs better for resource allocation in terms of low transmission power and maximum energy efficiency.


Introduction
The rapid innovation and evolution of the 5G networking have been a significant impact on standardization bodies in academia and industry with extensive use cases supported by the 5G Network Slicing like self-driving cars, augmented, mixed and virtual reality (AR/MR/VR), UAVs, etc [1]. The most important technology enablers of the Network slicing technique are SDN and NFV. SDN, which decouples the control plane with the data plane and provides programmability to network applications. NFV is a key technology to benefit industry virtualization by separating hardware network functions from primary hardware appliances [2,3] hardware components. Each network slice provides the specific functionalities covered by the core portions and RAN [4].
With the rapid development of remote mobile technology and innovation, Unmanned aerial vehicles (UAVs) are becoming more intelligent with machine learning, which further broadens with the extensive UAV use cases. In the past few years, UAVs have been used widely due to their high flexibility in traffic monitoring, aerial photography and disaster rescue [5]. Future generation networking aims to provide user coverage capacity and ensure the interconnection of everything [6]. In this perspective, of high mobility and versatility of UAV features, the integration of communication modules in the wireless network aspects UAVs can be occupied as aerial wireless access points for communication platforms in beyond fifth-generation (B5G) [7]. This enables the creation of a more flexible network in a new way for next-generation communication, like UAV-assisted emergency communications. In disaster areas using UAVs as access points for communication is an effective choice in which users can continuously communicate with others in a device-to-device (D2D) multi-cast manner [8]. At first, the disaster area can achieve a fast response in this way. Then, UAVs can take advantage of line-of-sight (LoS) scheduling and mobility, which makes the faster response time and more flexible compared to traditional networks.
The heterogeneous UDN is a promising architecture for coping with massive data traffic in 5G and beyond networks. Resource allocation needs to address to meet quality-ofservices (QoS and to ensure the service level agreement (SLA) in 5G networks [16,17]. In order to efficiently manage wireless resources in heterogeneous UDN faces challenges in reducing wireless interference. The mMTC-based dense network deploys low-power and short-range types of a large number of small cell coverage to macro base stations, enabling flexible and high throughput services with the 5G networks [18]. The UAV-assisted future networks have accomplished interest in various applications ranging from an increase in airborne flight time, robust manoeuvrability, rapid deployment, and payload capabilities, which are highly influenced by technologies such as machine learning, artificial intelligence, reinforcement learning, SDN and mobile edge computing (MEC) [19]. The 5G UDN has been known as the key solution for consequential energy consumption and sudden increase in mobile traffic. Further, popular content caching at the edge in mMTC can resolve the challenges of energy and traffic [20]. However, UAV-assisted 5G mMTC networks bring new challenges for maximising system energy efficiency (EE) to ensure the overall quality of user experience.
We study the optimization problem of UAV-assisted resource allocation in the 5G mMTC slice system to address the above-mentioned challenges. We first design the link selection method which solves the base station selection problem. Then, the energy efficiency maximization problem with low transmission power is proposed which aims to optimize the resource allocation strategy. Although the original problem is reduced to the MDP concept to make it a stochastic optimization-based problem and Deep reinforcement learning (DRL) approach is used to solve it. The generated massive data can be processed efficiently by DRL in mMTC slice compared to the traditional machine learning techniques. DRL is more consistent with the investigation of this study. The main contributions of this paper are as follows.
• We investigate the UAV-assisted resource allocation problem in the 5G mMTC Slice Network system. UAVs has been used as flying base station to deal with emergency communications in which flying base station communicate with ground base stations so that users can continuously communicate to others in a deviceto-device (D2D) multi-cast manner in order to increase the quality of experience (QoE). Compared with fixed infrastructure communications, UAVs have some salient attributes, such as strong LoS connection links, flexible deployment and controlled mobility with additional design degrees.
• To optimize the space-ground emergency communication system performance, we considered the power allocation and link selection problem for better system performance. We design the base station selection method based on the transmission power and distance of the user with the help of the locations of the UAV and base station. The user selects the best base station for the data transmission based on an optimal path which can improve the overall quality of user communication. • We create a scenario for a UAV-assisted 5G mMTC slice for the feasibility of this scheme. The problem of maximizing energy efficiency is reduced by MDP. We propose Dueling-DQN (DDQN) based dynamic resource management algorithm based on reinforcement learning to solve the MDP problem. We also discuss the Q-Learning and DQN models to compare with the proposed DDQNbased model for comprehensive study. • We conduct extensive experiments to evaluate the system performance of UAV-assisted 5G mMTC network slice scenario for emergency communication with reinforcement learning approaches. The simulation performance determines that the resource allocation based on Dueling-DQN (DDQN) scheme can improve the system energy efficiency than Q-Learning and DQN model. Further, we also investigate the relationship between the users, base stations and system energy efficiency with Q-Learning, DQN and the proposed model.
The current section is the introduction of this manuscript, and the remaining part of this paper is organized as follows.
In Sect. 2, we discuss existed and related work. In Sect. 3, we present the system model in which, we describe the channel model, network model, and computational model. In Sect. 4 we formulate the problem and transform the problem into the MDP technique which is solved by reinforcement learning based DQN and DDQN techniques. Then we present the performance evaluation and analysis in Sect. 5. Finally, we conclude this paper in Sect. 6 with the future scope.

Related work
In the literature, continuous and sincere efforts have been made toward network performance improvement in response to resource allocation strategies such as game theory and deep-Q networks. In the 5G networks, UAV assistance and resource allocation have great significance that continuously got more attention in 5G development. The UAV's application in 5G networks includes energy transmission and communication buffering with the aim of improving system flexibility. For instance, the authors [21] investigate multioptimization for resource management and massive access in the mMTC slice scenario by UDN environment through-put maximization problem of machine learning based user clustering method. The paper [22] studied CPU-cycle and stochastic task allocation for MEC systems to aim for energy consumption minimization with upper bounded task queue length and proposes a stochastic-based optimization algorithm. The paper [23] proposes MEC-UAV cooperation on task offloading with the aim of minimizing energy consumption by using a reinforcement learning approach. The authors [24] aim to extend 5G network slicing using UAV with MEC by proposing reinforcement learning-based power optimization to maximize objective functions. The paper [25] studied cooperative MEC wireless-powered UAV-supported systems, optimizing the energy consumption and trajectory of UAV-assisted systems. The [26] authors investigate the UAV-alone and UAV scenario with fixed base stations that aim to maximise load balance and minimise the number of UAVs. The authors [27] studied base station bandwidth allocation and UAV access selection problems with a game framework in IoT communication UAV-assisted networks. In [28], the joint power splitting and precoding vector optimization technique is proposed, which aims to secure UAV-assisted NOMA networks. The paper [29] proposes a multi-UAV-assisted NOMA scheme for better spectrum and energy efficiency of the uplink cellular communication systems. The work [30] jointly optimizes bandwidth allocation, power transmission and UAV route to maximization of energy efficiency to meet different quality-of-experience (QoE) requirements. The paper [31] proposes to utilize relay node and computing node to improve use latency in UAVassisted MEC networks.
In addition, considering intelligence as the main characteristic for future wireless communication, much work has been investigated recently. The UAV-enabled networks in which a flying object is used as the access point of a given flight period, seek to maximise the common throughput across the ground users [32][33][34]. In the paper [35], to address the massively connected devices UAVs have used as air base stations, aiming to increase the performance by improving placement and power allocation. In this paper [36], two UAVs are used; one moves to interact with various users on the ground, while the other jams eavesdroppers to safeguard the desired users' communications. While the work is based on a UAV-enabled network system model in which a UAV is used to connect and communicate with the ground nodes, some jammers are present that aim to optimize the trajectory of the UAV and maximize energy efficiency [37]. The paper [38] studies UAV's line-of-sight in 3D space for air-to-ground communication and proposes a UAV-assisted data collecting strategy for gathering information which aims to reduce the total time by optimising the UAV's altitude, velocity, and trajectory and data linkages with ground users. The deployed UAVs are used to collect sensor data in the work [39], in which authors aim to optimize sensor nodes in 3D trajectories.
The cellular-based network system with a UAV-mounted base station with numerous terrestrial-based base stations is considered in the article [40], with each base station servicing multiple customers. Further, designing the 3D placement for the aerial base station and the transmission power allotment for all nodes in the uplink and downlink in a probabilistic channel-based environment have been done. The problem of drone BS placement with resource allocation in a certain hotspot area is investigated in the paper [41] to ensure QoS for users within the hotspot area. The paper [42] investigates UAV-assisted networks in which multiple UAVs have been considered as the base stations. In this scenario, UAV-BS connects with the ground BS and ground BS forward the data between user and core networks and proposes the framework to maximise user overall throughput while maintaining fairness among users within the flight-time of base stations.
These are some of the works that are close to our work, In the paper [43], the authors aim to power optimization by maximizing social group utility with user downlink and QoS being satisfied. The UAV-supported communication can provide better coverage in remote areas with the capability to manage high traffics. In case of damage to the terrestrial network, UAVs can be used for emergency communication.
The authors [44] discuss the UAV-assisted based emergency communication model in heterogeneous IoT and distributed NOMA schemes. The IoT has problems with spectrum resources due to widespread and power consumption due to battery powered. In [45], authors aim to maximize the energy efficiency of UAV-assisted UDN networks. Through the flexible deployment of the system, UAVs can be used as the flying base station when any disaster occurs. The authors of the paper [46], study mMTC-based UAV-assisted optimization for energy efficiency.

System and channel models
We have considered a natural disaster scenario where the ground base station is not fully functional, and UAVs have been used as flying base stations for emergency communication. As shown in Fig. 1, the disaster can occur suddenly within the wireless network zone where the torn-down scope falls out which is described as the dotted circular region. The heterogeneous IoT has been considered in this region, where all the devices and users work with batteries and use them to communicate in the network. Moreover, we also assume that some of the base stations and cluster heads are down due to no battery backups as it is main powered. In this case, a UAVassisted BS is deployed for message transmission between the disaster region devices to the workable outside-disaster BSs. In order to deliver messages from the base station to the disaster region, the UAVs outside the disaster region should rely on UAVs above. The terminal UAVs could communicate with the devices and cellular users within the coverage area. Further, the users within the disaster region can communicate with the other disturbed regions and outside users.
The LoS region links are preferred and can be easily obtained within the UAV coverage. The LoS link depends on the θ angle which is the elevation of the UAVs to reach signals [25]. If the value of θ is small then the probability of UAV trajectory to transmitting signals with LoS links will be low. Thus, we define and say θ T H is the threshold of LoSbased elevation angle. If suppose the angle of the signal is less than θ T H then, no valid link is defined from UAV to users on the ground. Hence, the coverage of the UAV region with valid LoS links can be found in the height, and the location of the UAV is known as illustrated in Fig. 1 enclosed by the solid circles.
In the case of emergency communication, many messages need to be urgently delivered, and exclusive spectrum allocation for devices and users is inefficient. Hence, we accommodate to use of the same spectrum between the links of D2D outside of the LoS region and between the links of UAV and the user or devices. Figure 2 shows the Model of the coexisting links surrounded by a single UAV in the disaster region. θ LoS and θ cov are denoted as the maximum angle to the LoS region and UAV coverage region from a perpendicular line in order to denote the UAV coverage and line-of-sight (LoS) region for the user. It is likely to be that θ LoS is the complementary angle of θ T H .
Furthermore, it is condonable to assume that at least one device transmitter is located within the UAV's LoS coverage region. Therefore, this transmitter cannot allocate the spectrum to the devices and users. Since the air-to-ground (A2G) channel is more suitable for the existing device in the LoS region of UAV, this device could be used as the sink node of the IoT. Using the relay in the sink node, the messages from the base station can be disseminated to the devices. As a result, suppose the unit bandwidth in one subchannel is set as B 0 , while a total number of subchannels available to these devices and users for sharing is assumed to be Z T , then the corresponding power gain of the links can be computed using Eqs. (1)(2)(3)(4). We used the notations for the channel model as given in the work [25]. G A2C u,z and G D2D v,z are the power gains for the transmitting links from the UAV to the user u and between the device pair v on subchannel z, respectively. The power gains from the UAV to the receiving device and from the transmitting device to the cellular user on the subchannel are g A2D v,z and g D2C u,v,z .
The distance between the UAV and the user is denoted by d u,A , and the user and UAV distance is marked by d v between the device pair. Further, the distance between the UAV and the receiving station must be considered. The distance between the receiving device and UAV is d v,A , and d u,v is distance of transmitting device u and the user v. To define the gain in LoS transmission of messages from the UAV to cellular users and the NLoS gain of the UAV's interference to the ground station, on the same spectrum as the device receiver, ρ (LoS) and ρ (N LoS) are used. Additional attenuation factors are denoted for the sake of simplicity, ρ (LoS) is normalised to 0 dB. α and β stand for route loss factors in A2G and G2G are two different types of channels. Based on Rayleigh distribution the multipath-caused frequency channel gain is H z on a tiny scale for G2G transmissions on subchannel z.

Network model
We consider the 5G mMTC system with UAV-assisted in which there are set of base stations followed by UAV. The UAV is considered the flying base station within the system. A central access point (CAP) is positioned at the system's centre as depicted in Fig. 3 which is used to manage base stations and UAVs. The set for UAV is represented to A and there is only one element a in A, such that A = {a}. The set of base station is defined as The set for users in the mMTC network is represented to U , such that U = {1, 2, . . . , u, . . . , U }. In the system, the users have been assigned randomly to the base stations and UAVs. Further, we represent F = A ∪ B as the total feasible set. Consider F u as the set of feasible for the user u, which is represent as where f a defines that the the user u select UAV a for data transmission and f b defines that the user u select base station b for data transmission. Further, the notation of symbols and detailed description is provided in Table 1. We consider the system model in which a central access point can capture various environmental data within the system. The centre access point then chooses the best base station for the given user based on the data-link selection method and manages the base station's resource allocation policy. In the case of UAV-assisted, the user directly gets connected with the UAV, and the UAV is connected to the nearest centre access point so that the user can connect with outside and within the region in order to data transmission as shown in Fig. 3.

Computational model
In order to represent communication connection status in better way, we denote a function c u , where c u ∈ F u which express the user u connection with base station b or UAV u. We assume that the base station location is fixed because the UAV stays in the air. Furthermore, we define the UAV location as P a = [x a , y a , z a ], where x a , y a and z a is coordinate of UAV a . The base station location is defined as , where x b , y b and z b is coordinate of base station a. The users can be randomly distributed all over the system and the location of user u is defined as , where x u , y u and z u is the coordinate of user u. Channel gain is defined as the tight upper bound on the rate at which information can be transmitted with high reliability over a communication channel. The channel gain has mainly the relationship between the devices and the base station. The channel gain at time slot t from base station b to the user u is defined as where P t b − P t u is the distance between the base station b and user u at the time slot t, α represents the path loss index from user to base station and r bs is the rayleigh fading factor i.e. initial channel gain. Similarly, the channel gain at time slot t from UAV a to the user u is defined as where P t a − P t u is the distance between the UAV a and user u at the time slot t, β represents the path loss index from user to UAV and r uav is the channel gain of UAV within 1 m Fig. 3 The UAV-assisted 5G mMTC slicing model Assigned power to user by UAV and base station Bandwidth of UAV and base station ω(u) Differentiating connection coefficient for users σ 2 Additive-white-gaussian noise (AWGN) Power gains of UAV to the user and between device pair Interference power gains of UAV to device and device to cellular Attenuation factors of line-of-sight and outside reference distance.
where E a and E b are the power transmission of the UAV to user and base station to the user, respectively. The distance between UAV a to user u and base station b to user u is defined as d u a and d u b , respectively. The distance between UAV a to user u and the distance between base station b to user u is described as When the product E a E b · d u b is greater than the distance from the user u to the UAV a, then the user can access to the base station b and when the product E a E b ·d u b is less than the distance from the user u to the UAV a, then the user can access to the UAV a. Based on the user's connection status, we derive the signal-to-interference and noise ratio (SINR) for each user. The SINR in the downlink from UAV a to the user u at time slot t is defined as where E t u,a denotes the assigned power transmission by the UAV a to the user u and H t u,a is the channel gain from user u to the UAV a, the user connection status is determine by the indicator function c{c v = c u } and σ 2 is the additive white Gaussian noise (AWGN). The SINR in the downlink from base station b to the user u at time slot t is defined as where E t u,b denotes the assigned power transmission by the base station b to the user u and H t u,b is the channel gain from user u to the base station b. In order to calculate the channel capacity or data rate (DR), we use Shannon's channel capacity formula. The data rate in the downlink from user u to the UAV a at a time slot t is defined as where E a max is the maximum transmission power of the UAV a and B a is the bandwidth of the wireless channel of the UAV a. Similarly, the data rate in the downlink from user u to the UAV a at a time slot t is defined as where E b max is the maximum transmission power of the base station b and B b is the bandwidth of the wireless channel of the base station b. As a result, the total data transmission rate of the UAV and the base station is defined as (13) where ω(u) is the use for differentiating connection indicator. ω(u) = 0 shows user u is connected to the UAV a. In other way, ω(u) = 1 shows user u is connected to the base station b. The total power consumption E t total contain the power consumption E t u,a to the UAV and power consumption E t u,b of base station. In which the power of UAV and base station can be affected by user link mode. Therefore, at time slot t the total power consumption can be defined as By using Eqs. (13) and (14), the Energy Efficiency (η) of the system is given as The resource allocation problem in the UAV-assisted 5G mMTC network slice systems can be given as follows, as mentioned in the study above. max s.t.

C1
: : where C1 exhibit the limit of power transmission from the ground base station to the associated user within the coverage region. The limit of transmission power from the UAV to user within the coverage region is represented by C2. In C3, c u = 0 indicates that user u is linked to the ground base station, while c u = 1 indicates that the user u has been connected to the UAV a. We denote the received maximum power capacity by the user as E user in C4, in which the user receives at max E user from the UAV or from the groud base station. The overall energy efficiency of the system should be maximised with the minimum transmission power supplied. The UAV-assisted mMTC system ensures user connectivity and data transmission to achieve the overall quality of user experience.

Lemma 1
The path loss from UAV to the ground devices can be defined as g u,a = Pr(LoS)·d Proof The received signal strength at the user location for LoS and NLoS connections form the UAV to a ground user, is Link for NLoS (17) where d u,A is the distance between ground user and UAV, α is defined as path loss exponents from UAV to the ground, ρ (LoS) and ρ (N LoS) are the additional attenuation factors for links LoS and NLoS. The probability for LoS connection is mathematically describe as [47] Pr(LoS) where S and R are dependent parameters which is dependent on environments (such as urban, rural or dense urban), θ i is the angle of evaluation between user and UAV. Here, θ i can be calculated as θ i = 180 π tan −1 ( height radius i ), in this radius i = d 2 u,A − height 2 [48]. According to above Eq. (18), the probability of line-of-sight (LoS) region increase when the angle of evaluation increases. So, the probability of Non-line-of-sight (NLoS) region could be represented as Pr(N LoS) = 1 − Pr(LoS). Fundamentally, the UAV altitude defines evaluation angle and signal-propagation distance so, both NLoS and LoS depletion jointly impact the path-loss within UAV to the ground devices [43]. Hence,

Lemma 2
The total sub-channels from UAV to the ground users and devices can be represented as Z t total = E total ·Z T U l=1 E l in which, the total number of available sub-channel is assumed as Z T .
Proof According to the channel model assigning sub-chan nels to users can be simplified into assigned bandwidths, which is identical to finding the needed number of subchannels for each user in the system because transmission gain is only directly related to the path loss. As a result, maximizing the least of D R t u /E t u is the same as satisfying the Eq. (20).
As a result, if the available power is sufficient, Based on Shannon's capacity theory, the user's sum rate is defined as follows, where x t eq is a SINR benchmark for all users on each subchannel as a result, Eq. (20) becomes, To enhance data rates for all users equally, a simple and intuitive way is to start the random sub-channel assignment with the entire sub-channel number as the starting point, so the total sub-channels from the UAV to the users and devices can be defined as

Lemma 3 The minimum number of subchannel for device pair can be given as Z D D
Proof According to the device receivers' quality of service (QoS) restrictions The sum data rate of device to the device of sub-channels will be greater than or equal to the total data rate at the users to meet the quality of experience, Based on Shannon's capacity theory, the user's sum data rate is defined as, Hence, the number of subchannels assigned can be reduced. The minimum number of required subchannels can be calculated using Eqs. (26), (27) and (28) as

Problem formulation
The problem of allocating resources with UAVs goes beyond conventional optimization techniques due to the huge number of users in 5G mMTC systems. Considering data's discontinuous nature, traditional resource allocation algorithms cannot quickly find an appropriate solution to allocate spectrum and power. In order to solve the resource allocation problems in our system, first, we reduce our problem to Markov-Decision Process (MDP) technique. The MDP problem can be easily solved by reinforcement learning algorithms. We propose a Dueling-DQN-based algorithm for resource allocation for the maximization of overall system energy efficiency. The DDQN technique is the reinforcement learning technique based on an improved version of the Q-Learning algorithm. We also consider the Q-Learning and DQN models to com-pare with the DDQN model. The DDQN reinforcement learning technique does not rely on prior knowledge and can successfully address the data explosion problem generated by a vast state and action space. It can also solve the resource allocation problem to maximize energy efficiency. Hence, we use the reinforcement learning-based DDQN model in UAV-assisted 5G mMTC system for resource allocation.

MDP model
Considering a system model of UAV-assisted 5G mMTC, for the large action space, the reward as feedback should generate instantaneously. The characteristics of MDP are more suitable for the state, reward and action states. The Markov property is stated as "The future is independent of past for the given current state". In this model, the decision maker which is the agent has surrounded by the environment. The environment provides a reward for each the next state based on the agent's action. Here, we define the state space, action space and reward function below.
• State space The space state indicates agent position in the environment at a given time stamp t per user. Further, the updated space state t+1 is given to the agent for further action. Since the data-transmission rate is variable, we have allocated different data rates to each user. The state space for the given time t is given as: • Reward function In our system, the reward function is calculated based on the total data transmission rate. The reward function is given as: The environment and the agent are the two components of an MDP. The agent is either a decision-maker or a learner. The agent's surroundings are referred to as the environment. A new circumstance arises as a result of the agent's activity. These various conditions are referred to as states. The agent "interacts" with the environment in the MDP. The agent choose a course of action and then waits to observe how the environment responds to it. Then, it obtains a reward in correlation with the activity and the state she transitioned to. The agent simulates the conversation repeatedly to determine the best course of action for each condition. In this study our aim is to solve the optimization problem of the resource allocation in which our task is to maximize the energy efficiency with the given lower data transmission rate.

Q-Learning
Reinforcement learning (RL) contains state, action and reward in which the agent interacts with the environment with the active state. The environment sends the next reward and space states to the agent. The Q-Learning is based upon off-policy reinforcement learning in which the agent takes the best action based on the current state. We can say that to receive maximum reward, the agent chooses the best action. RL can be categorised into a variety of categories depending on the problem model-based (the environment can be understood by the actor), model-free on-policy (same actors for interacting as well as learning), model-free off-policy (different actors for learning and interacting). The dynamic resource allocation algorithm based on Q-Learning is given in algorithm 1 in which δ is the learning rate, γ is the discount factor, E min is minimum transmission power, H u , H a are channel gain, and σ 2 is environmental noise. The Q table value decides the next reward for taking the best action by the agent. In order to energy efficiency maximization, the action state is given as a t = arg max Q t (s t , a t ), a t ∈ π . The Bellman equation is used to continually update the value of Q given by the following equation, Here, δ denotes the learning rate and γ denotes the discount factor. Q value is Q t (s t , a t ) for the time slot t which is the anticipated Q of revenue generated as a result of doing action a (a ∈ A) in state s (s ∈ S) at time slot t. A new state space is created based on the actions of agent and returns the appropriate R reward. Here 0 ≤ γ < 1. γ value is determined by whether current or future output is greater. for each step of episodes do 5: Initial system state selection randomly from the state space S; 6: An action a t is chosen from set of possible allocations actions from the current action space A; 7: Compute action a t for observation of reward r t and next state s t+1 ; 8: The parameter updation of Q(s t , a t ); 9: When the expected state is reached, terminate. 10: end for 11: end for 12: The optimal resource allocation and strategy for power distribution is obtained to get maximized of the energy efficiency of the system

DQN model
The Deep-Q Network model is a combination of deep learning and reinforcement learning models to take more advantage of the decision-making. RL and deep learning are combined by DRL, maximizing RL's advantage, which includes decision-making with the perceiving benefits of deep learning. As a result, agents are able to notice more complicated environmental circumstances and develop more sophisticated action plans. We employ framework DQN which is depends on the model free DRL framework in order to solve the MDP problem and it does not require any prior knowledge. Convolutional neural networks (CNN) and Q-Learning are combined to create DQN. Based on the Q-Learning method, DQN creates a deep learning learnable goal function. The goal Q value is then generated by a CNN, and the Q value of the following state is then determined by approximating the target Q value using a second neural network in a high-dimensional and continuous state. A neural network function approximator with weights is referred to as a Q-network. It is possible to train a Q-network by minimising a series of loss functions L(θ ) that vary with each iteration.
where y t = r t + γ max Q((s t+1 , a t+1 ; θ − )). In CNN, the present and future update parameters, θ and θ −, respectively, estimate the desired Q value depending on the current Q value. The weight parameters of the loss function of the anticipated CNN model are solved using the gradient descent technique once the loss function L(θ ) has been determined. The parameter from the predicted network is transferred to the parameter of the target network after each iteration of C. The Q value updates for DQN can be describe as follows, An action a t is chosen ranomly by epsilon greedy approach 7: Compute state-action pair and observe the reward r t and next state s t+1 ; 8: Save the transitions (s t , a t , r t , s t+1 ) in R; 9: if size of R has reached its maximum capacity then 10: Select a batch of transitions (s i , a i , r i , s i+1 ) from R at random 11: end if 12: if next state s t+1 == s M then 13: change state value function and reward is observed 14: end for 20: end for 21: The optimal power allocation strategy π * is obtained

DDQN model
The DDQN is a type of Q-Learning that contains two streams to separate estimator in which both streams shares a common convolutional module. Dueling DQN is an extension of DQN. It uses two CNNs in order to get better policy evaluation in the presence of many similar-valued actions. The first network is the primary network which is used for selecting an action. The second network, the target network, generates a Q-value for the action. During training, the target-Q values are computed for every action's loss function. A Target network is a copy of the primary network, whose weights remain fixed. The weights of the target network are updated after some constant number of iterations, which is initialised as a hyperparameter. During training of the model, the benefit of experience replay is used and the tuple (s t , a t , r t , s t+1 ) is stored in it. With given agent's policy π , the state value and action value are defined as [49], Another way to express the aforementioned Q function is as an action a t is chosen ranomly by epsilon greedy approach 6: execute state-action pair and and the reward r t and next state s t+1 is observed 7: (s t , a t , r t , s t+1 ) is stored in experience buffer M ; 8: Get samples from the experience buffer M. 9: Calculate the target Q value using Calculate the loss value by the loss function L(θ) = E[(r t + θ · max a Q(s t+1 , a t+1 ; θ) − Q(s t , a t ; θ)) 2 ] 11: Update e and parameters with Adam Optimizer. 12: end for 13: end for 14: The optimal power allocation strategy π * is obtained The low-level convolutional structure of DDQN is identical to that in DQN. There first two fully connected layers, then there are three convolutional layers. There are 64, 4 × 4 filters make up the second convolutional layer, which is followed by 32, 8 × 8 filters with a stride of 4. Further, 64, 3 × 3 filters with a stride of 1, and 32, 8 × 8 filters with the third convolutional layer. In the duelling network, two streams of completely linked layers separate. Both the value and benefit streams contain 512 unit fully-connected layers. The value and advantage streams of final hidden layers are interconnected, with the value stream having one output and the advantage stream.

Algorithm analysis
We consider the optimal action-value function of reinforcement learning towards the discounted reward in Markov decision processes (MDPs). One of the key element in reinforcement learning is an agent, which communicates with the environment and records the outcome of that communication in the experience replay buffer. The initialization phase creates our environment with all necessary wrappers applied, the primary neural network that we train, and our target network with the same design. Additionally, we establish an experience replay buffer of the necessary size and give it to the agent. The last steps we do before entering the training loop are to build an optimizer, a buffer for entire episode payouts, a frame counter, and a variable to monitor the best mean reward attained. We count the total number of iterations finished at the start of the training loop and update epsilon. The Agent then moves one step forward in the environment. The first step is to randomly choose a small batch of transactions from the replay memory. The method then moves the individual NumPy arrays with batch data to GPU by wrapping them in PyTorch tensors. Then, using the tensor operation, we transmit data to the first model and extract the precise Q-values for the executed actions. An index from a dimension that we wish to use for collecting is the first input to this function call. A tensor of the element indices to be picked serves as the second parameter. The function max() returns the maximum values as well as their indexes. We choose the first entry in the result because in this instance, we are just interested in values. The vector of targets, or the vector of the predicted state-action value for each transition in the replay buffer, may now be used to calculate the Bellman approximation value.

Simulation environment
For simulation experiments and analysis, we used the Tyrone DIT400TR-48RL workstation with 128 GB RAM which configured NVIDIA Quadro RTX 5000 GPU card on an intel-C621 chipset. We perform extensive experiments in order to give a proper argument for our system model. We employ conda environment and packages on Linux to run python to assess the performance of the proposed model of the UAV-assisted 5G mMTC resource allocation approach. Convolutional Neural Network (CNN) is used in the model which comprises two hidden layers with 256 and 64 neurons in each. The memory bank's value is determined by its memory capacity. At any one time, the batch size refers to the number of samples gathered from the memory bank. Memory and batch size has an impact on DDQN's accuracy and training rate. We created simulation settings that are compliant with 5G specifications and recommendations as per 3GPP standards. The system begins with a single UAV and a large number of base stations. The system radius has been adjusted to 500 metres in our scenario. We have shown some important simulation parameters used in Table 2.

Simulation results and parameter analysis
In this section, we discuss the simulation results and performance analysis by investigating the consumption, throughput of power and energy efficiency of the systems. We have considered an emergency communication scenario in which the communication link has broken due to disaster. A UAV works as a base station in this region to establish communication with the outside since the ground base stations are mostly power-driven and might disturb in a disaster. The UAV-assisted 5G mMTC network has been considered for resource allocation, aiming to maximize energy efficiency. We simulated the proposed Dueling DQN model with 5000 epochs. The learning rate and discount factor are 0.01 and 0.99, respectively. We perform the experiments with 15 users, 4 ground base stations and 1 UAV. Figure 4 shows the system throughput with the different base stations as (2, 4, 6, …, 20) in which the UAV-assisted network provides better throughput than the UAV-assisted network. With the increase in the number of base stations, the UAV-assisted provides better overall throughput than the normal environment. Figure 5 shows the energy consumption with the different number of base stations as (2, 4, 6, …, 20) in which UAV-assisted networks consume more energy than without UAV-assisted networks. The plot is linear because we provide the uniform type of UAV and base station power in the system model. The operational expenses and throughput are two significant performance indicators, as excessive consumption raises operating costs and insufficient throughput negatively impacts the user's experience. The energy efficiency vs different users for different Q-learning algorithms as shown in Fig. 6. We have considered the number of users as (10, 15, 20, 25, and 30) with the DDQN model and shown a comparison with DQN, Q-Learning and Random algorithms. The performance with 10 users is better than the other because of the low dense network environment. With the increasing number of users, the overall energy efficiency and the DDQN model perform better in each case.
The different learning rate of DDQN and DQN model for energy efficiency is shown in Fig. 7. We consider the learning rate as (0.001, 0.01, 0.1) in which the learning rate of 0.01 performs better while the learning rate of 0.001 takes more iterations for stable performance. We fix the learning rate to 0.01 for the rest of our experiments. We run for 5000 episodes for analysis, but the models DDQN and DQN perform stable for about 1000 episodes.
One of the essential features of the reinforcement learning algorithms is the learning rate; changing it affects the neural network's weight and changing the depth affects the algorithm's performance. If the learning rate tends to 0, the most recent feedback function may not be acquired by the agent. It will take a long time to iterate, and the convergence speed  Fig. 6 The system energy efficiency for different users will slow. If it is high, on the other hand, the speed of convergence will be too fast to get the ideal allocation strategy of resources, resulting in a system performance decrease. The performance of the learning rate of 0.1 is not good as other parameters. Hence, we have considered 0.01 as the learning rate value for our experiments. The overall performance of the DDQN model is better than the DQN model. Therefore, we have considered the DDQN model for our experiments and the remaining reinforcement learning technique used for comparison.
The different discount factor of DDQN and DQN model for energy efficiency is shown in Fig. 8. We consider the discount factor as (0.99, 0.79, 0.59) in which the discount factor 0.99 performs better while the discount factor 0.79 and 0.59 provides less performance. Hence, we have considered 0.99 as the discount factor for our experiments. Also, the overall performance of the DDQN model is better than the DQN model. Therefore, we have considered the DDQN model for our experiments and the remaining reinforcement learning technique used for comparison.

Performance comparisons
We study the UAV-assisted framework in the 5G mMTC where a lot of connected IoT devices as the user which is connected to the ground base stations. The ground base stations can be compromised in the disaster due to the main power supplied. We considered the UAV-assisted base station which can be connected to the outside region for emergency communication. The UAV can have a line-of-sight (LoS) in which it can be connected to the users. In Fig. 9 we have shown the UAV throughput vs different θ LoS ranges for the given devices and UAV power and hover height. The UAV provides better throughput when the hover height is less and UAV, and device's power is more. Here, increasing the θ LoS degree range increases the throughput performance of two networks. This is a very interesting phenomenon because widely distributed users are close to the NLoS region [25]. Number of Iterations Energy Efficiency (bits/joule) DQN model discount factor (γ) = 0.99 discount factor (γ) = 0.79 discount factor (γ) = 0.59 Fig. 8 The different discount factors for DDQN and DQN model Furthermore, increasing the distance between user and UAV decreases the link gain, the larger θ LoS decreases the interference to the user.
The transmission power of the UAV-assisted base station has a major role which can affect the overall wireless communication. Figure 10 shows the system's performance with power transmission and with some mentioned different users. It is observed that with the increment in power transmission there is a decrement in the energy efficiency of the system. We have used the transmission power as 3 dBm with the 15 users for the rest of the experiments. The system energy efficiency decreases when the user's minimum transmission power rises, as seen in Fig. 10. We can also see that the system energy efficiency declines as user numbers rise. The system's limited resources mean that an increase in user count will result in higher energy usage and lower system energy efficiency. We simulated our system models with Dueling DQN, DQN, Q-Learning, and Random distribution algorithms for better performance comparison. In Fig. 11 we show the energy efficiency with different numbers of users. The peak distribution indicates that the base station sends data at the highest transmission power possible to the user. Dueling DQN provides better energy efficiency as compared to the DQN and Q-Learning models. The term random refers to the fact that the base station provides transmission power to the user at random.
The computation time is an important parameter by which we can think about scalability. We perform the computation analysis with the DDQN algorithm as shown in Fig. 12. The number of base stations we considered as (2, 4, 6, 8, 10,

Conclusion
This paper studies UAV-assisted 5G mMTC slice for emergency communication in a disaster scenario. We investigate the resource allocation maximization problem in our environment of UAV-assisted wireless networks. We consider UAV as a flying base station for the emergency communication system with 5G mMTC Network Slicing to overcome communication limits induced by natural disasters. We formulate our problem to improve overall energy efficiency, and we separate the problem of resource allocation into two modules: user link selection strategy and power control method. Then, we reduce the problem into a stochastic optimization problem using Markov Decision Process (MDP) theory. We proposed a Dueling Deep Q-Network (DDQN) based algorithm based on reinforcement learning for dynamic resource allocation. We perform extensive experiments with the proposed model, Q-Learning and DQN to present a better analysis. We found that the overall performance of the DDQN model is better. In the future, we will look at various deep reinforcement learning techniques and other UAVs as auxiliary equipment, such as UAVs that act as handovers and UAVs that act as wireless charging equipment.

Author Contributions
The study, implementation and experiments were done by all of the authors. RKG prepared the materials, ran the simulations, and conducted the analysis, and SK helped in the simulations under the supervision of Dr. RM. RKG wrote the first draft of the manuscript, and all contributors provided feedback. The final manuscript was read and approved by all of the authors.

Funding
The preparation of the manuscript is not funded.
Data availability All the used codes/data in this research will be available from the corresponding author on reasonable request.