Joint Optimization Framework For Maximization of Instantaneous Transmission Rate In Signal To Interference Noise Ratio Constrained UAVs-Supported Self-Organized Device-To-Device Network


 Due to their high maneuverability, flexible deployment, and line of sight (LoS) transmission, unmanned aerial vehicles (UAVs) could be an alternative option for reliable device-to-device (D2D) communication when a direct link is not available between source and destination devices due to obstacles in the signal propagation path. Therefore, in this paper, we have proposed a UAVs-supported self-organized device-to-device (USSD2D) network where multiple UAVs are employed as aerial relays. We have developed a novel optimization framework that maximizes the total instantaneous transmission rate of the network by jointly optimizing the deployed location of UAVs, device association, and UAVs’ channel selection while ensuring that every device should achieve a given signal to interference noise ratio (SINR) constraint. As this joint optimization problem is nonconvex and combinatorial, we adopt reinforcement learning (RL) based solution methodology that effectively decouples it into three individual optimization problems. The formulated problem is transformed into a Markov decision process (MDP) where UAVs learn the system parameters according to the current state and corresponding action aiming to maximize the generated reward under the current policy. Finally, we conceive SARSA, a low complexity iterative algorithm for updating the current policy in the case of randomly deployed device pairs which achieves a good computational complexity-optimality tradeoff. Numerical results validate the analysis and provide various insights on the optimal deployment of UAVs. The proposed methodology improves the total instantaneous transmission rate of the network by 75.37%, 52.08%, and 14.77% respectively as compared with RS-FORD, ES-FIRD, and AOIV schemes.


Introduction
The fifth-generation mobile communication (5G) has obliged to enable new features including critical and disaster recovery operations in the modern communication systems. These operations offer consistent network performance by maintaining the required quality of service (QoS) which incorporated latency, transmission rate, and reliability throughout urban and rural environments [1]. Although, maintaining the required QoS is challenging because radio coverage varies due to obstacles in the signal propagation path and high user density significantly reduces network capabilities. These limitations can be combated by deploying unmanned aerial vehicles (UAVs) having the ability to establish line of sight (LoS) dominant air to ground channel in a controllable manner [2]. Specifically, the UAV-enabled wireless sensor network (WSN) utilizes UAVs as computing hub and mobile data collectors to perform several analyses on the data, collected from spatially separated sensor nodes (SNs) and communicating devices [3].
The use of UAVs in existing wireless communication systems makes them more efficient in terms of coverage, capacity, and energy consumption. Hence, the flying cell site technology employs UAVs as aerial base stations to serve a region that is beyond the coverage of the cellular network [4]. The European Commission has launched Project Loon [5] where UAVs are deployed as flying base stations (BSs) to provide broadband service for rural and remote areas. The ABSOLUTE project, based on Long Term Evolution-Advanced (LTE-A) aerial base station has been inaugurated to support the services for disaster relief activities by providing huge coverage and connectivity [6]. Moreover, due to their agility and mobility, the deployment of UAV relays can be beneficial in a zone, where constructing cellular and mobile infrastructure is very expensive [7]. Although, when UAVs are operated as an aerial relay, not only the link between ground source device to UAV but also the link between UAV to destination device influence the network performance. Therefore, the deployment of UAVs as relays for device-to-device communication (D2D) is a more practical concern.
The existing literature focuses on the deployment of UAV relays for D2D communication under a broad range of aspects [8]- [16]. In [8], the authors have estimated the optimal UAV relay position in a multi-rate communication system using theoretical and simulated analysis. The work in [9] investigates optimal deployment and movement of UAV relay to improve the connectivity of ground users. The authors of [10] and [11] maximize the lower bound of uplink transmission rate and enhance the quality of the link between UAV relay and ground devices using dynamic heading adjusting approaches. For throughput maximization of mobile relaying system, an iterative algorithm is developed [12], [13] which jointly optimizes trajectory of the relays, transmit power of sources and UAVs by satisfying the practical constraints. Chen et al. [14] suggested two relaying schemes under different channel models for deployment of multiple UAV, but no resource competition is considered among source-destination pairs in the network. Underlay spectrum sharing [15] between three-dimensional drone cells and the traditional cellular networks has been modeled using a twodimensional Poisson point process where the coverage probability and achievable throughput of drone cells are maximized by satisfying cellular network efficiency constraint. However online spectrum allocation corresponding to various UAVs' locations has not been considered. In [16], the authors maximize the throughput of the UAV relay network by jointly optimizing transmit power, bandwidth, transmission rate, and relay deployment. Although, in this work model based centralized approach considering only a single relay is used where all necessary system parameters are required. Additionally, the research gap still exists on the enhancement of network performance for source-destination device pair communication by optimal placement of UAV relays. In such existing work, new problems will occur as follows.
• Due to the limited number of UAV relays in the D2D network, relay competition occurs among the sourcedestination device pairs. Moreover, a device receives different QoS from different positions of multiple UAVs. Therefore, device association with UAV relay is a significant factor that influences network capacity, which is overlooked by current articles. • In a practical scenario, the spectrum resources that can be used for UAV communication networks are limited.
Whenever multiple UAV relays and multiple device pairs are assigned the same channel, mutual interference occurs among the communicating nodes. Especially, the effect of interferences increases when the number of communicating devices and UAVs increases in the network. Therefore, channel assignment needs to be considered for a UAV communication system. • The relationship among device association with UAV relay, UAV's channel selection, and optimal deployment of UAV relay are intractable where one variable change may impact other variables. Furthermore, the existing optimization approaches related to these variables are all model-based. Some initial information, e.g., channel model and location of ground users are utilized to solve the optimization process. But in reality, the communication environment is unknown which indicates that the exact channel model and location of the ground unit are hard to know. Therefore, a model-free approach must be designed where UAVs can estimate channel models by field measurement and statistics. During recent decades, machine learning (ML) algorithms have gained remarkable attention in the communication and networking sector. Moreover, reinforcement learning (RL) being a part of ML can estimate the channel model and learns the system behavior with an unknown environment in UAV assisted wireless network [17], [18]. Liu et al. [19] proposed a Q-learning based deployment and movement algorithm that maximizes the achieved sum mean opinion score of ground users. A multi-agent RL framework has been adopted by Cui et al. [20] for investigating dynamic resource allocation where the aim is to maximize the long-term expected reward by automatically selecting power level and subchannel allocation when there is no information exchange among UAVs. The authors of [21] presented deep reinforcement learning (DRL) algorithm to control the trajectory of UAVs and minimize the energy consumption of all user equipment via jointly optimizing user association, resource allocation, and trajectory of UAVs.
It is observed from the existing literature that UAV relays are used for D2D pairs communication when a direct link cannot be established between them. Furthermore, UAVs help low transmitting power devices for conveying data to the intended receiver located at a long distance where UAVs' channel selection and device association with a UAV are executed in a decentralized manner. However, the deployment of UAVs using conventional methodology [7] faces difficulties to satisfy network coverage and capacity requirements. Particularly, the interactable relationship among UAVs' deployment, relay assignment, and resource allocation cannot be optimized jointly using a centralized modelbased approach in an unknown communication environment. Therefore, in this paper, we have proposed a UAVssupported self-organized device-to-device (USSD2D) network containing multiple source-destination device pairs and multiple UAVs where the aim is to find the optimal deployed location of UAVs, device association with UAV, and UAVs' channel selection that simultaneously supports reliable data transmission between source and destination device pairs. To the best of our knowledge, this is the first work that considers signal to interference noise ratio (SINR) constrained maximization of the total instantaneous transmission rate of the USSD2D network by jointly optimizing device association, UAV's channel selection, and their deployed location at every time slot while considering the random deployment of static source-destination device pairs in an unknown communication environment. Based on the proposed framework, the major contributions are summarized as follows • Considering a realistic scenario, we have formulated an SINR constraint joint optimization problem for maximizing the total instantaneous transmission rate of the USSD2D network over the entire flight period of UAVs. Because of its time slot dependent, nonconvex and combinatorial nature, the formulated optimization problem is non-trivial. To solve this challenging problem, a model-free RL-based solution methodology is proposed that decouples the joint optimization problem into three individual optimization problems. UAVs acting as RL agents learn the adaptive strategy from their own experiences earned by interaction with an unknown environment. • We have decomposed the formulated non-convex and combinatorial optimization problem into a multi-period decision-making process, having three individual optimization problems such as individual device association with UAV, UAVs' channel selection, and their deployed location at every time slot. Specifically, we model the trajectory optimization problem as a Markov decision process (MDP) where the states and actions are represented as locations of UAVs at the current time slot and corresponding movement of UAVs respectively. As each device selects a single UAV according to its association probability, the total transmission rate achieved by all devices at a time slot over the allocated channel is treated as an instantaneous reward attained by the UAVs. • State-action-reward-state-action (SARSA) algorithm and -greedy action selection scheme are adopted for obtaining the optimal policy that allows UAVs to find their deployed location at every time slot without the need for system identification. The proposed methodology obtains the current policy by using Q-value for all stateaction pairs. Furthermore, to enhance the learning process, we have analyzed both the convergence and complexity of the proposed algorithm. • We have provided numerical results to validate the analysis and gain insights on the impact of various system parameters for optimal UAVs deployment versus total instantaneous transmission rate trade-off. Finally, to corroborate the importance of the proposed joint optimization framework, we have presented its performance comparison against the benchmark scheme to quantify the achievable improvement of the total instantaneous transmission rate of the USSD2D network. The rest of the article is organized as follows. In section 2, the proposed USSD2D network including objective problem formulation is elaborated. Afterward, we discuss the key techniques of the SARSA algorithm and optimal decisionmaking policy in section 3 and section 4 respectively. Simulation results and corresponding inferences are focused in section 5. Finally, the conclusion is outlined in section 6 followed by references.

System model
Consider a UAV-supported self-organized device-to-device (USSD2D) network as depicted in Fig. 1 is operating under a discrete-time axis. There is a set of randomly deployed source and destination device pairs on the ground with a fixed location in the target area. Some of the device pairs can communicate via direct links because of short distances and good channel conditions between them which are identified as direct D2D pairs. Although in case of long-distance communication or obstructed by obstacles in signal propagation path, the direct link cannot be established for all communicating device pairs. In order to create links for the source-destination device pairs when a direct link is not available, multiple UAV equipped with single and half-duplex antenna is deployed at a fixed altitude to operate as a relay for their communication. We consider those device pairs as UAV-assisted D2D pairs.
According to practical scenario [22], one source device can associate only one UAV at a time slot but one UAV can be shared by multiple source devices. Furthermore, each UAV can occupy only a single orthogonal channel for each time slot and direct D2D pairs are assigned fixed channels during the entire flight period of UAVs. The transmission link between source device to associated UAV and UAV to destination device follows round-robin protocol [22] which is characterized by an amplify-and-forward relaying model. To indicate whether device ∈ = { ̅ ∪ ̅ ∪̃∪̃} associates with UAV ∈ ℳ at time slot , an indicator function ̅ , ( ) is proposed as The number of devices associated with UAV ∈ ℳ at time slot is calculated as Similarly, an indicator function ̃, ( ) is used to represent whether UAV ∈ ℳ selects the channel ∈ at time slot as The path loss between device ∈ and UAV ∈ ℳ at time slot can be expressed as [23] , where is the velocity of light, is the carrier frequency of the channel selected by UAV, and are attenuation factors corresponding to the LoS and NLoS path respectively, 1 and 2 are the environment-dependent constant.
) is the elevation angle between device and UAV where the instantaneous distance between device and UAV is calculated as, The instantaneous channel gain between source device ∈ and relay UAV ∈ ℳ can be calculated as

Transmission model
At time slot , the received signal to interference plus noise ratio (SINR) of UAV over channel ∈ from source device ∈ can be expressed as [22] where is transmit power of ground source device , 0 is the power of additive white Gaussian noise (AWGN) and , ( ) is instantaneous interference at the receiver of UAV due to receiving a signal from source device over channel . The instantaneous interference received by UAV can be categorized into two parts. The first part is the interferences from the sources of the relays that occupy the same channel as UAV which is denoted as ̃, ( ) and the second part is interferences from the sources of direct D2D pairs that occupy the same channel as UAV , represented by ̅ , ( ). The total interference received by UAV is calculated as , ( ) =̃, ( ) + ̅ , ( ). Since the number of source devices that causes the interference ̃, ( ), varies in each time slots and the sources of direct D2D pairs transmits probabilistically, hence the expected instantaneous SINR is useful in this case, which is approximated as where ̃̃ is the set of the source devices associated with UAV ̃ and |̃̃| is the number of devices in the set ̃̃.
Similarly, the expected instantaneous interference at UAV , caused by source devices of direct D2D pairs can be estimated as where ̅ ⊆ ̅ is the set of the sources of direct D2D pairs which select channel . The total expected interference at time slot can be calculated as The expected instantaneous SINR received by destination device + ∈ from UAV ∈ ℳ over channel ∈ , can be expressed as where is transmit power of UAV , is the total number of source and destination device pairs in the target area and [ , + ( )] is expected instantaneous interference at destination device + when it receives a signal from UAV over channel . This interference occurs when the other UAVs and direct D2D pairs select the same channel as UAV . Therefore, the expected interference received by the destination device at time slot is estimated as Since source device ̅ ∈ ̅ and destination device + are both on the ground, the amplify-and-forward channel model is not suitable for their communication. Hence, we consider the conventional channel model for ground user communication. According to this channel model, instantaneous channel gain between the device ̅ and + can be expressed as where 0 = ( 4 ) 2 is free space path loss at distance 1 m and is path loss exponent. The expected instantaneous SINR of direct D2D communication from source device ∈ to destination + over channel ∈ can be expressed as where [ , + ] is expected interference at the receiver of device + when it receives a signal from device over channel at time slot , can be measured as The instantaneous expected SINR received by destination device over channel must be greater than the threshold for reliable communication and maintaining required QoS, then [Γ , + ( )] > , ∀ ∈ , ∈ . The expected instantaneous transmission rate of a link between source and destination device of direct D2D pair over channel ∈ can be expressed as [24] where, is the bandwidth of the channel. The total instantaneous transmission rate of the links between direct D2D pairs can be calculated as The expected instantaneous transmission rate of the link between the source and destination device of UAV-assisted D2D pair via UAV and selected channel ∈ , can be formulated as [24]  The total instantaneous transmission rate of the links between UAV-assisted D2D pairs can be calculated as The overall instantaneous transmission rate of the USSD2D network is obtained as

Problem formulation
When UAVs fly towards some devices to acquire a better channel environment, the remaining devices of the network cannot receive adequate service from the UAVs. Moreover, the transmission rate of the devices which are far away from UAVs becomes lower than those devices that are closer to the UAVs. We consider that one source device can associate only one UAV for aid but one UAV can be shared by multiple devices. Furthermore, each UAV can only occupy a single orthogonal channel for each time slot. In order to maximize the total instantaneous transmission rate of the USSD2D network, we aim to jointly optimize the deployed location of UAVs, device association, and UAVs' channel selection at each slot, while assuring that the expected instantaneous SINR received by each destination device achieve a threshold for maintaining the required QoS. The corresponding optimization problem is formulated as Here, C1 assures that the USSD2D network maintains the required QoS if the received instantaneous SINR of each destination device is greater than the given threshold; C2 specifies the device association indicator and UAVs' channel selection indicator constraint at every time slot. C3 defines that at every time slot each device associates with a single UAV; and C4 implies that each UAV selects an orthogonal channel from the available channel set in each time slot. The formulated problem is complicated and intractable because the variables ( ( ), ( )), ̅ , ( ) and ̃, ( ) are coupled where the deflection of one variable impacts the optimization of other variables and the value of the objective function. Moreover, due to the absence of a central controller, path loss and gain of the selected channel, as well as the location of the communication node are prior unknown. Hence this non-convex and combinatorial optimization problem cannot be solved by the classical non-convex optimization method. To tackle these challenges, an online model-free learning-based UAV deployment strategy is needed where the aim is to find the optimal deployed location of UAVs via obtaining the required system parameters using real-time measurements and statistics of collected information without previous information of the channel model and device location in an unknown communication environment.

Dynamics of optimal decision making policy
The advanced automatic systems are trained to make intelligent decisions using RL schemes where RL agent communicates with an unknown environment directly by receiving feedback as a reward or penalty corresponding to the quality of taken action to perform a specific job [25]. Being an RL agent, each UAV adopts the environment through a self-learning procedure and decides an action according to its learning experiences. Conventionally [26], the optimal deployment of UAVs, device association, and UAVs' channel selection has been obtained by UAV assisted relaying system under the offline framework when the devices know the exact ground to air channel conditions. However, in our model, devices only know the causal information, such as its association probability, UAV's instantaneous location, and channel selection probability at the current time slot. But the statistical information corresponding to those parameters cannot be obtained for all time slots. Therefore, we refer the objective function mentioned by (22) to a non-convex and combinatorial optimization problem, and the corresponding suboptimal solution can be regarded as a sequential decisionmaking policy which is modeled by Markov Decision Process (MDP). These suboptimal solutions obtained by MDP for each time slot indicate the throughput of the proposed USSD2D network in terms of the total instantaneous transmission rate.

Analysis of model-free MDP
In order to solve this MDP, we adopt the online RL method because it does not require statistical information. At each time slot, devices can only obtain causal information. The action selected by UAVs entirely depends on their instantaneous position. The current states of UAVs are only related to the state and action of the previous time slot. Thus, this system satisfies the Markov properties and can be composed of quintuple 〈 , , , ℛ, 〉, where and denote the state and action space respectively, is state transition probability, ℛ is the reward, and is a set of time slots in which certain decisions need to be made. Specifically, for our optimization problem, the definitions of each tuple are explained as follows.

State space
In our scenario, we can only get ( ) and ( ), ∀ ∈ ℳ. Hence, we define the state of UAV at time slot as a vector of two elements ( ) and ( ) which represents the current location of UAV denoted by ( ) ∈ . The elements of state-space are independent and identically distributed random variables across the horizon of time slots and is the set of all state elements arranged by the combination of all possible values.

Action space
To solve the optimization problem described in (22), UAVs estimate their deployed locations at every time slot. We define the instantaneous action of UAV , ∀ ∈ ℳ as a change of its location denoted by ( ) ∈ . These changed locations are measured with respect to the X and Y coordinate of UAVs', obtained at the current time slot. Each UAV explores the sensing region by selecting a moving direction and moves towards the selected direction for a distance traveled by slot duration time length in each time slot. For better understanding, the moving direction of each UAV is illustrated in Fig. 2. It is observed from this figure that, UAV has a maximum of eight possible moving directions at each state i.e. {east, north-east, north, north-west, west, south-west, south, and south-east}. After selecting a moving direction, the corresponding X and Y coordinate changes of UAV at time slot are denoted as ( ) ∈ {0, ( ) } and ( ) ∈ {0, ( ) } respectively, ∀ ( ) = { ( ), ( )} ∈ , ∈ , where ( ) is the velocity of each UAV at time slot and is the action set where all possible actions can store. Therefore, the obtained X and Y coordinate of UAV at time slot + 1 is measured as

Reward formulation
The generated reward depends on the device association, UAVs' channel selection and their current state as well as taken action at each time slot. As a device can associate with a single UAV and a UAV is shared by multiple devices over a particular channel, therefore we need to find such locations of UAVs that maximize the total instantaneous transmission rate of the USSD2D network. The expected deployed location of UAV , ∀ ∈ ℳ is obtained more precisely when its generated reward at the current time slot is beneficial for the long term. To reflect this property, we model the instantaneous reward function of UAV as

State transition probability
It is defined as the probability that is obtained by UAV , ∀ ∈ ℳ for the transition from state ( ) to ( + 1) after selecting an action where 1 and 2 are the learning step sizes. ̃( ) and ( ) are current best UAV for device ̃ when selected channel by that UAV is fixed for that time slot and the current best channel of UAV for associated devices at that time slot respectively, which can be expressed as From (27) and (28), it is observed that the update of selection probability vectors depends on instantaneous transmission rate which does not need any prior information. Thus, the process of device association and UAVs' channel selection at each time slot is completely distributive and model-free.

Time steps
The optimal deployment of UAVs is referred to multi-period decision-making problem where UAVs can learn to take decisions via interaction with the environment through the episodic task. At time slot , UAV , ∀ ∈ ℳ makes a strategy to select ( ) according to ( ) and obtains a new state ( + 1). We define the transition of UAV from the state ( ) to ( + 1) as one step and this transition is repeated until the end of time slots. Thus, the sequential decision making policy corresponding to the location of UAVs acquires the suboptimal solution which maximizes the total instantaneous transmission rate of the USSD2D network over time slots.

On policy approach
For obtaining a suboptimal decision-making policy, we use RL based method. The RL model is illustrated in Fig. 3 where each UAV acting as an RL agent to build up its knowledge about the surrounding environment by an agentenvironment interaction process. In this formulated optimization problem, each UAV decides its location at every time slot. As UAV , ∀ ∈ ℳ does not know the communication environment, therefore it tentatively selects an action ( ) at state ( ) and gets a corresponding ( ( ), ( )) value of state-action pair and immediately receives the expected reward ℛ( ( ), ( )). This reward is used to measure the effectiveness of the selected action for the current state. Then, the state ( ) will update to the next state ( + 1). From Fig. 3, we can see that the agent continuously interacts with the environment and finally find a suboptimal decision-making policy by obtaining the characteristics of the environment. Since in our proposed scenario UAVs encounter a lack of plenary knowledge of the environment, therefore State-Action-Reward-State-Action (SARSA), an on-policy temporal difference (TD) learning algorithm is assigned where the initial conditions for updating the policy need to be assumed [27]. Being iterative dynamic programming, SARSA is less computationally complex and finds the suboptimal solution by the currently defined policy.

Mapping between state and action for optimal decision
The multi-period decision-making problem is formulated by the model-free TD method and it aims to approximate the state-action value function via RL. The model-free TD method consists of on-policy which is used to control MDP. The on-policy is a mapping function from state-space to the probability distribution of selecting each possible action. Therefore, UAVs' policy indicates the probability distribution of changed location in each time slot for each possible state. UAV , ∀ ∈ ℳ experiences the environment by taking suitable action in a particular state following policy ∈ Π where the expected mapping value between state and action can be expressed using Bellman equation [28] as where * is the optimal policy of UAV under arbitrary state ( ) ∈ . This optimal policy can generate maximum discounted cumulative reward than any other policies which are the elements of policy space Π. For any policy ∈ Π, the state value function is expressed as The optimal decision-making policy can be achieved by the value iteration method where the state transition probability is required to solve the MDP. Thus, we develop an online model-free MDP-RL algorithm to find the suboptimal decisionmaking policy.

48: end for
It is clear from the above analysis that SARSA is a table-based algorithm that needs a -table to record the value of each state-action pair. In this algorithm, UAVs select the action according to the -greedy action selection scheme at each time slot. As, the total number of episodes is and each episode estimates the contribution of UAVs, thus the computational complexity depends on the total steps including the dimension of the state and action space of the proposed RL. In our scenario, there are two-dimensional state locations and eight possible actions for each time slot. Therefore, the computational complexity of Algorithm 1 is (16 ) and it includes the complexity of -greedy policy in each step. Since SARSA estimates the optimal -value by iterative manner, hence its convergence relies on the learning scheme. The -table will converge if learning rate satisfies the conditions ∑ +∞ =0 = ∞ and ∑ 2 +∞ =0 < ∞ (0 ≤ ≤ 1 ) regardless of the initial settings. Moreover, the velocity of each UAV at time slot represented by ( ) influences the action selection probability. When ( ) → 0, UAVs select random action with equal probability which indicates an inefficient search technique. On the other way if ( ) → ∞, UAVs always select the action with maximum reward. To find the optimal deployed location of each UAV efficiently, its velocity should be a decreasing function of the time slot index. Therefore, the instantaneous velocity of each UAV is expressed as ( ) = 0 where 0 is the initial velocity of UAVs.

Simulation Results and Discussions
In this section, we perform the required simulation experiments and corresponding numerical results are presented to validate the performance of the proposed optimization framework. To do so first, we investigate the converging outcomes of the proposed SARSA algorithm for optimal deployment and subsequently evaluate the performance of reliable communication between source and destination device pairs. Finally, the effectiveness and superiority of the proposed design are compared with the existing baseline [22] i.e., random selection with fixed optimal relay deployment (RS-FORD), an exhaustive search for relay assignment and channel allocation with fixed initial relay deployment (ES-FIRD), and alternative optimization for the individual variable (AOIV) by considering randomly deployed direct D2D pair and UAV-assisted D2D pair devices within a squared area of size 4km×4km. Furthermore, we adopt the primary simulation parameters from [22] and [30] which are summarized in Table 1.

Convergence analysis
For the considered case of randomly deployed static device pairs in the USSD2D network, we perform the training operations and corresponding results are illustrated in Fig. 4. At the beginning of the simulation, we assign two fixed channels among the direct D2D device pairs. It is observed in this figure that UAVs are capable of carrying out their actions in an iterative manner and learn from mistakes to improve the instantaneous transmission rate of the USSD2D network. According to the results of these simulations, it is obvious that the total instantaneous transmission rate of the USSD2D network corresponding to the proposed SARSA algorithm outperforms that of the RS-FORD, ES-FIRD, and AOIV scheme. The reason is that the proposed SARSA algorithm using -greedy policy obtains large state space which helps UAVs to explore the target region more efficiently and consequently mutual interferences among UAVs and devices can be reduced by jointly optimizing device association, UAVs' channel selection, and their optimal deployment. From Fig. 4, we can say that RL based on SARSA methodology improves the overall instantaneous transmission rate of the USSD2D network by 71.39%, 48.55%, and 14.45% respectively compared with RS-FORD, ES-FIRD, and AOIV schemes.  Fig. 5(a) represents the variations of the total instantaneous transmission rate of the network with the different number of UAVs corresponding to proposed and benchmark schemes when the UAV-assisted D2D pairs, direct D2D pairs, and available orthogonal channels are set as 10, 2, and 7 respectively. It is observed from the figure that the proposed SARSA algorithm outperforms the benchmark schemes and the total instantaneous transmission rate of the network increases while the number of UAVs increases because UAVs utilize all the available orthogonal channels efficiently at their optimal deployed location. We can see from Fig. 5(a) that the total instantaneous transmission rate of the network does not change significantly when the number of UAVs exceeds 7 because due to limited numbers of orthogonal channels, more UAV reuse the allocated spectrum which enhances mutual interferences among UAVs and source-destination device pairs. RL based on SARSA methodology improves the overall instantaneous transmission rate of the USSD2D network by 73.53%, 50.05%, and 13.94% respectively compared with RS-FORD, ES-FIRD, and AOIV schemes. Fig. 5(b) plots the total instantaneous transmission rate of the network under different numbers of channels corresponding to proposed and benchmark schemes when UAV-assisted D2D pairs, direct D2D pairs, and number of UAVs are fixed as 10, 2, and 5 respectively. It is clear from the figure that the proposed SARSA algorithm improves the instantaneous transmission rate of the network compared to benchmark schemes due to efficient estimation of the channel selection probability vector. Apart from that, the instantaneous transmission rate of the network does not increase any more when the numbers of available orthogonal channels are greater than 7 because these are sufficient resources to mitigate the mutual interferences completely. From Fig. 5(b), we can say that RL based on SARSA methodology improves the overall instantaneous transmission rate of the USSD2D network by 77.33%, 55.45%, and 13.93% respectively compared with RS-FORD, ES-FIRD, and AOIV schemes. Fig. 5(c) shows the total instantaneous transmission rate of the network versus the number of UAV-assisted D2D pairs corresponding to proposed and benchmark schemes while the number of direct D2D pairs, available orthogonal channels, and number of UAVs are constant i.e., 2, 7, and 5 respectively. From Fig. 5(c) it can be confirmed that the instantaneous transmission rate of the network almost remains unchanged with the increase of the number of UAV-assisted device pairs. Since all the communication nodes in the network utilize a fixed amount of resources, therefore no such variation is found in this figure and we can say that RL based on SARSA methodology improves the overall instantaneous transmission rate of the USSD2D network by 71.85%, 48.89%, and 14.66% respectively compared with RS-FORD, ES-FIRD, and AOIV schemes. The variations of the total instantaneous transmission rate of the network over the number of direct D2D pairs corresponding to the proposed and benchmark schemes are illustrated in Fig. 5(d) when the number of UAV-assisted D2D pairs, available orthogonal channels, and number of UAVs are set as 10,7, and 5 respectively. It is observed from the figure that the total instantaneous transmission rate of the network decreases with the increase of the number of direct D2D pairs. When the number of direct D2D pairs is increased, the fixed number of orthogonal channels is allocated to the greater number of communication nodes. As a result, mutual interferences among devices and UAVs increase which leads to the reduction of the network throughput due to limited resources. From Fig. 5(d), we can say that RL based on SARSA methodology improves the overall instantaneous transmission rate of the USSD2D network by 78.46%, 54.2%, and 13.52% respectively compared with RS-FORD, ES-FIRD, and AOIV schemes. Fig. 6 presents the impact of various UAVs' altitudes on the total instantaneous transmission rate of the USSD2D network corresponding to the proposed and conventional schemes. We can see from this figure that the total instantaneous transmission rate of the network increases at a certain limit then it decreases gradually with the increasing value of the UAVs' altitude. Since LoS probability increases when altitudes of UAVs increase and consequently instantaneous transmission rate also increase. But when the altitude exceeds a certain limit, the transmission distance between UAV to its associated device increases which leads to a reduction of the instantaneous transmission rate. Moreover, there is still a gap between the proposed and conventional schemes. The reason is that the proposed optimization method efficiently estimates the location of the devices and accordingly it deploys the UAVs at a given altitude, whereas the conventional schemes deploy UAVs using successive learning techniques without proper approximation of devices' location. From Fig. 6, we can say that RL based on SARSA methodology improves the overall instantaneous transmission rate of the USSD2D network by 79.67%, 55.36%, and 18.11% respectively compared with RS-FORD, ES-FIRD, and AOIV schemes. Fig. 6: Impact of various UAVs' altitudes on network throughput.

Impact of learning parameters on deployment strategy
According to (27) and (28), it is observed that learning step sizes and initial velocity of UAVs influence the evolution of instantaneous transmission rate of the network and convergence properties corresponding to the proposed SARSA algorithm. Fig. 7(a) plots the total instantaneous transmission rate of the USSD2D network for each episode with different values of learning step sizes when 0 = 100 m/s. We can see that, with decreasing values of 1 and 2 the convergence rate of the proposed SARSA algorithm decreases but the total instantaneous transmission rate of the network increase because small values of learning step size encourage SARSA to update the device association probability and UAVs' channel selection probability more efficiently.
(a) Variations of the network throughput with the different learning step sizes.
(b) Variations of the network throughput with the different initial velocities of UAVs. Fig. 7: Total instantaneous transmission rate of the USSD2D network with different learning parameters. Fig. 7(b) plots the total instantaneous transmission rate of the USSD2D network for each episode with the different initial velocities of UAVs' when the learning step sizes are fixed as 1 = 0.004 and 2 = 0.002. We can see that the too small and too fast initial velocity of UAVs achieves a low convergence rate and poor performance while the moderate initial velocities of the UAVs achieve a fast convergence rate and good performance. This is because too small and too large initial velocity motivates UAVs for random movement which leads to an incomplete exploration of the environment and consequently UAVs do not have a chance to choose other actions. Therefore, the probability of the best-chosen action is equally matched as the worst action.

Conclusion
In this article, we have presented a UAV-supported self-organized device-to-device (USSD2D) network where multiple UAVs are deployed to support the reliable communication between source-destination device pairs. The above analysis shows that the performance of the USSD2D network is impacted by the UAVs' position, in which the goal is to maximize the total instantaneous transmission rate of the network by jointly optimizing the deployed location of UAVs, device association, and UAVs channel selection under SINR constraint. The formulated objective function is verified as a multi-period decision-making problem that is solved by RL based on the SARSA algorithm. UAVs pretending as RL agents select the optimal policy that maximizes the long-term reward by converging towards maximum values. Numerical results indicate that the proposed methodology provides much better performance than existing baselines and improves the total instantaneous transmission rate of the network by 75.37%, 52.08%, and 14.77% respectively as compared with RS-FORD, ES-FIRD, and AOIV schemes. We hope that this work would be extended for meaningful guidance and inspiration of developing future multi-UAV networks, especially the enhancement of network throughput in 5G and beyond communication systems. Figure 1 System model of UAVs-supported self-organized device-to-device network.   The variation of the total transmission rate of the USSD2D network corresponding to each episode Figure 5 Total instantaneous transmission rate of the USSD2D network with different network parameters.  Total instantaneous transmission rate of the USSD2D network with different learning parameters.