This section presents the proposed MAPQ scheme for mobilityaware flow rule placement with SDN. Then its function is analyzed, providing more space for highlevel flow rules and assuring the efficiency of the flow table management.
3.1 Overview
As depicted in Fig. 4, the operation of MAPQ involves a tripartite process comprising flow rule classification, position prediction, and flow entry update. The first step of flow rule classification utilizes the costsensitive AdaBoost methodology to categorize the flow rules into two distinct types. The accuracy of this particular classification process holds substantial sway over the performance of the MAPQ scheme, wherein mice or inactive flow are excluded from consideration. The proactive rule placement policy allocates the remaining flows to the switches. The Qlearning model is leveraged to prognosticate the node's position relative to the corresponding flow entry, thereby facilitating the scheduling of rule placement in advance. Subsequently, the controller updates the flow entries contingent upon the node position. The notations employed in this study are summarized in Table II.
Table 2. The list of notations.
Parameter

Description

li

ith position of the device in a sequence of positions

ti

arrival time of the device corresponding to ith position

wij

weight of the visit from li to lj

hi

ith basic classifier

Pij

transition probability from li to lj, i ≠ j

α

learning rate

γ

discount factor

3.2 Flow Classification
Flow classification is the first step of MAPQ, which classifies the flows into elephant flow or mice flow. The traditional machine learning algorithm uses a training dataset, (Xtra,Ytra)={(x1,y1),(x2,y2),…,(xn,yn)}, to generate a classifier. Then it is employed to identify the category of the test dataset, (Xtes,Ytes). With AdaBoost, a trained classifier is used to differentiate two kinds of flows; elephant (active) flow and mice (inactive) flow. Usually, the distribution of the training dataset is the same as that of the test dataset. As for the training dataset of a different distribution, it is regarded as a redundant dataset. Unlike traditional machine learning algorithms, the costsensitive algorithm considers the minimization of the misclassification cost to construct a fullscale classifier.
Based on the costsensitive scheme framework, the improved AdaBoost algorithm, known as costsensitive AdaBoost, has been utilized in the context of MAPQ. AdaBoost is a conventional machine learning algorithm that comprises multiple weak classifiers. These classifiers are assigned weights based on the classification error, and a robust classifier is produced by aggregating the weak classifiers utilizing these weights. The operational sequence is outlined as follows.
1) Initialize the weight of each training data to be the same as 1/m, where m is the total number of the training data. Thus, the initial weights are
$${w}_{1}={w}_{2}=\dots ={w}_{m}=\frac{1}{m}$$
1
2) Update the weights of the training data with n iterations.
A. Choose a weak classifier of the minimal error rate as the tth basic classifier, ht, and calculate its error rate.
$${\epsilon }_{t}={Pr}_{i\sim{D}_{t}}\left[{h}_{t}\left({x}_{i}\right)\ne {y}_{i}\right]$$
$$=\sum _{i=1}^{m}{w}_{i}^{t}[1I({h}_{t}\left({x}_{i}\right)={y}_{i}\left)\right]$$
2
where wit denotes the weight for the tth iteration and I(•) is the indicator function.
$$I\left(y=x\right)=\left\{\begin{array}{c}1, y=x\\ 0,y\ne x\end{array}\right.$$
3
B. Calculate the weight of the basic classifier for the final ensemble.
$${\alpha }_{t}=\frac{1}{2}ln\left(\frac{1{\epsilon }_{t}}{{\epsilon }_{t}}\right)$$
4
C. Update the weight of each training data.
$${w}_{(t+1)}\left(i\right)=\frac{{w}_{t}\left(i\right)exp\left({\alpha }_{t}{y}_{i}{h}_{t}\left({x}_{i}\right)\right)}{{Z}_{t}}$$
5
where Zt is the normalization factor.
$${Z}_{t}=2\sqrt{{\epsilon }_{t}\left(1{\epsilon }_{t}\right)}$$
6
3) Combine the basic classifiers using the following equation into a strong classifier.
$$H\left(x\right)=sign\left(\sum _{t=1}^{T}{\alpha }_{t}{h}_{t}\left(x\right)\right)$$
7
The overall steps are summarized in Procedure 1.
Procedure 1: AdaBoost

Input: (Xtra,Ytra)={(x1,y1),(x2,y2),…,(xn,yn)} where xi∈X,yi∈{−1,1}.
Initialize:\({w}_{1}={w}_{2}=\dots ={w}_{m}=\frac{1}{m}\)
1. FOR t = 1 to n
2. Train weak learners using the current weight distribution, Dt
3. Obtain weak hypothesis ht: X→{−1,1}.
4. Select ht of smallest weighted error:
\({\epsilon }_{t}={Pr}_{i\sim{D}_{t}}\left[{h}_{t}\left({x}_{i}\right)\ne {y}_{i}\right].\)5.
\({\alpha }_{t}=\frac{1}{2}ln\left(\frac{1{\epsilon }_{t}}{{\epsilon }_{t}}\right)\).
6. Update for i = 1,…,m:
\({w}_{(t+1)}\left(i\right)=\frac{{w}_{t}\left(i\right)exp\left({\alpha }_{t}{y}_{i}{h}_{t}\left({x}_{i}\right)\right)}{{Z}_{t}}\)
where Zt is a normalization factor used to make w(t+1) a weight.
7. Compute the final hypothesis:
\(H\left(x\right)=sign\left(\sum _{t=1}^{T}{\alpha }_{t}{h}_{t}\left(x\right)\right)\)
Output: H(x)

AdaBoost is similar to traditional machine learning algorithms in adopting the hypothesis that the training and test dataset distributions are the same. The costsensitive AdaBoost preserves AdaBoost for the training dataset of the same distribution, and a costsensitive function is applied to the training dataset of different distributions for flow classification. To differentiate the importance of the data, each data is associated with a cost: the higher the cost, the higher the importance of correctly identifying the data. Let {(x1,y1,c1),…, (xm,ym,cm)} be a sequence of training dataset, where ci∈[0,+∞) is an associated cost. c(yi,yj) is the cost of predicting class yj when the true class is yi for a sample. Its value is decided below.
$$c\left({y}_{i},{y}_{j}\right)=\left\{\begin{array}{c}\frac{p}{N}\times acc {y}_{i}\ne {y}_{j}{ and y}_{i}=1\\ \frac{1p}{N}\times \left(1acc\right) {y}_{i}\ne {y}_{j}{ and y}_{i}=1 \\ 0 {y}_{i}={y}_{j} \end{array}\right.$$
8
where p is the number of negative samples. c(0,0) is the cost of correctly predicting negative value for actual negative value, CostTN. Four combinations exist regarding the correctness of the positive/negative value prediction as listed in Table III.
Table 3. The combinations of prediction of the values.
Prediction

Actual Negative

Actual Positive

Negative

CostTN:c(0,0)

CostFN:c(1,0)

Positive

CostFP:c(0,1)

CostTP:c(1,1)

The update of the weight is done as follows.
$${w}_{(t+1)}\left(i\right)=\frac{{w}_{t}\left(i\right)\text{e}\text{x}\text{p}({\alpha }_{t}{y}_{i}{h}_{t}\left({x}_{i}\right)c\left({y}_{i}, {h}_{t}({x}_{i}\right)))}{{Z}_{t}}$$
9
The final ensemble formula is
$$H\left(x\right)=sign\left[\sum _{y\in \left\{\text{0,1}\right\}}L(x,y)(\sum _{\tau :h\left(x\right)=y}{a}_{\tau }{h}_{\tau }\left(x\right))\right]$$
10
where L(x,y) is the real loss of class_i. Here class_I is the actual class of x.
$$L\left(x,{\text{y}}_{i}\right)={\sum }_{j}p\left(jx\right)c\left({y}_{i},{y}_{j}\right)$$
11
As the cost factor is involved in the proposed AdaBoost scheme, it can be regarded as a costsensitive boosting algorithm, summarized in Procedure 2. For the AdaBoost algorithm, selecting a suitable value for the weight update parameter is essential in transforming a weak classifier into a strong one. When the cost parameters are added to the weight updating formula of AdaBoost, the data distribution is affected by the parameter. The efficiency of AdaBoost is not guaranteed if the cost is considered without proper weight updates. The value of the weight update parameter needs to be decided to minimize the overall training error of the combined classifier. The process of flow classification is shown in Fig. 5.
Procedure 2: CostSensitive AdaBoost

Input: (Xtra,Ytra)={(x1,y1),(x2,y2),…,(xn,yn)} where xi∈X,yi∈{−1,1}.
Initialize:\({w}_{1}={w}_{2}=\dots ={w}_{m}=\frac{1}{m}\)
1. FOR t = 1 to n
2. Train weak learners using the current weight distribution, Dt.
3. Get weak hypothesis ht: X→{−1,1}.
4. Select ht of the lowest weighted error:
\({\epsilon }_{t}={Pr}_{i\sim{D}_{t}}\left[{h}_{t}\left({x}_{i}\right)\ne {y}_{i}\right].\)
5. Choose
\({\alpha }_{t}=\frac{1}{2}ln\left(\frac{1{\epsilon }_{t}}{{\epsilon }_{t}}\right)\).
6. Update for i = 1,…,m:
\({w}_{(t+1)}\left(i\right)=\frac{{w}_{t}\left(i\right)\text{e}\text{x}\text{p}({\alpha }_{t}{y}_{i}{h}_{t}\left({x}_{i}\right)c\left({y}_{i}, {h}_{t}({x}_{i}\right)))}{{Z}_{t}}\)
Where Zt is a normalization factor used to make w(t+1) a weight.
7. Compute the final hypothesis:
\(H\left(x\right)=sign\left[\sum _{y\in \left\{\text{0,1}\right\}}L\left(x,y\right)(\sum _{\tau :h\left(x\right)=y}{a}_{\tau }{h}_{\tau }\left(x\right))\right]\)
Output: H(x)

3.3 Prediction of Location
The QL algorithm is used to predict the future position of the end device. The QL model consists of 5tuple < L,A,R,P,γ>, where L is a finite set of states, and A is a finite set of actions. The controller obtains feedback information which R denotes, and P is the transition probability between the states. The feedback for each end device is the same, and the discount factor γ is applied to the calculation of the reward value. First, two tuples, <H, P>, need to be elaborated.

H: A set of data showing the movement history of an enddevice with three tuples, <L,T,S>, where L={l1, l2,…,ln} is the location visited by the device, T={t1,t2,…,tn} is the arrival time, and S={s1,s2,…,sn} is the duration of stay at each location. n is the number of positions visited.

P: Transition probability from one location to another as Pij the transition probability from li to lj, i ≠ j.
The QL algorithm is typically composed of a series of actions carried out by agents and states, as well as a reward function and a transfer function. In the proposed QL model, the controller and SDN are considered the agent and environment, respectively, in which the agent operates. The agent initially assesses the state of the SDN and determines the appropriate course of action to interact with it. Once an action is taken, a reward is obtained and the environment transitions to a new state. Through ongoing interactions with the environment, the agent observes a sequence of stateactionreward events and assesses the reward in conjunction with the corresponding state and action pair. The ultimate objective is to obtain the expected maximum cumulative discounted reward, which the agent seeks to achieve by calculating the optimal actionvalue function.
The QL algorithm is a set of actions executed by agents and states, along with reward and transfer functions. Within the proposed QL model, the controller and SDN are identified as the agent and environment in which the agent conducts operations. At the outset, the agent evaluates the condition of the SDN and determines the most appropriate course of action for engaging with it. A reward is obtained upon taking action, and the environment moves to a new state. Through continual interactions with the environment, the agent observes a sequence of stateactionreward occurrences and evaluates the reward in conjunction with the corresponding state and action pair. The primary objective is attaining the expected maximum cumulative discounted reward, which the agent accomplishes by computing the optimal actionvalue function.
$$Q\left(s,a\right)=\left(1\alpha \right)Q\left(s,a\right)+\alpha \left[R\left(s,a\right)+\gamma max{Q}^{{\prime }}\left({s}^{{\prime }},{a}^{{\prime }}\right) Q(s,a)\right]$$
12
where Q(•) is the expected reward, and α is the learning rate set to be the same value as QLAODV [48]. γ (> 1) is the discount factor, determining the importance of the reward. In addition, Q\({\prime }\)(s\({\prime }\), a\({\prime }\)) is the maximum reward given the new state, s\({\prime }\), and all possible actions at s\({\prime }\). The proposed scheme defines the state, action, and reward as follows.
Definition 7 (State). The state is deemed the location that the end device has formerly accessed, with each position signifying a semantic locale such as the office, home, station, etc. As previously mentioned, L denotes this.
Definition 8 (Action). An action possessing a particular state can be delineated as the duration during which the device remains stationary in its present position (state). This duration can be calibrated in seconds, ranging from one to ten seconds, and is therefore classified as an action space, A, comprising of the values {1, 2,...,10}. This is because the device's state is assessed at intervals of ten seconds. It should be noted that a decrease in the number of actions enhances the precision of prediction since the number of predicted actions is curtailed.
Definition 9 (Reward). The reward is given for a transition between two states with an action. The reward with si, Ri, is represented as follows.
$${R}_{i}=\sum _{j=1}^{n}{w}_{ij}{P}_{ij}$$
13
where wij is the probability of a visit to lj from li.
$${w}_{ij}=P\left({l}_{t+1}={l}_{j}L\right)=\frac{N(a{l}_{j},L)}{N(a,L)}$$
14
Here a = l1,…,li is a sequence of i recent positions visited and N(•, L) is the number of occurrences of one sequence of the visited locations, L.
In SoftwareDefined Networking (SDN), the controller regularly verifies the location of the end device and its connectivity to a switch every 10 seconds. The OpenFlow protocol facilitates the collection of network information, thereby resulting in no additional burden in executing the suggested approach for acquiring network information.
Prediction operation
For predicting the following location, the Qvalues are evaluated. Here the input is the current location, li, and the output is a vector, Q(li)=(Q(li,a1), Q(li,a2),…,Q(li,a10)), consisting of 10 estimated Qvalues for each possible action. The following action from li, ai, is chosen from the vector.
$$\begin{array}{c}{a}_{i}=\text{arg}\text{max }Q({l}_{i},{a}_{j})\\ j\text{ϵ}[1,\dots ,10]\end{array}$$
15
Given the possibility of encountering a local optimum, the decisionmaking process employs the ԑgreedy policy to determine the subsequent action. Specifically, an action is randomly and uniformly selected with a small probability of ԑ, while the best action is chosen from the Qtable based on Eq. (15) with a probability of (1ԑ). The controller selects the action with the highest value with a probability of (1ԑ), and a random action with a probability of ԑ, thereby facilitating exploration of the environment with a low probability. It is noteworthy that the parameter, ԑ, is not constant, but rather varies based on the operational status of the scheme. Specifically, ԑ is set to 1/c, where c denotes the number of successful predictions. The value of ԑ decreases gradually as the frequency of successful predictions increases, thereby promoting exploration during the initial stages of training, while the degree of exploitation in the later stages is contingent on experience.
Update operation: The model gets < L, T, S > of the end device delivered by the switch. Then the state of the end device is updated, and the corresponding reward for the action is computed. Lastly, the last state, current state, previous action, and its corresponding reward are logged in the training dataset for future use.
Train: The data are periodically sampled from the training dataset to perform experience replay, which utilizes previous data to remove the samples' associations. With the input consisting of past states, the Qvalue of the pair of (s,a) is updated with the previous pair using the Bellman equation, while other Qvalues are unchanged. The updated results are stored as the output. The key to using the Bellman equation is that if the Qvalue of all the subsequent states is known, the optimal strategy is the action maximizing the expected value of R(s, a) + γ×max (Q (s,a) − Q(s, a)).
The proposed QL algorithm for reinforcement learning exhibits a significant deviation from other such algorithms in that the subsequent state and corresponding reward are not immediately procured upon the selection of a specific action. Rather, they are obtained post the relocation of the end device by the controller. This implies that the parameters of QL do not undergo any alteration until the end device is relocated. Moreover, the choice of the next action is not contingent on the current action. The process of predicting the location of a device is explicated in the following manner.
Procedure 3: Prediction of location

Input: H and current location, li
Initialize: the number of predictions, t = 0.
1. Decide optimal action according to the action function as shown in Eq. (15);
2. Take action based on the ԑgreedy policy;
3. Determine the next state, lj;
4. Compute reward
\({R}_{i}=\sum _{j=1}^{n}{w}_{ij}{P}_{ij}\)
5. Update discrete Qvalue
\(Q\left(s,a\right)=\left(1\alpha \right)*Q\left(s,a\right)+\alpha \left[R\left(s,a\right)+\gamma *max{Q}^{{\prime }}\left({s}^{{\prime }},{a}^{{\prime }}\right)Q(s,a)\right]\)
6. Formulate continuous actions through ԑgreedy policy
7. t = t + 1;
8. End while
Output: Sequence of predicted locations

3.4 Rule Placement
In an SDNbased IoT environment, the network comprises numerous end devices that are connected through distinct switches. These switches maintain flow tables that require consistent updates to optimize the hit ratio, considering the movement of the end devices. The proposed MAPQ scheme employs a proactive flow rule placement approach to update the flow table. Therefore, it is imperative to update the future location of the end device based on its past locations. Subsequently, the flow rule linked to the end device is positioned in the flow table of the corresponding switch.