Deep learning has accelerated the progress of the RL using deep learning algorithms in the RL and is termed as Deep Reinforcement Learning (DRL). For covering the seminal and other recent developments observed in the DRL, innovative ways that ensure neutral networks to be used for bringing use close to the development of other autonomous agents. To obtain a better and more comprehensive survey of all other recent efforts in the DRL, along with the applications of DRL they are forwarded to areas like natural language processing. The perception of deep learning function is combined with DRL and this is a method of AI that is closer to human thinking therefore regarded as the real AI [15, 16].
Taking an RL setup with a learning entity that interrelates with the whole space taken into consideration, this may be described by the newest of variables wherein S refers to the actual set of states. The set of actions is denoted by A, \({p(s}_{0})\) refers distribution of the initial states, \(r:S\times A⟶R, 0\left({s}_{t+1}\right|{s}_{t,}{a}_{t,})\) will be the transition probabilities and \(\gamma ϵ\left[\text{0,1}\right]\) which will be a discount factor [24].
The mapping of the states to actions: \(\pi :S⟶A\) is guided by the deterministic policy. At the start of each episode, this will be denoted by the initial state which is \({s}_{0}\). For every time step t, the entity will perform an action that is founded on the current state:\({a}_{t}=\pi \left({s}_{t}\right)\). The action executed will get a reward \({r}_{t}=r({s}_{t},{a}_{t,})\), and distribution p(.|\({s}_{t},{a}_{t})\) will help in sampling the new state of the environments. The total return will be: \({R}_{t}={\sum }_{i=T}^{\propto }{\gamma }^{i-t}{r}_{i}\). The goal of the agents was to attempt the maximization of expected return E[\({R}_{t}|{s}_{t},{a}_{t,}\)] with an optimal policy \({\pi }^{*}\) expressed as the policy \({\pi }^{*}\), so that Q\({\pi }^{*}\)(s,a) \(\ge {Q}^{\pi }(s,a)\) for each s \(ϵ S\) ; a \(ϵ A\) and any other policy π. The optimal Q-function, \({Q}^{*}\), is the optimal policy with Q-function which will satisfy the Bellman equation as in (1) [24]:
$${Q}^{*}\left(s,a\right)={E}_{\stackrel{´}{s{\prime } p(.|s,a\left)\right)}}[r\left(s,a\right)+ {\gamma }_{{a}^{{\prime }}ϵ A}^{max}{Q}^{*}\left(s{\prime },a{\prime })\right)]$$
1
Deep Q-Networks (DQN) is expressed in the form of a model-free reinforcement learner, for representing the discrete action spaces. In the case of a DQN, a neural network Q will be kept, and this will approximate \({Q}^{*}\). \({\pi }_{Q}\left(s\right)={argmax}_{aϵA}Q(s,a)\) signifies a new greedy policy w.r.t. Q. A – which will take a random action with the probability \(ϵ\) and action \({\pi }_{Q}\left(s\right)\) with the probability 1-\(ϵ\). The episodes were created at the time of training with the help of greedy policy. The Replay buffer will store transition tuples \({: {(s}_{t},a}_{t},{r}_{t},: {s}_{t+1})\) that were experienced in training. The training of the Neural network will be intertwined with the creation of new episodes. The Loss \(\mathcal{L}\) sampled from a replay buffer is given by \(\mathcal{L}\) =E \({\left(Q\right({{s}_{t},a}_{t})-{y}_{t})}^{2}\)where \({y}_{t}={r}_{t+}{\gamma }_{{a}^{{\prime }}ϵ A}^{max}Q\left({{s}_{t+1},}_{{a}^{\mathcal{L}}}\right)\) and tuples \({: {(s}_{t},a}_{t},{r}_{t},: {s}_{t+1})\).
Changes are slower with target network when compared to the main network used for measuring the targets \({y}_{t}\). These weights of target networks are set as the current weights of its main network. Polyak-averaged parameters may also be utilized.
Two neural networks are termed as an Actor and a Critic. Actor neural network will be a target policy π S\(\to\)A, and the critic neural network will be an action-value function approximator Q: S ×\(A\to\)R. Critic network Q (s, a|\({\theta }^{Q}\)) and Actor-network µ (s|\({\theta }^{{\mu }}\)) can be adjusted using weights \({\theta }^{\text{Q}}\) and \({\theta }^{{\mu }}\).
The behavioral policy employed for the generation of episodes that can be a noisy variant of the target policy \({\pi }_{b}\)(s)= π(s) + \(\aleph \left(\text{0,1}\right)\). The Q function in DQN is utilized for training of a critic neural network, however, in the place where a target \({y}_{t}\) has been calculated as \({y}_{t}={r}_{t}\)+\({\gamma }^{Q}\left({s}_{t+1, }{\pi }\right({s}_{t+1, }\left)\right)\), wherein, \(\gamma\) will be the discounting factor. Loss \({\mathcal{L}}_{a}\)= \(-{E}_{a}\) \(Q(s,{\pi }\left(\text{s}\right)\)was used for training actor networks.
The metaheuristic algorithms iterates through the space of parameter values for maximizing task performance and minimizing the number of training episodes. The parameters: discounting factor \(\gamma\); polyak-averaging coefficient \(\tau\); learning rate for networks are optimized. In this work, the Firefly Mutated algorithm, PSO-TABU, and Firefly-TABU are used for optimizing the parameters.
3.1 Hybrid PSO-TS Algorithm
PSO simulate the behavior of a flock of birds in a multi-dimensional space looking for an optimum one. The swarms made up of particles which are randomly initialized and used to search for optimal solutions through updating during the iterations. The candidate solutions encoded as particles traverse within an area of search (both current and possible solutions). It includes a movement to a promising area to obtain the global optimum and includes two approaches, cognitive and social [17].
Tabu search (TS) was based on the idea that if problem-solving has to be qualified as intelligent should integrate an adaptive memory with responsive exploration. This feature of adaptive memory of the TS permits the execution of certain processes that can search for a solution space that is both effective and economical. As all other local choices have been guided by information collated at the time of the search, the TS will contrast along with memoryless designs that are dependent on the semi-random processes implementing a new form of sampling [18].
The drawback of PSO is that it may be caught into a local solution and in certain cases slow convergence. For preventing the particles from jumping into various local location regions, the TS has been incorporated to it to allocate a new Tabu attribute to the regions. So, there can be a synergy of the TS and PSO duly employed in a hybridized fashion in Deep Reinforcement Learning (DRL). For accelerating the convergence of the PSO, the proposal was to combine it with the TABU search algorithm. At every epoch, the PSO provides the best solution, along with other paths that are examined by the application of the TABU [19, 20]. The objective function is thus used for the PSO to minimize Mean Square Error (MSE) calculated using Eq. (2):
Wherein: r refers to the sample size and T is the sampling time, e(𝑘) = d(𝑘) − m(𝑘) will be the difference between different quantities of the desired output d(𝑘) and the computed output with the m(𝑘) process under control.
The primary basis of the proposed PSO-TS is shown below: the good solutions are identified using the PSO. After this, a “critical path” is presumed to be a “TABU”. To improve the PSO solution, the TS process tries to enhance the quality of the gbest. The PSO and the TS is iteratively repeated until such time a stop position is met as per Fig. 1.
3.2 Hybrid FA-TS Algorithm
The Firefly Algorithm (FA) imitates the social behavior of fireflies. The two basic functions of flashing light of the fireflies are attracting mates and also to attract their potential prey. For the FA, all search strategies will be multi-agent that are regulated using randomization, and an effectual local search which the choice of the best possible solutions [21]. For the purpose of this work, a new hybrid FA-TS combination technique has been proposed for solving task scheduling using limited constraints. The primary objective function for this is the reduction of makespan while improving the solutions and their quality. Here, a complete algorithm scheme for task scheduling was designed with the TS local search aiming to look out for a local optimum for each individual and moving directly the TS to a much more promising space [22].
The hybrid FA-TS approach proposed was used to determine a global optimal value that had a high success rate working effectively and efficiently. The primary advantage of this was it made use of real and random numbers. The metaheuristic algorithms, in most cases, face the challenge of dealing with certain stochastic test functions. But the FA was better in dealing with them even if there were noisy problems of optimization. The main workload for critical resources and the total workload of all the resources and makespan were dealt with.
Even though there is a hybrid approach that has been proposed, the work does have some limitations. There is plenty of attention needed to make a choice of parameters since they can affect the results to a significant extent and may also yield poor results. The flowchart of the proposed FA-TS is depicted in Fig. 2.
3.3 PSO-Mutated
For the purpose of improving the PSO and its performance, and to mitigate local minima issue, different mutation operators for the PSO are proposed. Some works mutate its gbest particle and some of its local best particle using varied methods that can avert the PSO from being stagnated in local minima. Certain variants of the PSO were discussed based on mutation operators. A variant uses Cauchy mutation. The global best particle to be compared to the original one based on its fitness and the one that was fitter was chosen. The Eq. (3) below was used to mutate the particle.
\({gbest}_{i} \left(i\right)={gbest}_{i} \left(i\right)+W\left(i\right)*\) N (\({x}_{min}, {x}_{max}\)) (3)
Wherein, N refers to a cauchy distributed function using a scale parameter t = 1, N (\({x}_{min}, {x}_{max}\)) is a random number in (\({x}_{min}, {x}_{max}\)) of domain of test function and given by (4),
$$W\left(i\right)=\left({\sum }_{j=1}^{popsize}V\left[j\right]\left[i\right]\right)∕popsize$$
4
Wherein, \(V\left[j\right]\left[i\right]\)refers to the \({i}^{th}\) velocity vector of the \({j}^{th}\) particle, and the popsize was used for population.
3.4 Firefly Mutated
The strength of the optimization algorithm is based on the way in which the algorithm can explore new solutions in the minimum possible time and the manner in which the solutions are exploited to ensure they are made better and more efficient. The FA is a step movement that includes both exploitation and exploration effectively with good outcome in a short period of time. Thus, the mutation is included in the FA. Based on mutation, the manner in which new solutions are looked at by means of a step-by-step manner emitting brighter light. The features of better fireflies helps faster coverage and avoiding in the local optima. In order to perform a new mutation operator that chooses certain elements with a probability pm (mutation probability) and this is useful in helping the fireflies escape from the local optima. To decide if the firefly has to participate in mutation, it will not be dependent on the mutation probability \({p}_{m}\) and this is computed based on each iteration. The selection of the \({p}_{m}\) can significantly impact the performance of an algorithm. The generic values of \({p}_{m}\) will be 0.001 to 0.05. The mutation probability in the MFA is computed by using (5)
$$MP={f}_{new}-{f}_{old}$$
5
Wherein, \({f}_{new}\)fnew is the fitness for the new firefly and \({f}_{old}\) is the fitness of the original firefly. For the generation that goes through \({n}_{m}\) mutation operations, an average mutation progress value which is \(\stackrel{\sim}{M}P\) has been given by (6)
$$\stackrel{\sim}{M}P= \frac{1}{{n}_{m}}\sum MP$$
6
The rates of mutation were adjusted before the end of every generation with the average progress values, on the basis of equations (5) and (6) [23].