This section discusses the efforts of researchers to improve the solutions of optimization algorithms through the use of Q-learning capabilities. Each search action is governed by the reward and penalty concept in order to keep agents moving toward the global optimal solution.
The optimization algorithm combined a hybrid PSO (HPSO) algorithm with the DE process developed by Kim and Lee [28]. Combining the proposed HPSO with Q-learning increases the search capabilities by selecting algorithm control parameters adaptively. A Q-table is utilized to match the relationship between the three-dimensional state (s1, s2, and s3) and the three actions (MR, CR, BB).
Gao et al. [29] have presented a positioning system for underwater vehicles. Injection Q-learning was utilized to improve the PSO search equation. They used Q-learning, which can compare the expected value of different actions without needing a model of the environment.
Using Q-learning, Rakshit et al. [30] created a new variant of the Differential Evolution (DE) algorithm, called TDQL. The Q-table is used to control only two parameters, which are DE scaling factors (F1 and F2), in the proposed algorithm. The Q-table is updated based on the DE agents. For optimal performance, scaling factors for all search agents of the DE algorithm should not be equal. The search agent with the best fitness should search the local area, whereas the agent with the worst performance should participate in the global search. Thus, a good search agent should have small scaling factors, while a poor one should have relatively large scaling factors. According to the reported results, TDQL performed better than other DE variants.
Samma et al. [27] introduced a new Particle Swarm Optimization algorithm based on reinforcement learning (PSO). RLMPSO is the name of the suggested algorithm. Each search agent is exposed to five operations under the supervision of Q-learning: exploration, convergence, high-jump, low-jump, and fine-tuning. The agent follows these operations in line with the Q-learning activity. The RLMPSO is evaluated using four unimodal and multimodal benchmark problems, six composite benchmark problems, five shifted and rotated benchmark issues, and two benchmark application problems. According to the results of the experiments, the proposed model outperforms a range of state-of-the-art PSO-based algorithms.
The QSO algorithm was created by Hsieh and Su [8], who merged Q-learning with PSO. The top global search agent is chosen in QSO based on its overall performance rather than its performance during a single test. There are N search agents in the QSO population, and each agent has an external and an interior state. While each agent's external state identifies the individual within the population and is unaffected throughout the optimization phase, each person's interior state will be altered once it takes action. The internal state of an agent reflects the agent's current position. Each agent can do two sorts of activities (imitation and disruption) in each condition. Because an agent is often motivated by the success of its neighbors, and so aspires to mimic the behavior of the best agent worldwide, any agent in a population can benefit from the discoveries made by other agents in the population during their interactions.
Watchanupaporn and Pudtuan [31] utilized Q-learning to enhance PSO for solving the multi-robot target problem. Each robot will search for the desired path by assigning rewards and penalties to specific actions. The Q-table is shared by all robots. A robot moves in the Q-table direction with the highest value.
Ma and Zhang [32] introduced a unique optimization technique known as QABC, which is built by combining Q-learning into the ABC solution search equation. The QABC's basic principle is to employ bees and spectator bees to discover the best nectar supply location using a solution search equation chosen by Q-learning. In order to update the position of the nectar source, the method determines the solution search equation that corresponds to the optimal Q-table value. This strategy improves the algorithm's exploitation potential, and the ranking-based selection probability helps to keep the population diverse.
Q-learning used to dynamically determine the optimal search operator from four different (Sine, Cosine, Levy Flight, Crossover) during runtime by Zamli et al. [33]. The new algorithm called QLSCA. Q-learning is employed to toggle between these four possibilities, selecting the ideal one based on prior awards. This method utilizes a single Q-table with a dimension of 4*4. (4 actions and 4 states). Two additional operators were added. These are the Levy Flight and crossover operators. The QLSCA outperformed five state-of-the-art algorithms, including the original SCA, the particle swarm test generator (PSTG), adaptive particle swarm optimization (APSO), the cuckoo search (CS), and particle swarm optimization (DPSO), according to experimental data.
To assist robots in determining the optimal path through an unknown environment and in learning about it. Meerza et al. [34] combined Q-learning and PSO into a single algorithm termed QL + PSO. There are only four actions in the action set: forward, backward, left, and right. The particle receives a reward based on the action. The optimal position of each particle in the swarm is determined using the Q-table; additionally, particles learn their optimal position using the Q-table until the end of the trail. Simulated results indicate that the proposed algorithm outperforms Q-learning and PSO alone.
Li et al. [13] propose a novel approach for differential evolution based on reinforcement learning and fitness ranking. DE-RLFR is the proposed name for the algorithm. Q-learning directs DE search agents towards the ideal solution. every individual in the population is regarded an agent. The order of each agent's multi-fitness function values determines the hierarchical state variable, and three typical DE mutation procedures are accessible. In a generation, those that achieve or remain in a higher hierarchical state zone are rewarded, and the agent's experience is maintained in the Q-table and updated with each iteration to assist all agents in selecting the optimal mutation strategy for the following iteration.
Liu et al. [26] proposed a Q-Learning-based PSO, dubbed QLPSO, which trains the PSO parameters using Q-Learning. At the heart of the QLPSO algorithm is a Q-table containing a state and an action. By adjusting various parameters, the action regulates the exploration and exploitation of particles. Based on their past actions (rewards and punishments), the Q-table helps particles choose the best thing to do next.
Xu and Pi [35] proposed a dynamic communication topology in PSO based on Q-learning, dubbed QLPSO. When solving more complex problems, the primary advantage of dynamic topology is that it avoids falling into the local solution. Each particle acts independently in the proposed algorithm, selecting the optimal topology under the control of Q-learning during each iteration. The performance of QLPSO is compared to static and dynamic topologies for 28 CEC 2013 benchmark functions. The presented results demonstrate that the QLPSO outperforms several state-of-the-art methods.
Chen et al. [36] introduced the reinforcement learning mechanism into genetic algorithm (GA). RLGA is the proposed algorithm. The crossover and mutation operations are carried out using Q-learning to determine the crossover fragments and mutation points to be optimized. RLGA contains two Q-tables, one dedicated to crossover and another to mutation. RLGA was compared to traditional GA and state-of-the-art methods, and the experimental results indicated that RLGA is capable of achieving higher performance.
A single optimization model called QLSA was proposed by Samma et al. [37]. The Q-learning method is combined with Simulated annealing (SA). The Q-learning method is integrated into SA to improve its performance by adaptively adjusting its parameters during run time. Q-learning is used to keep track of the optimum performance values of SA parameters. A total of seven constrained engineering design problems were employed in this study to evaluate the efficacy of the proposed QLSA method. QLSA was compared to state-of-the-art population optimization methods such as PSO, GWO, CLPSO, Harmony, and ABC for further investigation. The results reveal that QLSA beats the other algorithms tested substantially.
Zhang et al. [38] utilized Q-learning to improve the PSO search process by choosing three different actions. These are the Exploration, Exploitation, and Jump actions. Based on the selected action, the PSO velocity and position will be updated, resulting in a new search area for the PSO agent. Based on the value of the Q-table, the DQN-PSO evaluates the reward of the agent's action and selects the action that can generate the greatest reward.
Oztop et al. [39] proposed a new optimization algorithm based on General Variable Neighborhood Search (GVNS) and Q-learning. GVNS-QL is the name of the proposed algorithm. The proposed GVNS-QL algorithm doesn't use constant parameter values. Instead, it uses Q-learning to figure out GVNS parameters.
To solve the job-shop scheduling problem, Chen et al. [40] propose a self-learning genetic algorithm (SLGA). The GA's key parameters are adjusted intelligently using reinforcement learning. In this algorithm, two reinforcement learning methods are utilized. These techniques are the SARSA and Q-Learning algorithms. In the initial and final stages of optimization, the SARSA and Q-Learning algorithms are applied as learning methods, and the conversion condition is designed. Second, the state determination method and the reward method are tailored for reinforcement learning in a GA setting.
Huynh et al. [41] used the Q-learning to improve the performance of Differential Evolution. The proposedalgorithm called qlDE. The Q-learning model is incorporated into DE as an adaptive parameter controller, adjusting the algorithm's control parameters at each search iteration to improve its behavior for diverse search domains. The performance of the proposed algorithm was improved by automatically adjusting the balance between two phases ( exploration and exploitation). Five benchmark instances of truss structural weight minimization were performed in this work to validate the efficiency of the qlDE in contrast to the traditional DE and many other methods in the literature. Seyyedabbasi et al. [42] presented three optimization methods based on Q-learning to solve global optimization problems, based on three metaheuristic algorithms: I-GWO [43], Ex-GWO [43], and Whale Optimization Algorithm (WOA)[44]. The suggested algorithms are known as RLI-GWO, RLEx-GWO, and RLWOA. These algorithms' search agents use the Q-Table values to choose between the exploration and exploitation phases. The Q-Table is in charge of controlling two phases in order to make more effective judgments. It also aims to help search agents discover new areas of the global search universe. To determine the reward and punishment values for each activity, a control mechanism was used. Each suggested method aims to handle global optimization issues as efficiently as possible while avoiding the local optimum trap. The algorithms proposed in this work were put to the test on 30 benchmark functions from CEC 2014 and 2015, and the results were compared against three metaheuristics, as well as traditional GWO and WOA. The presented approaches have also been used to solve the challenge of inverse kinematics for robot arms. The results showed that RLWOA is more effective at solving test problems.
Li et al. [45] combined the Q-learning and GA algorithms to create the QGA algorithm. The core idea of the proposed algorithm is to treat the GA's gene space as the Q-learning algorithm's action strategy space. In other words, in Q-learning, the selection action corresponds to the genetic selection operator in GA. QGA was used to schedule tasks. The experimental results demonstrated the algorithm's effectiveness.
Lu et al. [46] proposed a reinforcement learning-based PSO (RLPSO) for optimizing wastewater treatment processes (WWTP). RLPSO is composed of four fundamental components. The agent is a population particle, while the environment is the WWTP. The state represents the position of each particle in the population; the action represents the strategy for predicting the velocity of each particle, which is determined by the Q-table.
To solve problems involving dynamic optimization, Gölcük and Ozsoydan [47] proposed an algorithm recommendation architecture based on Q-learning to provide guidance on the most appropriate metaheuristic algorithm selection in changing environments. Due to the abundance of design options, choosing an appropriate metaheuristic algorithm becomes an immediate challenge. To accomplish this, the Artificial Bee Colony (ABC), Manta Ray Foraging Optimization (MRFO), Salp Swarm Algorithm (SSA), and Whale Optimization Algorithm (WOA) were used as low-level optimizers, with Q-learning selecting the optimizer based on its behavior. The results show the effectiveness of the Q-learning-based algorithm recommender in solving the dynamic multidimensional knapsack problem.
To increase the efficacy of brain storming optimization (BSO) Zhao et al. [48] proposed a new algorithm called the reinforcement learning brain storm optimization algorithm (RLBSO) by leveraging the power of reinforcement learning. The RLBSO is equipped with four mutation strategies that significantly balance the RLBSO's local and global search capabilities. The Q-learning mechanism was implemented to guide the selection of mutation strategies based on the feedback of historical data. The RLBSO was evaluated against the CEC 2017 benchmark and outperformed advanced BSO algorithm variants and state-of-the-art algorithms.
Based on DE, Hu and Gong [49] introduced a new optimization technique called RL-CORCO; they chose between two mutation strategies using Q-learning. The proposed algorithm has nine hierarchical populations; hence the Q-table is a (9 x 2) matrix with rows and columns representing actions and states, respectively. Then, based on the population's state changes after each strategy selection, the reward is obtained and the Q-table is modified.
Liao and Li [50] proposed ERLDE, a new DE variant based on Q-learning; they employed a Q-table with four actions and four states. Q-learning is used to tell the DE search agents what to do by controlling the mutation strategy and the size of the neighborhood based on the idea of rewards and punishments.
Wang et al. [51] infused Q-learning into an adaptive artificial bee colony (ABC) to determine a search operator dynamically. There are twelve states and six actions in the Q-table. The Q-learning algorithm chooses between these six actions, which represent six distinct search operators. So, different search operators can be used in different generations, and the chance of getting stuck in a locally optimal solution can be lowered.
Wu et al. [52] proposed a modification to the teaching-learning-based optimization algorithm by incorporating Q-learning into the RLTLBO algorithm. First, the teacher phase of basic TLBO is completed. The Q-learning system then works in the learner phase to enhance the search process. This facilitates the agents' comprehensive learning, thereby accelerating the convergence rate. The random opposition-based learning (ROBL) technique is added to make it easier to avoid local optima.
Hamad et al. [24] employed the Q-Learning algorithm to improve the original SCA. Q-Learning guides search agents toward a more efficient discovery and a global solution by skipping local optima. The Q-learning algorithm was introduced into SCA to control the values of two important parameters (r1 and r3). These parameters are crucial for directing the movement of the search agent within the search area. The Q-Table attempts to adjust the values of these two parameters in line with the values stored in the Q-Table. This algorithm has five search agents, and each one has a Q-table, resulting in five Q-tables. The table size in QLESCA is 9 × 9 (9 actions and 9 states). The Q-table governs the process of switching between exploration and exploitation in QLESCA. This means that it can start with exploration and then move to exploitation, or the other way around.
A new PSO variant based on Q-learning for ship damage stability design has been proposed by Huang et al. [53]. The proposed algorithm is called QLPSO. Q-learning identifies the optimal PSO settings (w, c1, and c2). Agents from the PSO algorithm are utilized for Q-learning as agents. The Q-learning environment is the search space for agents. The current search operation of the agent represents the state of Q-learning. The search operation in this algorithm consists of exploration, convergence, and jump operations. The Q-learning action is the agent's next search operation. If the agent achieves a good result, it will receive a positive reward; otherwise, it will receive a penalty.
Wang et al. [54] suggested a PSO variant termed reinforcement learning level-based PSO for high-dimensional problems (RLLPSO). They apply the level-based population structure inspired by LLSO [55] to enhance the exploration capability of PSO and prevent its premature inclination. In addition, they presented a reinforcement learning technique for level number control that automatically adjusts the population's level number. RLLPSO can identify the optimal level structure to enhance the population's search potential and increase search efficiency based on the environmental reward. RLLPSO uses a level competition mechanism to speed up convergence and manage the large search space. This is done by making it more likely that top agents will be chosen as learning examples.
Table 3 summarizes the main characteristics of each Q-learning-based optimization algorithms.
Table 3
Description of Q-learning-based optimization algorithms.
Reference | Year | Proposed Algorithm | Metaheuristic Algorithm | Characteristics |
Kim & Lee [28] | 2009 | - | HPSO | To improve the search capabilities of HSPO algorithm, Q-learning is used to select HSPO control parameters adaptively. |
Gao et al.[29] | 2009 | - | PSO | They update the PSO search equation by using Q-learning, so the agent will update his position based on the best value stored in the Q-table. |
Rakshit et al.[30] | 2013 | TDQL | DE | Each DE search agent is controlled through Q-Learning by controlling the DE scaling factors. |
Samma et al.[27] | 2016 | RLMPSO | PSO | Q-learning used to control the operations of each PSO search agent. Each agent has five operations: exploration, convergence, high-jump, low-jump, and fine-tuning. |
Hsieh & Su [8] | 2016 | QSO | PSO | The QSO algorithm chooses the best agents in the population based on their cumulative performances. The Q-table in the QSO consists of two states (external state and internal state) and two actions (imitation and disturbance) to direct the search agent to the optimal global solution. |
Watchanupaporn & Pudtuan [56] | 2016 | - | PSO | Q-learning is employed to improve PSO for solving the multirobot target problem. |
Ma & Zhang [32] | 2016 | QABC | ABC | Several different search strategies that change based on Q-learning were mapped into an action that was used to change the location of the nectar source location. |
Zamli et al.[33] | 2018 | QLSCA | SCA | During runtime, Q-learning is utilized to dynamically identify the best functioning of SCA, eliminating the chance of switching between sine and cosine equations. To switch between these two possibilities, Q-learning is employed, and it picks the best option based on previous awards. |
Meerza et al.[34] | 2019 | QL + PSO | PSO | Each PSO particle is guided by the Q-table to find the optimal path until the end of the trial by selecting one of four actions (forward, backward, left, or right) based on previous rewards. |
Li et al.[13] | 2019 | DE-RLFR | DE | The values of each agent's fitness ranking are used to select a state. Three standard DE mutation operations are utilized as optional agent actions. Based on the stored reward and penalty from the Q-table, each agent chooses the best mutation operation for the next round. |
Liu et al.[26] | 2019 | QLPSO | PSO | Q-learning guides the PSO particles to choose the best action at each iteration. These actions control the exploration and exploitation behavior of these particles. |
Xu & Pi [35] | 2020 | QLPSO | PSO | Q-Learning was incorporated into the PSO in order to determine the optimal communication topology for each particle at each iteration based on the problem that the PSO was aimed to resolve. |
Chen et al.[36] | 2020 | RLGA | GA | Two Q-tables were used, one to determine the crossover fragments and another to determine mutation points. |
Samma et al.[37] | 2020 | QLSA | SA | Q-learning is used to modify SA parameters adaptively during run time. The optimum performance values of SA parameters are tracked using Q-learning. |
Zhang et al.[38] | 2020 | DQN-PSO | PSO | Q-learning is used to choose between three actions (Exploration, Exploitation, and Jump). Q-learning attempts to choose the most rewarding action. The position of the PSO agent is updated based on the selected action. |
Oztop et al.[39] | 2020 | GVNS-QL | GVNS | The suggested approach uses Q-learning to define the GVNS algorithm's parameters. |
Chen et al.[40] | 2020 | SLGA | GA | The SARSA algorithm and Q-learning algorithm are applied in the learning module of SLGA. SLGA is working on integrating their methods to gain the benefits of both, allowing for faster learning speed and higher solution precision. |
Huynh et al.[41] | 2021 | qlDE | DE | The Q-learning is implemented in DE as an adaptive parameter controller, which adjusts the algorithm's control parameters at each search iteration to enhance its performance over a variety of search domains. |
Seyyedabbasi et al.[42] | 2021 | RLI−GWO | GWO | The Q-learning is integrated into three optimization algorithms (I-GWO, Ex-GWO, and WOA) in order to increase the performance of these methods. These algorithms' search agents use the Q-Table values to choose between the exploration and exploitation phases. According to the reward and penalty values for each search agent action, the Q-Table is filled with information. |
Seyyedabbasi et al.[42] | 2021 | RLEx−GWO | GWO |
Seyyedabbasi et al.[42] | 2021 | RLWOA | WOA |
Li et al.[45] | 2021 | QGA | GA | The genetic algorithm chooses the best genetic for the next generation based on the Q-table value. |
Lu et al.[46] | 2021 | RLPSO | PSO | The Q-table used to predict each particle's velocity in the PSO. |
Gölcük & Ozsoydan [47] | 2021 | - | ABC, MRFO, SSA, and WOA | From these four algorithms, Q-learning selects which is the best algorithm based on their previous performance. |
Zhao et al.[48] | 2022 | RLBSO | BSO | At each iteration, Q-learning is used to select one of four mutation strategies based on the historical record of rewards and penalties. |
Hu & Gong [49] | 2022 | RL-CORCO | DE | To pick between two mutation techniques, Q-learning is applied. There are nine states and two actions in the Q-table. |
Liao & Li [50] | 2022 | ERLDE | DE | Q-learning is used to choose the DE algorithm's mutation strategy and neighborhood size. |
Wang et al.[51] | 2022 | QABC | ABC | Q-learning is used to dynamically select between six different search operators of ABC. |
Wu et al.[52] | 2022 | RLTLBO | TLBO | In the learner phase of TLBO, Q-Learning is implemented to create a system for transitioning between two different learning modes. Then, ROBL is introduced to the algorithm to improve its avoidance of local optima. |
Hamad et al.[24] | 2022 | QLESCA | SCA | Q-learning is used to guide the SCA’s search agent to the best search area. The Q-table has nine states and nine actions to control the values of two of SCA’s important parameters (r1 and r3). By changing these parameters, QLESCA can make the search phase go from exploration to exploitation and back again. |
Huang et al.[53] | 2022 | QLPSO | PSO | Q-learning works to determine the optimal PSO parameter settings. |
Wang et al.[54] | 2022 | RLLPSO | PSO | In lieu of manual control, the Q-Learning approach for level number control was implemented to auto-adaptively alter the population's level structure. Due to the adaptability and flexibility provided by Q-Learning, the suggested algorithm can improve search efficiency by taking the best probable actions based on the reward earned during the learning process. |
The number of metaheuristic optimization algorithms published between 2009 and 2022 that used the Q-learning algorithm to improve their performance is depicted in Fig. 6. As shown by this graph, the number of algorithms has increased over the past three years (2020, 2021, and 2022) by 6, 7, and 8 algorithms, respectively.
Figure 7 illustrates the metaheuristic optimization algorithms that were enhanced by the Q-learning algorithm. As shown in the graph, 14 metaheuristic optimization algorithms are used by researchers to enhance their performance by utilizing Q-learning. PSO ranked first with the greatest number of variants, with eleven proposed algorithms. Then DE ranked second with five variants, ABC ranked third with three variants, followed by SCA, GA, WOA, and GWO with two variants each. Only one variant exists for TLBO, SA, BSO, MRFO, SSA, GVNS, and HPSO.