5.1. Simulation results
In this section, we randomly generate some instances (5 * 5, 6 * 6, 8 * 8, 10 * 10, 15 * 15, 20 * 20). For each size, we randomly generate 100 instances, and the corresponding processing time of each job is an integer in the range of 1–99 to train and test the proposed algorithm. Finally, the trained algorithm is applied to ft [30], La [31] and taillard' [32] examples to compare the performance of the algorithm. During the training, we used PyTorch to implement the PPO algorithm. PyTorch runs on the Windows 10 64-bit operating system, and the hardware used is an Intel Xeon bronze 3106 CPU with NVIDIA Quadro p6000 GPU.
In deep reinforcement learning, the DRL algorithm is very sensitive to superparameters, and the training results and convergence are different under different superparameters. Therefore, taking the parameter adjustment in the training process of our randomly generated (6 * 6) example as an example, this paper analyses the influence of different clip values and network structure learning rates on the training process of the reinforcement learning algorithm. As shown in Fig. 2, the abscissa of each point indicates the training times, and the ordinate indicates the makespan. The training times are set to 10,000 during the training.
In the PPO algorithm, the clip value means cutting the advantage function and keeping it in the trust domain to prevent overly large changes in the update process from affecting the effect of the algorithm. This also prevents the accumulation of a large number of errors in the process of reusing the same batch of samples and cuts off the gradient of some samples that make the output of the current strategy change too much. The clip value usually takes a positive number greater than 0. For the same input sample, the ratio of policy output change caused by each parameter update is \(1-ϵ\) and \(1+ϵ\). The smaller the \(ϵ\), the more stable the training; and the larger the \(ϵ\), the more samples that are saved. In this paper, we select three values of 0.1, 0.2 and 0.3 for analysis. As shown in Fig. 2 (a), when clip = 0.1, the convergence effect of the algorithm is the best and the makespan is the smallest. Therefore, we select clip = 0.1.
Similarly, the network structure has a great impact on the algorithm training process. In this paper, we set up the actor network\({\pi }_{\theta }\) and critic network \({v}_{\varphi }\). The GIN layer is shared. The GIN layer has two MLPs. There are four layers of MLP in \({\pi }_{\theta }\) and \({v}_{\varphi }\). As shown in Fig. 2 (c), by adjusting the dimension of each layer, we compare four different network structures 64, 128, 256 and 512. When the dimension is equal to 128 and 256, the performance is the best, but an excessively high dimension affects the training efficiency of the algorithm. Therefore, the dimension selected in this paper is 128.
Figure 2 (b) shows the impact of different learning rates on the algorithm performance. When lr = 1e-3, the algorithm performance is the best.
Table 2
Parameters
|
Values
|
Proximal Policy Optimization
|
|
Learning rate
|
1e-3
|
Clip
|
0.1
|
Entropy coefficient
|
0.1
|
Value function coefficient
|
2
|
Policy loss coefficient
|
2
|
Number of steps per update
Discount factor
|
32
0.95
|
Genetic Algorithm
|
|
Iteration
|
5000
|
Crossover probability
|
0.8
|
To further evaluate the performance of the algorithm, we use a public example to verify it. We compare the algorithm to a genetic algorithm and various scheduling rules. The genetic algorithm is the most classical metaheuristic algorithm in the field of job shop scheduling. It has good performance, and the effect of solving examples is very good, but the solution time is relatively long. The parameter values of the PPO and GA algorithms are listed in Table 2.
Finally, the results of each algorithm and scheduling rule in the open example are shown in Table 3 and Fig. 3. The results indicate that the maximum completion time of the methods used in this paper is shorter than the maximum completion time of the heuristic rules, and 80% of the examples are better than or equal to the metaheuristic algorithm GA, while the DRL algorithm is much better than the metaheuristic algorithm with regard to solving speed. Therefore, the overall effect of the DRL method is better than the heuristic scheduling rules and metaheuristic algorithm. Compared with the optimal solution, the relative error of DRL is very small and is close to the theoretical optimal solution. The DRL method can obtain the theoretical optimal solution on 60% of the open examples, especially for examples with a small number of machines, and DRL can converge well to obtain the optimal solution.
Table 3
Instance
|
Size
|
OPT
|
SPT
|
LPT
|
SSO
|
LSO
|
GA
|
PPO
|
La01.txt
|
(10×5)
|
666
|
920
|
889
|
844
|
806
|
706
|
666
|
La02.txt
|
(10×5)
|
655
|
901
|
894
|
952
|
815
|
732
|
655
|
La03.txt
|
(10×5)
|
597
|
770
|
748
|
904
|
793
|
680
|
597
|
La04.txt
|
(10×5)
|
590
|
916
|
848
|
937
|
868
|
637
|
639
|
La05.txt
|
(10×5)
|
593
|
827
|
787
|
894
|
633
|
593
|
593
|
La06.txt
|
(15×5)
|
926
|
1369
|
1105
|
1201
|
1009
|
948
|
926
|
La07.txt
|
(15×5)
|
890
|
1128
|
1145
|
1187
|
1104
|
985
|
894
|
La08.txt
|
(15×5)
|
863
|
1168
|
1061
|
1190
|
1131
|
932
|
863
|
La09.txt
|
(15×5)
|
951
|
1289
|
1105
|
1319
|
1174
|
1001
|
951
|
La10.txt
|
(15×5)
|
958
|
1345
|
1136
|
1604
|
976
|
958
|
958
|
La11.txt
|
(20×5)
|
1222
|
1654
|
1476
|
1470
|
1289
|
1304
|
1222
|
La12.txt
|
(20×5)
|
1039
|
1352
|
1222
|
1324
|
1173
|
1087
|
1047
|
La13.txt
|
(20×5)
|
1150
|
1747
|
1298
|
1592
|
1350
|
1207
|
1150
|
La14.txt
|
(20×5)
|
1292
|
1757
|
1360
|
1895
|
1341
|
1292
|
1292
|
La15.txt
|
(20×5)
|
1207
|
1476
|
1510
|
1627
|
1411
|
1403
|
1212
|
La16.txt
|
(10×10)
|
945
|
1588
|
1238
|
1565
|
1274
|
979
|
980
|
La17.txt
|
(10×10)
|
784
|
1094
|
1157
|
1135
|
1060
|
809
|
794
|
La18.txt
|
(10×10)
|
848
|
1259
|
1264
|
1232
|
1090
|
903
|
851
|
La19.txt
|
(10×10)
|
842
|
1339
|
1140
|
1225
|
1134
|
878
|
872
|
La20.txt
|
(10×10)
|
902
|
1331
|
1293
|
1453
|
1219
|
964
|
938
|
Ft06.txt
|
(6×6)
|
55
|
87
|
73
|
88
|
61
|
55
|
55
|
Ta01.txt
|
(15×15)
|
1231
|
1872
|
1812
|
2148
|
1957
|
1419
|
1315
|
Ta11.txt
|
(20×15)
|
1357
|
2273
|
2117
|
2243
|
2216
|
1633
|
1788
|
Ta21.txt
|
(20×20)
|
1642
|
2443
|
2691
|
2610
|
2647
|
1991
|
1952
|
Ta31.txt
|
(30×15)
|
1764
|
2670
|
2589
|
2916
|
2478
|
2226
|
1986
|
5.2. Case study
To further verify the effect of the proposed method in the actual scene, we take the personalized customized car assembly line in our laboratory as an example. As shown in Fig. 4, we establish the framework of the digital twin workshop according to the five-dimensional model proposed by Tao et al. [8] to include a physical layer, virtual layer, service layer, DT data and connection (CN). The lower part of Fig. 4 introduces the production content of our laboratory. Our laboratory is a workshop for producing eight models of customized cars. The production of each model includes eight processes: chassis, front wheel, rear wheel, frame, left door, right door, front cover and logo. These are assembled on eight different robot arms, and the assembly times are different. The lower right part of Fig. 4 shows the customized ordering system we developed. We can select the model and quantity to be customized through the software. After confirming the order, the order is transmitted to the database. When a batch of orders is entered, we use the PPO algorithm described above to solve it. According to the obtained results, the automated guided vehicle (AGV) transports the materials. After the AGV runs to the designated station, the robot arm program is manually started to assemble the trolley. The job processing time in this paper includes the sum of the AGV transportation time and robot arm assembly time.
The solution process of the job shop scheduling problem based on the digital twin is shown in Fig. 5. First, we establish a digital twin scenario and then schedule tasks according to the PPO algorithm described above. We simulate and interact the scheduling results in the digital twin scenario. If the simulation results meet our preset requirements, then we feed back the results to the physical workshop for execution. If the personalized customized ordering system is continuously used for task distribution during the solution process, when the order insertion occurs, due to the perception of the DT workshop, the rescheduling mechanism is triggered after the DT data changes. When rescheduling occurs, the newly inserted order and original unfinished order are solved together, and the final result is simulated and verified in the digital twin workshop. If it meets the requirements, then it is fed back to the physical workshop for task assembly.
Next, we take 8 models in our laboratory as examples for verification. We train the algorithm with a randomly generated example of 8 * 8. The makespan during the operation of the algorithm is shown in Fig. 6. Then, the trained algorithm is used to solve it. Take 8 jobs as an example, as shown in Formula (13). The first parameter is the equipment number that can be processed, which here refers to the robot arm, and the second parameter is the required time (s).
$${J}_{1}=\left[\begin{array}{cccccccc}\left(\text{3,38}\right)& \left(\text{5,67}\right)& \left(\text{1,62}\right)& \left(\text{7,54}\right)& \left(\text{8,59}\right)& \left(\text{6,73}\right)& \left(\text{4,65}\right)& \left(\text{2,49}\right)\\ \left(\text{8,49}\right)& \left(\text{4,71}\right)& \left(\text{5,71}\right)& \left(\text{7,41}\right)& \left(\text{1,51}\right)& \left(\text{2,49}\right) & \left(\text{3,62}\right)& \left(\text{6,45}\right)\\ \left(\text{1,48}\right)& \left(\text{4,63}\right)& \left(\text{6,51}\right)& \left(\text{5,59}\right)& \left(\text{3,42}\right)& \left(\text{7,43}\right) & \left(\text{2,61}\right)& \left(\text{8,52}\right)\\ \left(\text{4,60}\right)& \left(\text{8,53}\right)& \left(\text{3,63}\right)& \left(\text{1,72}\right)& \left(\text{2,76}\right)& \left(\text{7,69}\right)& \left(\text{6,59}\right)& \left(\text{5,69}\right)\\ \left(\text{6,68}\right)& \left(\text{8,74}\right)& \left(\text{4,66}\right)& \left(\text{5,69}\right)& \left(\text{2,42}\right)& \left(\text{3,70}\right)& \left(\text{1,49}\right)& \left(\text{7,52}\right)\\ \left(\text{7,57}\right)& \left(\text{2,48}\right)& \left(\text{8,49}\right)& \left(\text{3,66}\right)& \left(\text{1,68}\right)& \left(\text{6,79}\right)& \left(\text{4,56}\right)& \left(\text{5,41}\right)\\ \left(\text{1,48}\right)& \left(\text{5,79}\right)& \left(\text{6,83}\right)& \left(\text{2,85}\right)& \left(\text{8,73}\right)& \left(\text{7,52}\right)& \left(\text{4,43}\right)& \left(\text{3,62}\right)\\ \left(\text{2,50}\right)& \left(\text{6,86}\right)& \left(\text{3,74}\right)& \left(\text{4,53}\right)& \left(\text{1,63}\right)& \left(\text{8,71}\right)& \left(\text{5,66}\right)& \left(\text{7,88}\right)\end{array}\right]$$
13
The generated initial scheduling plan is solved by the PPO algorithm, as shown in Fig. 9. The abscissa represents the time (s), and the ordinate represents the equipment, which refers to the number of robot arms. The first parameter in the figure is the model, and the second parameter is the corresponding process. Taking 7 − 3 as an example, the parameter represents the third process of the seventh model. The maximum completion time obtained is 730 s.
Assuming that a dynamic event occurs at t = 500 s, we issue Model 2 and Model 8 through the personalized customization order system. Relying on the real-time mapping of the digital twin workshop, one can obtain the order insertion information in the DT data layer of the digital twin workshop, start the rescheduling solution and then simulate the rescheduling scheme in the digital twin workshop. If the requirements meet our preset, it is fed back to the physical workshop for execution. The obtained rescheduling scheme is shown in Fig. 14.
Assuming that a dynamic event occurs at t = 500 s, we issue Model 2 and Model 8 through the personalized customization order system. Relying on the real-time mapping of the digital twin workshop, one can obtain the order insertion information in the DT data layer of the digital twin workshop, start the rescheduling solution and then simulate the rescheduling scheme in the digital twin workshop. If the requirements meet our preset, it is fed back to the physical workshop for execution. The obtained rescheduling scheme is shown in Fig. 10.
Finally, we compare the results with heuristic rules and a genetic algorithm. The settings of various parameters are shown in Section 5.1. The running time and minimum maximum completion time of the scheduling results are listed in Table 4. It can be seen from the table that the running time of the heuristic rules and PPO algorithm is much shorter than that of the genetic algorithm, and the optimal value can be obtained in a very short time after training the DRL model. Therefore, we think the results obtained by DRL are impressive.
|
SPT
|
LPT
|
SSO
|
LSO
|
GA
|
DRL
|
Execution time(s)
|
2.1
|
2.1
|
2.1
|
2.1
|
89.4
|
1.6
|
Makespan
|
992
|
1011
|
992
|
885
|
810
|
730
|