5.1. Simulation results
In this section, we randomly generate some instances (5 * 5, 6 * 6, 8 * 8, 10 * 10, 15 * 15, 20 * 20). For each size, we randomly generate 100 instances, and the corresponding processing time of each job is an integer in the range of 1–99 to train and test the proposed algorithm. Finally, the trained algorithm is applied to ft [30], La [31] and taillard' [32] examples to compare the performance of the algorithm. During the training, we used PyTorch to implement the PPO algorithm. PyTorch runs on the Windows 10 64bit operating system, and the hardware used is an Intel Xeon bronze 3106 CPU with NVIDIA Quadro p6000 GPU.
In deep reinforcement learning, the DRL algorithm is very sensitive to superparameters, and the training results and convergence are different under different superparameters. Therefore, taking the parameter adjustment in the training process of our randomly generated (6 * 6) example as an example, this paper analyses the influence of different clip values and network structure learning rates on the training process of the reinforcement learning algorithm. As shown in Fig. 2, the abscissa of each point indicates the training times, and the ordinate indicates the makespan. The training times are set to 10,000 during the training.
In the PPO algorithm, the clip value means cutting the advantage function and keeping it in the trust domain to prevent overly large changes in the update process from affecting the effect of the algorithm. This also prevents the accumulation of a large number of errors in the process of reusing the same batch of samples and cuts off the gradient of some samples that make the output of the current strategy change too much. The clip value usually takes a positive number greater than 0. For the same input sample, the ratio of policy output change caused by each parameter update is \(1ϵ\) and \(1+ϵ\). The smaller the \(ϵ\), the more stable the training; and the larger the \(ϵ\), the more samples that are saved. In this paper, we select three values of 0.1, 0.2 and 0.3 for analysis. As shown in Fig. 2 (a), when clip = 0.1, the convergence effect of the algorithm is the best and the makespan is the smallest. Therefore, we select clip = 0.1.
Similarly, the network structure has a great impact on the algorithm training process. In this paper, we set up the actor network\({\pi }_{\theta }\) and critic network \({v}_{\varphi }\). The GIN layer is shared. The GIN layer has two MLPs. There are four layers of MLP in \({\pi }_{\theta }\) and \({v}_{\varphi }\). As shown in Fig. 2 (c), by adjusting the dimension of each layer, we compare four different network structures 64, 128, 256 and 512. When the dimension is equal to 128 and 256, the performance is the best, but an excessively high dimension affects the training efficiency of the algorithm. Therefore, the dimension selected in this paper is 128.
Figure 2 (b) shows the impact of different learning rates on the algorithm performance. When lr = 1e3, the algorithm performance is the best.
Table 2
Parameters

Values

Proximal Policy Optimization


Learning rate

1e3

Clip

0.1

Entropy coefficient

0.1

Value function coefficient

2

Policy loss coefficient

2

Number of steps per update
Discount factor

32
0.95

Genetic Algorithm


Iteration

5000

Crossover probability

0.8

To further evaluate the performance of the algorithm, we use a public example to verify it. We compare the algorithm to a genetic algorithm and various scheduling rules. The genetic algorithm is the most classical metaheuristic algorithm in the field of job shop scheduling. It has good performance, and the effect of solving examples is very good, but the solution time is relatively long. The parameter values of the PPO and GA algorithms are listed in Table 2.
Finally, the results of each algorithm and scheduling rule in the open example are shown in Table 3 and Fig. 3. The results indicate that the maximum completion time of the methods used in this paper is shorter than the maximum completion time of the heuristic rules, and 80% of the examples are better than or equal to the metaheuristic algorithm GA, while the DRL algorithm is much better than the metaheuristic algorithm with regard to solving speed. Therefore, the overall effect of the DRL method is better than the heuristic scheduling rules and metaheuristic algorithm. Compared with the optimal solution, the relative error of DRL is very small and is close to the theoretical optimal solution. The DRL method can obtain the theoretical optimal solution on 60% of the open examples, especially for examples with a small number of machines, and DRL can converge well to obtain the optimal solution.
Table 3
Instance

Size

OPT

SPT

LPT

SSO

LSO

GA

PPO

La01.txt

(10×5)

666

920

889

844

806

706

666

La02.txt

(10×5)

655

901

894

952

815

732

655

La03.txt

(10×5)

597

770

748

904

793

680

597

La04.txt

(10×5)

590

916

848

937

868

637

639

La05.txt

(10×5)

593

827

787

894

633

593

593

La06.txt

(15×5)

926

1369

1105

1201

1009

948

926

La07.txt

(15×5)

890

1128

1145

1187

1104

985

894

La08.txt

(15×5)

863

1168

1061

1190

1131

932

863

La09.txt

(15×5)

951

1289

1105

1319

1174

1001

951

La10.txt

(15×5)

958

1345

1136

1604

976

958

958

La11.txt

(20×5)

1222

1654

1476

1470

1289

1304

1222

La12.txt

(20×5)

1039

1352

1222

1324

1173

1087

1047

La13.txt

(20×5)

1150

1747

1298

1592

1350

1207

1150

La14.txt

(20×5)

1292

1757

1360

1895

1341

1292

1292

La15.txt

(20×5)

1207

1476

1510

1627

1411

1403

1212

La16.txt

(10×10)

945

1588

1238

1565

1274

979

980

La17.txt

(10×10)

784

1094

1157

1135

1060

809

794

La18.txt

(10×10)

848

1259

1264

1232

1090

903

851

La19.txt

(10×10)

842

1339

1140

1225

1134

878

872

La20.txt

(10×10)

902

1331

1293

1453

1219

964

938

Ft06.txt

(6×6)

55

87

73

88

61

55

55

Ta01.txt

(15×15)

1231

1872

1812

2148

1957

1419

1315

Ta11.txt

(20×15)

1357

2273

2117

2243

2216

1633

1788

Ta21.txt

(20×20)

1642

2443

2691

2610

2647

1991

1952

Ta31.txt

(30×15)

1764

2670

2589

2916

2478

2226

1986

5.2. Case study
To further verify the effect of the proposed method in the actual scene, we take the personalized customized car assembly line in our laboratory as an example. As shown in Fig. 4, we establish the framework of the digital twin workshop according to the fivedimensional model proposed by Tao et al. [8] to include a physical layer, virtual layer, service layer, DT data and connection (CN). The lower part of Fig. 4 introduces the production content of our laboratory. Our laboratory is a workshop for producing eight models of customized cars. The production of each model includes eight processes: chassis, front wheel, rear wheel, frame, left door, right door, front cover and logo. These are assembled on eight different robot arms, and the assembly times are different. The lower right part of Fig. 4 shows the customized ordering system we developed. We can select the model and quantity to be customized through the software. After confirming the order, the order is transmitted to the database. When a batch of orders is entered, we use the PPO algorithm described above to solve it. According to the obtained results, the automated guided vehicle (AGV) transports the materials. After the AGV runs to the designated station, the robot arm program is manually started to assemble the trolley. The job processing time in this paper includes the sum of the AGV transportation time and robot arm assembly time.
The solution process of the job shop scheduling problem based on the digital twin is shown in Fig. 5. First, we establish a digital twin scenario and then schedule tasks according to the PPO algorithm described above. We simulate and interact the scheduling results in the digital twin scenario. If the simulation results meet our preset requirements, then we feed back the results to the physical workshop for execution. If the personalized customized ordering system is continuously used for task distribution during the solution process, when the order insertion occurs, due to the perception of the DT workshop, the rescheduling mechanism is triggered after the DT data changes. When rescheduling occurs, the newly inserted order and original unfinished order are solved together, and the final result is simulated and verified in the digital twin workshop. If it meets the requirements, then it is fed back to the physical workshop for task assembly.
Next, we take 8 models in our laboratory as examples for verification. We train the algorithm with a randomly generated example of 8 * 8. The makespan during the operation of the algorithm is shown in Fig. 6. Then, the trained algorithm is used to solve it. Take 8 jobs as an example, as shown in Formula (13). The first parameter is the equipment number that can be processed, which here refers to the robot arm, and the second parameter is the required time (s).
$${J}_{1}=\left[\begin{array}{cccccccc}\left(\text{3,38}\right)& \left(\text{5,67}\right)& \left(\text{1,62}\right)& \left(\text{7,54}\right)& \left(\text{8,59}\right)& \left(\text{6,73}\right)& \left(\text{4,65}\right)& \left(\text{2,49}\right)\\ \left(\text{8,49}\right)& \left(\text{4,71}\right)& \left(\text{5,71}\right)& \left(\text{7,41}\right)& \left(\text{1,51}\right)& \left(\text{2,49}\right) & \left(\text{3,62}\right)& \left(\text{6,45}\right)\\ \left(\text{1,48}\right)& \left(\text{4,63}\right)& \left(\text{6,51}\right)& \left(\text{5,59}\right)& \left(\text{3,42}\right)& \left(\text{7,43}\right) & \left(\text{2,61}\right)& \left(\text{8,52}\right)\\ \left(\text{4,60}\right)& \left(\text{8,53}\right)& \left(\text{3,63}\right)& \left(\text{1,72}\right)& \left(\text{2,76}\right)& \left(\text{7,69}\right)& \left(\text{6,59}\right)& \left(\text{5,69}\right)\\ \left(\text{6,68}\right)& \left(\text{8,74}\right)& \left(\text{4,66}\right)& \left(\text{5,69}\right)& \left(\text{2,42}\right)& \left(\text{3,70}\right)& \left(\text{1,49}\right)& \left(\text{7,52}\right)\\ \left(\text{7,57}\right)& \left(\text{2,48}\right)& \left(\text{8,49}\right)& \left(\text{3,66}\right)& \left(\text{1,68}\right)& \left(\text{6,79}\right)& \left(\text{4,56}\right)& \left(\text{5,41}\right)\\ \left(\text{1,48}\right)& \left(\text{5,79}\right)& \left(\text{6,83}\right)& \left(\text{2,85}\right)& \left(\text{8,73}\right)& \left(\text{7,52}\right)& \left(\text{4,43}\right)& \left(\text{3,62}\right)\\ \left(\text{2,50}\right)& \left(\text{6,86}\right)& \left(\text{3,74}\right)& \left(\text{4,53}\right)& \left(\text{1,63}\right)& \left(\text{8,71}\right)& \left(\text{5,66}\right)& \left(\text{7,88}\right)\end{array}\right]$$
13
The generated initial scheduling plan is solved by the PPO algorithm, as shown in Fig. 9. The abscissa represents the time (s), and the ordinate represents the equipment, which refers to the number of robot arms. The first parameter in the figure is the model, and the second parameter is the corresponding process. Taking 7 − 3 as an example, the parameter represents the third process of the seventh model. The maximum completion time obtained is 730 s.
Assuming that a dynamic event occurs at t = 500 s, we issue Model 2 and Model 8 through the personalized customization order system. Relying on the realtime mapping of the digital twin workshop, one can obtain the order insertion information in the DT data layer of the digital twin workshop, start the rescheduling solution and then simulate the rescheduling scheme in the digital twin workshop. If the requirements meet our preset, it is fed back to the physical workshop for execution. The obtained rescheduling scheme is shown in Fig. 14.
Assuming that a dynamic event occurs at t = 500 s, we issue Model 2 and Model 8 through the personalized customization order system. Relying on the realtime mapping of the digital twin workshop, one can obtain the order insertion information in the DT data layer of the digital twin workshop, start the rescheduling solution and then simulate the rescheduling scheme in the digital twin workshop. If the requirements meet our preset, it is fed back to the physical workshop for execution. The obtained rescheduling scheme is shown in Fig. 10.
Finally, we compare the results with heuristic rules and a genetic algorithm. The settings of various parameters are shown in Section 5.1. The running time and minimum maximum completion time of the scheduling results are listed in Table 4. It can be seen from the table that the running time of the heuristic rules and PPO algorithm is much shorter than that of the genetic algorithm, and the optimal value can be obtained in a very short time after training the DRL model. Therefore, we think the results obtained by DRL are impressive.

SPT

LPT

SSO

LSO

GA

DRL

Execution time(s)

2.1

2.1

2.1

2.1

89.4

1.6

Makespan

992

1011

992

885

810

730
