Mathematical model
We introduce FDSSP by considering the production scheduling problem of the hydraulic cylinder. The hydraulic cylinder processes flow diagram is simplified to a production scheduling model in Fig. 1. Integrated production of hydraulic cylinder: (a) structure charts; (b) production layout; (c) job shop; (d) assembly shop.. Each cylinder 31–32 is assembled from several
Table 2
Model Parameters and indices.
Sets
|
Description
|
Indices
|
Description
|
J
|
Set of the job
|
j
|
Indices of the job,\(j\in J\)
|
M
|
Set of the machine
|
m
|
Indices of the machine,\(m\in M\)
|
O
|
Set of the operation
|
k
|
Indices of the operation,\(k\in O\)
|
Job-related parameters
|
Description
|
at
|
Arrival time of products
|
d
|
Delivery time of products
|
apt
|
Assembly time of the products
|
\({o_{jf}}\)
|
The first operation of the process path
|
\({o_{jl}}\)
|
The last operation of the process path
|
\({t_{jkm}}\)
|
Processing time of machine m used in the operation\({k}_{j}\)
|
\({S_{jkm}}\)
|
Start time of machine m used in the operation\({k}_{j}\)
|
\({C_{jkm}}\)
|
Processing time of machine m used in the operation\({k}_{j}\)
|
\({C_j}\)
|
Processing time of the last operation for the job\(j\)
|
\({C_{\hbox{max} }}\)
|
Makespan of the job \(j\)(maximum processing time of the last operation for the job \(j\))
|
L
|
Extreme value
|
components: body, bottom, piston, piston rod, lifting lug, O-ring, seal ring, piston pin, and wiper, as shown in Fig. 1(a). The cylinder body 3 is generally made of seamless steel pipe. Its internal machining accuracy is highly required. Piston 4 and piston rod 6 are connected using snap ring 2. The piston rod 6 is guided by guide sleeve 7 and sealed by seal ring 5. Cylinder bottom 1 and body 3 are respectively opened with oil inlet and outlet ports. When the right chamber of the hydraulic cylinder is filled with oil, the piston moves left. Inversely, the piston moves right.
The shop floor is divided into two areas, namely the job shop and the assembly shop. The product is started from the order and is finished with the assembly (Fig. 1. Integrated production of hydraulic cylinder: (a) structure charts; (b) production layout; (c) job shop; (d) assembly shop. (b)). The job shop is equipped with three machines (fine turning, CNC milling, and electric spark) (Fig. 1. Integrated production of hydraulic cylinder: (a) structure charts; (b) production layout; (c) job shop; (d) assembly shop. (c)). The assembly workshop has two assembly robots (\({A_1}\),\({A_2}\)) (Fig. 1. Integrated production of hydraulic cylinder: (a) structure charts; (b) production layout; (c) job shop; (d) assembly shop. (d)). Each operation can be completed by multiple alternative machines. After each operation k is completed, job j needs to enter the quality control center for quality inspection. If the quality is acceptable, a job is moved to the next operation k + 1; if instead, it is returned to the current operation to be queued and reworked again. The assembly operation is a complete kit assembly, which means that the assembly operation does not begin until the all job is completed.
Since the assembly operation may be relatively short and fixed, the planned start time of the assembly operation can be extrapolated from the delivery date of the order. In this paper, we concentrate on the job shop scheduling in a way that the completion time of each job is as close to the planned start time of the assembly as possible. The assembly shop is defined as the assembly constraint level.
Based on the above example, the FDSSP can be described as follows: supposing that there are n jobs to be processed in the job shop equipped with \(m\) machines. Each job j (\(j \in \{ 1,2,...,N\}\))including \(O\) operations k (\(k \in \{ 1,2,...,O\}\))needs to be processed according to the specified route. Each operation k can be selected processing on any powerful machines m (\(m \in \{ 1,2,...,{M_{ij}}\}\)) in\({M_{ij}}\) machines. Meanwhile, the machine m can process different operations k of different jobs. Hence, there is a great discrepancy in the processing time of the operation \(k\) on different machines, which makes the study of scheduling algorithms particularly significant. The model parameters and indices are shown in Table 2. Model Parameters and indices.
Assembly restraint level definition:
The job after assembly is referred to as the constrained job, and the job before assembly is referenced as the front job. Firstly, according to the assembly constraint relationship, all jobs constraint level that has no tight front constraint are set to 1. Jobs with undefined constraint levels make up the job set, which is denoted by U. Then, the job set \({J_{set}}\) is formed from U in sequence taking out all tight front jobs \({J_k}\). Determining whether the constraint levels in \({J}_{set}\) have all been determined. If so, the level of the job \({J_k}\) is set to \(\hbox{max} (L({J_{set}})+1)\), i.e., \(L({J_k})=\hbox{max} (L({J_{set}}))+1\). When not, it puts the job \({J_k}\) back into U until the constraint levels of all jobs have been determined.
Other assumptions are considered as follows:
-
The processing times of each operation by each machine are determined and known.
-
Each job can select only one process path. And, one operation can only be processed by one machine at a time.
-
The sum of the start time and processing time of an operation is less than or equal to the makespan of the operation.
-
The makespan of the previous operation is less than or equal to the start time of the next operation.
-
Completion time of products is the sum of processing time and assembly time.
-
The operation of each machine is cyclic.
-
Intermediate conversion time of the job, transferring from the job shop to the assembly shop, is omitted.
Decision variables:
\({x_{jkm}}=\left\{ {\begin{array}{*{20}{c}} {1,if{\text{ }}operation{\text{ }}k{\text{ }}of{\text{ }}job{\text{ }}j{\text{ }}is{\text{ }}processed{\text{ }}on{\text{ }}machine{\text{ }}m} \\ {0,otherwise} \end{array}.} \right.\)
According to the literature reviewed6,33–36 makespan is the most sufficiently studied objective. In this study, the objective of the model is as follows:
Makespan:
$$\hbox{min} ({C_{\hbox{max} }})=\hbox{min} (\mathop {\hbox{max} }\limits_{{1 \leqslant j \leqslant n}} ({C_j}))=\hbox{min} (\sum\nolimits_{1}^{n} {{x_{jkm}}{t_{jkm}}} ).$$
1
Transformation of scheduling problem
Definition of state-space
The state features can reflect the main features of the production environment. The division of state space is the basis for the reasonable selection of scheduling rules for the system. Nevertheless, owing to the constantly changing production environment, the complete system state is continuous and often described by tens or even dozens of state characteristics on the job shop.
To describe the state space in detail, the following state features are defined:
-
The state features can describe the main features and changes of the scheduling environment in detail, including the global features and local features of the system.
-
The states of all problems are represented by a common feature set.
-
Different scheduling problems can be represented and summarized by state features.
-
State feature is a numerical representation of state variables.
-
The state should be easy to calculate.
Table 3
No.
|
State features
|
Description
|
1.
|
\({x_{m,1}}=\sum\nolimits_{{j=1,k=1,{P_{j,k}}=0}}^{{{O_{jp}}}} {{T_{j,k}}}\)
|
Total time of operations to be processed on machine m
|
2.
|
\({x_{m,2}}=\sum\nolimits_{{j=1,k=1,{P_{j,k}}=0}}^{{{O_{jp}}}} {{T_{j,k}}}\).
|
Total time of operations processed on machine m
|
3.
|
\({x_{m,3}}=J_{{m,1}}^{{{P_{j,k=0}}}}\).
|
Time of the first operation in the sequence \(List\left(m\right)\) to be processed on the machine m
|
4.
|
\({x_{m,4}}=J_{{m,2}}^{{{P_{j,k=1}}}}\).
|
Time of the second operation in the sequence \(List\left(m\right)\) to be processed on machine m
|
5.
|
\({x_{m,5}}=W_{{j,1}}^{{{P_{j,k=1}}}}\).
|
Among all future operations, the time of the first operation in the sequence \(List\left(m\right)\) to be processed on the machine m
|
6.
|
\({x_{m,6}}=W_{{j,2}}^{{{P_{j,k=0}}}}\).
|
Among all future operations, the time of the second operation in the sequence \(List\left(m\right)\)to be processed on the machine m
|
7.
|
\({x_{m,7}}=\sum\nolimits_{{i=1}}^{n} {(1 - {P_{j,k}})}\).
|
Total number of operations for all future processes on the machine m
|
8.
|
\({x_{m,8}}=\sum\nolimits_{{i=1}}^{n} {(1 - {P_{j,k}})} {T_{j,k}}\).
|
Total time for all future operations on machine m
|
9.
|
\({x_{m,9}}=\left\{ {\begin{array}{*{20}{c}} {0,if{\text{ }}machine{\text{ }}is{\text{ }}idle} \\ {1,if{\text{ }}machine{\text{ }}is{\text{ }}processing{\text{ }}a{\text{ }}job} \end{array}} \right.\).
|
Machine states
|
10.
|
\({x_{m,10}}=\sum\nolimits_{{{J_k}=1}}^{n} {(L({J_k}))}\).
|
Total number of all jobs assembly constraint levels on the machine m
|
Table 4
Set of Candidate Behaviors for Each Machine.
No.
|
SCH
|
Description
|
1.
|
First come first served (FIFO)
|
Processing in sequence according to the arrival order of the job
|
2.
|
Shortest processing time (SPT)
|
Sorted by the total processing time of the job on all machines from shortest to longest
|
3.
|
Shortest remaining processing time (SRPT)
|
Sorted by the remaining processing time of the job on all machines from shortest to longest
|
4.
|
Most operations remaining (MOR)
|
Sorted by the number of the remaining operations on all machines from shortest to longest
|
5.
|
Earliest due date (EDD)
|
Sorted by the due date from shortest to longest
|
6.
|
Apparent tardiness cost (ATC)
|
Sorted by the tardiness cost from shortest to longest
|
7.
|
Total least operations remaining (TLOPR)
|
Sorted by assembly-related constraints of the job from shortest to longest
|
8.
|
Select no job (SNJ)
|
Machines don’t select processing each job
|
To facilitate the expression with the formula, the processing state of the processes is recorded as \({P}_{jk}=\left\{\text{0,1}\right\}\), i.e., the operations are not processed is \({P_{jk}}=0\), and has been processed is recorded as \({P_{jk}}=1\). The operations to be processed on the machine are arranged in descending order of time length, and the resulting process sequence is denoted as \(list(m)=\{ {J_{m1}},{J_{m2}},...,{J_{m{v_m}}}\}\), where\({v_m}\) is the number of processes to be processed on machine m. As shown in Table 3. State features., we define ten state features of the shop environment.
Definition of action space
Panwalker and Iskander37, summarizing the previous studies, elaborated 113 different combinations of dispatching rules. These rules defined the useful types of problems and measures of performance. The SCH is chosen to define a candidate set of behaviors for each machine, where priority assignment rules for reinforcement learning can overcome short-sighted natures. Behaviors that are relevant or irrelevant to the conversion should be adopted to take full advantage of existing scheduling theory and the ability of the intelligence to learn from it. In Table 4. Set of Candidate Behaviors for Each Machine., eight common behaviors are selected as candidate sets.
Definition of rewards
The definition of the reward function is closely related to the objective function. The agent is rewarded according to the result of the change of the system state after the implementation of the synthetic behavior and the reward function. The reward function is chosen to be defined according to the following rules.
-
The immediate reward for each state transition reflects the immediate effect of the action performed, which results in a short-term impact on the scheduling plan.
-
The cumulative total reward result reflects the long-term outcome of the execution strategy, denoted as the optimal value of the objective function.
-
This reward function can be applied to scheduling problems of different sizes.
The literature38 shows a direct relationship between \({C}_{max}\) and machine utilization (e.g., minimizing the makespan is equal to maximum machine utilization). This study is devoted to addressing minimizing the makespan. The immediate reward earned for each state transition reflects the immediate impact of the action performed. It also represents the short-term impact of the action on the scheduling scheme. Cumulative rewards reflect the long-term effects, which is the goal of RL maximization.
$${U_{ave}}(t)=\frac{1}{m}\sum\nolimits_{{k=1}}^{m} {{U_k}(t)=\frac{1}{m}} \frac{{\sum\nolimits_{1}^{n} {\sum\nolimits_{{k=1}}^{{{O_i}(t)}} {{t_{jkm}}{x_{jkm}}} } }}{{{C_{\hbox{max} }}(t)}}$$
2
.
Where \({U_{ave}}(t)\) is the average machine utilization rate. Let \({C_{\hbox{max} }}(t)\) denotes the completion time of the last operation assigned on machine m at scheduling point t. \({O_t}\) is the current number of operations for the job i that have been assigned. Define the machine m utilization rate as \({U_k}(t)\), which can be calculated by \({U_k}(t)=\frac{{\sum\nolimits_{1}^{n} {\sum\nolimits_{{k=1}}^{{{O_i}(t)}} {{t_{jkm}}{x_{jkm}}} } }}{{{C_{\hbox{max} }}(t)}}\). Let \({r_k}={U_k}(t) - {U_{k - 1}}(t)\), then the cumulative reward R can be calculated as follows:
$$R{\text{=}}\sum\nolimits_{{k=1}}^{K} {{r_k}} =\sum\nolimits_{{k=1}}^{K} {({U_k}(t) - {U_{k - 1}}(t))} ={U_k}(t)$$
3
.
Proof:
$$\begin{gathered} \sum\nolimits_{1}^{K} {({U_k}(t) - {U_{k - 1}}(t))} \hfill \\ ={U_1}(t) - {U_0}(t)+{U_2}(t) - {U_1}(t)+...+{U_k}(t) - {U_{k - 1}}(t) \hfill \\ ={U_k}(t) - {U_0}(t) \hfill \\ ={U_k}(t)=\frac{P}{{{C_{\hbox{max} }}(t)}} \hfill \\ \end{gathered}$$
4
.
Where k is the counter for the allocation operation. It can be considered as a discrete-time step in RL. \(P=\sum\nolimits_{1}^{n} {\sum\nolimits_{{k=1}}^{{{o_i}(t)}} {{t_{jkm}}{x_{jkm}}} }\). \({U_k}(t)\) and \({C_{\hbox{max} }}(t)\) are machine utilization and makespan at time step k.
Proposed methods for the FDSSP
Related work of RL
RL is a specific class of machine learning (ML) problems that can achieve global optimality38–40. In an RL model41, the decision-maker chooses an appropriate action by observing the environment and is rewarded for doing so. RL algorithms needn’t know many states and the state transfer probability matrix during iterations. RL is transformed into the model of solving the optimal solution of Markov decision models, which is mainly used to solve sequential decision problems. The most important feature of RL is that there is no correct answer in the learning process, rather learning is done through reward signals.
Markov decision processes (MDP)
Markov is the property that the next state \({s_{t+1}}\)in an RL system is related only to the current state \({s_t}\). The Markov decision process is described by 5-tuple as follows:
$$E{\text{=}}\{ S,A(s),P,R,\gamma \}$$
5
.
where S is a finite set of states, characterizing the description of the environmental state; A is a finite set of action spaces,
Algorithm 1 Procedure of TD with evaluating state value
|
1: For episode = 1: M do
|
2: \(\forall s\in S\), Initialize states value \(V\left(s\right)\);
|
3: Set the initial state \({s}_{0}\) to the current state \({s}_{t}\);
|
4: Select actions according to \({s}_{t}\), state value \(V\left(s\right)\) and strategy \(\pi\);
|
5: Perform the actions at, determine the next decision moment state \({s}_{t+1}\), and calculate the reward \({r}_{t+1}\);
|
6: Update \(V\left({s}_{t}\right)\) according to Eq. (8);
|
7: If \({s}_{t+1}\) is not the terminated state:
8: \(t=t+1\), skip to step 3;
9: End if
10: End for
|
representing the set of behaviors that can be taken; P is a state transfer rate function; R is a reward function; \(\gamma\) is a discount factor.
The objective of RL is to enable the agent to find an optimal strategy \({\pi ^ * }\) through continuous experimentation in the environment that maximizes the expected cumulative reward function obtained by following the strategy from any state. The reward function is determined by further defining the value function. The state-value function \({v_\pi }(s)\) and the action-value function \({q_\pi }(s)\) under the strategy are defined as follows.
$$\begin{gathered} {v_\pi }(s)=E[{G_t}|{S_t}=s] \hfill \\ ={E_\pi }[\sum\nolimits_{{k=0}}^{\infty } {{\gamma ^k}{r_{t+k+1}}|{S_t}s} ] \hfill \\ =\sum\nolimits_{{a \in A}} {\pi (s|a)} {q_\pi }(s,a) \hfill \\ =\sum {\pi (s|a)} [R_{s}^{a}+r\sum\nolimits_{{s \in S}} {p(s'|s,a)} {v_\pi }(s')] \hfill \\ \end{gathered}$$
6
Updating Bellman's expectation equation with the optimal strategy yields the optimal equation as follows:
$${V^ * }(s)=\mathop {\hbox{max} }\limits_{\pi } {V_\pi }(s)=\mathop {\hbox{max} }\limits_{\pi } R_{s}^{a}+\gamma \sum {P_{{ss'}}^{a}} {V^ * }(s')$$
7
.
Temporal Difference Algorithm
The TD algorithm42, combining Monte Carlo and dynamic planning methods, uses the classical Bellman formula to iterate until the value function converges. The basic iteration formula is as follows:
$$V({s_t})=V({s_t})+\alpha [{r_{t+1}}{\text{+}}\gamma V({s_{t+1}}) - V({s_t})]$$
8
.
Where\({r_{t+1}}{\text{+}}\gamma V({s_{t+1}})\) is the objective of TD; \({r_{t+1}}{\text{+}}\gamma V({s_{t+1}}) - V({s_t})\) is the deviation of TD; \(\alpha\)is the learning rate. The procedure to calculate \(v(s)\) is given in Algorithm 1 Procedure of TD with evaluating state value.
Deep learning model
Deep neural network
Deep learning43 is a type of representation learning that is based on artificial neural networks. Deep neural network structures have greater capacity and exponential representation space, which makes it easier to learn and represent a variety of features with a significantly reduced number of neurons.
The recent success of deep learning relies heavily on massive amounts of training data, flexible models, sufficient computing power, and prior experience to fight against dimensional disasters. Hinton44 has proposed a technique combining pre-training and fine-tuning to drastically reduce the time training a multi-layer neural network. Various optimization techniques have emerged to further alleviate the gradient disappearance problem. In particular, an application of a technique known as “Deep Residuals”45 can enable more than a hundred network layers.
Activation function
The activation function, a central unit in the design of neural networks, gives the ability to learn and adapt for the neurons46–47. It incorporates nonlinear factors in the neural network to address the defect of expression ability of the linear model. If the activation function isn’t used, the output of each layer is a linear function of the inputs of the previous layer. No matter how many layers the neural network has, the output is a linear combination of the inputs. Common activation functions include step functions, Sigmoid functions, Tanh functions, and approximate biological neuronal activation functions such as Relu, Leaky-Relu, and Softplus. Because approximate biological neuronal activation functions are better than traditional functions in most network applications, Relu is used in this paper.
Optimization function
Optimization function48, one of the core problems in neural network training, not only speeds up the solution process but also reduces the influence of hyperparameters on the solution process. Common optimization algorithms used in research applications are the Stochastic Gradient Descent algorithm (SGD), Adaptive Gradient algorithm (AdaGrad), Root Mean Square prop algorithm (RMSProp), and Adam algorithm.
In this paper, the deep neural network is made up of seven connection layers, which contain one input layer, five implicit layers, and one output layer. Figure 2. DTDN algorithm running model (3×3): Deep neural network model of state perception in agent: (a) Deep neural network model of state perception; (b) Deep neural network structure. (b) gives the structure of the neural network.
Exploration and exploitation
FDSSP can be classified as a multi-stage decision-making problem with terminals. To balance the allocation of exploration and exploitation of the agent in environmental interactions, a greedy strategy \(\epsilon\) is used as a strategy for selecting behavior. Greedy strategy is the selection of a greedy behavior with probability \(1{\text{-}}\varepsilon\) (\(0<\varepsilon <1\)) and the random selection of any optional
Algorithm 2: Procedure of DTDN
|
1: Input: Initialize playback memory \(D\) to capacity\(N\)
|
2: Initialize states value function \(V\) with weights\(\theta\)
|
3: Initialize the target state value function \(\widehat{V}\) with weights\({\theta }^{-}=\theta\)
|
4: For episode = 1, \(M\) do
|
5: Initialize sequence \({s}_{1}=\left\{{x}_{1}\right\}\) and pre-processed sequence
\({\varphi }_{1}=\varphi \left({S}_{1}\right)\)
|
6: For \(t=1\), \(T\) do
|
7: with probability \(\epsilon\) or Eq. (7) Select a random action\({a}_{t}\)
|
8: otherwise select \({a}_{t}={arg}_{a}Q(\varphi \left({s}_{t}\right),a;\theta )\)
|
9: Execute action \({a}_{t}\) in the emulator and observe the reward \({r}_{t}\) and image\({x}_{t+1}\)
|
10: Set \({s}_{t+1}={s}_{t}\), \({a}_{t},\) \({x}_{t+1}\) and pre-process \({\varphi }_{t+1}=\varphi \left({s}_{t+1}\right)\)
|
11: Store transition \(({{\varphi }_{t},a}_{j},{r}_{j},{\varphi }_{t+1})\) in\(D\)
|
12: Sample random minibatch of transitions \(({{\varphi }_{t},a}_{j},{r}_{j},{\varphi }_{t+1})\) from\(D\)
|
13: Set:
\({y}_{i}=\left(\begin{array}{c}{r}_{j+1}, if episode terminates at step j+1\\ {r}_{j+1}+\gamma {max}_{{a}^{{\prime }}}\widehat{V}\left({\varphi }_{j+1};{\varphi }^{-}\right); otherwise\end{array}\right.\)
|
14: Perform a gradient descent step \({[{y}_{j}-\gamma V(\varphi \left({S}_{t+1}\right);\theta \left)\right]}^{2}\) Concerning the network parameters\(\theta\)
|
15: Every \(C\) step reset\(\widehat{Q}=Q\)
|
16: End For
|
17: End For
|
Table 5
Instances
|
Benchmarks
|
Source
|
Case01-05
|
Kacem01-05
|
Kacem
|
Case06-15
|
Orb01-10
|
Hurink-data
|
Case16-20
|
Mt10c1-xxx
|
Barnes
|
Case21-35
|
01a-15a
|
ChambersBarnes
|
Case36-45
|
MK01-10
|
Brandimarte
|
Table 6
Relative ratio of different computers in studies.
Studies
|
DACS
|
SM
|
PSO
|
HTSA
|
DTDN
|
CPU
|
2.7GHz
|
2.4GHz
|
NaN
|
1.7GHz
|
3.7GHz
|
CPU mark
|
7023
|
228
|
NaN
|
132
|
10203
|
Relative ratio
|
0.6883
|
0.0223
|
NaN
|
0.0129
|
1.0000
|
Table 7
Results comparison of makespan and CPU time in different methods on Kacem’s test cases.
Prob.
|
|
B&B
|
DACS
|
SM
|
PSO
|
HTSA
|
DTDN
|
m
|
n
|
O
|
LB
|
UB
|
CM
|
T(s)
|
CM
|
T(s)
|
CM
|
T(s)
|
CM
|
T(s)
|
CM
|
T(s)
|
Case01
|
4
|
5
|
5
|
11
|
NaN
|
11
|
0.490
|
12
|
2.580
|
NaN
|
NaN
|
11
|
0.150
|
11
|
1.567
|
Case02
|
8
|
8
|
8
|
11
|
NaN
|
14
|
2.420
|
14
|
39.370
|
14
|
NaN
|
14
|
3.080
|
13
|
2.189
|
Case03
|
10
|
7
|
7
|
11
|
NaN
|
11
|
2.100
|
11
|
110.000
|
NaN
|
NaN
|
11
|
2.580
|
11
|
3.218
|
Case04
|
10
|
10
|
10
|
7
|
NaN
|
7
|
2.560
|
7
|
39.740
|
7
|
NaN
|
7
|
3.120
|
7
|
3.189
|
Case05
|
15
|
10
|
10
|
11
|
NaN
|
11
|
3.790
|
11
|
865.230
|
11
|
NaN
|
11
|
25.130
|
11
|
5.142
|
behavior with probability \(\varepsilon\), where \(\varepsilon\)is the exploration factor. Suppose \(P(s,a)\) denotes the probability of selecting a behavior at the decision state. The expression is as follows:
$$P(s,a)=\left\{ {\begin{array}{*{20}{c}} {1 - \varepsilon +\frac{\varepsilon }{{|A(s)|}},a={a^ * }(s)} \\ {\frac{\varepsilon }{{|A(s)|}},a \ne {a^ * }(s)} \end{array}} \right.$$
9
.
Where \(A\left(s\right)\) is the set of combinatorial behaviors that are candidates in the state \(s\); \(\left|A\left(s\right)\right|\) is the number of behaviors that can be chosen in the state \(s\); \({a}^{*}\left(s\right)\) is the greedy behavior of the state. It denotes Eq. (7) as follows:
$${a^ * }(s)=\arg \hbox{max} [r_{{ss'}}^{a}+\gamma V(s')]$$
10
Where \({r}_{ss{\prime }}^{a}\) is the immediate reward that takes a combination of actions from state \(s\) to state \({s}^{,}\).
Deep temporal difference network model
To briefly describe the implementation process, a workshop visualization (\(m=3,n=3\)) is proposed in Fig. 2. DTDN algorithm running model (3×3): Deep neural network model of state perception in agent: (a) Deep neural network model of state perception; (b) Deep neural network structure. The hexagonal shape represents the jobs. Hexahedra represents waiting for queues of sufficient capacity.
At the start of processing, the scheduling system is in the initial state \({S}_{0}\), i.e., all jobs are in the first waiting queue \({Q_1}\) with all machines free. Then the first machine selects an action \(a(k)\) (\(1<k<8\)). A job in the queue \({Q_1}\) is selected for processing while other machines select the action \(a(8)\). Whenever any machines complete an operation, the system moves to a new state \({S_t}\). In this state, each machine selects an action to perform. When another operation is completed, the system moves to a new state \({S_{t{\text{+}}1}}\), which gives the agent one reward \({r_{t+1}}\). \({r_{t+1}}\) can be calculated by the time interval between the two states. Since at each decision moment, each machine simultaneously selects one act to execute. In actuality, the system implements a multidimensional behavior with a combination of m sub-activities at a time in the state \({S_t}({a_{t+1}}=({a_1},{a_2},...,{a_m}))\). When the
Table 8
Design of the test case problems in sets: a(O5,4); b(O6,3).
Case
|
1
|
2
|
3
|
4
|
5
|
6
|
1
|
5/19
|
4/42
|
3/85
|
7/59
|
3/87
|
-/42
|
2
|
4/46
|
5/65
|
2/56
|
3/68
|
8/66
|
-/41
|
3
|
5/65
|
4/12
|
2/78
|
6/25
|
3/53
|
-/51
|
4
|
3/-
|
5/-
|
3/-
|
3/-
|
7/-
|
-/-
|
Table 9
Results comparison of scheduling score and RPD in different methods on Orb data cases.
Prob.
(%)
|
FIFO
|
SPT
|
LPT
|
SRPT
|
LRPT
|
MOR
|
EDD
|
QL
|
DDPG
|
DTDN
|
Score
|
Score
|
Score
|
Score
|
Score
|
Score
|
Score
|
Score
|
RPD
|
Score
|
RPD
|
Score
|
RPD
|
Case06
|
7.74e2
|
7.17e2
|
7.51e2
|
7.03e2
|
7.41e2
|
7.56e2
|
6.66e2
|
9.18e2
|
1.98e-2
|
8.75e2
|
0.1435
|
8.75e2
|
1.44e-1
|
Case07
|
8.81e2
|
7.56e2
|
6.87e2
|
6.54e2
|
6.86e2
|
7.64e2
|
6.83e2
|
9.54e2
|
4.48e-2
|
8.86e2
|
0.1284
|
8.86e2
|
1.28e-1
|
Case08
|
7.15e2
|
8.52e2
|
7.03e2
|
6.96e2
|
7.01e2
|
7.15e2
|
6.75e2
|
9.18e2
|
8.96e-2
|
8.74e2
|
0.1443
|
8.51e2
|
1.75e-1
|
Case09
|
7.59e2
|
8.13e2
|
7.10e2
|
6.82e2
|
7.32e2
|
7.02e2
|
7.03e2
|
9.41e2
|
6.27e-2
|
8.88e2
|
0.1264
|
8.55e2
|
1.70e-1
|
Case10
|
7.68e2
|
7.70e2
|
8.07e2
|
7.83e2
|
6.92e2
|
7.31e2
|
7.13e2
|
9.09e2
|
1.00e-1
|
8.49e2
|
0.1701
|
8.48e2
|
1.79e-1
|
Case11
|
7.59e2
|
8.49e2
|
6.85e2
|
6.67e2
|
7.02e2
|
6.98e2
|
6.82e2
|
9.49e2
|
5.35e-2
|
9.13e2
|
0.0951
|
8.49e2
|
1.78e-1
|
Case12
|
8.36e2
|
7.88e2
|
8.45e2
|
8.02e2
|
6.99e2
|
7.42e2
|
7.05e2
|
9.36e2
|
6.80e-2
|
8.48e2
|
0.1788
|
8.63e2
|
1.59e-1
|
Case13
|
7.34e2
|
8.12e2
|
7.65e2
|
7.72e2
|
8.20e2
|
7.92e2
|
7.36e2
|
9.40e2
|
6.34e-2
|
8.80e2
|
0.1369
|
8.14e2
|
2.29e-1
|
Case14
|
7.86e2
|
7.40e2
|
7.26e2
|
6.41e2
|
6.53e2
|
6.87e2
|
6.93e2
|
9.38e2
|
6.63e-2
|
8.63e2
|
0.1585
|
8.49e2
|
1.78e-1
|
Case15
|
7.79e2
|
7.89e2
|
7.42e2
|
7.11e2
|
7.13e2
|
7.32e2
|
6.95e2
|
9.34e2
|
7.11e-2
|
9.75e2
|
0.1424
|
8.54e2
|
1.71e-1
|
system reaches the termination state\({S_T}\), it means that all queues are empty and that all jobs have been processed. Hence, a scheduling plan is obtained.
The deep Q-network (DQN) output layer uses several nodes to represent a finite number of discrete action values. However, it cannot cover the exponential multidimensional action space. When the Q-learning online evaluates action values for heterogeneous strategies, it results in over-estimation that optimal value replaces actual interaction values. Hence the method is not directly applied to the multi-dimensional action space problem. Temporal difference learning with the same
strategy, state-values indirectly calculating action-values, is proposed to replace Q learning and state values are indirectly calculated for behavior values, which is suitable for selecting multi-dimensional action in Algorithm 2: Procedure of DTDN.
Experiment study
To evaluate the validity of the proposed algorithm, the experiments have been conducted utilizing different test cases in four parts. First of all, according to the standard test set established in Kacem49, we use eight small-scale cases to compare with other algorithms16,50–52 in section 5.1. Then, in section 5.2, we compare the proposed DTDN algorithm with the Q-Learning (QL) algorithm (Jiménez53) and deep deterministic policy gradient (DDPG) algorithm (Liu54) on different performances in Brandimarte55. Moreover, for large-scale instances, we designed our test cases, which included 30 FDSSP problems of varying complexity, as shown in Section 5.3. Last but not least, in section 5.4, we illustrate in detail the application of our algorithm in a case study of solving the hydraulic cylinder production scheduling problem.
The DTDN algorithm is coded in Python 3.7 language on JetBrains Pycharm Community Edition 2019.2.1 x64 and runs on Intel Core i9-10900x @ 3.7GHz CPU and 16 GB RAM. First of all, we build FDSSP environment classes, machine classes, and job classes in an object-oriented manner on the RL platform OpenAI Gym. Gym specifies the main member methods of environment classes as a framework, including init, reset, step, render, and close. Then, an agent that executes the algorithm iterates interactively with the environment. The deep neural network model of the agent is implemented with the back-end TensorFlow. The experimental data are shown in Table 5. Benchmarks.
The selection of parameters may affect the quality of the solution, thus general principles can be followed. The discount factor \(\gamma\) measures the weight of the subsequent state value on the total return, which is why it generally takes a value close to 1 (i.e.,\(\gamma =0.95\)). To facilitate full exploration of the strategy space during the initial phase of the iteration, the \(\epsilon\)-greedy strategy sets the initial value of \(\varepsilon =1\) and decays with the discount rate of 0.995. Set the learning rate: \(\alpha =5 \times {10^{ - 4}}\); the maximum number of interactions: \(MAX - EPISODE=800\); memory D capacity: \(N=6000\); and sample batch: \(BATCH - SIZE=64\). The deep neural network of the agent is shown in Fig. 2. DTDN algorithm running model (3×3): Deep neural network model of state perception in agent: (a) Deep neural network model of state perception; (b) Deep neural network structure.(b), in which the network parameters adopt a random initialization strategy.
Performance metrics: The relative percentage deviation (RPD) and average relative percentage deviation (ARPD) are described as follows:
\(RPD=\frac{{{C_{\hbox{max} }} - LB}}{{LB}}\)
Where\({C_{\hbox{max} }}\) are the optimal results of algorithms; \(LB\) is the optimal results of the Branch and Bound algorithm. It represents the most ideal solution result and is not possible to achieve.
Small scale FDSSP
To prove the validity of the solution process in this study, the cases proposed by Kacem. are validated. Where, the number of jobs (n), the number of machines (m), and each operation of jobs (\({O_{ij}}\)) are represented. For example, \(n \times m\) is a
Table 10
Results comparison of scheduling score and RPD in different methods on Orb data cases.
Prob.
|
B&B
|
HGA
|
MG
|
HMEA
|
DTDN
|
|
m
|
n
|
O
|
LB
|
UB
|
Score
|
RPD
|
Score
|
RPD
|
Score
|
RPD
|
Score
|
RPD
|
Case16
|
10
|
11
|
1
|
6.55e2
|
9.27e2
|
9.27e2
|
4.15e-1
|
9.28e2
|
4.17e-1
|
NaN
|
NaN
|
9.12e2
|
3.92e-1
|
Case17
|
10
|
12
|
1
|
6.55e2
|
9.14e2
|
9.10e2
|
3.89e-1
|
9.10e2
|
3.89e-1
|
NaN
|
NaN
|
9.10e2
|
3.89e-1
|
Case18
|
10
|
11
|
1
|
6.55e2
|
9.29e2
|
9.18e2
|
4.12e-1
|
9.18e2
|
4.02e-1
|
NaN
|
NaN
|
9.18e2
|
4.02e-1
|
Case19
|
10
|
12
|
1
|
6.55e2
|
9.29e2
|
9.18e2
|
4.02e-1
|
9.18e2
|
4.02e-1
|
NaN
|
NaN
|
9.18e2
|
4.02e-1
|
Case20
|
10
|
13
|
1
|
6.55e2
|
9.36e2
|
9.18e2
|
4.02e-1
|
9.06e2
|
3.83e-1
|
NaN
|
NaN
|
9.06e2
|
3.83e-1
|
Case21
|
10
|
5
|
1
|
2.51e3
|
2.53e3
|
2.52e3
|
5.19e-3
|
2.52e3
|
4.01e-4
|
3.83e3
|
9.22e-2
|
2.52e3
|
6.39e-3
|
Case22
|
10
|
5
|
1
|
2.23e3
|
2.24e3
|
2.23e3
|
1.36e-3
|
2.23e3
|
1.35e-3
|
3.40e3
|
5.26e-1
|
2.35e3
|
5.66e-2
|
Case23
|
10
|
5
|
1
|
2.23e3
|
2.24e3
|
2.23e3
|
4.49e-4
|
2.23e3
|
4.49e-4
|
3.01e3
|
3.52e-1
|
2.23e3
|
2.24e-3
|
Case24
|
10
|
5
|
1
|
2.50e3
|
2.57e3
|
2.52e3
|
4.79e-3
|
2.50e3
|
0.00e0
|
3.80e3
|
5.16e-1
|
2.52e3
|
7.59e-3
|
Case25
|
10
|
5
|
1
|
2.19e3
|
2.23e3
|
2.22e3
|
1.28e-2
|
2.22e3
|
1.23e-2
|
3.33e3
|
5.21e-1
|
2.26e3
|
1.64e-2
|
Case26
|
10
|
5
|
1
|
2.16e3
|
2.22e3
|
2.20e3
|
1.57e-2
|
2.20e3
|
1.90e-2
|
3.11e3
|
4.37e-1
|
2.20e3
|
1.94e-2
|
Case27
|
15
|
8
|
1
|
2.19e3
|
2.41e3
|
2.31e3
|
5.49e-2
|
2.28e3
|
4.39e-2
|
3.88e3
|
7.72e-1
|
2.32e3
|
6.13e-2
|
Case28
|
15
|
8
|
1
|
2.06e3
|
2.09e3
|
2.07e3
|
5.82e-3
|
2.07e3
|
3.88e-3
|
3.33e3
|
6.18e-1
|
2.08e3
|
9.70e-3
|
Case29
|
15
|
8
|
1
|
2.06e3
|
2.07e3
|
2.07e3
|
2.43e-3
|
2.07e3
|
2.43e-3
|
2.98e3
|
4.47e-1
|
2.07e3
|
6.31e-3
|
Case30
|
15
|
8
|
1
|
2.18e3
|
2.36e3
|
2.32e3
|
6.29e-2
|
2.29e3
|
5.19e-2
|
4.00e3
|
8.36e-1
|
2.35e3
|
7.94e-2
|
Case31
|
15
|
8
|
1
|
2.02e3
|
2.08e3
|
2.07e3
|
0.00e0
|
2.06e3
|
2.28e-2
|
3.17e3
|
5.70e-1
|
2.08e3
|
3.32e-2
|
Case32
|
15
|
8
|
1
|
1.97e3
|
2.05e3
|
2.03e3
|
3.10e-2
|
2.03e3
|
3.30e-2
|
3.24e3
|
6.46e-1
|
2.04e3
|
3.35e-2
|
Case33
|
20
|
10
|
1
|
2.16e3
|
2.30e3
|
2.26e3
|
4.44e-2
|
2.26e3
|
4.59e-2
|
3.92e3
|
8.14e-1
|
2.28e3
|
5.51e-2
|
Case34
|
20
|
10
|
1
|
2.16e3
|
2.18e3
|
2.17e3
|
2.78e-3
|
2.17e3
|
2.78e-3
|
3.45e3
|
5.96e-1
|
2.16e3
|
9.25e-4
|
Case35
|
20
|
10
|
1
|
2.16e3
|
2.17e3
|
2.17e3
|
1.85e-3
|
2.17e3
|
2.78e-3
|
3.34e3
|
5.44e-1
|
2.17e3
|
5.09e-3
|
|
|
|
|
LB
|
UB
|
HGA
|
|
SLGA
|
|
TLBO
|
|
DTDN
|
|
Case36
|
10
|
6
|
2
|
3.60e1
|
4.20e1
|
4.00e1
|
1.11e-1
|
4.00e1
|
1.11e-1
|
6.20e1
|
7.22e-1
|
4.20e1
|
1.67e-1
|
Case37
|
10
|
6
|
3.5
|
2.40e1
|
3.20e2
|
2.60e1
|
8.33e-2
|
2.70e1
|
1.25e-1
|
4.80e1
|
1.00e0
|
3.00e1
|
2.50e-1
|
Case38
|
15
|
8
|
3
|
2.04e2
|
2.11e2
|
2.04e2
|
0.00e0
|
2.04e2
|
0.00e0
|
3.74e2
|
8.33e-1
|
2.04e2
|
0.00e0
|
Case39
|
15
|
8
|
2
|
4.80e1
|
8.10e1
|
6.00e1
|
2.50e-1
|
6.00e1
|
2.50e-1
|
1.36e2
|
2.44e0
|
6.20e1
|
2.92e-1
|
Case40
|
15
|
4
|
1.5
|
1.68e2
|
1.86e2
|
1.72e2
|
2.38e-2
|
1.72e2
|
2.38e-2
|
2.65e2
|
5.77e-1
|
1.72e2
|
2.38e-2
|
Case41
|
10
|
15
|
3
|
3.30e1
|
8.60e2
|
5.80e1
|
7.58e-1
|
6.90e1
|
1.10e01
|
9.40e1
|
1.85e0
|
9.00e1
|
1.73e1
|
Case42
|
20
|
5
|
3
|
1.33e2
|
1.57e2
|
1.39e2
|
4.51e-2
|
1.44e2
|
8.27e-2
|
2.46e2
|
8.50e-1
|
1.58e2
|
1.88e-1
|
Case43
|
20
|
10
|
1.5
|
5.23e2
|
NaN
|
5.23e2
|
0.00e0
|
5.23e2
|
0.00e0
|
6.23e2
|
1.91e-1
|
5.23e2
|
0.00e0
|
Case44
|
20
|
10
|
3
|
2.99e2
|
3.69e2
|
3.07e2
|
2.68e-2
|
3.20e2
|
7.02e-2
|
3.92e2
|
3.11e-1
|
3.20e2
|
7.02e-2
|
Case45
|
20
|
15
|
3
|
1.65e2
|
2.96e2
|
1.97e2
|
1.94e-1
|
2.54e2
|
5.39e-1
|
2.75e2
|
6.67e-1
|
2.41e2
|
4.61e-1
|
case of a set consisting of jobs and machines. The literature with the same case study as this paper is selected for comparison (Zhang16 (DACS); Xing et al.50 (SM); Moslehi51 (PSO); Li et al.52 (HTSA)), which ensures the credibility of the comparison results. Meanwhile, each case is run ten times to obtain the combined optimal solution. CPU times of various algorithms are calculated by the “relative ratio” downloaded from https://www.cpubenchmark.net/ (Table 6. Relative ratio of different computers in studies.). The results show that the optimal solution of DTDN and other algorithms are the same as LB, but CPU running time is very significantly different for small-scale problems (Kacem) in Table 7. Results comparison of makespan and CPU time in different methods on Kacem’s test cases..
As designed in Table 8. Design of the test case problems in sets: a(O5,4); b(O6,3). (a), it can be seen that the algorithm progressively generates an optimal production schedule (35). An optimal policy set \(\{ {\pi ^ * }\} =\{ (6,8,8,8),(2,1,8,8),...,(8,8,8,1)\}\) is the operation sequence of each job on the machine. Where the number of parentheses in the policy set indicates the combined behavior of the four machines consisting of the behavior number taken in the corresponding state. Where the number in parentheses in the policy set indicates that the four machines in the corresponding state consisting of the behavior number adopt the combined action. At each decision time point, since most of the machine waiting queues are empty or in-process, their feasible action space includes only \(a\left(8\right)\), which saves computation time. Moreover, the comparison of test results for Problem 2 (Table 8. Design of the test case problems in sets: a(O5,4); b(O6,3). (b)) shows that the optimal solution of the DTDN algorithm (426) is improved by 4.3% and 2.7% compared to the Nawaz Encore Han (NEH) (445) algorithms and NEH-KK algorithms (438), respectively.
Comparisons with the proposed dispatching rules
To verify the efficiency and generality of the proposed DTDN, we planned the Brandimarte55 data set as our adopted data set. Scores and RPD of the seven dispatching rules on each data case are tallied. As known from Table 9. Results comparison of scheduling score and RPD in different methods on Orb data cases., the proposed DTDN
Table 11
Results Comparison of makespan and RPD in extended Nourali’s test cases
Prob.
|
B&B
|
CP
|
PSO
|
DACS
|
DTDN
|
|
n
|
m
|
O
|
LB
|
UB
|
\({C}_{max}\)
|
RPD
|
\({C}_{max}\)
|
RPD
|
\({C}_{max}\)
|
RPD
|
\({C}_{max}\)
|
RPD
|
Ins01
|
6
|
2
|
2
|
4.16e2
|
5.23e2
|
4.23e2
|
1.68e-2
|
4.23e2
|
1.68e-2
|
4.23e2
|
1.68e-2
|
4.23e2
|
1.68e-2
|
Ins02
|
6
|
2
|
2
|
4.56e2
|
5.12e2
|
4.64e2
|
1.75e-2
|
4.64e2
|
1.75e-2
|
4.64e2
|
1.75e-2
|
4.64e2
|
1.75e-2
|
Ins03
|
6
|
2
|
2
|
5.10e2
|
6.17e2
|
5.17e2
|
1.37e-2
|
5.17e2
|
1.37e-2
|
5.17e2
|
1.37e-2
|
5.17e2
|
1.37e-2
|
Ins04
|
6
|
2
|
2
|
3.78e2
|
4.42e2
|
3.89e2
|
2.91e-2
|
389e2
|
2.91e-2
|
3.89e2
|
2.91e-2
|
3.89e2
|
2.91e-2
|
Ins05
|
6
|
2
|
2
|
4.31e2
|
5.36e2
|
4.31e2
|
0.00e0
|
4.31e2
|
0.00e0
|
4.31e2
|
0.00e0
|
4.60e2
|
6.73e-2
|
Ins06
|
6
|
2
|
3
|
3.80e2
|
4.27e2
|
3.84e2
|
1.05e-2
|
3.84e2
|
1.05e-2
|
3.84e2
|
1.05e-2
|
3.82e2
|
5.26e-3
|
Ins07
|
6
|
2
|
3
|
4.07e2
|
4.97e2
|
4.12e2
|
1.23e-2
|
4.12e2
|
1.23e-2
|
4.12e2
|
1.23e-2
|
4.12e2
|
1.23e-2
|
Ins08
|
6
|
2
|
3
|
3.89e2
|
4.76e2
|
3.97e2
|
2.06e-2
|
3.97e2
|
2.06e-2
|
3.97e2
|
2.06e-2
|
3.97e2
|
2.06e-2
|
Ins09
|
6
|
2
|
3
|
4.57e2
|
5.39e2
|
4.68e2
|
2.41e-2
|
4.68e2
|
2.41e-2
|
4.68e2
|
2.41e-2
|
4.68e2
|
2.41e-2
|
Ins10
|
6
|
2
|
3
|
3.02e2
|
3.99e2
|
3.06e2
|
1.32e-2
|
3.06e2
|
1.32e-2
|
3.06e2
|
1.32e-2
|
3.06e2
|
1.32e-2
|
Ins11
|
6
|
3
|
2
|
3.14e2
|
4.37e2
|
3.26e2
|
3.92e-2
|
3.26e2
|
3.82e-2
|
3.26e2
|
3.82e-2
|
3.26e2
|
3.82e-2
|
Ins12
|
6
|
3
|
2
|
3.47e2
|
4.26e2
|
3.53e2
|
1.73e-2
|
3.53e2
|
1.73e-2
|
3.53e2
|
1.73e-2
|
3.53e2
|
1.73e-2
|
Ins13
|
6
|
3
|
2
|
4.36e2
|
5.33e2
|
4.56e2
|
4.59e-2
|
4.58e2
|
4.95e-2
|
4.56e2
|
4.59e-2
|
4.56e2
|
4.59e-2
|
Ins14
|
6
|
3
|
2
|
3.94e2
|
4.93e2
|
4.05e2
|
2.79e-2
|
4.05e2
|
2.79e-2
|
4.05e2
|
2.79e-2
|
4.05e2
|
2.79e-2
|
Ins15
|
6
|
3
|
2
|
2.87e2
|
3.75e2
|
3.04e2
|
5.92e-2
|
3.04e2
|
5.92e-2
|
3.04e2
|
5.92e-2
|
3.04e2
|
5.92e-2
|
Ins16
|
6
|
3
|
3
|
2.76e2
|
3.63e2
|
2.80e2
|
1.45e-2
|
2.93e2
|
6.16e-2
|
2.80e2
|
1.45e-2
|
2.77e2
|
3.62e-3
|
Ins17
|
6
|
3
|
3
|
3.25e2
|
3.97e2
|
3.37e2
|
3.69e-2
|
3.37e2
|
3.69e-2
|
3.37e2
|
3.69e-2
|
3.37e2
|
3.69e-2
|
Ins18
|
6
|
3
|
3
|
3.65e2
|
3.23e2
|
2.71e2
|
-2.57e-1
|
2.74e2
|
-2.49e-1
|
2.71e2
|
-2.58e-1
|
2.71e2
|
-2.58e-1
|
Ins19
|
6
|
3
|
3
|
3.46e2
|
4.33e2
|
3.54e2
|
2.31e-2
|
3.54e2
|
2.31e-2
|
3.54e2
|
2.31e-2
|
3.54e2
|
2.31e-2
|
Ins20
|
6
|
3
|
3
|
2.98e2
|
3.77e2
|
3.09e2
|
3.69e-2
|
3.09e2
|
3.69e-2
|
3.09e2
|
3.69e-2
|
3.08e2
|
3.36e-2
|
Ins21
|
10
|
2
|
2
|
4.22e2
|
5.87e2
|
4.38e2
|
3.79e-2
|
4.47e2
|
5.92e-2
|
4.38e2
|
3.79e-2
|
4.36e2
|
3.32e-2
|
Ins22
|
10
|
2
|
2
|
4.76e2
|
5.98e2
|
4.97e2
|
4.41e-2
|
4.97e2
|
4.41e-2
|
4.97e2
|
4.41e-2
|
4.97e2
|
4.41e-2
|
Ins23
|
10
|
2
|
2
|
3.43e2
|
3.95e2
|
3.54e2
|
3.21e-2
|
3.54e2
|
3.21e-2
|
3.54e2
|
3.21e-2
|
3.54e2
|
3.21e-2
|
Ins24
|
10
|
2
|
2
|
4.32e2
|
5.33e2
|
4.46e2
|
3.24e-2
|
4.46e2
|
3.24e-2
|
4.46e2
|
3.24e-2
|
4.46e2
|
3.24e-2
|
Ins25
|
10
|
2
|
2
|
4.87e2
|
6.23e2
|
5.07e2
|
4.11e-2
|
5.21e2
|
6.98e-2
|
5.07e2
|
4.11e-2
|
5.07e2
|
4.11e-2
|
Ins26
|
10
|
2
|
3
|
4.21e2
|
5.47e2
|
4.48e2
|
6.41e-2
|
4.69e2
|
1.14e-1
|
4.54e2
|
7.84e-2
|
4.68e2
|
1.12e-2
|
Ins27
|
10
|
2
|
3
|
4.67e2
|
5.98e2
|
4.86e2
|
4.07e-2
|
5.03e2
|
7.71e-2
|
4.86e2
|
4.07e-2
|
4.86e2
|
4.07e-2
|
Ins28
|
10
|
2
|
3
|
4.02e2
|
5.88e2
|
4.22e2
|
4.98e-2
|
4.39e2
|
9.20e-2
|
4.24e2
|
5.47e-2
|
4.22e2
|
4.98e-2
|
Ins29
|
10
|
2
|
3
|
5.03e2
|
6.98e2
|
5.25e2
|
4.37e-2
|
5.34e2
|
6.16e-2
|
5.34e2
|
6.16e-2
|
5.25e2
|
4.37e-2
|
Ins30
|
10
|
2
|
3
|
4.75e2
|
6.21e2
|
4.79e2
|
8.42e-3
|
4.79e2
|
8.42e-3
|
4.79e2
|
8.42e-3
|
4.79e2
|
8.42e-3
|
Ins31
|
10
|
3
|
2
|
4.25e2
|
5.22e2
|
4.25e2
|
0.00e0
|
4.27e2
|
4.71e-3
|
4.25e2
|
0.00e0
|
4.25e2
|
0.00e0
|
Ins32
|
10
|
3
|
2
|
4.24e2
|
5.17e2
|
4.34e2
|
2.36e-2
|
4.34e2
|
2.36e-2
|
4.34e2
|
2.36e-2
|
4.34e2
|
2.36e-2
|
Ins33
|
10
|
3
|
2
|
3.35e2
|
4.73e2
|
3.68e2
|
9.85e-2
|
3.76e2
|
1.22e-1
|
3.70e2
|
1.05e-1
|
3.70e2
|
1.05e-2
|
Ins34
|
10
|
3
|
2
|
3.74e2
|
5.27e2
|
4.01e2
|
7.22e-2
|
4.03e2
|
7.75e-2
|
4.01e2
|
7.22e-2
|
4.22e2
|
1.28e-2
|
Ins35
|
10
|
3
|
2
|
3.86e2
|
4.87e2
|
4.09e2
|
6.00e-2
|
4.12e2
|
6.74e-2
|
4.09e2
|
6.00e-2
|
4.10e2
|
6.22e-2
|
Ins36
|
10
|
3
|
3
|
3.93e2
|
5.07e2
|
4.15e2
|
5.60e-2
|
4.15e2
|
5.60e-2
|
4.15e2
|
5.60e-2
|
4.15e2
|
5.60e-2
|
Ins37
|
10
|
3
|
3
|
4.13e2
|
5.37e2
|
4.52e2
|
9.44e-2
|
4.63e2
|
1.21e-2
|
4.52e2
|
9.44e-2
|
4.52e2
|
9.44e-2
|
Ins38
|
10
|
3
|
3
|
3.87e2
|
5.89e2
|
4.19e2
|
8.27e-2
|
4.21e2
|
8.85e-2
|
4.19e2
|
8.27e-2
|
4.19e2
|
8.27e-2
|
Ins39
|
10
|
3
|
3
|
4.97e2
|
6.27e2
|
5.21e2
|
4.83e-2
|
5.34e2
|
7.44e-2
|
5.21e2
|
4.83e-2
|
5.21e2
|
4.83e-2
|
Ins40
|
10
|
3
|
3
|
3.69e2
|
4.79e2
|
3.86e2
|
4.61e-2
|
3.91e2
|
5.96e-2
|
3.88e2
|
5.15e-2
|
3.84e2
|
4.07e-2
|
algorithm compared with other algorithms can obtain better solutions, and some of them are already below the upper bound of the original cases. The actions that are used more than 10% are FIFO, SPT, LPT, SRRT, LRPT, MOR, and EDD in Fig. 3. Performance comparison of action space (dispatching rules) under different data cases.. It is known that these actions have a greater contribution to obtaining the optimal solution and thus have a greater utilization value. The frequency distribution of other actions was relatively even, but the performance was not obvious. Therefore, it can be considered to add other heuristic behaviors to the candidate action space, which eliminates some underutilized behaviors to streamline the actions.
Large-scale instances of FDSSP
In this study, to study the performance of DTDN on large-scale problems, the results of Brandimarte cases are further compared with problem sizes ranging from BC, DP, and BR data cases in a total of 30 data cases. The solution results of the proposed DTDN are compared with Gao et al.56 (HGA), Mastrolilli and Gambardella57 (MG), Sun et al.58 (HMEA), Chen et al.59 (SLGA), and Reddy60 (teaching-learning based optimization (TLBO)), which are shown in Table 10. Results comparison of scheduling score and RPD in different methods on Orb data cases. and Fig. 4. Box plots based on the results in Table 10 and Table 11.. The test results show that the proposed DTDN algorithm can find better computational results globally through a large amount of trial and error in the solution procedure. The obtained performance index results are better than traditional optimization methods for different scales of arithmetic cases, demonstrating the validity of the DTDN algorithm for FDSSP.
Case Study: Hydraulic cylinder production scheduling problem
Nourali7–8 proposed a useful benchmark of FDSSP, including 40 different data cases. The solution results of the proposed DTDN are compared with Huang et al.61 (particle swarm optimization (PSO)); Zhang and Wong6 (constraint programming (CP)), and Zhang et al.16 (distributed ant colony system (DACS)). The results are displayed in Table 11. Results Comparison of makespan and RPD in extended Nourali’s test cases and Fig. 4. Box plots based on the results in Table 10 and Table 11., and the following conclusions can be summarized. The optimal solutions of this algorithm are all within [LB, UB], indicating that the solutions are valid. The performance of DTDN is close to that of the other three algorithms. The run time from CPU Time is about as long as the other algorithms for small-scale cases, but the large-scale problems are much more efficient than the other method. Lastly, in general, this algorithm is slightly less capable of solving large-scale problems because of the large scheduling state space for large-scale problems, the large learning error using the same network structure, and the need for more iterations and a more optimized network structure to reduce the training error.