Alpha-T: Learning to Traverse over Graphs with An AlphaZero-inspired Self-Play Framework

. The combinatorial optimization problems on the graph are the core and classic problems in artificial intelligence and operations research. For ex-ample, the Vehicle Routing Problem (VRP) and Traveling Salesman Problem (TSP) are not only very interesting NP-hard problems but also have important significance for the actual transportation system. Traditional methods such as heuristics methods, precise algorithms, and solution solvers can already find approximate solutions on small-scale graphs. However, they are helpless for large-scale graphs and other problems with similar structures. Moreover, traditional methods often require artificially designed heuristic functions to assist decision-making. In recent years, more and more work has focused on the application of deep learning and reinforcement learning (RL) to learn heuristics, which allows us to learn the internal structure of the graph end-to-end and find the optimal path under the guidance of heuristic rules, but most of these still need manual assistance, and the RL method used has the problems of low sampling efficiency and small searchable space. In this paper, we propose a novel framework (called Alpha-T) based on AlphaZero, which does not require expert experience or label data but is trained through self-play. We divide the learning into two stages: in the first stage we employ graph attention network (GAT) and GRU to learn node representations and memory history trajectories, and in the second stage we employ Monte Carlo tree search (MCTS) and deep RL to search the solution space and train the model.


Introduction
Combinatorial optimization is an important tool to solve some core problems in information theory, management science, computer science, artificial intelligence (AI), and other disciplines, and has extremely important applications in engineering technology, economy, military, and many other aspects [1]. Various algorithms and computational complexity theories to solve and explore combinatorial optimization problems have always been the focus of research in optimization theory, computer science, and other related fields. Known classic NP-hard problems such as the vehicle routing problem (VRP) [2], the traveling salesman problem (TSP) [3], and so on be-long to the category of combinational optimization problems. Therefore, it is of great theoretical significance and practical application value to quickly solve large-scale combinatorial optimization problems.
Over the years, many traditional algorithms such as approximation algorithms and heuristic algorithms have been designed to solve combinatorial optimization problems, including the A* algorithm, simulated annealing algorithms, genetic algorithms, nearest neighbor algorithms, and heuristic solution solvers such as "OR-Tools", "LKH3", and "Concorde". The above algorithms are often well targeted and accurate for specific problems, but for different instances of similar problems, we need to design new algorithms, again and again. That is, the previous solving experience is not helpful for the problems to be solved. Traditional algorithms do not make good use of the fact that in practical applications, most combinatorial optimization problems in the same scene have similar combinatorial structures and the differences are only in values and variables [4]. Therefore, researchers hope to design a general method that can excavate the essential information of the problem through learning, and improve the quality and efficiency of problem solutions by constantly updating the solution policy iteratively.
In recent years, with the rapid development of big data and artificial intelligence technology, machine learning [5], especially its deep learning [6] and RL [7] [8] is increasingly applied to solve combinatorial optimization problems [9] [10] [11]. In the face of huge search space and data points, it should be a reasonable scheme to combine the perceptual ability of deep learning with the reasoning ability of RL. Deep RL [7] has made a revolutionary breakthrough and extensive influence in the field of artificial intelligence, such as AlphaGo [12], AlphaGo Zero [13], and AlphaZero [14], etc. The fundamental motivation of applying deep learning and RL to combinatorial optimization lies in the discovery and reasoning of new policies. Compared with traditional algorithms, instance-based machine learning methods can discover the internal characteristics of instances by learning and apply the solving experience of existing instances to guide the solving of future instances, which also makes the solving of NP-hard problems that were not easy to solve in the past possible.
At present, the application of deep learning and RL in combinational optimization has obtained some preliminary results, but it is still in the experimental stage. Because the problems they solve, including VRP, TSP, MVC, and so on [4], are often limited to the scale of hundreds of nodes, and they need to adjust the network architecture, reward function, or decoding process to solve the same kind of problems with similar structure, which does not achieve complete universality and generalization. Most of the existing learning-based approaches are essentially learning heuristics [15], but they cannot have the ability to learn heuristics independently and require artificial design heuristics to assist learning. Besides, it remains a challenge to learn heuristics more cheaply, intelligently, and efficiently, and to integrate state-of-the-art models with their training methods, because the present RL has some problems such as sparse rewards, low sampling efficiency, and limited space exploration.
Combinatorial optimization is similar to Go in that they both seek feasible solutions or optimal solutions under constrained conditions in a huge combinatorial space. Even if in terms of search size, the combinatorial optimization problem on the graph can have a much larger search space than go, so the combinatorial optimization on the graph is not lower in terms of complexity. AlphaZero won by thinking smarter rather than faster, by discovering the principles of board play on its own to develop a way of playing that reflects the truth of the game rather than programmer priorities and biases. It reveals the fact and advocates a direction that in the field of intelligent optimization we are not only competing for computing power, algorithms and scale, but also for intelligence and insights. Inspired by AlphaZero, we designed a lightweight framework for combinational optimization that applies the Q-Learning algorithm [16] with MCTS [17] to model the policy network and the value network. We first design a memory component which combines GAT [18] and GRU [19] to process the input graph, it can selectively memory state trajectories avoiding pre-training, and then we employ the history trajectories to the joint modeling policy and value, finally employ Q-learning algorithm with MCTS to optimize and update the policy network and value network.
To sum up, the main contributions of this paper are as follows:  We design a memory component based on GAT and GRU, which can selectively remember historical trajectories for pathfinding and reasoning to omit the pre-training in previous work.  We employ the history trajectories generated by the memory component to jointly model the policy network and the value network, which can be updated and promoted simultaneously.  We employ the Q-learning with MCTS to uniformly optimize the policy network and value network, which is more efficient and more conducive to expanding the exploration space and avoiding the problem of sparse rewards than the REINFORCE algorithm or actor-critic algorithms [16].


We design a scoring function based on MCTS to predict the nodes to be searched and reasoned about more efficiently in the testing stage.

Related Work
The application of deep neural networks in combinatorial optimization can be traced back to the Hopfield-network [20]. With the rapid development of artificial intelligence technology and hardware devices (such as GPU, TPU, etc.), more and more researchers are committed to applying machine learning [21] to traditional combinatorial optimization problems in recent years [22] [23]. Inspired by the great success of deep RL in games, some researchers have tried to transfer it to combinatorial optimization. To make this paper self-consistent, we first introduce the application of deep RL in games, then sort out the representative learning-based methods used in combinatorial optimization in recent years, and finally explain the inspiration of Al-phaGo Zero for solving combinatorial optimization methods and the relationship between Go and combinatorial optimization problems.

Games with Deep RL
Deep RL [16], which integrates deep learning and RL [24], has become one of the most mainstream directions of artificial intelligence. Mnih et al. proposed DQN, which combined with the deep neural network, Q-learning, and experience playback to achieve the human-level performance of Atari games [25]. DreamerV2 was the first agent to achieve human-level performance on 55 Atari benchmarks by learning behavior in a single trained world model [26]. Agent 57 was the first deep RL agent to exceed the standard human benchmark in all 57 Atari games [27]. DeepMind designed a series of algorithms for the more complex games of Go, chess, and so on, such as AlphaGo, AlphaGo Zero, and AlphaZero, and applied them to defeat professional human masters, officially ushering in the era of artificial intelligence [28]. MCTS [29] is one of the core techniques of these go/chess algorithms, which is applied to search for the huge state space while using UCB to make decision choices to balance exploration [17]. AlphaGo employs a two-stage training pipeline consisting of supervised learning and RL, expert experience-assisted training for a hot start in the initial phase, and then RL to iteratively update and optimize policy [12]. AlphaGo Zero and AlphaZero are updated versions of AlphaGo, which do not require any human experience but only the rules of go, then train themselves with the data generated by self-play and easily beat AlphaGo in tests [13]. Deep RL has been hugely successful in increasingly complex single-agent environments and two-player rotational games [30]. However, the real world is more complex and consists of multiple agents, each learning and acting independently and then cooperating and competing with each other. DeepMind applied tournament-style evaluations to verify that agents could achieve human-level performance in 3D multiplayer first-person video games such as "Quake III Arena in Capture the Flag Mode" [30]. DeepMind's Alpha Star was rated grandmaster in all three Starcraft tournaments, with over 99.8% of the officially ranked human players [31].

Combinatorial Optimization with ML/RL
We generally divide the learning-based methods into two categories: attentionbased methods and GNN-based methods, although there are some overlaps and fuses between them. Attention-based Methods. Inspired by the seq2seq model [32] in natural language processing (NLP), the pointer network [33] is proposed to solve the combinatorial optimization problem in a targeted way. It combines seq2seq and attention to deal with the feature that the length of the output sequence depends on the length of the input sequence in combinatorial optimization problems. The pointer network has changed the traditional attention, that is, the probability of each city in the input sequence will be obtained according to the attention when the output is predicted.
Bello et al. are the first to clarify that intensive learning is more suitable for combinatorial optimization problems than supervised learning, because it is expensive or even impossible to obtain lots of label data, and apply the actor-critic algorithm [16] to train the Pointer Network to solve the TSP problem [34].
Based on [33] and [34], Nazari et al. proposed an end-to-end RL method to solve VRP, which simplifies the pointer network and can effectively handle both static and dynamic elements. It extends the Pointer Network to both VRP and TSP and exceeds traditional heuristics and OR-tools [35].
Kool et al. built a framework based on the transformer using only the attention mechanism [36] to solve the combinational optimization problem [37]. Instead of using the sequence as input, it uses the graph as input, which eliminates the dependence on the order of the nodes in the input [18]. That is, no matter how we arrange the nodes, the output will not change if the given graph does not change, which is an advantage over the sequential approach. GNN-based Methods. As a generalization of deep learning in graph data, graph neural network (GNN) [38] [39] has been widely used in combinatorial optimization in recent years due to its strong ability to represent the intrinsic structure of graphs and its ability to aggregate information of nodes and edges.
Dai et al. proposed a scheme of finding evaluation function with graph embedding [40]. They first applied Structure2Vec to embed the graph into the low-dimensional space, and then employed the greedy Q-learning algorithm, which was based on the goal of maximizing an "evaluation function" and applied the method of adding nodes step by step to get the solution of the problem [4].
Li et al. applied supervised learning training GCN [41] to guide the parallel tree search process, which quickly generated lots of candidate solutions and selected one of them after subsequent optimization [42]. Mittal et al. designed a two-stage training method based on [4] and [42], in which supervised learning followed by RL and greedy probability distribution were applied to solve combinatorial optimization problems on large-scale graphs [43]. The methods for combinatorial optimization which are designed into an encoder-decoder paradigm based on GNN framework also include [44] [45] [46] [47].
Chen et al. proposed a neural writer that learns a policy to select heuristic algorithms and rewrites local components of the current solution to iteratively improve it until it converges [48]. Lu et al. proposed "L2I", which starts from a random initial solution and learns to iteratively improve the solution with an improved operator selected by the controller based on RL [49].

AlphaGo Zero's Enlightenment on Combinatorial Optimization
AlphaGo Zero [13] only applies one deep neural network to replace the two independent networks in AlphaGo [12], namely the policy network and the value network and outputs the playing policy and the winning rate value of the current chess board situation, which not only saves the computing space and reduces the operation consumption, but also the mixed two networks can better adapt to a variety of different situations. Games vs. Combinatorial Optimization. (1) Combinatorial optimization also has a huge solution space like Go. For example, the solution set of TSP with n cities can reach an order of magnitude of (( − 1)!). Therefore, we can also establish an appropriate mathematical model for combinatorial optimization problems and adopt the MCTS to reduce the solution space. (2) Similar to the winning and losing rules in Go, combinative optimization problem also has clear objective function and constraint conditions (rules) to evaluate current policy and learn from reward signals. (3) The heuristic function designed based on expert experience or the model trained with a small amount of label data may be locally optimal in the convergence domain, while we can achieve self-optimization by generating data and high-quality sample labels by self-play like AlphaGo Zero (or AlphaZero).
At present, some scholars have been inspired by the success of AlphaGo Zero in Go and have transferred it to combinatorial optimization to achieve higher accuracy and efficiency [50] [51] [52] [53]. However, a common problem of these methods is the excessive demand for experimental environment. Besides, the policy or value network in the previous MCTS-based methods were trained independently without MCTS participation and only applied to assist MCTS after training, while our method improved the policy by learning the trajectory generated by MCTS and using the information of the entire MCTS search tree.

Problem Setup
VRP is a generic term for a class of problems in which a fleet based on one or more warehouses must determine a set of routes for a geographically dispersed city or customer [2]. The goal of VRP is to provide a set of customers with known requirements with the lowest-cost vehicle routes starting and ending in a warehouse ( Figure  1). TSP is a simple special case of VRP that describes starting from one city and returning to the starting point after traversing all the other cities with the lowest total cost [3]. VRP is one of the most challenging combinational optimization (NP-hard) problems, and much of the interest over the years stems from its practical significance and its difficulty.  1. The demonstration of VRP in the practical application scenario, in which each node can be different facilities or cities containing their demand; Different paths correspond to different miles; A vehicle can be a truck, a motorbike, a taxi or a shuttle bus with their respective load capacity. The goal of VRP is to optimize a set of paths to minimize the total cost. We formally define VRP and TSP on the graph, and because we employ the MCTS algorithm in the framework, the optimal path problem on the graph can be translated into searching for the path with the minimum cost in the tree [51]. = ( , , ) is used to represent an undirected and unlabeled graph, where is a set of vertices and represents each node and the optimal path is composed of a sequence of nodes { 0 , … , , … , 0 } where 0 represents the starting node, is a set of edges, and is a set of weights on the edges = ( , ). VRP is Converted to TSP. In theory, we can apply distributed multi-agents to simulate multi-vehicles to find the paths on the graphs. That is, each agent represents a vehicle. However, due to our limited implementation capacity, computing resources, etc., we are still committed to designing a generic and lightweight framework for finding and reasoning paths. If we apply the subgraph sampling algorithm [54] to divide the VRP on the graph into subgraphs of the same size, and each subgraph contains a central node, then the VRP can be converted to TSP. In this way, we can solve both VRP and TSP without defining two Markov decision processes for them respectively, and subgraph sampling is equivalent to greedy probability distribution [43] and Graph reduction [42] to effectively preprocess large-scale graphs. Heuristic Functions. The learning-based method to solve combinatorial optimization problems is essential to learn the corresponding heuristics, so we can construct a heuristic function for the combinatorial optimization problems on the graph to guide the learning. At the same time, inspired by the Q&A system [55], we can also turn the combinatorial optimization problem into finding an answer to a certain question. For example, for TSP, we can construct a function ( , 0 , ), whose functional form is generally unknown, where is a relational query or target such as "shortest distance", "longest distance", and "lowest cost", etc. We need to use something like ( 0 , … , 0 ) such samples (the optimal solution sets) to constitute a training data set to learn ( , 0 , ). Therefore, in theory, we can solve other combinatorial optimization problems [42], such as Minimum Vertex Cover (MVC), Maximum Cut (MC), and Maximal Independent Set (MIS), by setting different relational queries and targets to make the heuristic function (•) more universal.

Markov Decision Process (MDP)
In this paper, ( , 0 , ) is modeled by a graph walking agent, which can intelligently traverse all the remaining nodes in the graph and return to the starting point, and requires the shortest sum of traversed paths. The agent must learn a search policy from the training set so that when the training is complete, the agent understands how to traverse all nodes in the graph with minimal cost. The agent is not supervised from beginning to end but receives only delayed evaluation feedback: the agent is rewarded positively when it correctly predicts the optimal path in the training set. For this purpose, we formulate the optimal path problem as a MDP so that we can train the agent through RL. 8 The MDP is defined by the tuple ( , , , ), where is the set of states, is the set of actions, is the state transition probability and is the reward function. Our environment is a limited range, deterministic, and partially observed GAT located on a VRP graph. Actions. Action space refers to all actions performed by the agent in the graph. Every time the agent selects a node , it means that it executes an action ( ∈ ) in the time step . States. represents the state space, and represents the node set traversed by the agent up to time step . At , the agent can perform two actions: (1) select an edge and move to the next node. (2) Terminate the walk after returning the start node 0 (the state becomes ). The agent needs to make a correct decision based on the traversed node sequence trajectory and the target, so we recursively define : Where , is the node visited by the agent in time step , , represents the set of adjacent neighbor nodes with , , and , represents the set of edges connected with , . Transition. refers to the probability that the state changes after an action is performed. ( −1 , −1 , ) = 1 if and only if represents a partial tour produced by adding , to the partial tour in −1 . Rewards. We define the reward function ( ) as the length of the tour, if = and = 0, (represents the complete tour), otherwise ( ) =0.

Memory Components with GRU Encoder and Graph Attention
In order to solve the finite time domain deterministic part of the observable GAT, we introduce the GRU [19] to memorize previously traveled paths (states or actions) for learning random history-dependent policies instead of using supervised learning in AlphaGo [12] and GCOMB [43] for pre-training. Pre-training with supervised learning requires many known optimal paths for model training, and such "hot starts" may cause the model to overfit on given paths. Based on historical embedding ℎ , the policy network decides to take an action from all available actions based on the query . Each possible action represents a node or an outgoing edge with labels and information. In order to take the graph as the input and enable the agent to have a larger field of vision during traversal to better judge the weights of paths and nodes, we introduce the GAT [18] to process the information on the graph.
We employ the complete graph as input rather than a sequence of nodes because we can apply GNN to learn the internal structure of the graph so that we can aggregate and update the information on the graph even when the nodes and edges in the graph are changing dynamically (figure 1). As an encoder of the graph, we first obtain the original feature of the node on the graph = ( ), ∈ , where is the dimension of the feature vector, and () can be a fully connected neural network such as FCN, MLP, etc. Then we apply each layer of the GAT to update the feature vector of the − ℎ node, and the attention weights from node to node are calculated as follows: Where is a mapping of ′ × ′ → , and ∈ ′ × is a weight matrix. To ensure that the structural information on the graph is not lost, we apply masked attention which simply assigns attention to the adjacent node set of the node . Normalized self-attention is shown as follows: We employ a single layer neural network , and the specific calculation process is as follows: Where ∈ 2 ′ is the parameter of the feedforward neural network and is the activation function. We can get the updated vector: To improve the representational ability of the model, we apply multiple to calculate self-attention at the same time, and then combine the results obtained by each : Where || represents the connection, and and represent the calculation result of the − ℎ head. We can also take the summation to get ′ : We can also replace GAT with the state-of-the-art DAGN [56], which is a principled approach that incorporates multi-hop adjacent contexts into attention calculations, supporting remote interaction at each level. The state vector of the node at the time can be calculated using the following formula:

Joint Modeling Policy and Value with Self-Play Neural Network
We apply both the policy ( | ) and the , where is the model parameter.
( | ) represents the probability that the agent performs the action in the current state , which is regarded as a prior to bias of the MCTS.
defines the longterm reward obtained by following the optimal policy and taking the action at the state . We aim to learn a policy that can maximize long-term rewards so that the agent can predict nearby target nodes with high probability and eventually return to the original node to form a tour ( Figure 3 shows the complete process of our modeling and training). As described in Section 3.3, ℎ is composed of the following parts: vector ℎ , encodes ( −1 , −1 , , ) , i.e., the state, action, and current node at the last time; ℎ , ′ , ( , ′ ∈ , ) integrates neighbor nodes , and edges , ; ℎ , judges whether the agent arrives at the destination based on , and , . We can model and by the following formula: , ′ , =< ℎ , , ℎ , ′ , > (12) ( ) = ( ,•) = ( 0 , 1 ′ , … , ′ ) (13) ( , ) = (• | ) = ( 0 , 1 ′ , … , ′ ) (14) Where 0 is obtained by stitching ℎ , and ℎ , together through a full-connected neural network (). , ′ , is given by the inner product of ℎ , and ℎ , ′ , . ′ refers to the neighbor's score, is obtained by the function, and is obtained by the function with the temperature parameter [57].

Training Algorithm with MCTS
Inspired by AlphaGo [12], we design a training pipeline to train the parameter from beginning to end, and we further define ( ) = ( ( , ), ( )) according to [13] [28], where ( , ) is the action probability, and ( ) is the evaluation score of the model predicted by the network with parameter for the given data set , task and state . The neural network predicts the probability of taking an action, which in turn addresses a given task on a data set. The network input is a selfplaying training instance ( , , ), where is the evaluation of the pipeline at the end of the traversal. The parameter of network is adjusted by the stochastic gradient descent (SGD) of loss function , which calculates the sum of the mean squared error and the cross-entropy loss: The L2 regularization parameter is used to prevent overfitting. The network output is the probability of action ( , ) and the evaluation ( ) of pipeline performance. A little bit different from the loss in [28] is the normalized cumulative reward that we use. Instead of comparing two players' odds of winning, the model evaluator generates random graph instances each time and compares their average performance, specifically: When estimating the state value of state , we perform an action to maximize ( ) and apply and ′ to recover the unnormalized value, where and ′ are the standard and the mean deviation of the cumulative rewards of the random plays from state .
Our algorithm runs multiple simulations in the MCTS [17] through neural network predictions ( ( , ), ( )) using them to search for better evaluation. The search results improve the predicted results given by the network by using update rules to improve network policy: Where, ( , ) is the expected reward of an action starting from state , ( , ) is the neural network's estimation of the probability of an action starting from state , ( , ) is the number of an action starting from state , ( ) is the number of times to access state , and is the constant to determine the amount of exploration. At each step of the simulation, we need to find the action and state that maximize ( , ), and add the new state to the tree if it does not exist in the neural network estimation ( ( , ), ( )), otherwise the search is called recursively.
We developed a new reverse link algorithm that employs MCTS to leverage MDP transfer functions. That is, in each MCTS simulation, the trajectory is expanded by selecting actions based on variations of the PUCT algorithm [13] from the root state Where and are constants that resemble . The core idea of Alpha-T is to run multiple MCTS simulations to generate a set of trajectories with more positive rewards, which can be understood as being generated by a promoted policy , while learning these trajectories can in turn promote the policy. Since these trajectories are discrete rather than continuous, we employ Qlearning [58] [59] instead of REINFORCE [16] to update Q-network from these trajectories in an off-policy way: Since the Q-network (value Network) and the policy network share the same set of parameters, as soon as the parameters in the Q-network are updated, the policy network will also be improved automatically at the same time, and the new policy will be applied to control the MCTS in the next iteration.
According to [13], the data generator repeatedly generates random graph instances as self-playing records based on the current best model of MCTS, which is referred to ( , , ) as mentioned above, indicating that taking actions from the states depends on the policy of MCTS-improved. The learner randomly samples and updates the parameters of the best model from the generator's data (the above Q-learning update process with MCTS). The description of the training algorithm with MCTS is shown in Algorithm 1.

Predict Nodes with the MCTS Scoring Function
In the test phase, we generally apply the trained policy to predict the sequence of nodes (paths) on the unseen (unknown) graph. Previous methods predict the probability distribution of nodes on the unknown graph, such as the greedy probability distribution scoring function used in GCOMB [43], while our method combines the policy and value learned by MCTS to generate an MCTS tree during the training phase. In the testing phase, previous methods often predict multiple paths based on different rewards, but in practice, we often need an accurate shortest path, which means that the result we need should be unique. When alpha-T predicts multiple paths, leaf states in the MCTS tree correspond to nodes on different paths, and we need to combine the predicted results of MCTS leaf states into a score to sort the nodes, specifically: Where is the sum of all leaf states corresponding to the same node . 0, is an action in the terminated state (see section 3.2).
( ) is the weighted average value of terminal state values associated with the same candidate node . Among the candidate nodes (paths) we choose the one with the highest score:

Data Sets and Settings
Alpha-T does not require expert experience to learn heuristics through self-play, while data can be generated by data generators. In this paper, we focus on path optimization problems, and we can generate VRP or TSP instances to train the model as in previous works [37]. Our method is based on MCTS and RL, which theoretically makes it more generalization than supervised learning (which relies on label data) and has more search space than other search methods. Therefore, we can also generate ER and Barabasus-Albert (BA) graphs to extend the model to other similar combinatorial optimization problems on a larger scale [43] [4].
In terms of baseline selection, we chose some classic heuristics, the most recent ones based on attention or GNN, and AlphaGo Zero-based methods. The experimental results of baselines in the comparison experiment were taken from their respective papers. Due to the difficulty of reproducing the previous work (such as computing resources and time), we chose the experimental tasks and parameter settings similar to previous works as far as possible for convenience comparison.
For the memory component, we set GRU hidden dimension to 200, the attention dimension to 100, MLP contains 5 layers and each layer hidden dimension is set to 100. For MCTS we follow AlphaZero, AlphaGo Zero, and ELF OpenGo [60], we set the temperature parameters = 1 in the training phase, while in the test phase = 0. During the training, each learner sampled 20 tracks from the self-play record and conducted random gradient descent through Adam, with a learning rate of 0.001, a weight attenuation of 0.0001, and a batch size of 16. The learner saves the new model after 15 iterations. We set the number of data generators, learners, and model evaluators to 20,6 and 2, respectively. Meanwhile, we incorporate PBT [61] to optimize the parameters in the network, which is a population-based method for accelerating and improving AlphaZero. The experiments are performed on eight NVIDIA GeForce GTX1080Ti GPUs, and part of the code is available at: https://github.com/wangqi798252101/Alpha-T

Performance for Small-Scale TSP
We first verify the efficiency and effectiveness of Alpha-T on small TSP instances. The selected baselines are representative GNN-based methods, including GAT, GCN, Structure2Vector, and Graph-Pointer network, as well as traditional heuristic search algorithms such as Nearest and MST [37] [47]. The optimal solutions of TSP20/50/100 instances are relatively easy to be obtained by the solution solvers (such as Concorde, LKH3, and OR Tools [37]), and the approximate ratio of each method to the optimal solution is applied to measure the quality of the solution obtained by the model, among which the method with the lower approximate ratio is the better.  Figure 4 shows the experimental results of different methods. From the trend, we can see that the solution obtained by Alpha-T is the closest to the optimal solution, which indicates its good effect on small-scale path problems. We think that compared with the previous method based on GNN alone, Alpha-T applies GAT with GRU, which effectively integrates and remembers the information of neighboring nodes to form a more complete state space and action space to guide agents to better find nodes and reasoning paths. During the experiment, we compared the memory components with GAT alone, and we found that using GRU to aid attention on a small scale is more accurate than just using attention. We believe that attention can infer the im-portance of the surrounding neighbors so that the GRU can record the historical trajectories more efficiently and avoid a lot of redundant states and actions. Moreover, if the agent can remember the paths taken, it is helpful to infer the paths to take.

Performance for Larger-Scale TSP
Real-world transportation networks often contain hundreds of nodes, so it is necessary to verify alpha-T's performance on a larger TSP instance. We train on small scale graphs and then we reason on larger scale graphs, which is another way to demonstrate the generalization ability of the model. Since the GNN-based method may be too heavy to be trained on the big picture due to too many parameters, this phenomenon is common in the previous methods [4]. At the same time, due to the Alpha-T framework using GRU, which is a variant of RNN, if all training in large scale figure could lead to a gradient or explosion due to excessive parameters is too heavy to be trained, so for the super-large size graphs, we can apply the subgraph sampling algorithm [54] to divide into subgraphs and iteratively deal with these subgraphs [62]. If there is too much training data, the training time will be too long or we do not know how long the training time is appropriate, because if the training time is too long, it will lead to overfitting, while the training time is short, the potential of the model cannot be fully released. To sum up, we trained for one hour on TSP20, TSP50, and TSP100 respectively, and then tested on TSP500, TSP750, and TSP1000. Table 2 shows the comparison results with baselines. From the experimental results, we can see that Alpha-T achieves better results compared with the previous RL methods, but it is still not as of accurate as the tradi-tional heuristic methods. We can also observe that Alpha-T performs better when the training data set is larger, which indicates that the increase of training data within a certain range is conducive to improving its ability and unleashing its potential. To analyze the reason, we applied the memory component in the framework, which is helpful for the agent to find the correct node more accurately. Moreover, the Qlearning algorithm based on MCTS can further expand the search space and search efficiency, because it enhances the sampling efficiency and solves the problem of reward sparseness.

Performance on VRP
In theory, we can apply Alpha-T to solve other types of combinational optimization problems on graphs by simply changing the GAT to fit other problems such as MVC, graph coloring, maximum cut problems, and so on [50] [52]. As mentioned above, this paper focuses on VRP on the graph, which can be transformed into TSP through subgraph sampling algorithm. In the VRP experiment, is set as 5, that is, the VRP instance in the experiment is converted into 5 TSP instances. The model trained in TSP20 is used for VRP100, TSP50 for VRP200, and TSP100 for TSP400. We try to establish a method system to solve these kind of path problems based on AlphaZero, so whether good effects can be maintained on other problems will be reserved for future work. The previous works based on AlphaGo Zero are mostly used in graph coloring problems, 3D packing problems, MVC, etc. [50] [53] [52] [51], which did not overlap with our methods in the experiment, and such methods were often too complex to be easily repeated, so we still chose the targeted method to solve VRP as the baselines.
From the experimental results, we can see that Alpha-T is still very competitive compared with the baseline, which mainly proves the effectiveness of the subgraph sampling algorithm for the conversion of VRP to TSP because we have previously proved the efficiency of Alpha-T on TSP. We could have handled VRP directly by changing the GAT without the need for transformation, but we try to unlock the possibility of fast parallel processing of large-scale graphs. In the combinational optimization problems, a lot of problems have mathematical equivalence, for example, Independent Set (MIS), Minimum Vertex Cover (MVC), and Maximal Clique (MC) in the bipartite graph [42]. We can apply machine learning to convert them to each other to make it possible to solve multiple problems using only one model that does not need to be modified. Besides, we apply Q-learning with MCTS to achieve more efficient learning heuristics, search through trees and graphs, and update the network so that the agent can find the target node more intelligently. Table 3. The performance of Alpha-T in the VRP instances. We employ the solution solver OR-Tools as a benchmark to calculate the performance of each model relative to it, where "Obj" represents the length of the optimal solution obtained, and "Gap" represents the Gap between each model and "OR-Tools". To facilitate the comparison, we choose the most representative methods recently, among which the performance of the methods with different search algorithms is different, and we choose the most effective version.

Effect of Training
So far, there are still some problems in the application of RL, such as sparse rewards, low sampling efficiency, limited search space, etc., because, under normal circumstances, agents may find a lot of useless nodes or paths in constant trial and error, or enter an endless cycle in a certain state. Even the role of RL in the AlphaGo series is generally believed to be due to the assistance of MCTS and deep learning. We will then verify the learning effect of alpha-T, whether it can learn useful paths and find the right nodes efficiently. First, we verified the accuracy rate of the agent to find the target node with the increasing number of training episodes on TSP20/50/100, and then we compared with baselines the change of learning convergence with the increasing number of training steps.  From Figure 5, we can see that the accuracy of finding the right node increases with the increasing number of training steps of agents, which indicates that the learning and updating process of Q-learning with MCTS is effective. It can be seen from Figure 6 that Alpha-T gradually converges to the optimal solution with the increase of training steps, and its effect is obvious at the beginning stage, which indicates the effectiveness and efficiency of gradient descent. Because GCN-NPEC employs supervised learning in the initial stage, it converges faster in the initial stage.

Discussion and Conclusion
We designed alpha-T inspired by AlphaZero, which does not require any expert experience and can learn heuristics entirely from unlabeled data through self-play. Aiming at path optimization problems (including TSP, VRP, etc.), we have studied a lightweight self-playing framework based on MCTS, whose consumption is far lower than AlphaZero. Since the memory component can be targeted to record the history trajectories to assist the modeling of the policy network and value network later, this is equivalent to saving a lot of computing resources in the early stage and avoiding the supervised pre-training in AlphaGo. We apply the Q-learning algorithm with MCTS to train the policy network and the value network together so that both can be iteratively updated at the same time to produce the best model, which simultaneously effectively integrates learners, data generators, and model evaluators. Like AlphaZero, which plays not just Go but a variety of board games, Alpha-T can be generalized to other similar combinatorial optimization problems on a much larger scale. It also shows that combinatorial optimization is one of the most cutting-edge problems in the field of artificial intelligence, and its scientific value and practical application value make us try to apply state-of-the-art technology to it. In this paper, we propose a gen-eral RL framework in which components can be replaced by other models, such as the state-of-the-art GNNs and RL, which will be our work in the future.