Distributed Multi-Agent Learning is More Effectively than Single-Agent


 Interpretable distributed group intelligence techniques have emerged as an essential topic in artificial intelligence. The mathematical interpretability of prediction outcomes is critical for improving the reliability of machine learning, especially in random scenes. Although some experimental results published so far show that the prediction of group intelligence is better than individual intelligence, establishing a mathematical foundation for the superiority of distributed group intelligence is still a challenging problem for enhancing the interpretability of learning systems. Through the Radermacher complexity principle, we proved mathematically that the learning quality of group machine intelligence is better than its subset machine intelligence with a high probability, significantly better than any individual among them if the number of individuals in the group is large enough. We proposed a multi-agent distributed learning method for time series forecasting by incorporating multi-agent cooperation in cognitive processes into machine learning. In addition, since the way of cooperative interaction between multi-agent affects the training effect of the model, we provide a generalized interaction approach and prove its convergence. We conduct sufficient experiments on predicting time series for classically chaotic systems, and the results indicate that distributed group intelligence significantly improves the prediction accuracy of individual intelligence. The experiments result shows that the prediction error reduces substantially as the number of agents increases, confirming the theoretical accuracy and the model's validity. This work provides new ideas for theoretically exploring how group intelligence emerges.


Abstract
Interpretable distributed group intelligence techniques have emerged as an essential topic in artificial intelligence. The mathematical interpretability of prediction outcomes is critical for improving the reliability of machine learning, especially in random scenes. Although some experimental results published so far show that the prediction of group intelligence is better than individual intelligence, establishing a mathematical foundation for the superiority of distributed group intelligence is still a challenging problem for enhancing the interpretability of learning systems. Through the Radermacher complexity principle, we proved mathematically that the learning quality of group machine intelligence is better than its subset machine intelligence with a high probability, significantly better than any individual among them if the number of individuals in the group is large enough. We proposed a multi-agent distributed learning method for time series forecasting by incorporating multi-agent cooperation in cognitive processes into machine learning. In addition, since the way of cooperative interaction between multi-agent affects the training effect of the model, we provide a generalized interaction approach and prove its convergence. We conduct sufficient experiments on predicting time series for classically chaotic systems, and the results indicate that distributed group intelligence significantly improves the prediction accuracy of individual intelligence. The experiments result shows that the prediction error reduces substantially as the number of agents increases, confirming

Introduction
Recently, multi-agent learning has gained popularity due to the development of generative adversarial networks and their application to complex systems [1][2][3][4]. Organisms utilize their variety to show their adaptability to the uncertainty of the real world. The real world is complete only when the whole group is integrated. In multi-agent cognitive learning, the group comprises individuals, who share a common objective but have distinct parameters. The differences exhibited by individuals through learning can explain world uncertainty and biodiversity. The emergence of group learning is a phenomenon that occurs as a consequence of individuals' continuous mutual learning.
Time series data has attracted intensive attention from academia and industry, continuously produced in industrial, environmental, social and healthcare applications. Acquiring accurate and reliable future trends is becoming a perpetual endeavour in many fields, providing the basis for various applications such as production planning, control, optimization, prevention, etc [5][6][7][8]. Traditional time series forecasting technologies such as autoregressive integrated moving average (ARIMA) [9][10][11], filtering-based approaches [12,13], support vector machines [14][15][16], and others are restricted in their capacity to handle high-dimensional data and correctly represent complicated functions. Deep learning has shown significant benefits in this area, with various models based on long short-term memory (LSTM) [17] and recurrent neural networks (RNN) successfully applied to time series forecasting fields, including remote sensing [18], the atmosphere [19], language processing, and speech recognition [20,21].
Most models learn the representation of potential features mainly at higher and more abstract multiple levels [22]. However, it is difficult for the upper layers to extract correct features based on the prior learning in the deep model, when the bottom layers do not collect sufficient and trustworthy data. Moreover, the learning result of each layer and layer-to-layer connection patterns will highly influence the final effect. Dan Schechtman thinks that group learning is the most efficient method of acquiring information for lack of prior knowledge [23]. To increase the model's learning efficiency, we integrated the multi-agent conceptual cognitive learning process seen in social networks into the model training process. The diverse social roles of cognitive agents create information asymmetry in social concept cognition. Reinforcement learning synthesizes social roles, and unknowable elements outside the specific cognitive agent into an "environment" [24]. This approach can compensate for the shortcomings of the previous "open-loop learning" of cognitive agents. However, it cannot accurately reflect the complexity of the cognitive learning process for social ideas. A dopamine-based distributed reinforcement learning method confirms that the brain does not encode potential future rewards in a mean-valued way, but effectively and in parallel with multiple future rewards in a setvalued manner [25]. Their discovery demonstrates the feasibility of distributed reinforcement learning on neural networks.
In contrast to group learning, dopamine-based distributed reinforcement learning does not reflect cooperative learning behaviours, social role distinctions, and agent learning inclinations. Thus, the research of complicated multi-agent distributed social concept cognition theories and methods is urgently required to develop artificial intelligence. Swarm Learning, a decentralized machine learning method that surpasses collaborative learning, have developed to rapidly and accurately identify patients with severe and diverse illnesses. Four different illness cases show that Swarm Learning classifiers outperform single-agent classifiers, showing that multi-agent distributed learning beats single-agent learning [26]. This also verifies the profundity of the Chinese classic "The Book of Rites and Learning", which says "one who studies alone and without companions is alone and uninformed". Similarly, the literature [27] indicates that when the task is complex, the group solves the problem as fast as the fastest individual, and the efficiency will be better than the most efficient individual in the group. However, most of the results above are based on experimental evidence rather than theoretical proofs, and their mathematical foundations have become essential concerns in machine intelligence.
Cooperation is a common phenomenon in group learning, dolphins hunt in groups [28], humans cooperate to win [29], cooperation exists in all biological species and human cultures [30,31]. Mathematical studies have shown that when the leader does not pursue a dominating strategy, the system coevolves and stays in a cooperative condition. This phenomenon occurs in the Snowdrift and the Staghunt game as well [32]. This research offers mathematical evidence for cooperative learning in the leader-follower model to reach win-win outcomes. Cooperation may take many forms, including specialization, sacrifice, and coordination. The evolutionary population model consists of two interacting individuals that reproduce under stochastic environmental conditions. Reducing the correlation of fecundity between replication units is an evolutionary selection mechanism for collaboration, which explains how cooperation evolved [31]. Additionally, Swami Iyer et al. demonstrated that structured groups based on complex networks display a more excellent cooperative behaviour than unstructured groups [33]. That is, when the interaction between individuals creates a network, the network topology significantly affects the outcome of group learning. In this paper, we integrate cooperative learning and network structure into a novel machine learning framework, the multi-agent distributed LSTM, to improve the prediction accuracy of network models.
In recent years, numerous academics have investigated the mathematical foundations behind machine learning. For time series prediction, the interpretability of the prediction results is essential to improve the reliability of machine learning systems. Therefore, the research of interpretable machine learning methods has emerged as a hot topic for future research. In the previous work, we established a new condition for generator and discriminator data distribution agreement. Additionally, we propose a multi-agent distributed generative adversarial network (MADGAN) to address the multi-agent cognitive consistency problem in complicated distributed systems [34]. Chen et al. break through a critical nonlinear partial differential equation closely related to the optimal transport theory familiar in machine learning. For example, W-GAN belongs to optimal transmission theory [35]. Multimodal learning outperforms learning with its modal subsets, since the former provides a more accurate estimation of latent space representation [36]. In this paper, we show that utilizing multi-agent learning reduces overall risk compared to using single-agent learning when the size of the dataset is enough. We developed a multi-agent distributed learning framework based on LSTM networks to verify the validity of the theory. The experimental results corroborate the theories, demonstrating that reciprocal learning among agents may compensate for the cognitive process's lack of individual sample size and expertise.
In this paper, multi-agent learning outperforms subset learning is shown by minimizing empirical risk. The multi-agent distributed LSTM network (MAD-LSTM) is proposed as a new prediction framework. MAD-LSTM significantly increase prediction accuracy by simulating multi-agent conceptual cognitive learning and incorporating cooperative learning. The agents in multiagent cognitive learning consist of LSTM networks with the same goal but distinct parameters, and they interact via exchanging parameter information. Notably, the interaction between agents may be regarded as information sharing between neurons at each layer. We evaluate the results using prominent benchmark datasets and real-world datasets. The experimental results demonstrate that when sufficient data is available, multi-agent distributed learning beats learning with a subset of agents.

Results
Notation This section first introduces the knowledge of Markov chains. R, N denote the real and natural number spaces, respectively. The Markov chain state-space can be represented as I = {1, 2 · · · , m} when m agents learn from each other. G = (V, E, W ) is an m-order weighted directed graph with V = {1, 2, · · · , m} being the set of all agents. E = {e ij | i, j ∈ V } is the set of edges, which represents the information interaction or conceptual influence relationship between agents in the network. The edge e ij indicates that agent j can receive information from agent i. The weighted adjacency matrix is W = [w ij ] m×m , where w ij denotes the weight corresponding to edge e ji , and also represents the one-step transfer probability from state i to state j. If and only if e ji ∈ E, there is w ij > 0, otherwise, w ij = 0. W (n) denotes the adjacency matrix of multi-agent interactions at nth time. When any times W are the same, then W (n) = W n . N i = {j ∈ V : w ij > 0 denotes the collection of neighbors of node i. Additionally, if w ii > 0, node i has a selfloop. w (n) ij denotes the nth transfer probability from state i to state j. If the set {n : n ≥ 1, w ii > 0} is not empty, then the greatest common divisor d of the set is called the period of state i. Specifically, when d = 1, state i is nonperiodic. If states i and j interwork, that is, i ↔ j, then states i and j have the same period. If every state is interworking in state-space I = {1, 2, · · · , m}, it is an irreducible closed set. The irreducible aperiodic Markov chain with finite state must have a single stationary distribution {π j , j ∈ I}, which is equivalent to the limit distribution [37,38], denoted lim n→∞ w (n) ij = π j , j ∈ I.

Mathematical theorems on the excellence of distributed group learning
Multi-agent learning outperforms single-agent learning mathematically. Suppose X denotes the input domain and Y denotes the target domain.
Consider a group of m agents, who are trained using the same data x ∈ X. Let M = {1, 2, · · · , m} denote the set of all agents and |M | = m. K be a subset of all agents set M . F is the function class, f * : X → Y denotes the true mapping from the input domain to the target domain. f i : X → Y, i = 1, 2, · · · , m denotes the mapping of the agent i. Multi-agent learn from each other according to the adjacency matrix W . According to the condition for the existence of a stationary distribution [37], when the number of training n → ∞, there is a stationary distribution{π j , j ∈ M }, and m j=1 π j = 1.
That is, lim n→∞ w (n) ij = π j , j ∈ M . Through mutual reference between agents, the learning result of agent i at the n + 1th time is defined for input x, as j (x), i = 1, 2, · · · , m. Because of the existence of stationary distribution for opinion shifts, group learning is consistent, namely, |f for input x (see Methods for details). Then, the consistency of group learning is defined as Let F M denote the mapping function class from the domain X to the domain Y containing m agents, then as follows: The data pair (x, y) ∈ X × Y is sampled from the total data in the learning task, obeying the unknown distribution D, that is is obtained by independent sampling from the total dataset. Based on the Empirical Risk Minimization (ERM) principle [39], the goal of learning is to find the mapping function learned by the m agents f M ∈ F M to jointly minimize the empirical risk, that is: where l(·, ·) denotes the loss function. Givenr(f M ), we can define the corresponding population risk as As in the previous research [40,41], we utilized the population risk to measure the performance of learning. Representation learning quality was introduced to measure the goodness of the learned representations. Definition 1 Given a data distribution of the form (1), for any learned representation mapping f ∈ F , the representation learning quality is defined as can measure the loss due to the distance between f and f * .
The analytical proof leads us to the following Theorem 1. The detailed proof of Theorem 1 is provided in Methods.
The former has better representation learning than the latter, thus indicating that the learning efficiency is better in m agents than in k agents.
f M has a more sufficient space exploration thanf K , as shown in Fig. 1c. In addition, our experimental results also show that multi-agent learn better than their subsets.

Experimental results of distributed predictive models
MAD-LSTM approach. The long short-term memory (LSTM) model is the basic network to describe the multi-agent distributed learning process (Fig.  1b). The LSTM is regarded as an agent (physical or abstract object), and the training process for the LSTM parameters is disregarded. The agent can receive and transmit information with the surrounding individuals in a limited range. The training process of multiple LSTM networks is the process of conceptual state transfer between agents. The interaction between agents forms a network, and the weights of the network (edges) indicate the degree of influence between two agents. Multi-agent learn from each other with adjacency matrix W (Fig.  1a).
Multi-agent learning is better than single-agent. Assuming m = 3, we construct a multi-agent distributed LSTM (MAD-LSTM) network, which composes three agents with identical network structures. Three agents learn from each other with a random matrix

 
The random matrix W is determined empirically and is retained throughout the training procedure. To verify that multi-agent learning outperforms single-agent, we trained MAD-LSTM and LSTM using Mackey-Glass, Lorenz chaotic time series and Beijing PM2.5 concentration datasets, respectively. More details about training data are described in appendix A. We provide the experimental results in Table 1. The results indicate that MAD-LSTM has a superior training effect to a single LSTM.  Further, we explore the relationship between the number of agents and the experimental results. We use three datasets to train MAD-LSTM networks containing different numbers of agents, respectively. Table 2 provides the experimental results for different numbers of agents. Fig. 2 shows that distributed learning with two agents is better than single agent. Similarly, distributed learning with five agents is better than with two agents, thus verifying the validity of Theorem 1. Detailed analysis of the experimental results is provided in Appendix Fig. A1.
The effectiveness of the MAD-LSTM network structure. The literature [33] shows that groups with a network structure exhibit higher levels of cooperative behavior. The MAD-LSTM network describes the process of multi-agents learning from each other and eventually reaching a consensus on mapping functions. The weights connecting the agents represent their degree of impact on one another. Therefore, we can construct a adjacency matrix based on the degree of influence between the agents. Assuming m = 3, we train three independent agents (LSTM-1, LSTM-2 and LSTM-3), MAD-LSTM and Non-cooperative MAD-LSTM (NC-LSTM) networks with three datasets, respectively. NC-LSTM indicates that the three agents do not share parameters based on the adjacency matrix, but average the training results to produce the final prediction. Table 3 shows that the training loss of MAD-LSTM is lower than that of the other two networks, demonstrating the effectiveness of the multi-agent distributed learning architecture.

Discussion
This work demonstrates that multi-agent learning outperforms single-agent learning when the size of the dataset is enough, as the former leads to better representation learning. This paper establishes a theoretical foundation for distributed multi-agent learning. The same conclusion can be obtained in practice even if the training samples are different for each agent. For example, multimodal learning is better than single, and multi-view learning is better than single-view [40]. Multimodal and multi-view learning can essentially be understood as multi-agent learning separately. In summary, our conclusions are generalizable.
Collaboration is a common phenomenon in group learning. In this paper, through merging cooperative learning and network architecture into a machine learning framework, we construct a multi-agent distributed LSTM. In multiagent cognitive learning, any agent is composed of an LSTM network with the same goals but different parameters. We propose a new machine learning framework, multi-agent distributed LSTM (MAD-LSTM), to enhance prediction accuracy. We evaluate the Multi-agent Distributed LSTM (MAD-LSTM) framework on Mackey-Glass, Lorenz chaotic time series data and Beijing PM2.5 concentration datasets to verify theoretical correctness and model validity. PM2.5 concentration belongs to a complex weather environment system with specific chaotic nature. The experimental results show that multi-agent learning outperforms its subset for classically chaotic systems.
In particular, agent interaction learning methods significantly impact the final learning outcome in multi-agent learning. There are various methods of interaction between multi-agent, but the most prevalent are described in this paper. We analyze the convergence of a multi-agent distributed learning framework based on the existence of a stationary distribution of Markov chains in the Degroot [42] model. In fact, the adjacency matrix can be replaced with a time-varying matrix, and the multi-agent interaction can be changed, which are worth further exploration.

Rademacher complexity principle and proof of theorem 1
First, we provide a few assumptions. Assumption 1 : The loss function l(·, ·) is L-smooth concerning the first coordinate and is bounded by a constant C. Assumption 2 : The true mapping f * is included in function class F . Assumption 3 : For any set K ⊂ M , we have F k ⊂ F m ⊂ F . In theoretical analysis, Assumption 1 is the regularity condition for loss functions. Assumption 2 is the realizability condition in representation learning, which guarantees that the function class we optimize contains the true mapping [40]. Next, we introduce the Rademacher complexity, which describes the hypothesis space complexity and considers the data distribution. We use it to quantify the population risk performance based on different agents. Specifically, let F M be function class containing m agents R d → [a, b] X . Let x 1 , · · · , x s be an independent random variable obeying some distribution D on R d . Q = (x 1 , · · · , x s ) denotes the sample set. The Empirical Rademacher complexity [43] of F M with respect to the sample set Q iŝ where σ = (σ 1 , · · · , σ s ) T , σ i is an independent random variable taking values of {−1, 1}, also known as the Rademacher variable. The Empirical Rademacher complexity measures the correlation between the function class F M and the random noise σ in the sample set Q. The Rademacher complexity of the function class F M is define by that In addition, we introduce McDiarmid's inequality and the property of Radmacher complexity in the following lemma. Lemma 1. [43] Let the random variables x, {x i } s i=1 be a set of samples independently sampled from the space X, and F ⊆ [a, b] X be a set of bounded functions. Then Lemma 2. [43] Let F and H be real function classes, then for R s (F ) satisfies: 1. If H ⊆ F, R s (H) ≤ R s (F ). 2. If φ : R → R is Lipschitz with constant L φ and satisfies φ(0) = 0, then R s (φ • F ) ≤ 2L φ R s (F ). 3. For any uniformiy bounded function c, R s (F + c) ≤ R s (F ) + ∥c∥ ∞ / √ s. Lemma 3 (McDiarmid's Inequality). [43] If x 1 , x 2 , · · · , x s is an independent random variable taken from the set A, and for any 1 ≤ i ≤ s, the function f : For any ε > 0, then is the centering empirical risk, and By Assumption 1, l is bounded by the constant C, and for any (x, y) we have 0 ≤ l(f M (x), y) ≤ C. If a pair of data (x i , y i ) is changed, the above equation is changed by at most 2C s . According to McDiarmid's inequality, with probability 1−δ/2 we get the equation: Let

Consider the function class containing m agents
Suppose F = l F M in Lemma 1, then the upper bound of J 11 is 2Rs(l F M ). To utilize the hypothesis function class directly, we need to decompose the Rademacher term consisting of the loss function class. We center the function l ′ (f M (x), y) = l(f M (x), y) − l( − → 0 , y). According to Lemma 2(3), we know that Because l ′ is Lipschitz's first coordinate with constant L and l ′ (f M ( − → 0 ), y) = 0. According to Lemma 2(2) we can get: Combining the above Rademacher complexity analysis, we can get Therefore, we can obtain with probability 1 − δ: In summary, the probability of at least 1 − δ satisfies: | is the empirical loss of the center. Consider the multi-agent set K ⊂ M . According to Assumption 3, it is known that optimizing on a larger class of function yields a smaller empirical risk. Thereforê The Rademacher complexity Rs(F ) with sample size s is bounded by C(F )/s, where C(F ) represents the intrinsic property of function class F . Thus, the boundary of Rs(F M ) is C(F M )/s and the boundary of Rs(F K ) is C(F K )/s. According to Lemma 2(1), we can obtain C(F K ) ≤ C(F M ). Then The above equation shows that: (1) when the size of data s increases, the effect of the intrinsic complexity of the function class decreases. (2) Using more agents to learn can effectively optimize the empirical risk and thus improve the representation learning quality. When K ⊂ M and the size of training sample s is big enough, ϕ(f M ) ≤ ϕ(f K ) is fulfilled with at least 1 − δ probability, indicating that m agents learn better than single agent. □

Construction of distributed systems and proof of consistency (convergence)
Long short-term memory (LSTM). Theoretically, recurrent neural networks can handle sequence inputs of arbitrary length, but they are not well trained for lengthy sequences due to gradient disappearance, gradient explosion and lack of computational resources. To address this problem, Hochreiter and Schmidhuber et al. proposed a new network topology, the long shortterm memory (LSTM) model, in 1997. The LSTM overcomes this limitation by incorporating a specific structure of "cellular cells" into the network. This complex mechanism effectively captures long-term relationships between cells without significantly increasing the amount of parameters. Long and short-term memory models are a sort of recurrent structure suited for processing and forecasting significant events in time series with extremely long intervals and delays. The classical long and short-term memory models include input gates, output gates, forgetting gates, and cellular cell structures. Fig. 1b depicts one of the more prevalent long and short-term memory models.
Model architecture of MAD-LSTM. The LSTM is regarded as an agent. The training process of multiple LSTM networks is the process of conceptual state transfer between agents. Agents can be classified as "leaders" or "followers" based on their degree of influence on one another. Multi-agent learn from each other with a random matrix W with row sum 1. All agents first learn the same dataset separately to form their own mapping function. Then each agent updates their mapping functions based on the random matrix W . The weighted average of all agents' new mapping functions is the output of the whole network system. Finally, the network system updates parameters based on losses. The mapping function of an agent is materialized as a network parameter or a high-dimensional feature vector in machine learning. Fig. 1a shows a multi-agent distributed LSTM network structure with three agents.
Let f i (n) denote the mapping function of agent i at the nth time in an influence network G = (V, E, W ) composed of m agents, where n ∈ N. The mapping function of agent i at the n + 1th time is determined by the weighted average of its own and its neighboring agents' mapping functions at the nth time [42], specifically, where w ij is an element of the adjacency matrix W , and satisfies m j=1 w ij = 1.
Let F (n) = (f 1 (n) , f 2 (n) , · · · , f m (n) ) T ∈ R n denote the vector of all agents' mapping functions at the nth time. The matrix form of equation (10) is where W is a random matrix, i.e., a non-negative matrix with a row sum equal to 1. Network G depicts multi-agent interactions in a distributed learning system. MAD-LSTM convergence analysis. Participant role heterogeneity is mentioned as important for cooperation [32]. Agents are divided into "leaders" and "followers" based on mutual influence. Specifically, any node k ∈ N i , satisfies w ik = max j∈Ni w ij , i.e., node i receives the most information from node k or is most influenced by it in the set N i of nodes. Then designate node k as the leader of node i and node i as its follower. In fact, the number of leaders influences network convergence [34]. Considering the complexity of the network, this paper focuses on the multi-agent distributed network model with a single leader.
In an influence network G = (V, E, W ) consisting of m agents, it is assumed that W remains constant throughout the learning process. The initial mapping function of agent i is f i (0) , where i ∈ V . Next, the mapping function of agent i is updated according to equation (10), i.e., f i (1) = m j=1 w ij f j (0) , written as a matrix of the form F (1) = W F (0) . According to equation (11), all agents' mapping functions are updated, thus According to the condition for the existence of a stationary distribution [37], when n → ∞, there is a stationary distribution, that is, lim In general, the multi-agent distributed learning model will converge when n is big enough. Simultaneously, the loss function guides the whole network to optimal.

Hyperparameter setting.
All agents are three hidden layer LSTM networks with the same number of neurons in each layer. We utilize Adam [44] as the optimizer, MAE (mean absolute error) as the loss, a batch size of 120, and default values for other hyperparameters. In addition, we monitor the "val-loss" via the keras callback class function (keras.callbacks). If the value of "val-loss" does not drop after three epochs, we multiply it by 0.5 to reduce the learning rate. Set the minimum value of the learning rate to 0.00001 and do not decrease it afterwards. When using the Mackey-Glass and Lorenz datasets, the hidden layer of the neural network was set to 32 neurons, and the first 200 data were used to predict the next one. When using the PM 2.5 dataset, the hidden layer of the neural network was configured with 48 neurons, and all monitoring data from the previous five days (220 hours) were used to predict the PM2.5 concentration one day (24 hours) into the future, i.e. the PM2.5 concentration in the early hours of day 7. The Lorenz dataset was trained 300 times, whereas the other datasets were trained 200 times.

Appendix A Several chaotic systems and real datasets
Dataset. To validate our model, we employ two benchmark time series datasets and a real-world dataset. The following is a brief overview of the datasets.
(1) Mackey-Glass: Mackey-Glass time series forecasting is a well-known non-linear fitting problem that serves as the test data for many research. The series data is generated by the following time lag differential equation: where x(t) is the generated data, τ represents a time delay parameter, α, β, and γ are the parameters. Setting α = 0.2, β = 0.1, γ = 10, τ = 17, we get the training data.
(2) Lorenz: In the area of chaos, the Lorenz model has become the widely accepted model. It is a fully deterministic system of third-order ordinary differential equations proposed by the American meteorologist Lorenz's research of atmospheric motion via simplifying the convection model. The equation takes the following form: where σ denotes the Planter number, ρ is the Rayleigh number, and β denotes the angle ratio. Setting σ = 10, ρ = 28, and β = 3, We obtained 1000 3D evolutionary trajectory data.
(3) PM 2.5: The Beijing air PM2.5 concentration dataset is another widelyused public benchmark dataset. This paper only considers a sample of data from Beijing, observed from January 1, 2010, to December 31, 2014. The dataset contains PM2.5 concentrations (pm2.5), dew points (DEWP), temperatures (TEMP), atmospheric pressure (PRES), wind direction (cbwd), wind speed (lws), cumulative snowfall (ls), and accumulated rainfall (lr). The dataset includes 43,800 observations in total. We trained on 80% of the data and validated on 20%. The goal of the experiment was to use the first 200 data to predict PM2.5 concentration data after an interval of 24 hours. Fig. A1 Distribution of experimental results for more than ten runs. The points in the graph above are the experiment results mean of more than ten runs. The graphs below show the results of all experiments. The experimental results easily show that the multi-agent distributed learning outperforms its subsets.