Why do animals swirl and how do they group?

We report a possible solution for the long-standing problem of the biological function of swirling motion, when a group of animals orbits a common center of the group. We exploit the hypothesis that learning processes in the nervous system of animals may be modelled by reinforcement learning (RL) and apply it to explain the phenomenon. In contrast to hardly justiﬁed models of physical interactions between animals, we propose a small set of rules to be learned by the agents, which results in swirling. The rules are extremely simple and thus applicable to animals with very limited level of information processing. We demonstrate that swirling may be understood in terms of the escort behavior, when an individual animal tries to reside within a certain distance from the swarm center. Moreover, we reveal the biological function of swirling motion: a trained for swirling swarm is by orders of magnitude more resistant to external perturbations, than an untrained one. Using our approach we analyze another class of a coordinated motion of animals – a group locomotion in viscous ﬂuid. On a model example we demonstrate that RL provides an optimal disposition of coherently moving animals with a minimal dissipation of energy.

We report a possible solution for the long-standing problem of the biological function of swirling motion, when a group of animals orbits a common center of the group. We exploit the hypothesis that learning processes in the nervous system of animals may be modelled by reinforcement learning (RL) and apply it to explain the phenomenon. In contrast to hardly justified models of physical interactions between animals, we propose a small set of rules to be learned by the agents, which results in swirling. The rules are extremely simple and thus applicable to animals with very limited level of information processing. We demonstrate that swirling may be understood in terms of the escort behavior, when an individual animal tries to reside within a certain distance from the swarm center. Moreover, we reveal the biological function of swirling motion: a trained for swirling swarm is by orders of magnitude more resistant to external perturbations, than an untrained one. Using our approach we analyze another class of a coordinated motion of animals -a group locomotion in viscous fluid. On a model example we demonstrate that RL provides an optimal disposition of coherently moving animals with a minimal dissipation of energy.

I. INTRODUCTION
Swirling is one of the most enigmatic phenomenon of the collective behavior of animals. The circular motion around a common center is observed in large groups of animals at different evolution stages ranging from insects and plant-animal worms to fish. The biological function of such motion is hardly understood and remains under debate till now [16,18,27,30,31,38]. Moreover, even a realistic model of swirling is presently lacking. The existing approaches exploit artificial mechanical forces acting between animals. These interactions have a form of spring-like forces, gravitation-like forces, and forces of Morse potential [14,15,26,29]. Certainly, such mechanical forces do not exist but serve to mimic an intention of moving animals to change their velocity. This intention is modeled in the form of the second Newton law, which describes the rate of change of animals' speed. The authors of Ref. [2] assumed that all particles move with a constant absolute velocity. The directions of the velocities vary in response to the torques originating due to interactions between the particles. Although the milling patterns arise in this model the functional form of such torques is rather artificial and may be hardly justified.
More realistic was the celebrated Vicsek model [11,41]. Here an intention of a moving agent (animal) is explicitly formulated in the form of a kinematic algorithm. The algorithm dictates the change of the animals' velocity. It is assumed that all agents have a constant absolute velocity of varying direction. At each time step the direction is chosen as an average direction of motion of all neighbors, located within a certain distance from the agent. In spite of the simplicity, the model demonstrates a very rich behavior, predicting different modes of collective motion, pattern formation, like in systems of dissipative particles [8,9] and even phase transitions [11,12,26,34,41]. Still Vicsek model is too simple to describe swirling motion, which neither arises spontaneously in this model, nor in the presence of a circularly moving leader; the reason for this is a fixed magnitude of the velocity [7]. Hence there is a challenge to propose an intention-based model with a simple kinematic algorithm resulting in swirling. Such an algorithm, based on the a priori knowledge about the agents, should be as realistic as possible. The aim of the present study is to propose a relevant intention-based model of collectively moving agents, which accounts for their perceptional and physical limitations. We expect that such a model will not only predict the emergence of the swirling motion, but also help to understand the biological function of this enigmatic phenomenon.
In a seminal paper [33], Reynolds mentioned that an adequate model should reflect a limited perception of animals performing coordinated motion ("fish in murky water", "land animals that see only nearest herdmates", etc. [33]). A problem of multi-agent connectedness under limited information access has been addressed in [22], where the existence of algorithms keeping the connectedness has been mathematically proven. This study demonstrates, however, that it is extremely difficult to formulate an explicit action algorithm for systems with a limited information access. Furthermore, the additional constraints, due to the physical limitations of the agents, make such a problem even harder and possibly unsolvable.
Nevertheless, nature finds the solution. Myriads of living beings -insects fish, etc. swirl. It seems extremely improbable that animals follow a sophisticated mathematical algorithm imprinted in their genes. More reasonable is to assume that animals learn to adopt their velocity -both the magnitude and the direction, by trial-and-error method, receiving a reward for a correct action and some form of punishment for an incorrect one. The reward and punishment are regulated by internal chemical processes in animals' organisms. Hence, the most plausible assumption is that the response to an action (reward or punishment) and the action variability are the fundamental features of animals, imprinted in their genes. This is our main hypothesis, see also the discussion below.
Once we accept such a hypothesis we immediately recognize that the trial-and-error method, supplemented by a reward and penalty is in the heart of the reinforcement learning (RL) -one of the most powerful tools of machine learning (ML) techniques. This method has been successfully exploited for various transport problems of active particles [10,32,36]. The application of ML to communication problems, including animal communications has also demonstrated its efficiency [17,19,23,37,40,42]. Moreover, the RL applied to biological systems gives a new insight into reward-based learning processes [28] and, as we show below, helps to reveal the biological function of swirling. In recent studies [20,21], the RL has been exploited to solve the rendezvous and pursuit-evasion problem for communicating agents possessing only local (limited) information. The inverse RL, which allows to get an individual strategy of the agents yielding the observed collective behavior, was also proposed [21].
In the present article, we apply the RL approach to describe a collective motion in a swarm under constraints, which reflect a very limited perception and physical limitations of the agents, as dictated by their biological nature. Instead of applying the inverse RL, as suggested in [20,21], we explicitly consider different individual action rules, that may steam from very limited abilities of the information processing of the animals. This reveals an interesting connection between the agents' strategy and informational completeness and/or their physical limitations.
As it follows from our analysis, the swirling motion may be understood as a specific form of an escort behavior, when an individual animal tries to remain within a certain distance from a swarm center. Therefore, to illustrate the basic ideas we start from the most simple escort problem, when only a few (in our case -a pair) of animals are involved. Such an escort behavior is observed mainly for mammalians [6,38] with a high information-processing abilities. For illustrative purposes, however, we consider the escort of animals with different levels of information processing.
The simplest one-to-one escort behavior may be formulated as follows: One animal -the "follower", tries to reside within a certain interval of distances from another, independently moving animal -the "leader". We demonstrate that depending on the degree of awareness and physical abilities, the follower may choose very different strategies. Some of these strategies look rather astonishing. Next we apply the same methodology for the collective motion, supplementing the rules of one-to-one escort motion by a few rules of collective motion in a large group of animals. We show that the swirling motion arises spontaneously and persists. Then we apply random external perturbation to a swarm and observe that the group of animals, "trained" to perform swirling, demonstrates much stronger resistance to the perturbations, than a non-trained group. This justifies our conjecture about the biological function of the swirling motion 1 To check whether the RL can find the best strategy, we investigate the optimal locomotion of a group of animals coherently moving in viscous fluid; they also experience physical interactions through the media. The optimal locomotion strategy may be found in this case from a straightforward solution of the coupled equations of motion for the agents. We demonstrate that the RL correctly reproduces all the results of the direct optimization.

II. RESULTS
A. One-to-one escort strategies In the one-to-one escort a follower tries to reside within some range of distances from an independently moving leader. As we wish to model rather simple animals, we assume a very basic level of their perception. For instance, it is hard to believe that such simple beings can perceive an exact distance between themselves and other objects. At the same time, it is natural to assume the ability of a simpler perception -whether they reside within some distance range from one another. We call this interval of distances, which can be rather large, a "comfort zone", see Fig. 1a. The notion of the comfort zone will be later applied for swirling motion, see Fig. 1b and c.
Some additional information is needed to construct a trajectory of the follower (in what follows also the "agent"). We assume that the follower can perceive its own direction of motion and this of the leader and can also distinguish between approaching and moving off objects. Next, we consider three different scenarios: (A) The follower can perceive its absolute velocity, and there is no limitation for its acceleration; (B) the follower can perceive its absolute velocity and its acceleration is limited and (C) the follower does not perceive its absolute velocity but can perceive a c b its acceleration, which is limited from above. The last scenario is the most realistic. Indeed, it is not easy to measure the velocity, however, even primitive beings can perceive the direction of motion and their acceleration, trough muscle efforts or other biological mechanisms 2 . The underlying physiology dictates the limits for the animals' acceleration.
Leaving the mathematical and algorithmic detail for the section Methods and Supplementary Material (SM), we discuss here the general ideas -how to implement the RL to the addressed problem. In the heart of the RL is the reward -the function of an agent action in a given system state [5,28]. If the action brings the agent closer to the aim, the reward is positive, otherwise -negative. This is the same as the positive and negative stimuli in the nervous system of a living being. The efficiency of the action policy, is characterized by the average of the sum of all rewards (positive and negative) at all actions. The neural network is trained to choose the action policy that maximizes the reward and reaches in this way the desired goal. We wish to stress that the neural network and the according policy is associated with an individual agent. Moreover, we hypothesize that our RL-based model of the agent actions mimics the most prominent features of the real informational processes which determine the behavior of living beings. We investigate different trajectories of a leader -circle, ellipse, eight-curve, spiral, and triangle, which is not a smooth curve. For all these trajectories, the RL managed to train the network to develop successful strategies for the follower.
The results of the application of the RL to the escort problem are presented in Fig. 2. It is interesting to note that the optimal strategy drastically depends on the available information and physical limitations. Furthermore, the shape of the optimal trajectories of the follower is not necessarily smooth and sometimes looks very astonishing. For the case of "abundant" information, when the velocity of the follower is known, and there are no physical limitations (Ascenario), the follower reaches first the target distance to the leader and then applies the "frog strategy" of successive jumps: It waits until the leader moves far away and makes a long jump. In the new position, it waits again and then jumps, and so on, Fig. 2a. Noteworthy, the jerkily motion with non-smooth trajectories is not related to the discontinuity of the step-function, involved in the estimate of the reward. It persists for other smooth functions, see SM for more detail. When the limitation of the acceleration is imposed, but the agent still perceive its velocity (B-scenario) the follower changes the strategy and moves on smooth trajectories, Fig. 2b.
For the most realistic, C-scenario, when only acceleration is known (which is also limited), the follower starts to use the surprising wrapping trajectories, independently on the shape of the leader trajectory, Fig. 2c,d,e. In this way it is guaranteed that the follower always remains within a "comfort zone", even if the information of the velocity and the distance to the leader is lacking. We have also considered an additional energy-saving condition for the C-scenario of the escort. The resulting optimized trajectory remains wrapping, although the wrapping circles become much smaller. The appearance of the curled trajectories for the most realistic C-scenario gives us a hint that these trajectories can transform into swirling for a multi-agents problem.

B. The emergence of swirling
Now we address the collective motion (see Methods for mathematical and algorithmic details). As in Ref. [20] we assume that all animals (also "agents" here) are identical 3 and follow the same action policy. We also assume that any information about an agent is not transmitted to any other agent. That is, n animals learn simultaneously and independently, receiving the reward individually. We believe that the application of the individual training describes more adequately the studied phenomenon and prevents a spurious information exchange between the agents.
To model the collective behavior, we supplement the individual perception rules, as for the escort problem by the collective rules. Namely the following information may be perceived and processed by an agent in a swarm: (i) whether the closest agent approaches or moves off; (ii) whether the closest agent is within a comfort zone; (iii) the direction of motion of the closest neighbor and of the self-motion; (iv) the acceleration of the agent, i.e., the exerted by the agent force; (v) whether the agent approaches or moves off from the center of a group; (vi) the direction of the entire group velocity; (vii) whether the agent resides within a target distance from the group center, which corresponds to the "comfort zone" with respect to (w.r.t.) the group center.
Note that while the rules (i)-(iv) describe the "one-to-one" escort motion, the rules (v)-(vii) describe the behavior in a group. This essentially corresponds to the escort of the group center by an individual animal. Indeed, the reward is given when an agent resides within the comfort zones, both w.r.t. to its nearest neighbor, as well as w.r.t. to the swarm center; otherwise, the agent is penalized. We wish to stress that animals possess a very fuzzy knowledge about the location of their neighbours and the swarm center -they perceive only whether they reside in the according comfort zone, which may be rather large.
Here we hypothesize that the rules (i) -(vii) have been imprinted in animals genes by evolution; they motivate all animals in a group to move in a way that fulfills the requested criteria. The agents choose an appropriate policy using (very limited) information at hand. We observe that starting with very different initial conditions, swirling motion of a swarm spontaneously emerges, see Fig. 3a. The swirling around a swarm center is commonly accompanied by a linear motion of the swarm as a whole. The swirling motion may be quantified by the average angular velocity of particles orbiting their common center of mass Ω (see the next Sec. III for the definition). As it may be seen from Fig. 3a, initially Ω = 0. In the course of time the swarm self-synchronizes and a steady swirling with non-zero average angular velocity arises and persists.

C. The biological function of the swirling motion
Is it possible to understand the biological function of the swirling motion in swarms? We assume that the swirling helps to resist the external forces, which may jeopardize animals. For instance, it can be a wind that may blow an insects' swarm far from their inhabitation. To prove this conjecture, we conducted the following experiments. We incorporated an additional external force into the environment and measured the shift of the swarm center in the direction of the external force. During the training, we changed the direction of the external force with a uniform angular distribution. We also modulated the force strength with some periodicity, applying a stretched exponential distribution, widely used to describe natural phenomena, see e.g. [24,35] and Sec. III for detail. We performed the averaging of the swarm shift (in the direction of the force) for different values of the average force. For each average Note that optimal configuration for the triplet has two symmetric realizations. Also note that the larger the swarm, the smaller the specific dissipation.
force, we repeat our experiments ten times. We check that varying the parameters of the force distribution does not change the qualitative behavior of the system. Moreover, the behavior of the system persist when the size of the comfort zone or a number of agent in a swarm varies. More details are given in SM.
In Fig. 3b we plot the dependence of the shift of a swarm center on the average external force for the basic version of the stretched exponential distribution (see Sec. III); the results for the general case may be found in SM. We compare two groups of agents: The one has been trained to swirl, as described in the previous section, the other -untrained. As it may be seen from the figure, the group that performs swirling resists up to 100 times more efficiently than that without the swirling. The resistance fades for very strong external forces, comparable with the maximal force which may be exerted by an agent. Hence we come to an important, although surprising conclusion: The intention to move around a swarm center results in extremely high resistance to external perturbations. It seems astonishing that such a simple strategy helps living beings to cope with a hostile environment. In other words, we conclude that the enigmatic swirling motion is not at all a random arrangement or a behaviorial error of a group of animals. In contrast, it plays a crucial role in their survival. Certainly, animals with a complex organization can learn more efficient strategies, but for very simple beasts, this strategy may be the most reliable to resist the environmental threats.

D. Optimal swarm locomotion
As we have illustrated above, the RL allows modeling complicated coordinated motion of swarms where the agents receive and process very restricted information about themselves and the system as a whole. Moreover, the RL allows to model systems without formulation of an explicit action strategy -the action policy is developed through the training with the use of an appropriately chosen reward function. Since the explicit action strategy was lacking for the case of swirling motion, it would be worth to check whether the RL provides a truly optimal strategy. Therefore we consider a problem where the optimal strategy is known beforehand and prove that the RL does finds this strategy. Namely, we analyze an optimal locomotion of a group of animals coherently moving in viscous medium. The total energy dissipated by the moving group sensitively depends on the mutual disposition of the agents. The specific energy dissipation per an animal in a group may be significantly smaller than the dissipation of a solely moving agent. That's why the migrating birds form flocks and cyclists form a compact group in a cycle race -this helps to save a lot of energy. Here we demonstrate that the RL applied to the locomotion problem of a swarm, indeed, yields the optimal disposition of the swarm members.
Realistic modelling of a flock of birds is extremely difficult due to complicated form of birds and a complex flux structure in the air surrounding birds. Moreover, birds fly at high Reynolds numbers. Therefore, to illustrate the concept, we consider a simplified model, where a swarm is comprised of spherical agents moving with velocities corresponding to low Reynolds numbers. These constraints make the model tractable. Indeed, for low Reynolds numbers, there exists an analytical theory for the forces acting between spherical particles moving in viscous fluid (see Methods and SM). Using this theory the optimal configuration may be easily found for any number of agents, as illustrated in Fig. 4.
Next, we check whether the RL is able to find the optimal configurations for the same setup. Namely, we consider a group with a leading animal, which moves with a fixed velocity v lead and is not affected by other agents of the group. Here we assume that more abundant information may be received and processed. Namely, we assume that the animals have a high perception level, that is, they can perceive: (i) relative positions of the agents and the leader; (ii) whether the nearest neighbor is within the comfort zone; (iii) self-velocity of an agent in the laboratory frame and the relative velocity with respect to other agents; (iv) the force exerted by an agent. The reward function tends to keep agent in their mutual comfort zones and minimize the total dissipation (see Methods and SM for detail). Fig. 5 illustrates optimal configurations found by the RL, which practically coincides with those obtained by the direct optimization, Fig. 4. These results justify our a'priory trust that the RL manages to find the optimal solution, even if it is not known explicitly.

Conclusion and Outlook
We propose an explanation of the enigmatic phenomenon -swirling motion in large groups of animals at different evolution stage; possibly we reveal its biological function. Our approach is based on the hypothesis that the learning processes in nervous system of animals may be mimicked by the reinforcement learning (RL), with the reward for a beneficial action and punishment for a harmful one. We apply the RL to understand a collective motion of simple animals. They have very limited abilities to perceive and process information about their kinematic states. These limitations are associated with a rather basic level of their organization. We also consider physical limitations caused by the biological nature of the living beings. We formulate a small set of very simple rules which animals strive to follow to conform with the beneficial behavior. We hypothesize that such set of rules is imprinted in animals' genes.
We assume, that among the main rules of the beneficial behavior in a swarm, is the rule to reside within a certain interval of distances (the comfort zone) from the center of the group and to reside within the comfort zone with respect to other neighbours. Such kind of the individual behavior corresponds to the escort behavior w.r.t. the group center. Therefore, we start with the analysis of the one-to-one escort problem, with a leader and a follower. The leader moves independently on various trajectories and the follower tends to reside within the respective comfort zone. We demonstrate that depending on the information-processing abilities of the follower and its physical limitations, very different escort strategies arise. These include such an amazing behavior as the leap-frog strategy and the strategy of wrapping trajectories.
We demonstrate that the escort strategy of individuals w.r.t the swarm center, results in a spontaneous swirling motion. We observe the emergence of swirling for all studied initial conditions and propose a criterion to quantify it. Interestingly, we reveal that the swirling motion leads to a dramatic increase of the resistance of a swarm to external perturbations. We believe that this proves the biological function of the swirling -animals swirl to protect themselves in a hostile environment.
We also consider the problem of optimized locomotion of a group of animals moving in viscous fluid. In this case the energy dissipation of a single animal drastically depends on the mutual disposition of the group members. That's why the organization of the group is so important for coherently moving animals. Using a simplified model, we demonstrate the ability of the RL to find the optimal configuration of a pack. We show that the agents disposition obtained by the RL practically coincide with these found by the direct minimization of the total dissipation, as it follows from the underlying physics. We believe that this result confirms that the RL is able to find the optimal solution, even for the case when such solutions may not be found by traditional mathematical methods.
In our study we use the assumption, that the exploited reward rules have been formed in the course of biological evolution -the process continuing on the evolution time-scale, comprising myriads of species generations. It is governed by random mutations of animals' genes accompanied by natural selection [3]. This has a strong impact on the evolution of the rewards rules and drives them (by the natural selection) to the most optimal ones for a species survival. The result of the evolution is imprinted in genes of all animals of a species and persists as long as the species exists. In contrast, the training process, has a much sorter time-scale -just a time needed to train an animal. The result of the training is coded in neurons of a particular living being and disappears with its death. Naturally, it is desirable to have a complete model, which describes both, the formation of the reward rules during the biological evolution as well as a training of individual beings. This is however computationally very challenging task and we leave it for future studies.

A. General principles of the RL
To model an agent (animal) information processing we firstly need to describe the surroundings. That is, we need to formulate the physical properties of the objects and the surrounding media. The respective model is called the "environment". It describes the system dynamics using the predefined parameters as well as a distribution of these parameters, e.g., the distribution of the initial agents' positions and velocities. Generally speaking, the environment contains all possible evolution scenarios of the system; its elaboration is in the heart of the approach. Secondly, we need to specify the level of the agents awareness of the environment. The available information about the surroundings refers to an agent state. It contains information about the medium and surrounding objects -their speed, location, etc. It reflects an agent's understanding of itself as well. Thirdly, we need to define an action space -it determines the way, how the agents interact with the environment. The fourth component is called "policy"; it characterizes the strategy of the agents' actions. It is quantified by a probability of a certain action at a given state. Finally, the fifth part refers to a reward. The reward is a mechanism to judge, how desirable is the present state as compared to another one. It also assesses a benefit of an action to reach a specific state right now, instead of doing this later.
Once all the above parts are specified, the most computationally intensive part -the learning of the optimal policy, may be performed. Such an approach is called "reinforcement learning" (RL). There are plenty of algorithms to obtain the optimal policy. Below we use the policy gradient algorithm [39]. To summarize, we use the following steps in our analysis: In what follows we discuss the above steps in detail; at the end of the section we present mathematical formulation and algorithms.

B. Environment
We start with the random initialization and then continue with the dynamic simulations. Namely, we generate random initial positions and velocities of all agents. Also, an external force direction (wind direction) is specified as a random parameter. Random parameters as well as other specified variables remain constant during a specified time cycle of modeling, called an "epoch". The force magnitude changes independently as a random process in time according to specified probability distribution. In our study we exploit the stretched exponential distribution for the magnitude of the random force f , widely used to model natural phenomena [24,35]: Here Γ(x) is the Gamma function, β is the stretching exponent and f 0 is the scale factor. We mainly use the simplest version with β = 1, which is also called exponential distribution. The parameter f 0 of the exponential distribution corresponds to the distribution of a mean value. Other distributions from this class with a wide range of β have been also tested. This does not change the qualitative behavior of the system, see SM for more detail. In our numerical experiments we model the evolution of the environment. Namely, we solve time-discretized equations, where the variables are calculated using their values on the previous time step and applying the action policy. Random factors that impact the process were also included. The simulations were repeated from one epoch to another, iteratively improving the action policy, according to the learning rules.

C. Policy
The action policy has been developed for various types of actions. All these require continuum random variables. Here we use normally distributed variables with a slight modification, that prevents an occurrence of very large quantities forbidden by the underlying physics. It is straightforward to derive the properties of such distributions (we call them "bounded normal distributions"); this is detailed in SM.
Denote π θ (a t | s t ) the policy with the parameters θ which is the probability density function of an action a t for a given state s t at time t. The probabilistic environment p(s t+1 | s t , a t ) describes the probability of the next state s t+1 at time t + 1 for a given state s t and the applied action a t at at time t. The policy gradient algorithm [39] is based on a policy approximation by a class of parametric functions that encodes the probability distribution of actions in a given state. This class of functions iteratively approaches the optimal action policy with the use of Monte-Carlo gradient estimate which maximizes the expected reward. The reward r t = r t (a t , s t ) is a quantitative description of an advantage of a current state. It quantifies how desirable is the state s t caused by the action a t . The quality of the whole process, i.e., the quality of the action policy, can be assessed as an average sum of the successive (discounted) rewards: where 0 < γ < 1 is the discount factor and T is the duration of the training epoch. The overall goal of an agent is to achieve the maximal cumulative reward. Note that the environment is random and the action policy is associated with the probability density function. Hence the maximization of a cumulative reward R t corresponds to the maximization of the average expected reward, see SM for more detail. An analytical estimate of this quantity is usually not possible, while the numerical integration over an epoch is computationally demanding. Hence Monte-Carlo gradient estimate method is applied. Knowing the complete history of the state-action distribution, one can find an unbiased estimate of the policy gradient. Using the gradient descent, epoch after epoch, the optimal policy is eventually found. We refer to [39] and SM for the detailed description of this method.

D. Neural network architectures
The policy described above depends on two sets of parameters -the mean values of the original normal distribution and the according variances. Since the action space is two-dimensional (the motion occurs in 2D), we have two means and two variances. The number of observed states strongly depends on the problem, however the neural network architecture remains the same. For the action policy this comprises three fully connected layers and two ELU (Exponential Linear Unit) activation functions between them, see Fig. 6a. The exponential activation functions are used to avoid negative values of the variances. The architecture of the baseline function Fig. 6b has the similar structure.

E. Algorithms of the Escort problem
First we consider the escort problem which consists of a single agent-follower moving in an inert two-dimensional space and a leader, moving on a predefined trajectory. The goal of the follower is to pursue the leader, more specifically -to reside within some range of distances from the leader.
Let the leader position be x l and the follower -x f . Then the following information available for the follower reads: 1. Whether the follower resides within a range of distances from the leader: where H(x) is the Heaviside step function, d t is the average distance from the leader and e t > 0 specifies the range of acceptable distances.
2. Whether the follower approaches the leader: The goal of the agent is to maximize the reward which is computed by the Algorithm 1. We encode the reward function in the way that the most desirable state is to reside within a range of target distances from the leader. However, if the agent is not close enough, the reward is given for the pursuit of the leader.

Algorithm 1 Reward -Escort problem
Require: Agent states: 1. Whether the follower resides within a range of target distances from the leader: S d . 2. Whether the follower approaches the leader: Sa. Ensure: Reward r.
The second important part of the policy refers to the actions which a follower can perform to maximize the reward. We define the leader velocity as v l and the follower velocity as v f . We assume that the follower knows the direction of the leader velocityv l = v l / v l and the direction of its own velocityv f = v f / v f . For the scenarios A and B detailed in the Section II the follower also knows the magnitude of v f ; that is, the agent possesses a complete information about its velocity. Hence, the agent (follower) actions refer to the regulation of its own velocity. In real life, however, the velocity is limited, which implies the limitation of actions that needs to be reflected in the policy. For the scenario B there exist an additional limitation for an agent acceleration, which is also reflected in the policy, see SM.
Here we discuss in detail the most realistic escort scenario C, where the information about the velocity of an agent (follower) is not available, but the agent can control the force, that is, regulate its acceleration. Namely, the agent possesses the following information characterising the state of the environment: (i) the direction of the leader velocitŷ v l = v l / v l , (ii) the direction of the follower velocityv f = v f / v f and (iii) the follower force F f . The available information about the space location has been itemised above. The actions of an agent for the scenario C refer to the regulation of its force, instead of the velocity, as in the scenarios A and B. The force, and thus the acceleration is limited, as in the scenario B.
Noteworthy, without a knowledge of a follower/leader position and a follower/leader absolute velocity, it is extremely difficult (if possible) to obtain an explicit policy of the optimal pursuit. The reinforcement learning allows to construct an efficient stochastic policy for a very limited available information.

F. Algorithms of motion in a swarm and emergence of swirling
Similarly to the escort scenario each agent moving in a swarm exploits the available information of its state. We additionally assume that the agents strive to reside not far from the group center and supplement the agent state with extra parameters which specify the desired interval of distances from the group center located at r gc . That is, we use the concept of the target range of distances from the group center. Each agent strives to reside within the radius r cz (i.e. in the comfort zone) around r gc . Thus, each ith agent in the swarm perceives the following data: 1. Whether the closest neighbour breaks into the comfort zone r cz : where x ci is the coordinate of the nearest neighbour and H(x) is the Heaviside step function.
2. Whether the agent approaches the closest agent: 3. Whether the agent resides within a target distance r cz to the group center r gc : 4. Whether the agent approaches the group center: where n is a number of the agents.
Based on these data we define the reward, see the Algorithm 2. It encodes our assumptions about an agent striving. The most desirable agent states are realized when the agent resides within a target distance to the group center. However, the agent is penalized if someone brakes into its comfort zone. Finally, if the agent is not within a target distance, we reward a pursuit of the group center.

Algorithm 2 Reward for the motion in a swarm
Require: The agent states: 1. Whether the closest agent breaks into the comfort zone: S i cz . 1. Whether the agent approaches to the group center: S i ga . 2. Whether the agent resides within a target distance from the group center: S i gc . Ensure: Additionally, each agents knows the direction of the closest agent velocityv ci = v ci / v ci , the velocity direction of the agent itselfv i = v i / v i , the agent force (its action) F i and the direction of the entire group velocityv gc = v gc / v gc for v gc = 1 The emergent swirling motion is quantified by the average angular velocity of the swarm, which is defined as where the radius vector r gc and velocity v gc of the group center (i.e. of the center of mass) have been given above.
G. Physical background and algorithms for RL of efficient locomotion of a swarm A system of spherical particles moving in viscous fluid at small Reynolds numbers may be described by the so-called Rodne-Prager theory [13], which goes beyond the classical Oseen theory [25]. This theory may be formulated in the form of the velocity of i-th agent, v i , resulting from the forces F e j , j = 1, 2, . . . , n, applied to all n agents by the environment: with the matrix of friction coefficients, whereÎ is a unit matrix, a is the agent radius, η is the fluid viscosity and x ij = x i − x j is the radius vector joining two agents. Finally, r ij = x ij / x ij is the unit inter-agent vector. We assume that the motion is always over-damped, so that the velocity of an agent immediately relaxes to a uniform, time-independent value, corresponding to the set of forces F e i , as in Eqs. (1) and (2), see SM. Then the power dissipated by i-th agent may be written as (v i − v f,i ) · F a i , where F a i is the force exerted by i-th agent on the medium and v f,i is the fluid velocity at the location of i-th agent. By the third Neuton's law the force exerted by a uniformly moving body on a medium equals to the force acting on the body back from the medium. Hence the total power dissipated by a swarm of n agents reads, To obtain the last part of Eq. (3), we observe that the local fluid velocity follows from Eqs. (1) and (2): v f,i = v i +ζ ii F e i , so that (v i − v f,i ) = −F e i /(6πηa) = F a i /(6πηa). Consider a swarm where all animals move with the same absolute velocity in the same direction. Then, as it follows from Eqs. (1), (2) and (3), the total dissipation P will be determined by the set of vectors x ij , i, j = 1, . . . , n, that define the mutual disposition of the agents. To find the optimal swarm configuration, we need to minimize P . This is a well-posed optimization problem with a straightforward solution; the result is given in Fig. 4. Interestingly, the optimal configuration for a pack of three animals corresponds to a triangle with a base along the direction of motion, that is, two equivalent configurations are possible, see Fig. 4.
Our next goal is to check whether the RL is able to find the optimal configurations which are known. To solve this problem we consider a group of animals comprising of a leader moving in fixed direction with a constant velocity and several followers which aim to retain their comfort zone w.r.t other agents and the leader; simultaneously they strive to minimize the dissipation of energy. The agents (followers) are aware of the following information: 1. Whether the closest agent breaks into the comfort zone r cz : where x i is the i-th agent position and H(x) is the Heaviside step function.
2. Whether the agent approaches the closest agent: 3. Whether the agent resides within a target distance from the leader: where x l is the leader position and d t specifies the range of target distances to the leader.
We defined a reward, as explained in the Algorithm 3. The most desirable for the agent is to reside within the target distance to the leader. However, the agent is penalized if someone brakes into its comfort zone. If an agent is not within the target distance to the leader, we reward a pursuit of the leader. If an agent resides within the target distance to the leader, and there are no neighbours in the agent's comfort zone, it is penalized for the energy consumption. In our simulations the value of d t was large enough to accommodate the whole swarm with an optimal configuration. In this scenario, each agent is also aware about its velocity w.r.t. the "laboratory" system, that is, w.r.t. fluid at rest v i , the relative agents' velocities: v ij = v j − v i , the agent force (its action) F a i and the relative agents positions.