## Data

All data were provided by G. Robinson and his group from the University of Illinois at Urbana-Champaign; their detailed data collection methods are reported in their paper (11). To summarize briefly, they used a single-cohort hive consisting of approximately 1000 adult worker bees aged 1 day and one unrelated, naturally mated queen. This type of colony is commonly used in experiments analyzing the division of labor to control the effect of age and, instead, focus on the behavioral development of worker bees as a function of their genetics and social interactions (26).

Similar to other approaches for tracking insects [12–14], their method used custom 2D QR-barcode devices, known as *bCode*. These devices were attached to the thoraxes of individual bees and provided sequences of digital images that enabled reliable identification and tracking of every individual in the hive. The digital images were converted into data with coordinates (x, y) and the orientation of each bee every second, with x ranging from 0 to 6576 and y ranging from 0 to 4384. The hive was installed vertically in a dark room with a consistent temperature (approximately 35°C) and humidity, and it was connected to the outside through a tunnel, whose entrance was closed until 2 or 3 days after the start of the observation and then opened to allow the workers to exit the hive and begin foraging (Fig. S1). The bees produced sufficient honey to feed the entire hive for the duration of the experiment, with sufficient bee nutrition for 2 days. Some bees could not be tracked because they died and because of sublations of *bCode*. The researchers removed the bees that could not be tracked from the tracking data before opening the entrance. The software could identify the bees in 94% of all cases (determined by a manual analysis of 60 images), with an error rate of 1.4% (determined by a manual analysis of 5000 detected barcodes). There were five separate trials (five different hives) of the experiments over July 2012–2013 (Table S1).

### Detection Of A “forager”

After opening the entrance, bees often went outside. It is well known that bees go outside for “foraging.” Although the point at which a bee becomes a forager varies, under normal colony conditions, forager bees are workers with an age of more than 3 weeks, and they shift to perform outside tasks, such as collection of water, nectar, pollen, and resin (27,28).

Because the time duration of our experiment was shorter than 3 weeks, the experimental duration and bees’ age were almost similar, so that there were no “real foragers” in our experiment. We considered bees that went outside as “forager candidates”; however, they could become real foragers after 3 weeks. In this study, we identified a bee that returned from outside within 10 min before each burst as a “forager candidate.” Moreover, to focus on foraging activities, we had to omit “orientation flights” (16,17) that occurred several days before the foraging behavior. To omit orientation flights, the forager candidate’s first day of foraging was defined as the first day on which it has at least six reads of going outside and > 25% of which occurred before 12:00 (based on personal observations, Gene Robinson’s group was aware that in their locality, most orientation flights occur in the afternoon). This criterion for the detection of the orientation flight is similar to that in the previous study by Gene Robinson’s group (18). Any flight activity on or after this day was defined as foraging activity.

### Detection Of “waggle Dance”

By performing a “waggle dance,” foragers can share directions and the distance to patches of flowers and water sources with other hive mates (21,29,30). The dancing bees waggle back and forth as they move forward in a straight line and then circle around to repeat the dance. We developed an algorithm for detecting a waggle dance in which the rotation of the bee was used as an indicator of the waggle dance.

The orientation of a bee at time *t*, where \({{n}_{x}}_{t}\) and \({n}_{{y}_{t}}\) denote the respective \(x\) and \(y\) components of the head direction of each bee, is as follows:

$${n}_{t}={(n}_{{x}_{t}},{n}_{{y}_{t}})$$

If \({n}_{t}\cdot {n}_{t+1}<0\), then the orientation of the bee changed to > 90°. As the first condition was to determine whether a bee was dancing, they used the following parameter: in 5 consecutive seconds, there should be at least four negative scalar products, which implies that, in 5 consecutive seconds, the potential dancer had to change its orientation at least four times to > 90°.

As the second criterion, if \({n}_{t}\times {n}_{t+1} >0\), then the bee made a right turn. When the orientation of the bee alternated between left turn and right turn over 5 consecutive seconds, it was identified as a “potential dancer.” We identified a bee that met both conditions as a “dancer.” To confirm the success rate of the detection, we attempted to validate the results by watching the videos of the detected dances and by considering the features of the “dancer” (e.g., circling, have followers, and trophallaxis) that were related to dance. We found that, in our tested list of 500 dances, the success rate of the detected dances was 75–82%. In Fig. 5, to examine the relationship between a burst and its most recent dance event, bees that were dancing within 10 minutes before each burst are considered dancers for that burst.

### Detection Of “dance Followers”

Dance followers receive information about the feeding area from dancers (21,22). Their existence was observed by von Frisch in 1967 (20). The dance follower tracks the motion of a dancer by keeping its head facing. In this study, we defined a dance follower as a bee that is within 600 pixels (approximately the size of a bee) of the detected dancer when the dancer is dancing and where the direction vector from the follower to the dancer and the vector of the follower’s head direction are always within 90°.

### Detection Of A “burst”

We also required a criterion to detect the “bursting” phase. Kleinberg’s burst detection algorithm was used to detect bursting events (15). This algorithm assumes that the intervals of the events occur independently according to the following exponential distribution:

$$f\left(x\right)=\stackrel{-}{\lambda }{e}^{-\stackrel{-}{\lambda }x}$$

Here,\(\stackrel{-}{\lambda }\) is the mean frequency, defined as \(N/T\), where *N* is the total number of events over the time series and *T* is the total length of the time series. In addition, \(x\) is the interval between consecutive events. Bursts were detected by comparing the expected frequency with the actual event frequency observed within the specific time window.

We defined the burst level at every time point \(t\) of individual events. The burst level was expressed as \(bl\left(t\right)\). When \(bl\left(t\right)\) was ≥ 1, we considered it as a burst. When the local event frequency at time \(t\), shown as \(\lambda\), exceeded a certain threshold, the burst level was updated. When \({\lambda }_{t}\) exceeded \(\stackrel{-}{\lambda }{s}^{1}\), the burst level \(bl\left(t\right)\) became 1 from 0. In the same manner, if \({\lambda }_{t}\) exceeded \(\stackrel{-}{\lambda }{s}^{2}\), \(bl\left(t\right)\) became 2 from 1, and so on. We used \(s=2\), so the burst level would increase by 1 when the frequency doubled the previous one.

In case we had followed this process naively, the algorithm would have detected a large number of bursts, when the actual frequency of events fluctuated around the boundary between two burst levels, 0 and 1. To alleviate this, we introduced another parameter \(\gamma\) into the algorithm. We used \(\gamma =1\) (see reference [15] for details). We defined the “burst period” as a period when the burst level was ≥ 1.

This algorithm originally required timestamps of the sequence of events in question. However, our data are provided as a time series \({K}_{G}\). Here, we consider \({K}_{G}\) to be a frequency-based sequence, and we must convert our frequency-based data into interval-based data. We simply calculated the inverse of the frequency as the interval between events. We applied the algorithm after removing the diurnal cycle of \({K}_{G}\left(t\right)\).

In this study, we explored the bursting behavior detected by Kleinberg’s burst detection algorithm. However, there were also small burst-like activities that were not detected by Kleinberg’s algorithm (depending on the parameters of the algorithm). We speculate that there are some reasons why these small bursts do not grow into larger, global bursts. By examining these small bursting behaviors, we may elucidate a more detailed mechanism for the bursting behavior.

### Classifying Bee Behavior By Activity And Activity Timing

To examine when and which bees were initializing and committing to organizing a global burst, we used NMF, which has become a popular decomposition algorithm (19,20).

Given a nonnegative matrix \(X\), this algorithm finds the nonnegative matrix factors \(W\)and \(H\), where \(W\in {R}^{m \times n}\), \(H\in {R}^{r \times n}\), and \(r\) is the number of components (known as *rank*), as follows:

In practice, \(r\) is often chosen such that \(r\ll \left(m,n\right)\).

A critical parameter of NMF is the factorization rank \(r\) that defines the number of representative bee activities used to approximate the target matrix. The method of setting the value of \(r\) is to attempt different values and compute some quality measures and then select the best value. Several approaches are available to decide the optimal value of \(r\)—for example, Brunet *et al*. proposed choosing the first value of \(r\) for which the cophenetic coefficient starts decreasing (19). Hutchins *et al*. suggested choosing the first value where the residual sum of the squares curve presents an inflection point (35). We decided to use the approach of Brunet *et al*. in this study. The advantage of this algorithm is that it can factorize the input matrix without destroying the cluster structure of the original matrix. To estimate \(r\), we calculated the cophenetic coefficient, changing the value of \(r\) from 2 to 10.

In general, \(H\)is known as the base (feature) matrix, and \(W\) is known as the weight matrix. In the case of our analysis, \(X\in {R}^{m \times n}\) implies the \({K}_{i}\) matrix at the bursting phase, as determined by Kleinberg’s algorithm. In addition, \(n\) is the number of individual bees, \(m\) is the time length of the bursting phase, and the matrix elements are \({K}_{i}\). To extract \(X\) from the original \({K}_{i}\) matrix, we explored the timestamps of the peak points for each burst \(t{s}_{b}\), where \(b\) is the index of the detected bursts. We then defined \({X}_{b}\) as follows:

$${X}_{b}=\left[\begin{array}{c}{k}_{{ts}_{b}-\text{1000,0}} \cdots {k}_{{ts}_{b}+1000,n} \\ ⋮ \ddots ⋮\\ {k}_{{ts}_{b}-\text{1000,0}} \cdots {k}_{{ts}_{b}+1000,n} \end{array}\right]$$

We confirmed that the minimum interval of the peaks of the detected bursts was 2538 s. Moreover, \(H\) describes the representative examples of the time series \({K}_{i}\) (the number of examples [base] depends on \(r\)). The method starts by randomly initializing the matrices \(W\) and \(H\), which are iteratively updated to minimize a functional divergence. To estimate the matrices \(W\) and \(H\), as a local minimum, the NMF algorithm solves the following minimization problem, where \(D\) is a loss function based on the Kullback–Leibler divergence:

$$min\left\{ W,H \ge 0 \right\}\left[ D\left(X ,WH\right)\right]$$

There are some types of update functions for solving the prior minimization problem iteratively. In this study, we used the “multiplicative update rules.” The initial entries for \(W\) and \(H\) were drawn from a uniform distribution. The number of iterations of the update function was 500. The error rate converges sufficiently well at around 500 iterations. We executed it 50 times, changing the initial matrices, and we then selected the solution in which the cost function was minimized.

### Classifying “pioneer Bees” Using Nmf

NMF decomposes the original matrix into a matrix composed of feature vectors and its weight matrix; hence, it is often used as a soft-clustering algorithm. The cluster member was computed as the index of the dominant basis component for each bee. For instance, if the maximum value of the weight vector \(w\) ( \(= {W}_{\left(,beeZ\right)}\)) of bee Z was \({w}_{\left\{i\left(1\le i \le r\right)\right\}}\), bee Z was classified as the *i*th base.

We identified that a “pioneer bee” is an individual corresponding to a base vector whose largest amplitude was observed before that of the global burst. If the largest value of the recomposed \({K}_{i}\) of bee Z at burst \(i\) was smaller than the mean \({X}_{i}\), we did not identify bee Z as a “pioneer bee.”

Such bees were identified for each observed burst. We created the “pioneer-bee binary matrix” \(P{M}_{\left\{i,j\right\}}\), where the columns represent each burst index and the rows represent each individual bee index (Fig. 5). The matrix element was either 0 or 1—if a bee \(i\) was categorized as a “pioneer bee” of the burst \(j\), then the matrix element \(\left(i,j\right)\) was assigned 1; otherwise, it was 0.

### Dimension Reduction Of The “pioneer-bee Binary Matrix”

To examine the similarity between the combination of pioneer bees in different bursts, we conducted a dimensional reduction of \(P{M}_{\left\{i,j\right\}}\) using nonmetric multidimensional scaling (nMDS) (25).

The input data of nMDS is a distance matrix. In this study, we first created the distance matrix from the “pioneer-bee matrix.” We used the Jaccard index for computing the distance of a binary matrix. Next, we reduced the dimension of the distance matrix using nMDS. In the reduced space, the distance between the points exhibited the behavioral similarity of the pioneer bees in each burst. As such, if the same bees were pioneer bees during different bursts, then these bursts are represented close to each other; otherwise, they are presented further apart. We iterated until the stress (the loss function of nMDS) was smaller than 0.2.